pythonpysparkparquetmlrun

Valid parquet file, but error with parquet schema


I had correct parquet file (I am 100% sure) and only one file in this directory v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/. I got this generic error AnalysisException: Unable to infer schema ... during read operation, see full error detail:

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-26-5beebfd65378> in <module>
      1 #error
----> 2 new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/")
      3 new_DF.show()
      4 
      5 spark.close()

/spark/python/pyspark/sql/readwriter.py in parquet(self, *paths, **options)
    299                        int96RebaseMode=int96RebaseMode)
    300 
--> 301         return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
    302 
    303     def text(self, paths, wholetext=False, lineSep=None, pathGlobFilter=None,

/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1320         answer = self.gateway_client.send_command(command)
   1321         return_value = get_return_value(
-> 1322             answer, self.gateway_client, self.target_id, self.name)
   1323 
   1324         for temp_arg in temp_args:

/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

AnalysisException: Unable to infer schema for Parquet. It must be specified manually.

I used this code:

new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/")
new_DF.show()

strange is, that it worked correctly, when I used full path to the parquet file:

new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/")
new_DF.show()

Did you have similar issue?


Solution

  • The error is happening because the parquet file is not in "v3io://projects/risk/FeatureStore/ptp/parquet/" folder, but is in "v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/" folder.

    This will work:

    new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/*/*/*")
    new_DF.show()
    

    The * syntax reads everything in the directory.

    For more info about mass reading files with spark.read checkout this question: Regex for date between start- and end-date