I had correct parquet file (I am 100% sure) and only one file in this directory v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/
. I got this generic error AnalysisException: Unable to infer schema ...
during read operation, see full error detail:
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-26-5beebfd65378> in <module>
1 #error
----> 2 new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/")
3 new_DF.show()
4
5 spark.close()
/spark/python/pyspark/sql/readwriter.py in parquet(self, *paths, **options)
299 int96RebaseMode=int96RebaseMode)
300
--> 301 return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
302
303 def text(self, paths, wholetext=False, lineSep=None, pathGlobFilter=None,
/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py in __call__(self, *args)
1320 answer = self.gateway_client.send_command(command)
1321 return_value = get_return_value(
-> 1322 answer, self.gateway_client, self.target_id, self.name)
1323
1324 for temp_arg in temp_args:
/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
I used this code:
new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/")
new_DF.show()
strange is, that it worked correctly, when I used full path to the parquet file:
new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/")
new_DF.show()
Did you have similar issue?
The error is happening because the parquet file is not in "v3io://projects/risk/FeatureStore/ptp/parquet/"
folder, but is in "v3io://projects/risk/FeatureStore/ptp/parquet/sets/ptp/1681296898546_70/"
folder.
This will work:
new_DF=spark.read.parquet("v3io://projects/risk/FeatureStore/ptp/parquet/*/*/*")
new_DF.show()
The *
syntax reads everything in the directory.
For more info about mass reading files with spark.read
checkout this question:
Regex for date between start- and end-date