I have multiple Orc files in HDFS with the following directory structure:
orc/
├─ data1/
│ ├─ 00.orc
│ ├─ 11.orc
├─ data2/
│ ├─ 22.orc
│ ├─ 33.orc
I am reading these files using Spark:
spark.sqlContext.read.format("orc").load("/orc/data*/")
The problem is one of the files is corrupted so I want to skip/ignore that file.
The only way I see is to get all the Orc files and validate(By reading them) one by one before passing it to Spark. But this way I will be reading the same files twice.
Is there any way I can avoid reading the files twice? Does Spark provide anything regarding this?
This will help you:
spark.sql("set spark.sql.files.ignoreCorruptFiles=true")