apache-sparkpysparkemramazon-emrapache-spark-sql

Pyspark - Load file: Path does not exist


I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()\

df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)

When I run the script raises the following error message:

pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv

Then, I found out that I have to add file:// in the file path so it can read the file locally:

df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)

But this time, the above approach raised a different error:

Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1): java.io.FileNotFoundException: File file:/home/hadoop/observations_temp.csv does not exist

I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.

Do you know how can I read the csv file and make it available to all the other nodes?


Solution

  • You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.

    Here is the official documentation Ref. External Datasets.

    If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

    So basically you have two solutions :

    You copy your file into each worker before starting the job;

    Or you'll upload in HDFS with something like : (recommended solution)

    hadoop fs -put localfile /user/hadoop/hadoopfile.csv
    

    Now you can read it with :

    df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)
    

    It seems that you are also using AWS S3. You can always try to read it directly from S3 without downloading it. (with the proper credentials of course)

    Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories. I don't recommend this approach unless your csv file is very small but then you won't need Spark.

    Alternatively, I would stick with HDFS (or any distributed file system).