pythonpython-2.7apache-spark

No FileSystem for scheme: s3 with pyspark


I'm trying to read a txt file from S3 with Spark, but I'm getting thhis error:

No FileSystem for scheme: s3

This is my code:

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("first")
sc = SparkContext(conf=conf)
data = sc.textFile("s3://"+AWS_ACCESS_KEY+":" + AWS_SECRET_KEY + "@/aaa/aaa/aaa.txt")

header = data.first()

This is the full traceback:

An error occurred while calling o25.partitions.
: java.io.IOException: No FileSystem for scheme: s3
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

How can I fix this?


Solution

  • If you are using a local machine you can use boto3:

    s3 = boto3.resource('s3')
    # get a handle on the bucket that holds your file
    bucket = s3.Bucket('yourBucket')
    # get a handle on the object you want (i.e. your file)
    obj = bucket.Object(key='yourFile.extension')
    # get the object
    response = obj.get()
    # read the contents of the file and split it into a list of lines
    lines = response[u'Body'].read().split('\n')
    

    (do not forget to setup your AWS S3 credentials).

    Another clean solution if you are using an AWS Virtual Machine (EC2) would be granting S3 permissions to your EC2 and launching pyspark with this command:

    pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2
    

    If you add other packages, make sure the format is: 'groupId:artifactId:version' and the packages are separated by commas.

    If you are using pyspark from Jupyter Notebooks this will work:

    import os
    import pyspark
    os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
    from pyspark.sql import SQLContext
    from pyspark import SparkContext
    sc = SparkContext()
    sqlContext = SQLContext(sc)
    filePath = "s3a://yourBucket/yourFile.parquet"
    df = sqlContext.read.parquet(filePath) # Parquet file read example