amazon-web-serviceshadoopamazon-emrmapr

Getting Amazon EMR to use S3 for input and output


How would I get Amazon EMR (0.20.205 MapR) to use S3 buckets for input and output?

I tried adding the following to the core configuration xml file (through bootstrap actions):

<property>
        <name>fs.default.name</name>
        <value>s3n://</value>
</property>

<property>
        <name>dfs.name.default</name>
        <value>s3n://</value>
</property>

But I always get something like:

Caused by: java.io.IOException: Could not resolve path: s3n://some_out_bucket/out at com.mapr.fs.MapRFileSystem.lookupClient(MapRFileSystem.java:219) at com.mapr.fs.MapRFileSystem.delete(MapRFileSystem.java:385) at cc.mrlda.ParseCorpus.run(ParseCorpus.java:192) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at cc.mrlda.ParseCorpus.main(ParseCorpus.java:675) ... 10 more

Hadoop newbie here. Please help!


Solution

  • Further to the configuration steps described in the question above, I have modified the code:

    FileSystem fs = FileSystem.get(URI.create(outputPath), new JobConf(SomeClass.class)); where outputPath points to a resource on S3 e.g. s3n://some_bucket

    Using URI.create, I am now able to access files directly from S3.