amazon-s3apache-crunch

How to write output of Apache Crunch to Amazon S3 bucket


Is there a way through which we can write our Apache Crunch output to S3 bucket. There is a method in crunch pipeline write which takes Target as parameter. Is there a way to add S3 as Target to write method of crunch.


Solution

  • Couldn't you just use the write method on your PCollection and supply it to your S3 location?

    PCollection<String> items = ...;
    items.write(To.avroFile("s3://bucket/prefix");
    pipeline.done();
    

    This essentially is how we do it, however we are running within EMR. For migrating data from our on-prem cluster, we utilize the Hadoop dist-cp command.