hadoopamazon-web-serviceselastic-map-reducecascadingscalding

How can I make my Scalding job operate recursively on its input bucket?


I have a Scalding job which runs on EMR. It runs on an S3 bucket containing several files. The source looks like this:

MultipleTextLineFiles("s3://path/to/input/").read
  /* ... some data processing ... */
  .write(Tsv("s3://paths/to/output/))

I want to make it run on a nested bucket, i.e. a bucket containing buckets which themselves contain files. It should process all the files in the inner buckets. If I try to do this without altering the source, I get this error:

java.io.IOException: Not a file: s3://path/to/innerbucket

How can I alter this job to make it run on a nested bucket?


Solution

  • Use a wildcard:

    s3://path/to/input/*
    

    If you have multiple levels of nesting, use multiple wildcards to get to the files:

    s3://path/to/input/*/*/*