I have a Scalding job which runs on EMR. It runs on an S3 bucket containing several files. The source looks like this:
MultipleTextLineFiles("s3://path/to/input/").read
/* ... some data processing ... */
.write(Tsv("s3://paths/to/output/))
I want to make it run on a nested bucket, i.e. a bucket containing buckets which themselves contain files. It should process all the files in the inner buckets. If I try to do this without altering the source, I get this error:
java.io.IOException: Not a file: s3://path/to/innerbucket
How can I alter this job to make it run on a nested bucket?
Use a wildcard:
s3://path/to/input/*
If you have multiple levels of nesting, use multiple wildcards to get to the files:
s3://path/to/input/*/*/*