javaapache-sparkamazon-s3common-crawl

How to read multiple gzipped files from S3 into a single RDD with http request?


I have to download many gzipped files stored on S3 like this:

crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz
crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00001.warc.gz

to download them you must add the prefix https://commoncrawl.s3.amazonaws.com/

I have to download and decompress the files,then assemble the content as a single RDD.

Something similar to this:

JavaRDD<String> text = 
    sc.textFile("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-43/segments/1539583508988.18/robotstxt/CC-MAIN-20181015080248-20181015101748-00000.warc.gz");

I want to do this code with spark:

    for (String key : keys) {
        object = s3.getObject(new GetObjectRequest(bucketName, key));

        gzipStream = new GZIPInputStream(object.getObjectContent());
        decoder = new InputStreamReader(gzipStream);
        buffered = new BufferedReader(decoder);

        sitemaps = new ArrayList<>();

        String line = buffered.readLine();

        while (line != null) {
            if (line.matches("Sitemap:.*")) {
                sitemaps.add(line);
            }
            line = buffered.readLine();
        }

Solution

  • To read something from S3, you can do this:

    sc.textFiles("s3n://path/to/dir")
    

    If dir contains your gzip files, they will be gunzipped and combined into one RDD. If your files are not directly at the root of the directory like this:

    /root
      /a
        f1.gz
        f2.gz
      /b
        f3.gz
    

    or even this:

    /root
      f3.gz
      /a
        f1.gz
        f2.gz
    

    then you should use the wildcard like this sc.textFiles("s3n://path/to/dir/*") and spark will recursively find the files in dirand its subdirectories.

    Beware of this though. The wildcard will work but you may get lattency issues on S3 in production and may want to use the AmazonS3Client you retrieve the paths.