datasetinformation-retrievalcorpuscommon-crawl

Download small sample of AWS Common Crawl to local machine via http


I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests.

The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs.

There's some code here, but it requires an S3 account and access (although I do like Python).

Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again.

Thanks!


Solution

  • It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. The crawls are announced here: https://commoncrawl.org/connect/blog/

    1. take the latest crawl (eg. April 2019)
    2. navigate to the WARC file list and download it (same for WAT or WET)
    3. unzip the file and randomly select one line (file path)
    4. prefix the path with https://commoncrawl.s3.amazonaws.com/ (or since spring 2022: https://data.commoncrawl.org/ - there is a description in the blog post) and download it

    You're down because every WARC/WAT/WET file is a random sample by its own. Need more data: just pick more files at random.