I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests.
The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs.
There's some code here, but it requires an S3 account and access (although I do like Python).
Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again.
Thanks!
It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. The crawls are announced here: https://commoncrawl.org/connect/blog/
You're down because every WARC/WAT/WET file is a random sample by its own. Need more data: just pick more files at random.