pythonbotocommon-crawl

Download Common crawl complete index file


The common crawl index file used in the below project

https://github.com/trivio/common_crawl_index/blob/master/bin/remote_copy

mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792')

is a partial one.

I want the complete index file(APRIL-2015 crawl data) to use in my project which uses the above project as a base.

Where can I download the entire index file?

Here Tom Morris states that

The index files which are used by the index service are also available for download.


Solution

  • Common crawl index files are publicly available at s3://commoncrawl/cc-index/collections/

    You can check out all the crawl indexes available by aws command line: aws s3 ls s3://commoncrawl/cc-index/collections/

    Index files for April 2015 are at s3://commoncrawl/cc-index/collections/CC-MAIN-2015-18/indexes/

    If you want to download index *.gz files via http protocol, you can do:

    https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2015-18/indexes/cdx-00000.gz

    cdx files are mostly from cdx-00000.gz up to cdx-00299.gz, so complete index is contained in 300 files.