The common crawl index file used in the below project
https://github.com/trivio/common_crawl_index/blob/master/bin/remote_copy
mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792'
)
is a partial one.
I want the complete index file(APRIL-2015 crawl data) to use in my project which uses the above project as a base.
Where can I download the entire index file?
Here Tom Morris states that
The index files which are used by the index service are also available for download.
Common crawl index files are publicly available at s3://commoncrawl/cc-index/collections/
You can check out all the crawl indexes available by aws command line: aws s3 ls s3://commoncrawl/cc-index/collections/
Index files for April 2015 are at s3://commoncrawl/cc-index/collections/CC-MAIN-2015-18/indexes/
If you want to download index *.gz
files via http protocol, you can do:
https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2015-18/indexes/cdx-00000.gz
cdx files are mostly from cdx-00000.gz up to cdx-00299.gz, so complete index is contained in 300 files.