web-crawlercommon-crawl

Common Crawl requirement to power a decent search engine


Common Crawl releases massive dataloads every month, sizing nearly hundreds of terabytes. This has been going on for last 8-9 years.

Are these snapshots independent (probably not)? Or do we have to combine all of them to be able to power a decent search engine where results from a wide spectrum of webpages are shown? The size of all payloads in Common Crawl repo history (they have not specified size for most of the 2016 payloads) is around 20 PB, add to it 2016 approximation, it becomes around 22 PB. How much of it is possibly duplicate data? If we strip all HTML tag and other nonsense data from the HTML pages, how much can the new data (just the raw text content) look like in size?

If there was a webpage from New York Times present in the payload in 2015 March, what are the odds that they have since then appeared in multiple payloads (I have read the Jacard number reports, but they don't paint a very clear picture) and that a massive number of such pages will be duplicated across all payloads, needing fair amount of pruning?


Solution

  • The information below is licensed under the Hippocratic License Hippocratic License HL3-ECO-MIL-SV Please use the information responsibly and avoid harming others and the environment (including engaging in military, mass-surveillance, or ecocide activities) using the tools you build.

    Paraphrased from https://news.ycombinator.com/item?id=26598044 :

    tl;dr: The crawl is an ongoing process, and each snapshot dataset is somewhat independent (up to 20% URL overlap with any other dataset, up to 2% content digest overlap with any other dataset), but downloading just one snapshot will not give you a comprehensive index. If you combine multiple common crawl snapshots together, the vast majority of data added by each snapshot will be new.

    If you compare the size of each crawl vs how many new URLs get added at each crawl here: https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize then you will see that the vast majority of content digest from each crawl are new.

    Size of each crawl New items per crawl

    You can tradeoff space vs comprehensiveness by combining only the last N crawls, but then you lose historical/stale content (which may be beneficial or harmful depending on what your goal is): URLs cumulative over last N crawls

    You can learn more about the overlap between crawls on this page: both how many URLs each crawl has in common with other crawls as well as how similar the crawled data content digests are between pairs of snapshots (typically < 2%, Jaccard similarity less than 0.02): https://commoncrawl.github.io/cc-crawl-statistics/plots/crawloverlap