pythonamazon-web-servicesweb-crawleramazon-athenacommon-crawl

Querying HTML Content in Common Crawl Dataset Using Amazon Athena


I am currently exploring the massive Common Crawl dataset hosted on Amazon S3 and am attempting to use Amazon Athena to query this dataset. My objective is to search within the HTML content of the web pages to identify those that contain specific strings within their tags. Essentially, I am looking to filter out websites whose HTML content matches particular criteria.

I am aware that Athena is capable of querying large datasets on S3 using standard SQL. However, I am not entirely sure about the feasibility and the approach to directly query inside the HTML content of the web pages in the Common Crawl dataset.

Here's a simplified version of what I am looking to achieve:

sql

SELECT * 
FROM "common_crawl_dataset" 
WHERE html_content LIKE '%specific-string%';

Is it possible to directly query the HTML content of the web pages in the Common Crawl dataset using Athena? If yes, what would be the best approach to accomplish this, considering efficiency and cost-effectiveness? Are there any limitations or challenges that I should be aware of?


Solution

  • This is not easily possible, because the html content is not in the schema of the index that you are querying.

    Please see the Common Crawl Columnar Index blog post for further details.

    The most common use of this index is to select a small subset of the crawl (things like "all webpages with a Swiss domain name (*.ch) classified as being in the Romansh "roh" language). Accessing the html for these selected web captures is a second step.

    There is a large list of examples of columnar index (in many programming languages) in the cc-index-table GitHub repo.