node.jsaxiosnode-fetchcommon-crawl

Common crawl request with node-fetch, axios or got


I am trying to port my C# common-crawl code to Node.js and getting error in with all HTTP libraries(node-fetch, axios of got) in getting the single page HTML from common-crawl S3 archive.

const offset = 994879995;
  const length = 27549;
  const offsetEnd = offset + length + 1;
  const url = `https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-43/segments/1539583511703.70/warc/CC-MAIN-20181018042951-20181018064451-00103.warc.gz`;
  const response = await fetch(
    url, //'https://httpbin.org/get',
    {
      method: "GET",
      timeout: 10000,
      compress: true,
      headers: {
        Range: `bytes=${offset}-${offsetEnd}"`,
        'Accept-Encoding': 'gzip'
      },
    }
  );

  console.log(`status`, response.status);
  console.log(`headers`, response.headers);
  console.log(await response.text());

The status is 200, but none of the package able to read the body gzip body.

enter image description here

While my C# code is working fine to read the body as byte array and decompress it.

enter image description here


Solution

  • The code below will fetch a single WARC record and extract the HTML payload. All status lines and headers (HTTP fetch of the WARC record, WARC record header, WARC record HTTP header) are logged as well as the HTML payload. The following points are changed:

    const fetch = require("node-fetch");
    const warcio = require("warcio");
    
    class WarcRecordFetcher {
    
        async run() {
            const offset = 994879995;
            const length = 27549;
            const offsetEnd = offset + length - 1;
            const url = `https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-43/segments/1539583511703.70/warc/CC-MAIN-20181018042951-20181018064451-00103.warc.gz`;
    
            const response = await fetch(
                url, //'https://httpbin.org/get',
                {
                    method: "GET",
                    timeout: 10000,
                    headers: {
                        Range: `bytes=${offset}-${offsetEnd}`
                    },
                }
            );
    
            console.log(`status`, response.status);
            console.log(`headers`, response.headers);
    
            const warcParser = new warcio.WARCParser(response.body);
            const warcRecord = await warcParser.parse();
    
            console.log(warcRecord.warcHeaders.statusline);
            console.log(warcRecord.warcHeaders.headers);
    
            console.log(warcRecord.httpHeaders.statusline);
            console.log(warcRecord.httpHeaders.headers);
    
            const warcPayload = await warcRecord.contentText();
            console.log(warcPayload)
        }
    
    }
    
    new WarcRecordFetcher().run();