common-crawlwarcheritrix

Which block represents a WARC-Block-Digest?


At Line 09 below there is this line: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ

Line 01: WARC/1.0
Line 02: WARC-Type: request
Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
Line 04: Content-Type: application/http;msgtype=request
Line 05: WARC-Date: 2018-11-03T17:20:02Z
Line 06: WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
Line 07: WARC-IP-Address: 54.230.195.16
Line 08: WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
Line 09: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 10: Content-Length: 141
Line 11:
Line 12: GET /vital-signs/carbon-dioxide/ HTTP/1.1
Line 13: User-Agent: Wget/1.15 (linux-gnu)
Line 14: Accept: */*
Line 15: Host: climate.nasa.gov
Line 16: Connection: Keep-Alive

WARC's specs say that The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record.

I've been trying to figure out what full block of the record refers to. Is it line 11 to 16? Or Line 12 to 16? Or Line 1 to 16 (without line 9)? I've tried hashing those possibilities but can't get the sha1 (base 32) value above.


Solution

  • A WARC record of a HTTP GET requests has three parts (cf. the WARC spec):

    1. the WARC header
    2. the HTTP request header
    3. the payload which is empty (note: a POST requests would include a non-empty payload)

    The payload digest of the record is the base32-encoded SHA-1 of the empty string. A proof using Linux command-line tools:

    $> echo -n "" | openssl dgst -binary -sha1 | base32
    3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
    

    A WARC record has the form:

    warc-record  = header CRLF
                   block CRLF CRLF
    

    (see WARC spec: record model)

    The "full" block should include everything up to the trailing \r\n\r\n. This means lines 11 to 17. Note: also the HTTP GET request ends with \r\n\r\n (a trailing blank line):

    $> cat request 
    GET /vital-signs/carbon-dioxide/ HTTP/1.1
    User-Agent: Wget/1.15 (linux-gnu)
    Accept: */*
    Host: climate.nasa.gov
    Connection: Keep-Alive
    
    $> tail -n2 request | hexdump -C
    00000000  43 6f 6e 6e 65 63 74 69  6f 6e 3a 20 4b 65 65 70  |Connection: Keep|
    00000010  2d 41 6c 69 76 65 0d 0a  0d 0a                    |-Alive....|
    0000001a
    $> cat request | openssl dgst -binary -sha1 | base32
    CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ