At Line 09 below there is this line: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 01: WARC/1.0
Line 02: WARC-Type: request
Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/
Line 04: Content-Type: application/http;msgtype=request
Line 05: WARC-Date: 2018-11-03T17:20:02Z
Line 06: WARC-Record-ID: <urn:uuid:e44bc1ea-61a1-4200-b94f-60042456f638>
Line 07: WARC-IP-Address: 54.230.195.16
Line 08: WARC-Warcinfo-ID: <urn:uuid:6d14bf1d-0ef7-4f03-9de2-e578d105d3cb>
Line 09: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ
Line 10: Content-Length: 141
Line 11:
Line 12: GET /vital-signs/carbon-dioxide/ HTTP/1.1
Line 13: User-Agent: Wget/1.15 (linux-gnu)
Line 14: Accept: */*
Line 15: Host: climate.nasa.gov
Line 16: Connection: Keep-Alive
WARC's specs say that The WARC-Block-Digest is an optional parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record.
I've been trying to figure out what full block of the record
refers to. Is it line 11 to 16? Or Line 12 to 16? Or Line 1 to 16 (without line 9)? I've tried hashing those possibilities but can't get the sha1 (base 32) value above.
A WARC record of a HTTP GET requests has three parts (cf. the WARC spec):
The payload digest of the record is the base32-encoded SHA-1 of the empty string. A proof using Linux command-line tools:
$> echo -n "" | openssl dgst -binary -sha1 | base32
3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
A WARC record has the form:
warc-record = header CRLF
block CRLF CRLF
(see WARC spec: record model)
The "full" block should include everything up to the trailing \r\n\r\n
. This means lines 11 to 17. Note: also the HTTP GET request ends with \r\n\r\n
(a trailing blank line):
$> cat request
GET /vital-signs/carbon-dioxide/ HTTP/1.1
User-Agent: Wget/1.15 (linux-gnu)
Accept: */*
Host: climate.nasa.gov
Connection: Keep-Alive
$> tail -n2 request | hexdump -C
00000000 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20 4b 65 65 70 |Connection: Keep|
00000010 2d 41 6c 69 76 65 0d 0a 0d 0a |-Alive....|
0000001a
$> cat request | openssl dgst -binary -sha1 | base32
CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ