I'm trying to open a warc file with python using the toolbox from the following link: http://warc.readthedocs.org/en/latest/
When opening the file with:
import warc
f = warc.open("00.warc.gz")
Everything is fine and the f object is:
<warc.warc.WARCFile instance at 0x1151d34d0>
However when I'm trying to read everything in the file using:
for record in f:
print record['WARC-Target-URI'], record['Content-Length']
The following error appears:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 390, in __iter__
record = self.read_record()
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
header = self.read_header(fileobj)
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
raise IOError("Bad version line: %r" % version_line)
IOError: Bad version line: 'WARC/0.18\n'
Is this because my warc file version is not supported by the warc toolbox I'm using or something else?
ClueWeb09 dataset is available in the WARC 0.18 format. However, it has several issues. Some records are malformed.
The most prevalent problem is an extra newline in the WARC header. There are a few cases of other malformed headers also.
Moreover, it does not use the standard \r\n end-of-line markers which is actually your problem.
warc-clueweb library can handle it. This is a special python library to work with ClueWeb09 WARC files. According to documentation
Only minor modifications to the original library were made. The original documentation of the warc library still holds