pythonpython-3.xwarc

Retrieving records from WARC file based on url


I have to retrieve records from a *.warc.gz file based on Target-URI. The documentation says that this requires external CDXJ index files to be created.

I've tried opening the file as gzip.open() and do a seek(offset), but the seek operation is taking quite some time(seconds).

Is there any other correct way to retrieve the records.

Edit:I'm using warc python library and they don't seem to provide a direct f.seek() on the warc file.


Solution

  • You should do the seek on the file before decompressing. Usually, WARC files are compressed record by record and the offset and length in the CDXJ allow to clip out a single WARC record, then do a gzip.open() then on the single record. In doubt, better use a library. Warcio even provides a command-line tool to extract a single record by offset: warcio extract xyz.warc.gz offset.