I have to retrieve records from a *.warc.gz file based on Target-URI. The documentation says that this requires external CDXJ index files to be created.
I've tried opening the file as gzip.open()
and do a seek(offset)
, but the seek operation is taking quite some time(seconds).
Is there any other correct way to retrieve the records.
Edit:I'm using warc python library and they don't seem to provide a direct f.seek() on the warc file.
You should do the seek on the file before decompressing. Usually, WARC files are compressed record by record and the offset and length in the CDXJ allow to clip out a single WARC record, then do a gzip.open() then on the single record. In doubt, better use a library. Warcio even provides a command-line tool to extract a single record by offset: warcio extract xyz.warc.gz offset
.