python-3.xlzmaxzwarc

How to compress warc records with lzma (*.warc.xz) in python3?



I have a list of warc records. Every single item in list is created like this:

header = warc.WARCHeader({
    "WARC-Type": "response",
    "WARC-Target-URI": "www.somelink.com",
}, defaults=True)
data = "Some string"
record = warc.WARCRecord(header, data.encode('utf-8','replace'))

Now, I am using *.warc.gz to store my records like this:

output_file = warc.open("my_file.warc.gz", 'wb')

And write records like this:

output_file.write_record(record) # type of record is WARCRecord

But how can I compress with lzma as *.warc.xz? I have tried replacing gz with xz when callig warc.open, but warc in python3 do not support this format. I have found this trial, but I was not able to save WARCRecord with this:

output_file = lzma.open("my_file.warc.xz", 'ab', preset=9)
header = warc.WARCHeader({
    "WARC-Type": "response",
    "WARC-Target-URI": "www.somelink.com",
}, defaults=True)
data = "Some string"
record = warc.WARCRecord(header, data.encode('utf-8','replace'))
output_file.write(record)

The error message is:

TypeError: a bytes-like object is required, not 'WARCRecord'

Thanks for any help.


Solution

  • The WARCRecord class has a write_to method, to write records to a file object.

    You could use that to write records to a file created with lzma.open().