I am trying to sift through a big database that is compressed in a .zst. I am aware that I can simply just decompress it and then work on the resulting file, but that uses up a lot of space on my ssd and takes 2+ hours so I would like to avoid that if possible.
Often when I work with large files I would stream it line by line with code like
with open(filename) as f:
for line in f.readlines():
do_something(line)
I know gzip has this
with gzip.open(filename,'rt') as f:
for line in f:
do_something(line)
but it doesn't seem to work with .zsf, so I am wondering if there're any libraries that can decompress and stream the decompressed data in a similar way. For example:
with zstlib.open(filename) as f:
for line in f.zstreadlines():
do_something(line)
Knowing which package to use and what the corresponding docs are can be a bit confusing, as there appears to be several Python bindings to the actual Zstandard library.
Below, I am referring to the library by Gregory Szorc, that I installed from conda
s default channel with:
conda install zstd
# check:
conda list zstd
# # Name Version Build Channel
# zstd 1.5.5 hc292b87_0
(even though the docs say to install with pip
, which I don't unless there is no other way, as I like my conda environments to remain usable).
I am only inferring that this version is the one from G. Szorc, based on the comments in the __init__.py
file:
# Copyright (c) 2017-present, Gregory Szorc
# All rights reserved.
#
# This software may be modified and distributed under the terms
# of the BSD license. See the LICENSE file for details.
"""Python interface to the Zstandard (zstd) compression library."""
from __future__ import absolute_import, unicode_literals
# This module serves 2 roles:
#
# 1) Export the C or CFFI "backend" through a central module.
# 2) Implement additional functionality built on top of C or CFFI backend.
Thus, I think that the corresponding documentation is here.
In any case, quick test after install:
import zstandard as zstd
with zstd.open('test.zstd', 'w') as f:
for i in range(10_000):
f.write(f'foo {i} bar\n')
with zstd.open('test.zstd', 'r') as f:
for i, line in enumerate(f):
if i % 1000 == 0:
print(f'line {i:4d}: {line}', end='')
Produces:
line 0: foo 0 bar
line 1000: foo 1000 bar
line 2000: foo 2000 bar
line 3000: foo 3000 bar
line 4000: foo 4000 bar
line 5000: foo 5000 bar
line 6000: foo 6000 bar
line 7000: foo 7000 bar
line 8000: foo 8000 bar
line 9000: foo 9000 bar
Notes:
mode='rb'
, same as a regular file. The underlying file is always written in binary mode, but if we use text mode for open
, then according to open
's doc, "(...) an io.TextIOWrapper
if opened for reading or writing in text mode".f
, not readlines()
. From the inline docstring, they make it sound like readlines()
returns a list of lines from the file, i.e. the whole thing is slurped in memory. With the iterator, it is more likely that only portions of the file are in memory at any moment (in zstd
's buffer).ABout notes 2 and 3 above: I tested empirically, by changing the number of lines to 100 millions and compared the memory usage of two versions (using htop
):
with zstd.open('test.zstd', 'r') as f:
for i, line in enumerate(f):
if i % 10_000_000 == 0:
print(f'line {i:8d}: {line}', end='')
--no bump in memory usage.
with zstd.open('test.zstd', 'r') as f:
for i, line in enumerate(f.readlines()):
if i % 10_000_000 == 0:
print(f'line {i:8d}: {line}', end='')
--bump in memory usage by a few GBs.
This may be specific to the version installed (1.5.5).