I have a file enwiktionary_namespace_0.tar.gz that contains 86 .ndjson files
enwiktionary_namespace_0_0.ndjson
enwiktionary_namespace_0_1.ndjson
enwiktionary_namespace_0_2.ndjson
...
enwiktionary_namespace_0_85.ndjson
My goal is to process .ndjson files in parallel without decompressing them into disk. Each .ndjson file will be processed line-by-line. In this way, the out-of-memory problem will not happen.
If the file .tar.gz contains a single .ndjson file, then a solution from this answer is:
# Source - https://stackoverflow.com/a/79811790
# Posted by furas, modified by community. See post 'Timeline' for change history
# Retrieved 2025-11-07, License - CC BY-SA 4.0
import tarfile
import json
with tarfile.open("data.tar.gz", "r:gz") as tar:
data_file = tar.extractfile("data.ndjson")
for json_line in data_file:
html = json.loads(json_line)
print("html:", html)
# html = process_htm(html)
Is it possible to adopt the above solution to process in parallel multiple .ndjson files inside the .tar.gz file?
Yes, you can stream each .ndjson member from the tarball and process them in parallel without extracting to disk.
You just need to open the .tar.gz in read mode, iterate over its members, and hand each one to a worker thread that reads line by line
tarfile.extractfile() will returns a file-like object you can stream directly from memory.
So using ThreadPoolExecutor keeps it simple and avoids loading everything at once.
import tarfile, json, concurrent.futures
def process_member(member):
with tar.extractfile(member) as f:
for line in f:
obj = json.loads(line)
# here you do something with obj
with tarfile.open("enwiktionary_namespace_0.tar.gz", "r:gz") as tar:
ndjson_members = [m for m in tar if m.name.endswith(".ndjson")]
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(process_member, ndjson_members)
Each worker reads and processes its stream independently, so no need to extract files to disk, and memory stays bounded since lines are parsed on the fly...
In conclusion: stream .ndjson entries using tarfile.extractfile() and process them concurrently withThreadPoolExecutor, as recommended in Python’s standard library documentation.