Process in parallel multiple .ndjson files inside a .tar.gz file

I have a file enwiktionary_namespace_0.tar.gz that contains 86 .ndjson files

enwiktionary_namespace_0_0.ndjson
enwiktionary_namespace_0_1.ndjson
enwiktionary_namespace_0_2.ndjson
...
enwiktionary_namespace_0_85.ndjson

My goal is to process .ndjson files in parallel without decompressing them into disk. Each .ndjson file will be processed line-by-line. In this way, the out-of-memory problem will not happen.

If the file .tar.gz contains a single .ndjson file, then a solution from this answer is:

# Source - https://stackoverflow.com/a/79811790
# Posted by furas, modified by community. See post 'Timeline' for change history
# Retrieved 2025-11-07, License - CC BY-SA 4.0

import tarfile
import json

with tarfile.open("data.tar.gz", "r:gz") as tar:

    data_file = tar.extractfile("data.ndjson")

    for json_line in data_file:
        html = json.loads(json_line)
        print("html:", html)
        # html = process_htm(html)

Is it possible to adopt the above solution to process in parallel multiple .ndjson files inside the .tar.gz file?

Solution

Yes, you can stream each .ndjson member from the tarball and process them in parallel without extracting to disk.
You just need to open the .tar.gz in read mode, iterate over its members, and hand each one to a worker thread that reads line by line tarfile.extractfile() will returns a file-like object you can stream directly from memory.
So using ThreadPoolExecutor keeps it simple and avoids loading everything at once.

import tarfile, json, concurrent.futures

def process_member(member):
    with tar.extractfile(member) as f:
        for line in f:
            obj = json.loads(line)
            # here you do something with obj

with tarfile.open("enwiktionary_namespace_0.tar.gz", "r:gz") as tar:
    ndjson_members = [m for m in tar if m.name.endswith(".ndjson")]
    with concurrent.futures.ThreadPoolExecutor() as executor:
        executor.map(process_member, ndjson_members)

Each worker reads and processes its stream independently, so no need to extract files to disk, and memory stays bounded since lines are parsed on the fly...
In conclusion: stream .ndjson entries using tarfile.extractfile() and process them concurrently withThreadPoolExecutor, as recommended in Python’s standard library documentation.