pythonparsingfreebasegzip

Extract Data Dump From Freebase in Python


With the Data Dump Freebase Triples (freebase-rdf-latest.gz) downloaded from website, What would be the optimal process to open and read this file in order to extract information, let's say relative info about companies and businesses? (In Python)

As far as I've gone, there are some packages to accomplish this target: open gz file in python and read a rdf file, Im not sure how to accomplish this...

My Failed Attempt in python 3.6:

import gzip

with gzip.open('freebase-rdf-latest.gz','r') as uncompressed_file:
       for line in uncompressed_file.read():
           print(line)

After that with the xml structure I could get the info by parsing it, but I cannot read the file.


Solution

  • The problem is that the gzip module unzips the whole file at once, storing the uncompressed file in memory. For a file this large, the more practical approach is to uncompress the file a little bit at a time, streaming the results.

    #!/usr/bin/env python3
    import io
    import zlib
    
    def stream_unzipped_bytes(filename):
        """
        Generator function, reads gzip file `filename` and yields
        uncompressed bytes.
    
        This function answers your original question, how to read the file,
        but its output is a generator of bytes so there's another function
        below to stream these bytes as text, one line at a time.
        """
        with open(filename, 'rb') as f:
            wbits = zlib.MAX_WBITS | 16  # 16 requires gzip header/trailer
            decompressor = zlib.decompressobj(wbits)
            fbytes = f.read(16384)
            while fbytes:
                yield decompressor.decompress(decompressor.unconsumed_tail + fbytes)
                fbytes = f.read(16384)
    
    
    def stream_text_lines(gen):
        """
        Generator wrapper function, `gen` is a bytes generator.
        Yields one line of text at a time.
        """
        try:
            buf = next(gen)
            while buf:
                lines = buf.splitlines(keepends=True)
                # yield all but the last line, because this may still be incomplete
                # and waiting for more data from gen
                for line in lines[:-1]:
                    yield line.decode()
                # set buf to end of prior data, plus next from the generator.
                # do this in two separate calls in case gen is done iterating,
                # so the last output is not lost.
                buf = lines[-1]
                buf += next(gen)
        except StopIteration:
            # yield the final data
            if buf:
                yield buf.decode()
    
    
    # Sample usage, using the stream_text_lines generator to stream
    # one line of RDF text at a time
    bytes_generator = (x for x in stream_unzipped_bytes('freebase-rdf-latest.gz'))
    for line in stream_text_lines(bytes_generator):
        # do something with `line` of text
        print(line, end='')