pythonbz2

Reading first lines of bz2 files in python


I am trying to extract 10'000 first lines from a bz2 file.

   import bz2       
   file = "file.bz2"
   file_10000 = "file.txt"

   output_file = codecs.open(file_10000,'w+','utf-8')

   source_file = bz2.open(file, "r")
   count = 0
   for line in source_file:
       count += 1
       if count < 10000:
           output_file.writerow(line)

But I get an error "'module' object has no attribute 'open'". Do you have any ideas? Or may be I could save 10'000 first lines to a txt file in some other way? I am on Windows.


Solution

  • Here is a fully working example that includes writing and reading a test file that is much smaller than your 10000 lines. Its nice to have working examples in questions so we can test easily.

    import bz2
    import itertools
    import codecs
    
    file = "file.bz2"
    file_10000 = "file.txt"
    
    # write test file with 9 lines
    with bz2.BZ2File(file, "w") as fp:
        fp.write('\n'.join('123456789'))
    
    # the original script using BZ2File ... and 3 lines for test
    # ...and fixing bugs:
    #     1) it only writes 9999 instead of 10000
    #     2) files don't do writerow
    #     3) close the files
    
    output_file = codecs.open(file_10000,'w+','utf-8')
    
    source_file = bz2.BZ2File(file, "r")
    count = 0
    for line in source_file:
        count += 1
        if count <= 3:
           output_file.write(line)
    source_file.close()
    output_file.close()
    
    # show what you got
    print('---- Test 1 ----')
    print(repr(open(file_10000).read()))   
    

    A more efficient way to do it is to break out of the for loop after reading the lines you want. you can even leverage iterators to thin up the code like so:

    # a faster way to read first 3 lines
    with bz2.BZ2File(file) as source_file,\
            codecs.open(file_10000,'w+','utf-8') as output_file:
        output_file.writelines(itertools.islice(source_file, 3))
    
    # show what you got
    print('---- Test 2 ----')
    print(repr(open(file_10000).read()))