pythonbz2

Limit on bz2 file decompression using python?


I have numerous files that are compressed in the bz2 format and I am trying to uncompress them in a temporary directory using python to then analyze. There are hundreds of thousands of files so manually decompressing the files isn't feasible so I wrote the following script.

My issue is that whenever I try to do this, the maximum file size is 900 kb even though a manual decompression has each file around 6 MB. I am not sure if this is a flaw in my code and how I am saving the data as a string to then copy to the file or a problem with something else. I have tried this with different files and I know that it works for files smaller than 900 kb. Has anyone else had a similar problem and knows of a solution?

My code is below:

import numpy as np
import bz2
import os
import glob

def unzip_f(filepath):
    '''
    Input a filepath specifying a group of Himiwari .bz2 files with common names
    Outputs the path of all the temporary files that have been uncompressed

    '''


    cpath = os.getcwd() #get current path
    filenames_ = []  #list to add filenames to for future use

    for zipped_file in glob.glob(filepath):  #loop over the files that meet the name criterea
        with bz2.BZ2File(zipped_file,'rb') as zipfile:   #Read in the bz2 files
            newfilepath = cpath +'/temp/'+zipped_file[-47:-4]     #create a temporary file
            with open(newfilepath, "wb") as tmpfile: #open the temporary file
                for i,line in enumerate(zipfile.readlines()):
                    tmpfile.write(line) #write the data from the compressed file to the temporary file



            filenames_.append(newfilepath)
    return filenames_


path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S*bz2'
unzip_f(path_)   

It returns the correct file paths with the wrong sizes capped at 900 kb.


Solution

  • It turns out this issue is due to the files being multi stream which does not work in python 2.7. There is more info here as mentioned by jasonharper and here. Below is a solution just using the Unix command to decompress the bz2 files and then moving them to the temporary directory I want. It is not as pretty but it works.

    import numpy as np
    import os
    import glob
    import shutil
    
    def unzip_f(filepath):
        '''
        Input a filepath specifying a group of Himiwari .bz2 files with common names
        Outputs the path of all the temporary files that have been uncompressed
    
        '''
    
    
        cpath = os.getcwd() #get current path
        filenames_ = []  #list to add filenames to for future use
    
        for zipped_file in glob.glob(filepath):  #loop over the files that meet the name criterea
            newfilepath = cpath +'/temp/'   #create a temporary file
            newfilename = newfilepath + zipped_file[-47:-4]
    
            os.popen('bzip2 -kd ' + zipped_file)
            shutil.move(zipped_file[-47:-4],newfilepath)
    
            filenames_.append(newfilename)
        return filenames_
    
    
    
    path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S0*bz2'
    
    unzip_f(path_)