pythonpandasdata-sciencefile-conversionpython-zipfile

zipfile and pandas failure mid-loop


I'm writing this on my phone, so a full code example is sorta out of the question at the moment, but I need some help.

I'm working on parsing a set of .csv files from a zipped infile, pulling out specific columns from each file, generating a new .csv with the chosen columns, and then exporting the new dataframes to a zipped outfile.

I am doing this through a series of loops, but can't get beyond 78% success on the parse process, and 73% on the parse combined with the compression process.

Somewhere along the way either zipfile.ZipFile is breaking, or pandas.to_csv... and I'm not sure why. I've been trying to figure it out for two weeks and I'm finally breaking down to ask assistance.

Brief code snippets for now:

Export function:

 def export(new_filename):

   os.chdir([import_file location])
   try:
      with zipfile.ZipFile(outfile_name,'a',zipfile=ZIP_DEFLATED, allowZip64=true) as outfile:
         try:
           outfile.write(new_filename)
           #random errors at runtime saying the writing handle is still open... Not sure why. 
         except:
           #print statement to alert of failure at this step. I have tried NameError 
           #and ValueError exceptions, but they don't help. 
   except:
      #another statement to alert failure

Pandas function:

 def infile_parser(filename, new_filename):

     #excluding code beyond making the dataframe and file generation
     df = pd.dataframe(data,columns=useful_columns)
     df.to_csv(new_filename,index=false)

Thank you in advance. I can add more context if requested.


Solution

  • I figured out where it was breaking. Sorry I forgot to update this question with the solution.

    The issue was in the data of some of the files. Added automated badfile checking based on length of dataframe. Basically, the files causing issues only had 1 or 2 rows in column A but the good files had full tables of many rows. Pandas was assigning the string in the first cell to the header and basically breaking from there, since the columns being used in the other files did not exist in the badfiles.

    Pre-parse file verification / data checking, thereby omitting the badfiles from the process, solved all issues.