I wrote a python
program to rename file names (around 500k files) in a specific format. To track the status, I have created an array of array status [[]]
to which I am appending filename, file path, status (whether renaming is successful or not) and error (if any). Once the process is completed, exporting that array to a csv
using pandas
.
status_file = # path to csv file
status_file_stream = open(status_file, 'w')
df_status = pandas.DataFrame(status)
df_status.to_csv(status_file, index=False)
Generated CSV
file size is 228 MB
, but when I open it with openoffice
application and save as xlsx
then the new xlsx
file size is only 15.6 MB
without loosing any data.
Why there is a huge size difference between CSV
and XSLX
or is this is the issue with pandas
to get large file size?
for completeness sake ill write my comment as an answer:
compression! csv is just a fancy way to say this is a txt with commas, its a plain text file. xlsx on the other hand is a binary format and programs that generate that format compress the data, if your data is mostly the same, which I'm assuming it is because the paths are mostly the same, then its highly compressible, which is what you are seeing
xlsx uses zip (at least according to google)
to test this claim I have generated a ~110MB file containing random file names with the same path and used zlib
to compress it and the output was ~20MB
xlsx probably has additional optimizations but I did not dig into the format