pythonrfeather

2 .feather files with same data, completely different sizes?


I have 2 feather files based on the same data. The only difference is the way the data is obtained.

File 1 has a list of queries, broken out by month, that are each saved as individual files. Then each file is read into a dictionary and concatenated with pd.concat(dict[values]) in python.

File 2 is another list of queries, broken out into quarters, that are each saved as individual files. Each file is then concatenated through some process in R that I'm not familiar with.

Upon reading both files, I can see that the data is the same. Same number of rows, sums, etc.

But File 1 is 3GB and File 2 is 6GB. Why is that?


Solution

  • This happens because 6GB file contains more blocks than 3GB one. Less blocks file is split, the better compression is achieved. Compare WinRar compression with and without "create solid archive" option. What is worth to mention, 6GB file may be more optimized for random read.