rcsvmemorycompressionrds

RDs format weights more than csv one for the same dataframe


So, I saved a dataframe in both csv and RDs formats, but the RDs one weights significantly more than the csv alternative (40 GB vs. 10 GB). According to this blog:

[RDs format] creates a serialized version of the dataset and then saves it with gzip compression

So, if RDs data is compressed while csv one is uncompressed, then why is the RDs version so much heavier? I would understand the difference if the dataset was small, but it is 140,000 by 42,000, so there shouldn't be an issue with asymptotics kicking in.


Solution

  • So, I believe this is some issue that is related to integer overflow in R when computing the indices of the new dataframe. Although nowhere in the documentation I could find a reference to overflow as a possible cause of such errors, I did run into similar issues with Python for which docs indicate overflow as a possible cause. I couldn't find any other way of fixing this and had to reduce the size of my dataset after which everything worked fine.