juliadataframes.jl

Convert Julia DataFrame to an array of bytes for compression


So I loaded two datasets from a csv and then merged them using a leftjoin:

using CSV
using DataFrames
using CodecZstd

df1 = CSV.read(joinpath(root, "data", "raw", "df1.csv"), DataFrame)
df2 = CSV.read(joinpath(root, "data", "raw", "df2.csv"), DataFrame)

merged = leftjoin(df1, df2, on=:id)

Now I want to write the merged dataframe to disk as a .zst compressed file (Zstandard compression).

I was successful in first writing to .csv then reading then writing again as .zst but is there a way to directly convert a DataFrame into an array of bytes to be able to save to disk?


Solution

  • There are several options. The one built-in into Julia is to serialize a data frame. You can achieve this by using the Serialialization standard library. It offers two functions serialize for serialization of streams and deserialize for their deserialization. Then you can use CodecZstd.jl to compress the serialized stream and save it to disk.

    Note that when you use serialization it is your responsibility to ensure that the Julia and package versions are consistent between the Julia session where you write data and where you read your data.