dataframejuliadynamically-generateddataframes.jl

How to define an empty DataFrame with dynamically typed Column Names and Column Types in Julia?


Given column names and column types like these:

col_names = ["A", "B", "C"]
col_types = ["String", "Int64", "Bool"]

I want to create an empty DataFrame like this:

desired_DF = DataFrame(A = String[], B = Int64[], C = Bool[]) #But I cannot specify every column name and type like this every time.

How do I do this?

I seek either your code snippet for doing the needful or, if you like the following solution I've copied below, please explain it to me.

I've seen a solution here. It works, but I do not understand it, especially the third line, in particular the semicolon at the beginning and the three dots at the end.

col_names = [:A, :B] # needs to be a vector Symbols
col_types = [Int64, Float64]
# Create a NamedTuple (A=Int64[], ....) by doing
named_tuple = (; zip(col_names, type[] for type in col_types )...)

df = DataFrame(named_tuple) # 0×2 DataFrame

Also, I was hoping that perhaps there is an even more elegant way to do the needful?


Solution

  • Let us start with the input:

    julia> col_names = ["A", "B", "C"]
    3-element Vector{String}:
     "A"
     "B"
     "C"
    
    julia> col_types = [String, Int64, Bool]
    3-element Vector{DataType}:
     String
     Int64
     Bool
    

    Note the difference, col_types need to be types not strings. col_names are good the way you proposed.

    Now there are many ways to solve your problem. Let me show you the simplest one in my opinion:

    First, create a vector of vectors that will be columns of your data frame:

    julia> [T[] for T in col_types]
    3-element Vector{Vector}:
     String[]
     Int64[]
     Bool[]
    

    Now you just need to pass it to DataFrame constructor, where this vector of vectors is a first argument, and the second argument are column names:

    julia> DataFrame([T[] for T in col_types], col_names)
    0×3 DataFrame
     Row │ A       B      C
         │ String  Int64  Bool
    ─────┴─────────────────────
    

    and you are done.

    If you would not have column names you can generate them automatically passing :auto as a second argument:

    julia> DataFrame([T[] for T in col_types], :auto)
    0×3 DataFrame
     Row │ x1      x2     x3
         │ String  Int64  Bool
    ─────┴─────────────────────
    

    This is a simple way to get what you want.


    Now let us decompose the approach you mentioned above:

    (; zip(col_names, type[] for type in col_types )...)
    

    To understand it you need to know how keyword arguments can be passed to functions. See this:

    julia> f(; kwargs...) = kwargs
    f (generic function with 1 method)
    
    julia> f(; [(:a, 10), (:b, 20), (:c, 30)]...)
    pairs(::NamedTuple) with 3 entries:
      :a => 10
      :b => 20
      :c => 30
    

    Now the trick is that in the example above:

    (; zip(col_names, type[] for type in col_types )...)
    

    you use exactly this trick. Since you do not pass a name of the function a NamedTuple is created (this is how Julia syntax works). The zip part just creates you the tuples of values, like in my example function above:

    julia> collect(zip(col_names, type[] for type in col_types ))
    3-element Vector{Tuple{Symbol, Vector}}:
     (:A, String[])
     (:B, Int64[])
     (:C, Bool[])
    

    So the example is the same as passing:

    julia> (; [(:A, String[]), (:B, Int64[]), (:C, Bool[])]...)
    (A = String[], B = Int64[], C = Bool[])
    

    Which is, given what we have said, the same as passing:

    julia> (; :A => String[], :B => Int64[], :C => Bool[])
    (A = String[], B = Int64[], C = Bool[])
    

    Which is, in turn, the same as just writing:

    julia> (; A = String[], B = Int64[], C = Bool[])
    (A = String[], B = Int64[], C = Bool[])
    

    So - this is the explanation how and why the example you quoted works. However, I believe that what I propose is simpler.