dataframetuplesjuliaapache-arrowjulia-dataframe

Trying to save a DataFrame using Arrow.jl gives: ArgumentError: type does not have a definite number of fields. Tuples of tuples of ints


I have a dataframe that I'd like to save using Arrow.write().

I can save a subframe of it by omitting one column. But if I leave the column in, I get this error:

ArgumentError: type does not have a definite number of fields

The objects in this column are all 4-Tuples, and their elements are all either empty Tuples or 1- or 2-Tuples of Int64s. Typical examples would be ((1), (), (2), ()) and ((1, 2), (), (), ()). If I use Arrays of Arrays rather than Tuples of Tuples, it works just fine. I prefer to use tuples, and I would prefer not to have to process data before writing and after reading it (note that this also rules out things like using four separate columns -- plus I suspect having 2-tuples and 1-tuples and empty tuples in the same column would produce the same error).

I don't really understand the meaning of the error here, so I'm not sure how to fix it. Is there an easy fix? Or do I need to use arrays instead?

Here is a minimal working example which gives me this error:

using Arrow, DataFrames

x = ((1,), (1,), (), ());
y = ((1, 2), (), (), ());
df = DataFrame(col = [x, y]);
Arrow.write("test.arrow", df)

If I use col=[x] or col=[y], it works, so the problem stems from having both tuple shapes in the same vector. Maybe this is a fundamental limitation of Arrow?

More details on the error message: The error message comes from reflection.jl on line 764, in fieldcount(@nospecialize t). This function is called by Arrow's arrowvector (in `arraytypes/struct.jl'). Here is the full function definition:

function arrowvector(::StructKind, x, i, nl, fi, de, ded, meta; kw...)
    len = length(x)
    validity = ValidityBitmap(x)
    T = Base.nonmissingtype(eltype(x))
    data = Tuple(arrowvector(ToStruct(x, j), i, nl + 1, j, de, ded, nothing; kw...) for j = 1:fieldcount(T))
    return Struct{withmissing(eltype(x), namedtupletype(T, data)), typeof(data)}(validity, data, len, meta)
end

fieldcount is called on line 5, but I don't know what T will be for my use case.


Solution

  • The problem is fixed by explicitly typing the array before constructing the DataFrame. Here is a fixed working example:

    using Arrow, DataFrames
    
    x = ((1,), (1,), (), ());
    y = ((1, 2), (), (), ());
    T = Union{
        Tuple{Tuple{Int64}, Tuple{Int64}, Tuple{}, Tuple{}},
        Tuple{Tuple{Int64, Int64}, Tuple{}, Tuple{}, Tuple{}}
    };
    C = T[x, y];
    df = DataFrame(col = C);
    Arrow.write("test.arrow", df)