I have a dataframe that I'd like to save using Arrow.write()
.
I can save a subframe of it by omitting one column. But if I leave the column in, I get this error:
ArgumentError: type does not have a definite number of fields
The objects in this column are all 4-Tuples, and their elements are all either empty Tuples or 1- or 2-Tuples of Int64s. Typical examples would be ((1), (), (2), ())
and ((1, 2), (), (), ())
. If I use Arrays of Arrays rather than Tuples of Tuples, it works just fine. I prefer to use tuples, and I would prefer not to have to process data before writing and after reading it (note that this also rules out things like using four separate columns -- plus I suspect having 2-tuples and 1-tuples and empty tuples in the same column would produce the same error).
I don't really understand the meaning of the error here, so I'm not sure how to fix it. Is there an easy fix? Or do I need to use arrays instead?
Here is a minimal working example which gives me this error:
using Arrow, DataFrames
x = ((1,), (1,), (), ());
y = ((1, 2), (), (), ());
df = DataFrame(col = [x, y]);
Arrow.write("test.arrow", df)
If I use col=[x]
or col=[y]
, it works, so the problem stems from having both tuple shapes in the same vector. Maybe this is a fundamental limitation of Arrow?
More details on the error message: The error message comes from reflection.jl
on line 764, in fieldcount(@nospecialize t)
. This function is called by Arrow's arrowvector
(in `arraytypes/struct.jl'). Here is the full function definition:
function arrowvector(::StructKind, x, i, nl, fi, de, ded, meta; kw...)
len = length(x)
validity = ValidityBitmap(x)
T = Base.nonmissingtype(eltype(x))
data = Tuple(arrowvector(ToStruct(x, j), i, nl + 1, j, de, ded, nothing; kw...) for j = 1:fieldcount(T))
return Struct{withmissing(eltype(x), namedtupletype(T, data)), typeof(data)}(validity, data, len, meta)
end
fieldcount
is called on line 5, but I don't know what T
will be for my use case.
The problem is fixed by explicitly typing the array before constructing the DataFrame. Here is a fixed working example:
using Arrow, DataFrames
x = ((1,), (1,), (), ());
y = ((1, 2), (), (), ());
T = Union{
Tuple{Tuple{Int64}, Tuple{Int64}, Tuple{}, Tuple{}},
Tuple{Tuple{Int64, Int64}, Tuple{}, Tuple{}, Tuple{}}
};
C = T[x, y];
df = DataFrame(col = C);
Arrow.write("test.arrow", df)