I want to write a simple dataframe as an ORC file. The only sequence is of an integer type. If I set all values to None
, an exception is raised on to_orc
.
I understand that pyarrow
cannot infer datatype from None
values but what can I do to fix the datatype for output? Attempts to use .astype()
only brought TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Bonus points if the solution also works for
Script:
data = {'a': [1, 2]}
df = pd.DataFrame(data=data)
print(df)
df.to_orc('a.orc') # OK
df['a'] = None
print(df)
df.to_orc('a.orc') # fails
Output:
a
0 1
1 2
a
0 None
1 None
Traceback (most recent call last):
File ... line 9, in <module>
...
File "pyarrow/_orc.pyx", line 443, in pyarrow._orc.ORCWriter.write
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unknown or unsupported Arrow type: null
This is a known issue, see https://github.com/apache/arrow/issues/30317. The problem is that the ORC writer does not yet support writing a column of all-nulls without specific dtype (not object dtype). If you cast the column to, for example, float first, then the writing works.
Using the df
from your example:
>>> df.dtypes
a object
dtype: object
# the column has generic object dtype, cast to float
>>> df['a'] = df['a'].astype("float64")
>>> df.dtypes
a float64
dtype: object
# now writing to ORC and reading back works
>>> df.to_orc('a.orc')
>>> pd.read_orc('a.orc')
a
0 NaN
1 NaN