First SO Question. I hope this is descriptive enough.
Pandas 0.25, Oracle 11g
I have a dataframe read from a csv. It contains a mix of numeric, string and date data.
I force data types within the dataframe using .astype(str)
, .astype(int)
and .to_datetime
.
I then create a dtype dictionary to select the data types I want.
When there are some nulls in the numeric columns types.NUMBER
and types.INTEGER
creates a FLOAT
in the Oracle table. It should be NUMBER(38,0)
, especially if I use types.INTEGER
. The key column that is defined as types.NUMBER
and contains all non-null integers is created as a NUMBER(38,0)
as expected.
When there are columns with all nulls, but have had .astype(str)
applied and dtype of types.VARCHAR(300)
the columns are also created as FLOAT
in Oracle.
I need to use if_exists='append'
in to_sql as the table collects history, so I can't wait for the VARCHAR columns to recieve data. Though I have been using if_exists='replace
during testing to ensure the table is dropped and recreated.
Is there a way to resolve these issues, caused by nulls in the data, resulting in the datatype selection being incorrect? I shouldn't need to use a blank (ie '') in the strings and 0 for integers, I need nulls to come through as nulls.
Nulls in date columns, even when the entire column is null values works, and creates a DATE
in Oracle as requested.
EDIT: String to VARCHAR Issue was actually an issue with a trapped and incorrectly handled exception.
Numbers were still an issue that had to be handled separately I will add an answer with the solution.
The issue with the numeric fields with some null and some non-null values was due to Pandas using NaN for null and numpy treating NaN as float.
.astype(int)
doesn't handle NaN's and actually raises an exception due to the NaNs (which a try block had caught and handled incorrectly in my case).
The solution is: df['pref1'] = df['pref1'].astype('Int64')
The 'Int64' needs the capitalised 'I'. .astype('int64')
also doesn't work.