One problem with a pandas DataFrame is that it needs some data to create its structure. Hence, it can be a problem to represent the no-row case.
For example, suppose I have a function that returns a list of records represented as dictionaries: get_data() -> list[dict[str, Any]]
and I want to have a function that returns a DataFrame of the same data:
def get_dataframe() -> pd.DataFrame:
l = get_data()
df = pd.DataFrame(l)
return df
This works well except when len(l)=0
because pandas needs at least one record to infer the number of columns and column types. It is not great to return None in this case because you would likely need to write a ton of if/else statements downstream to handle the zero-record case. Ideally, it would be nice to return an empty DataFrame with the correct number of columns and column types so that we don't have to do special treatment for the no record case in the downstream code. But it is very tedious to do, because:
get_dataframe()
, I need to specify the number of columns and column types to create an empty DataFrame, but such information is already specified somewhere else. It is tedious to specify the same things twice.One idea to remove the redundancy is that instead of representing the raw data as a list of dict, I represent them as a list of dataclass, which allows me to annotate the type of each field. I can then use the annotation information to create the column types. This is not ideal either because type annotation is optional, and also the mapping of Python types to dtype
is not one-to-one.
I wonder how is the situation of no data usually handled.
You can initialize a pandas dataframe like this:
df = pd.DataFrame({
'Name': pd.Series(dtype='str'),
'Age': pd.Series(dtype='int'),
'Salary': pd.Series(dtype='float'),
'Date': pd.Series(dtype='datetime64[ns]')
})
this will create an empty df with specified types per column. Is this what you were looking for?
With that, you can also use a schema, as such:
import pandas as pd
from typing import List, Dict, Any
def get_dataframe_schema() -> Dict[str, Any]:
return {
'name': str,
'age': int,
'score': float
}
def get_dataframe() -> pd.DataFrame:
schema = get_dataframe_schema()
l = get_data()
if not l:
return pd.DataFrame(columns=schema.keys()).astype(schema)
df = pd.DataFrame(l)
return df.astype(schema)