pandasdataframe

Is there any idiomatic way to return an empty pandas DataFrame when there is no data?


One problem with a pandas DataFrame is that it needs some data to create its structure. Hence, it can be a problem to represent the no-row case.

For example, suppose I have a function that returns a list of records represented as dictionaries: get_data() -> list[dict[str, Any]] and I want to have a function that returns a DataFrame of the same data:

def get_dataframe() -> pd.DataFrame:
    l = get_data()
    df = pd.DataFrame(l)
    return df

This works well except when len(l)=0 because pandas needs at least one record to infer the number of columns and column types. It is not great to return None in this case because you would likely need to write a ton of if/else statements downstream to handle the zero-record case. Ideally, it would be nice to return an empty DataFrame with the correct number of columns and column types so that we don't have to do special treatment for the no record case in the downstream code. But it is very tedious to do, because:

  1. In get_dataframe(), I need to specify the number of columns and column types to create an empty DataFrame, but such information is already specified somewhere else. It is tedious to specify the same things twice.
  2. Because I specify the same information twice, they may not be consistent. So I would need to add code to check consistency.
  3. Believe it or not, the DataFrame constructor does not take a list of dtypes. There are workarounds to specify a type for each column, but it is not convenient.

One idea to remove the redundancy is that instead of representing the raw data as a list of dict, I represent them as a list of dataclass, which allows me to annotate the type of each field. I can then use the annotation information to create the column types. This is not ideal either because type annotation is optional, and also the mapping of Python types to dtype is not one-to-one.

I wonder how is the situation of no data usually handled.


Solution

  • You can initialize a pandas dataframe like this:

    df = pd.DataFrame({
        'Name': pd.Series(dtype='str'),
        'Age': pd.Series(dtype='int'),
        'Salary': pd.Series(dtype='float'),
        'Date': pd.Series(dtype='datetime64[ns]')
    })
    

    this will create an empty df with specified types per column. Is this what you were looking for?

    With that, you can also use a schema, as such:

    import pandas as pd
    from typing import List, Dict, Any
    
    def get_dataframe_schema() -> Dict[str, Any]:
        return {
            'name': str,
            'age': int,
            'score': float
        }
    
    def get_dataframe() -> pd.DataFrame:
        schema = get_dataframe_schema()
        l = get_data()
        if not l:
            return pd.DataFrame(columns=schema.keys()).astype(schema)
        df = pd.DataFrame(l)
        return df.astype(schema)