pythonpandasdataframepandasqlmodin

How to append a Modin pandas dataframe to other?


I am working on performing calculations on large files around 6GB each file and came across Modin pandas which I heard optimized compared to pandas.

I need to read a CSV file in chunks and perform calculations on that and append it to a big dataframe and convert the big dataframe to a CSV file again.

This is working absolutely fine with Pandas. But it's taking too much time to process even small files. And I can't even imagine it for 6GB files.

However, when I try to do the same thing with modin pandas, it is unable to append dataframe to a big dataframe which I want to convert to csv file.

Can anyone suggest any alternative to this or a solution.

Python - 3.6
Pandas - 0.24.2
Modin Pandas - 0.5.2

Code.

import modin.pandas as pd

def calculate_visit_prioritization(df):
    # calculations here
    return df

def get_all_data():
    big_df = pd.DataFrame()
    for df in pd.read_csv('./samp.csv', chunksize=50):
        big_df = big_df.append(calculate_visit_prioritization(df))
    big_df.to_csv('samps3.csv', index=False)

def main():
    get_all_data()

if __name__ == '__main__':
    main()

Error when using Modin pandas to append dataframes.

UserWarning: DataFrame.append for empty DataFrame defaulting to pandas implementation.

File "/home/tony/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 289, in __init__ raise TypeError(msg)

TypeError: cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

I have gone through this link where it's said the pandas's .append() function is partially ( P ) implemented in Modin.


Solution

  • Modin's USP is that the only difference between pandas and modin pandas is the import statement. To concatenate multiple DataFrames, use a single pd.concat call rather than N append calls for better performance.

    df_list = []
    for df in pd.read_csv('./samp.csv', chunksize=50):
        df_list.append(calculate_visit_prioritization(df))
    
    big_df = pd.concat(df_list, ignore_index=True)
    big_df.to_csv('samps3.csv', index=False)
    

    This should gracefully handle empty sub-DataFrames as well.