I am working on performing calculations on large files around 6GB each file and came across Modin pandas which I heard optimized compared to pandas.
I need to read a CSV file in chunks and perform calculations on that and append it to a big dataframe and convert the big dataframe to a CSV file again.
This is working absolutely fine with Pandas. But it's taking too much time to process even small files. And I can't even imagine it for 6GB files.
However, when I try to do the same thing with modin pandas, it is unable to append dataframe to a big dataframe which I want to convert to csv file.
Can anyone suggest any alternative to this or a solution.
Python - 3.6
Pandas - 0.24.2
Modin Pandas - 0.5.2
Code.
import modin.pandas as pd
def calculate_visit_prioritization(df):
# calculations here
return df
def get_all_data():
big_df = pd.DataFrame()
for df in pd.read_csv('./samp.csv', chunksize=50):
big_df = big_df.append(calculate_visit_prioritization(df))
big_df.to_csv('samps3.csv', index=False)
def main():
get_all_data()
if __name__ == '__main__':
main()
Error when using Modin pandas to append dataframes.
UserWarning:
DataFrame.append
for empty DataFrame defaulting to pandas implementation.File "/home/tony/.local/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 289, in
__init__
raise TypeError(msg)TypeError: cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I have gone through this link where it's said the pandas's .append()
function is partially ( P ) implemented in Modin.
Modin's USP is that the only difference between pandas and modin pandas is the import statement. To concatenate multiple DataFrames, use a single pd.concat
call rather than N append
calls for better performance.
df_list = []
for df in pd.read_csv('./samp.csv', chunksize=50):
df_list.append(calculate_visit_prioritization(df))
big_df = pd.concat(df_list, ignore_index=True)
big_df.to_csv('samps3.csv', index=False)
This should gracefully handle empty sub-DataFrames as well.