pythonexcelpandascsvglob

How can I combine multiple CSV files into one Excel worksheet using Python?


I am trying to combine multiple CSV files into an Excel worksheet so I can analyse my data.

So far I have been able to create the worksheet with the names of the CSV files as the tabs, but in the sheets themselves, I can't get the data I want.

I am using read.csv from pandas, but from what I can see and have been able to get working, the syntax only gives me the option to read the data of one CSV file and it then copies that data to all of the tabs which is not what I want.

import pandas as pd 
import os 

os.chdir('C:/Users/user3482176/correct_document_location) 
csv_files = ['doc1.csv', 'doc2.csv', 'doc3.csv'] 
reading = pd.read_csv('C:/Users/user3482176/correct_document_location/doc1.csv') 
with pd.ExcelWriter('combined_csv_files.xlsx', engine='openpyxl') as writer:
    for csv_file in csv_files:
        sheet_name = csv_file.split('.')[0]
        df.to_excel(writer,sheet_name=sheet_name, index=False)
    if_sheet_exists:'overlay'

I have seen that you can use the glob package to get the csv files and then display the data using that but I get an array error. I believe this is because my data has columns of different sizes, which they are meant to be, so not sure how to get around this.

import pandas as pd 
import os 
import glob 

os.chdir('C:/Users/user3482176/correct_document_location) 
csv_files = ['doc1.csv', 'doc2.csv', 'doc3.csv'] 
xlsx_files = glob.glob(os.path.join(directory_path, '*.csv')) 
df_list = [pd.read_csv(file) for file in xlsx_files] 
combined_df = pd.concat(df_list,join='outer', ignore_index=True) 
df = pd.DataFrame(df_list) 

with pd.ExcelWriter('combined_csv_files.xlsx', engine='openpyxl') as writer:
    for csv_file in csv_files:
        sheet_name = csv_file.split('.')[0]
        df.to_excel(writer,sheet_name=sheet_name, index=False)
    if_sheet_exists:'overlay'

Traceback (most recent call last): File "C:\Users\user3482176\PycharmProjects\combination_worksheet\collating sheets.py", line 53, in <module> df = pd.DataFrame(df_list) File "C:\Users\user3482176\PycharmProjects\combination_worksheet\venv\lib\site-packages\pandas\core\frame.py", line 867, in __init__ mgr = ndarray_to_mgr( File "C:\Users\user3482176\PycharmProjects\combination_worksheet\venv\lib\site-packages\pandas\core\internals\construction.py", line 319, in ndarray_to_mgr values = _prep_ndarraylike(values, copy=copy_on_sanitize) File "C:\Users\user3482176\PycharmProjects\combination_worksheet\venv\lib\site-packages\pandas\core\internals\construction.py", line 575, in _prep_ndarraylike values = np.array([convert(v) for v in values]) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (39,) + inhomogeneous part. 

What I am trying to understand is how I can get around this error to get my multiple CSV files into 1 Excel Worksheet.


Solution

  • I got it working like this:

    import pandas as pd
    import os
    
    csv_files = ['doc1.csv', 'doc2.csv']
    
    excel_writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter')
    
    for csv_file in csv_files:
    
        df = pd.read_csv(csv_file)
    
        sheet_name = os.path.splitext(os.path.basename(csv_file))[0]
        df.to_excel(excel_writer, sheet_name=sheet_name, index=False)
    
    excel_writer.close()
    

    You could switch the xlsxwriter engine with your openpyxl

    This creates an Excel file with a sheet for each .csv file.