pythonpandaspandas-explode

Separate excel columns with different lengths into multiple rows using Python


I am attempting to use python to loop through rows in an excel spreadsheet and separate data in a row into multiple rows if the column in question has multiple newlines or carriage returns.

In the example data set below, I would be using Col 3 to determine if a row needed to be expanded. The "current data set" shown is one row in excel where Col 3 - 8 have multiple lines within the cell. This would just be one row of multiple on my spreadsheet that would need expanded, and some may not have multiple newlines or carriage returns that would require a split. Col 1 and Col 2 will always be one line and will need to be added into the new rows. There is a potential for blank spots in the data that will just be empty newline-carriage returns that need to be accounted for in the goal data set. Col 3 - 8 will have the same amount of lines and the lines will represent the new rows.

I have tried to use pandas explode function but I am left with an absurd amount of copies of some of the rows. On one attempt, turning approximately 50 rows into 1,500 rows.

Current data set                            
Col 1   Col 2   Col 3   Col 4   Col 5   Col 6   Col 7   Col 8
Num 1   Num 2   "Num 3  "Text 1 "Num 7  "Text 5 "Num 11 "Text 9
                 Num 4   Text 2  Num 8   Text 6          Text 10
                 Num 5   Text 3  Num 9   Text 7  Num 12
                 Num 6"  Text 4" Num 10" Text 8" Num 13" Text 11"
        
                                            
                            
Goal data set                           
Col 1   Col 2   Col 3   Col 4   Col 5   Col 6   Col 7   Col 8
Num 1   Num 2   Num 3   Text 1  Num 7   Text 5  Num 11  Text 9
Num 1   Num 2   Num 4   Text 2  Num 8   Text 6          Text 10
Num 1   Num 2   Num 5   Text 3  Num 9   Text 7  Num 12  
Num 1   Num 2   Num 6   Text 4  Num 10  Text 8  Num 13  Text 11
data = pd.read_excel(f'Output\\spreadSheet.xlsx')
columns = data.columns.tolist()
for column in columns:
    if column == "Col 3":
        continue

    for idx, status in enumerate(data[column]):
        try:
            if '\n' in status:
                    data[column] = data[column].str.split('\n \n')
                    data = data.explode(column)
        except:
            continue

data.to_excel(f'Output\\spreadSheet.xlsx')

Solution

  • I have created simple example to reproduce:

    df = pd.DataFrame( [[1, 2, 'A\n2\n1\n3\n7', 
                         'A\n9\n9\n9\n9', 
                         'A\n2\nV\n4\n1', 
                         'A\n2\n1\n\n7', 
                         'A\n3\n1\n3\n3', 
                         'A\n\n\nB\nB']], columns =[f'Col {i}' for i in range(1,9)])
    df
    
    Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7 Col 8
    0 1 2 A\n2\n1\n3\n7 A\n9\n9\n9\n9 A\n2\nV\n4\n1 A\n2\n1\n\n7 A\n3\n1\n3\n3 A\n\n\nB\nB

    I think that it's easier to create second data frame than modify presented one. Namely,

    df_rep = pd.DataFrame(index = range(len(df['Col 3'][0].split('\n'))), #number of indexes defined by col3
    columns =[f'Col {i}' for i in range(1,9)])
    df_rep
    
        
    
    Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7 Col 8
    0 NaN NaN NaN NaN NaN NaN NaN NaN
    1 NaN NaN NaN NaN NaN NaN NaN NaN
    2 NaN NaN NaN NaN NaN NaN NaN NaN
    3 NaN NaN NaN NaN NaN NaN NaN NaN
    4 NaN NaN NaN NaN NaN NaN NaN NaN

    Then you can use following code:

    df_rep['Col 1'] = df.iat[0,0]
    df_rep['Col 2'] = df.iat[0,1]
    for i in range(3,9):
        df_rep['Col {}'.format(i)] = df['Col {}'.format(i)][0].split('\n')
    df_rep
    
    Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7 Col 8
    0 1 2 A A A A A A
    1 1 2 2 9 2 2 3
    2 1 2 1 9 V 1 1
    3 1 2 3 9 4 3 B
    4 1 2 7 9 1 7 3 B

    Update: multiple rows

    For the multi-row data frame:

    df = pd.DataFrame( [[1, 2, 'A\n2\n1\n3\n7', 
                         'A\n9\n9\n9\n9', 
                         'A\n2\nV\n4\n1', 
                         'A\n2\n1\n\n7', 
                         'A\n3\n1\n3\n3', 
                         'A\n\n\nB\nB'], [2, 3, 'B\n2\n1\n3\n7', 
                         'B\n9\n9\n9\n9', 
                         'B\n2\nV\n4\n1', 
                         'B\n2\n1\n\n7', 
                         'B\n3\n1\n3\n3', 
                         'B\n\n\nB\nB']], columns =[f'Col {i}' for i in range(1,9)])
    
    Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7 Col 8
    0 1 2 A\n2\n1\n3\n7 A\n9\n9\n9\n9 A\n2\nV\n4\n1 A\n2\n1\n\n7 A\n3\n1\n3\n3
    1 2 3 B\n2\n1\n3\n7 B\n9\n9\n9\n9 B\n2\nV\n4\n1 B\n2\n1\n\n7 B\n3\n1\n3\n3

    We can use following code:

    df_rep = pd.DataFrame(columns =[f'Col {i}' for i in range(1,9)])
    
    for i, row in enumerate(df.index):
        single_rep = pd.DataFrame(index = range(len(df['Col 3'][i].split('\n'))), columns =[f'Col {i}' for i in range(1,9)]) #number of indexes defined by col3
        single_rep['Col 1'] = df.iat[i,0]
        single_rep['Col 2'] = df.iat[i,1]
        for j in range(3,9):
            single_rep['Col {}'.format(j)] = df['Col {}'.format(j)][i].split('\n')
        df_rep = pd.concat([df_rep, single_rep])
    df_rep.reset_index(drop=True, inplace=True)
    

    df_rep stores all information about the loop, whereas single_rep is created independently for each row. It's overwritten inside the loop as well as df_rep by adding new rows. As output, we have:

    Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7 Col 8
    0 1 2 A A A A A
    1 1 2 2 9 2 2 3
    2 1 2 1 9 V 1 1
    3 1 2 3 9 4 3
    4 1 2 7 9 1 7 3
    5 2 3 B B B B B
    6 2 3 2 9 2 2 3
    7 2 3 1 9 V 1 1
    8 2 3 3 9 4 3
    9 2 3 7 9 1 7 3