I am attempting to use python to loop through rows in an excel spreadsheet and separate data in a row into multiple rows if the column in question has multiple newlines or carriage returns.
In the example data set below, I would be using Col 3 to determine if a row needed to be expanded. The "current data set" shown is one row in excel where Col 3 - 8 have multiple lines within the cell. This would just be one row of multiple on my spreadsheet that would need expanded, and some may not have multiple newlines or carriage returns that would require a split. Col 1 and Col 2 will always be one line and will need to be added into the new rows. There is a potential for blank spots in the data that will just be empty newline-carriage returns that need to be accounted for in the goal data set. Col 3 - 8 will have the same amount of lines and the lines will represent the new rows.
I have tried to use pandas explode function but I am left with an absurd amount of copies of some of the rows. On one attempt, turning approximately 50 rows into 1,500 rows.
Current data set
Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7 Col 8
Num 1 Num 2 "Num 3 "Text 1 "Num 7 "Text 5 "Num 11 "Text 9
Num 4 Text 2 Num 8 Text 6 Text 10
Num 5 Text 3 Num 9 Text 7 Num 12
Num 6" Text 4" Num 10" Text 8" Num 13" Text 11"
Goal data set
Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7 Col 8
Num 1 Num 2 Num 3 Text 1 Num 7 Text 5 Num 11 Text 9
Num 1 Num 2 Num 4 Text 2 Num 8 Text 6 Text 10
Num 1 Num 2 Num 5 Text 3 Num 9 Text 7 Num 12
Num 1 Num 2 Num 6 Text 4 Num 10 Text 8 Num 13 Text 11
data = pd.read_excel(f'Output\\spreadSheet.xlsx')
columns = data.columns.tolist()
for column in columns:
if column == "Col 3":
continue
for idx, status in enumerate(data[column]):
try:
if '\n' in status:
data[column] = data[column].str.split('\n \n')
data = data.explode(column)
except:
continue
data.to_excel(f'Output\\spreadSheet.xlsx')
I have created simple example to reproduce:
df = pd.DataFrame( [[1, 2, 'A\n2\n1\n3\n7',
'A\n9\n9\n9\n9',
'A\n2\nV\n4\n1',
'A\n2\n1\n\n7',
'A\n3\n1\n3\n3',
'A\n\n\nB\nB']], columns =[f'Col {i}' for i in range(1,9)])
df
Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | Col 6 | Col 7 | Col 8 | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | A\n2\n1\n3\n7 | A\n9\n9\n9\n9 | A\n2\nV\n4\n1 | A\n2\n1\n\n7 | A\n3\n1\n3\n3 | A\n\n\nB\nB |
I think that it's easier to create second data frame than modify presented one. Namely,
df_rep = pd.DataFrame(index = range(len(df['Col 3'][0].split('\n'))), #number of indexes defined by col3
columns =[f'Col {i}' for i in range(1,9)])
df_rep
Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | Col 6 | Col 7 | Col 8 | |
---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Then you can use following code:
df_rep['Col 1'] = df.iat[0,0]
df_rep['Col 2'] = df.iat[0,1]
for i in range(3,9):
df_rep['Col {}'.format(i)] = df['Col {}'.format(i)][0].split('\n')
df_rep
Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | Col 6 | Col 7 | Col 8 | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | A | A | A | A | A | A |
1 | 1 | 2 | 2 | 9 | 2 | 2 | 3 | |
2 | 1 | 2 | 1 | 9 | V | 1 | 1 | |
3 | 1 | 2 | 3 | 9 | 4 | 3 | B | |
4 | 1 | 2 | 7 | 9 | 1 | 7 | 3 | B |
Update: multiple rows
For the multi-row data frame:
df = pd.DataFrame( [[1, 2, 'A\n2\n1\n3\n7',
'A\n9\n9\n9\n9',
'A\n2\nV\n4\n1',
'A\n2\n1\n\n7',
'A\n3\n1\n3\n3',
'A\n\n\nB\nB'], [2, 3, 'B\n2\n1\n3\n7',
'B\n9\n9\n9\n9',
'B\n2\nV\n4\n1',
'B\n2\n1\n\n7',
'B\n3\n1\n3\n3',
'B\n\n\nB\nB']], columns =[f'Col {i}' for i in range(1,9)])
Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | Col 6 | Col 7 | Col 8 |
---|---|---|---|---|---|---|---|
0 | 1 | 2 | A\n2\n1\n3\n7 | A\n9\n9\n9\n9 | A\n2\nV\n4\n1 | A\n2\n1\n\n7 | A\n3\n1\n3\n3 |
1 | 2 | 3 | B\n2\n1\n3\n7 | B\n9\n9\n9\n9 | B\n2\nV\n4\n1 | B\n2\n1\n\n7 | B\n3\n1\n3\n3 |
We can use following code:
df_rep = pd.DataFrame(columns =[f'Col {i}' for i in range(1,9)])
for i, row in enumerate(df.index):
single_rep = pd.DataFrame(index = range(len(df['Col 3'][i].split('\n'))), columns =[f'Col {i}' for i in range(1,9)]) #number of indexes defined by col3
single_rep['Col 1'] = df.iat[i,0]
single_rep['Col 2'] = df.iat[i,1]
for j in range(3,9):
single_rep['Col {}'.format(j)] = df['Col {}'.format(j)][i].split('\n')
df_rep = pd.concat([df_rep, single_rep])
df_rep.reset_index(drop=True, inplace=True)
df_rep
stores all information about the loop, whereas single_rep
is created independently for each row. It's overwritten inside the loop as well as df_rep by adding new rows. As output, we have:
Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | Col 6 | Col 7 | Col 8 |
---|---|---|---|---|---|---|---|
0 | 1 | 2 | A | A | A | A | A |
1 | 1 | 2 | 2 | 9 | 2 | 2 | 3 |
2 | 1 | 2 | 1 | 9 | V | 1 | 1 |
3 | 1 | 2 | 3 | 9 | 4 | 3 | |
4 | 1 | 2 | 7 | 9 | 1 | 7 | 3 |
5 | 2 | 3 | B | B | B | B | B |
6 | 2 | 3 | 2 | 9 | 2 | 2 | 3 |
7 | 2 | 3 | 1 | 9 | V | 1 | 1 |
8 | 2 | 3 | 3 | 9 | 4 | 3 | |
9 | 2 | 3 | 7 | 9 | 1 | 7 | 3 |