Suppose I have a dataframe Old
with columns A
, B
, and C
. I want a new dataframe New
where two columns D
and E
. For each cell in Old
, I want a corresponding row in the D
column in New
where the value in E
is the name of the column the cell was in.
I know that straight up iterating over a dataframe is bad, but that's how I did it. Here, I only cared about some column names in the Old
dataframe, so if the cell wasn't under a column I cared about, I just assigned it the value other
. But the principle is the same.
for column in df.columns:
for entry in df[column]:
entries.append(entry)
labels.append(column_labels.get(column, "other")) # Assign label based on column
My question is what are some better ways to do this? Running this will become exceedingly slow as the dataset grows.
You must be looking for stack():
df = pd.DataFrame(np.arange(12).reshape((4,3)), columns=list("ABC"))
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
res = (df.stack()
.reset_index(level=1)
.sort_values(by="level_1")
.reset_index(drop=True)
.rename(columns={"level_1":"labels", 0:"entries"})
)
labels entries
0 A 0
1 A 3
2 A 6
3 A 9
4 B 1
5 B 4
6 B 7
7 B 10
8 C 2
9 C 5
10 C 8
11 C 11