I have a dataframe, with duplicated index, which I'm trying to reindex according to rowname of another dataframe. However, due to the duplicated index problem it's telling me cannot reindex on an axis with duplicate labels
when I do df2 = df2.reindex(df.index)
.
Below is what my df2 looks like
index | column1 | column2 |
---|---|---|
A | sample_a | 1 |
B | sample_a | 3 |
B | sample_b | 2 |
C | sample_c | 3 |
So I want to make my index unique by adding prefix to it by values from column1. If column1 is sample_a, I want to make index as a_index; if column1 is sample_b, I want to change index as b_index, to change my df2 as below:
index | column1 | column2 |
---|---|---|
a_A | sample_a | 1 |
a_B | sample_a | 3 |
b_B | sample_b | 2 |
c_C | sample_c | 3 |
I do see ways of adding prefix to index but I'm not sure how to selectively add different prefix?
Thank you!
If your other DataFrame also has both "index" and "column1", why not use both as index?
Alternatively, you can de-duplicate "index" using str.extract
to get the id from "column1" that is after the underscore:
df['index'] = (df['column1']
.str.extract('_([^_]+)$', expand=False)
.add('_'+df['index'])
)
Updated dataframe:
index column1 column2
0 a_A sample_a 1
1 a_B sample_a 3
2 b_B sample_b 2
3 c_C sample_c 3
regex:
_ # match underscore
([^_]+) # capture non-underscore characters
$ # match end of line