I have two dataframes as follows:
leader
:
0 11
1 8
2 5
3 9
4 8
5 6
[6065 rows x 2 columns]
DatasetLabel
:
0 1 .... 7 8 9 10 11 12
0 A J .... 1 2 5 NaN NaN NaN
1 B K .... 3 4 NaN NaN NaN NaN
[4095 rows x 14 columns]
The Information dataset column names 0 to 6 are DatasetLabel
about data and 7 to 12 are indexes that refer to the first column of leader
Dataframe.
I want to create dataset where instead of the indexes in DatasetLabel
dataframe, I have the value of each index from the leader
dataframe, which is leader.iloc[index,1]
.
How can I do it using python features?
The output should look like:
DatasetLabel
:
0 1 .... 7 8 9 10 11 12
0 A J .... 8 5 6 NaN NaN NaN
1 B K .... 9 8 NaN NaN NaN NaN
I have come up with the following, but I get an error:
for column in DatasetLabel.ix[:, 8:13]:
DatasetLabel[DatasetLabel[column].notnull()] = leader.iloc[DatasetLabel[DatasetLabel[column].notnull()][column].values, 1]
Error:
ValueError: Must have equal len keys and value when setting with an iterable
You can use apply
to index into leader
and exchange values with DatasetLabel
, although it's not very pretty.
One issue is that Pandas won't let us index with NaN
. Converting to str
provides a workaround. But that creates a second issue, namely, column 9
is of type float
(because NaN
is float
), so 5
becomes 5.0
. Once it's a string, that's "5.0"
, which will fail to match the index values in leader
. We can remove the .0
, and then this solution will work - but it's a bit of a hack.
With DatasetLabel
as:
Unnamed:0 0 1 7 8 9 10 11 12
0 0 A J 1 2 5.0 NaN NaN NaN
1 1 B K 3 4 NaN NaN NaN NaN
And leader
as:
0 1
0 0 11
1 1 8
2 2 5
3 3 9
4 4 8
5 5 6
Then:
cols = ["7","8","9","10","11","12"]
updated = DatasetLabel[cols].apply(
lambda x: leader.loc[x.astype(str).str.split(".").str[0], 1].values, axis=1)
updated
7 8 9 10 11 12
0 8.0 5.0 6.0 NaN NaN NaN
1 9.0 8.0 NaN NaN NaN NaN
Now we can concat
the unmodified columns (which we'll call original
) with updated
:
original_cols = DatasetLabel.columns[~DatasetLabel.columns.isin(cols)]
original = DatasetLabel[original_cols]
pd.concat([original, updated], axis=1)
Output:
Unnamed:0 0 1 7 8 9 10 11 12
0 0 A J 8.0 5.0 6.0 NaN NaN NaN
1 1 B K 9.0 8.0 NaN NaN NaN NaN
Note: It may be clearer to use concat
here, but here's another, cleaner way of merging original
and updated
, using assign
:
DatasetLabel.assign(**updated)