I've been working on a large genomics data set that contains multiple reads of every sample to make sure we got the data, but when analyzing it we need to drop it down to one row so we don't skew the data (count the gene as present 6 times when it actually was one instance read multiple times). Every row has an ID so I used the pandas df.groupby()
function on the ID. Here's a table to try and illustrate what I want to do:
# ID | functionality | v_region_score | constant_region
# -----------------------------------------------------------------
# 123 | productive | 820 | NaN
# | unknown | 720 | NaN
# | unknown | 720 | IgM
# 456 | unknown | 690 | NaN
# | unknown | 670 | NaN
# 789 | productive | 780 | IgM
# | productive | 780 | NaN
(Edit) Here is the code for an example dataframe:
df1 = pd.DataFrame([
[789, "productive", 780, "IgM"],
[123, "unknown", 720, np.nan],
[123, "unknown", 720, "IgM"],
[789, "productive", 780, np.nan],
[123, "productive", 820, np.nan],
[456, "unknown", 690, np.nan],
[456, "unknown", 670, np.nan]],
columns=["ID", "functionality", "v_region_score", "constant_region"])
And this would be the final output with correct rows chosen:
df2 = pd.DataFrame([
[789, "productive", 780, "IgM"],
[123, "productive", 820, np.nan],
[456, "unknown", 690, np.nan]],
columns=["ID", "functionality", "v_region_score", "constant_region"])
So after grouping, for each group if it has a "productive" value in functionality I want to keep that row, if it is "unknown" I take the highest v_region_score, and if there are multiple "productive" values, I take the one that has some value in its constant_region.
I've tried several ways of accessing these values:
id, frame = next(iter(df_grouped))
if frame["functionality"].equals("productive"):
# do something
Looking at just one group:
x = df_grouped.get_group("1:1101:10897:22442")
for index, value in x["functionality"].items():
print(value)
# returns the correct value and type "str"
Even putting each group into a list:
new_groups = []
for id, frame in df_grouped:
new_groups.append(frame)
# access a specific index returns a dataframe
new_groups[30]
The error I get with all of these is "The truth value of a Series is ambiguous" which I now understand why this doesn't work, but I can't use a.any()
, a.all()
, or a.bool()
because of how complicated the conditional is.
Is there any way I can choose a specific row in each group based on the values of its columns? Sorry for such a complicated question and thanks in advance! :)
You can approach your problem from a different angle:
ID
ID
For example:
df1 = df1.sort_values(['ID','functionality','v_region_score','constant_region'], ascending=[True,True,False,True], na_position='last')
df1.groupby('ID').first().reset_index()
Out[0]:
ID functionality v_region_score constant_region
0 123 productive 820 IgM
1 456 unknown 690 NaN
2 789 productive 780 IgM
Additionally, if you want to merge values from constant_region
when it's null
, you can use fillna(method='ffill')
so you keep the value that exists:
## sorted here
df1['constant_region'] = df1.groupby('ID')['constant_region'].fillna(method='ffill')
df1
Out[1]:
ID functionality v_region_score constant_region
4 123 productive 820 NaN
2 123 unknown 720 IgM
1 123 unknown 720 IgM
5 456 unknown 690 NaN
6 456 unknown 670 NaN
0 789 productive 780 IgM
3 789 productive 780 IgM
## Group by here