I want to add aliases to consecutive occurrences of the same gene name in column gene_id
. If the gene_id
value is unique, it should be unchanged.
Here is my example input:
df_genes_data = {'gene_id': ['g0', 'g1', 'g1', 'g2', 'g3', 'g4', 'g4', 'g4']}
df_genes = pd.DataFrame.from_dict(df_genes_data)
print(df_genes.to_string())
gene_id
0 g0
1 g1
2 g1
3 g2
4 g3
5 g4
6 g4
7 g4
and there is the desired output:
gene_id
0 g0
1 g1_TE1
2 g1_TE2
3 g2
4 g3
5 g4_TE1
6 g4_TE2
7 g4_TE3
Any ideas on how to perform it? I've been looking for solutions but found only ways to count consecutive occurrences, not to label them with aliases.
EDIT:
I've tried to find gene_id
values which occur more than once in my data:
rep = []
gene_list = df_genes['gene_id']
for idx in range(0, len(gene_list) - 1):
if gene_list[idx] == gene_list[idx + 1]:
rep.append(gene_list[idx])
rep = list(set(rep))
print("Consecutive identical gene names are : " + str(rep))
but I have no idea how to add desired aliases to them.
Use shift
+ne
+cumsum
to group the consecutive values, then groupby.transform('size')
to identify the groups of more than 2 values, and groupby.cumcount
to increment the name:
# Series as name for shorter reference
s = df_genes['gene_id']
# group consecutive occurrences
group = s.ne(s.shift()).cumsum()
# form group and save as "g" for efficiency
g = s.groupby(group)
# identify groups with more than 1 value
m = g.transform('size').gt(1)
# increment values
df_genes.loc[m, 'gene_id'] += '_TE'+g.cumcount().add(1).astype(str)
Output:
gene_id
0 g0
1 g1_TE1
2 g1_TE2
3 g2
4 g3
5 g4_TE1
6 g4_TE2
7 g4_TE3
Intermediates:
gene_id group m cumcount+1 suffix
0 g0 1 False 1
1 g1 2 True 1 _TE1
2 g1 2 True 2 _TE2
3 g2 3 False 1
4 g3 4 False 1
5 g4 5 True 1 _TE1
6 g4 5 True 2 _TE2
7 g4 5 True 3 _TE3