pythonpandas

How to add aliases to consecutive occurrences in column?


I want to add aliases to consecutive occurrences of the same gene name in column gene_id. If the gene_id value is unique, it should be unchanged.

Here is my example input:

df_genes_data = {'gene_id': ['g0', 'g1', 'g1', 'g2', 'g3', 'g4', 'g4', 'g4']}
df_genes = pd.DataFrame.from_dict(df_genes_data)
print(df_genes.to_string())

  gene_id
0      g0
1      g1
2      g1
3      g2
4      g3
5      g4
6      g4
7      g4

and there is the desired output:

  gene_id
0      g0
1  g1_TE1
2  g1_TE2
3      g2
4      g3
5  g4_TE1
6  g4_TE2
7  g4_TE3

Any ideas on how to perform it? I've been looking for solutions but found only ways to count consecutive occurrences, not to label them with aliases.

EDIT:

I've tried to find gene_id values which occur more than once in my data:

rep = []
gene_list = df_genes['gene_id']
for idx in range(0, len(gene_list) - 1):
    if gene_list[idx] == gene_list[idx + 1]:
        rep.append(gene_list[idx])
rep = list(set(rep))
print("Consecutive identical gene names are : " + str(rep))

but I have no idea how to add desired aliases to them.


Solution

  • Use shift+ne+cumsum to group the consecutive values, then groupby.transform('size') to identify the groups of more than 2 values, and groupby.cumcount to increment the name:

    # Series as name for shorter reference
    s = df_genes['gene_id']
    # group consecutive occurrences
    group = s.ne(s.shift()).cumsum()
    # form group and save as "g" for efficiency
    g = s.groupby(group)
    # identify groups with more than 1 value
    m = g.transform('size').gt(1)
    # increment values
    df_genes.loc[m, 'gene_id'] += '_TE'+g.cumcount().add(1).astype(str)
    

    Output:

      gene_id
    0      g0
    1  g1_TE1
    2  g1_TE2
    3      g2
    4      g3
    5  g4_TE1
    6  g4_TE2
    7  g4_TE3
    

    Intermediates:

      gene_id  group      m  cumcount+1 suffix
    0      g0      1  False           1       
    1      g1      2   True           1   _TE1
    2      g1      2   True           2   _TE2
    3      g2      3  False           1       
    4      g3      4  False           1       
    5      g4      5   True           1   _TE1
    6      g4      5   True           2   _TE2
    7      g4      5   True           3   _TE3