pandaslistbioinformaticsrna-seqscanpy

Gseapy: how to get gene list used for each pathway


I am running an enrichment analysis with gseapy enrichr on a list of genes. I am using the following code:

enr_res = gseapy.enrichr(gene_list = glist[:5000], 
                            organism = 'Mouse', 
                            gene_sets = ['GO_Biological_Process_2021'],
                            description = 'pathway',
                            #cutoff = 0.5
                            )

The result looks like this:

enr_res.results.head(10)

enter image description here

The question I have is, how do I get the full set of Genes (very right column in the picture) used for the individual pathways?

If I try the following code, it will just give me the already displayed genes. I added some correction to have a list that I then could further use for the analysis.

x = 'fatty acid beta-oxidation (GO:0006635)'

g_list = enr_res.results[enr_res.results.Term == x]['Genes'].to_string()

deliminator = ';'
g_list = [section + deliminator for section in g_list.split(deliminator) if section]

g_list = [s.replace(';', '') for s in g_list]
g_list = [s.replace(' ', '') for s in g_list]
g_list = [s.replace('.', '') for s in g_list]

first_gene = g_list[0:1]
first_gene = [sub[1 : ] for sub in first_gene]

g_list[0:1] = first_gene
for i in range(len(g_list)):
    g_list[i] = g_list[i].lower()
for i in range(len(g_list)):
    g_list[i] = g_list[i].capitalize()

g_list

I think my approach might be wrong to get all the Genes and I just get the displayed genes. Does somebody has an idea, how it is possible to get all genes?


Solution

  • pd.set_option('display.max_colwidth', 3000)
    

    This increases the number of displayed characters and somehow this solves the problem for me. :)