pythonpandascrosstabscanpy

Panda crosstab function getting number for conditions


I´m not sure if the title was well picked, sorry for that. If this was already covered please let me know where I couldn´t find it. For an analysis that I am doing, I am working in JupyterLab mainly scanpy. I want to see the number of cells that are coexpressing certain genes in a leiden clustering. So far I was trying with pandas crosstab function and I get the number for each cluster. However, I have two conditions and there I´m struggling to separate the samples to get the cell counts separately.

The code I am using to get the total cell number which works fine.

pd.crosstab(adata_proc.obs['leiden_r05'], adata_proc.obs['CoEx'])

The code where I am struggling to get the numbers for the samples. I know that the aggfunc = ','.join is not the correct way but this is to explain what the problem is.

pd.crosstab(adata_proc.obs['leiden_r05'], adata_proc.obs['CoEx'], adata_proc.obs['sample'], aggfunc = ','.join)

I can get the name of the conditions out in the table but I don´t want this. I want the numbers for the 2 conditions. How is this possible? Maybe there is a way to do this in a separate function?

enter image description here


Solution

  • Edit: Using crosstab, you'll need to add the 'CoEx' column to the index, and use the 'sample' as the column of interest:

    pd.crosstab(index=[adata_proc.obs['leiden_r05'],adata_proc.obs['CoEx']], columns=[adata_proc.obs['sample']])
    

    I suggest using the .groupby function:

    adata_proc.obs.groupby(['leiden_r05','CoEx'])["sample"].value_counts()
    

    Another option (a bit of an abuse) is the pivot_table interface. In your case it be:

    pd.pivot_table(adata_proc.obs, index=["leiden_r05"], columns=["CoEx","sample"],values='barcode',  aggfunc=len, fill_value=0)
    

    *The 'values' argument is here only to reduce the amounts of columns, an artifact of using an unfit method