pythonpandasdataframescanpy

Unable to iterate over pandas dataframe loaded from tabular data


I can load my tabular data into a DataFrame using scanpy but I'm missing how to iterate over it to access selected rows/columns.

This is single-cell genomics data, where each row is a gene and each column is the expression value for a specific cell. Both rows and columns have labels. The tabular raw data looks like:

Gene_symbol Cancer--Cell_1  Cancer--Cell_10 Cancer--Cell_100
A2M.AS1 0.0 0.0 0.0
A2MP1   0.0 0.0 0.0
AADACL2 0.0 0.0 0.0
AAGAB   154.561226827488    0.0 0.0
AAR2    295.875190529996    299.455534712676    0.0
AATF    546.792205537953    323.38381204192996  0.0
AATK    0.0 0.0 0.0
AATK.AS1    0.0 0.0 0.0
ABAT    0.0 0.0 0.0

This was pretty easily converted to h5ad like this:

import pandas as pd
import scanpy.api as sc

adata = sc.read('fig1.tab', ext='txt', first_column_names=True).transpose()
adata.write('fig1.h5')

I can load it but am having trouble accessing all parts of it again. How could I, for example, select two gene rows and get all columns and their corresponding values? What if I only wanted certain columns?

Notes in my code attempt with output below:

adata = sc.read_h5ad('fig1.h5')

# this is for the cancer dataset
selected = adata[:, adata.var_names.isin({'AAR2', 'ECT2'})]

## this line spews information on the columns like:
#  Empty DataFrameView
#  Columns: []
# Index: [Cancer--Cell_1, Cancer--Cell_10, Cancer--Cell_100, Cancer--Cell_1000, Cancer--Cell_1001
print(selected.obs)

## this line gives the row information:
# Empty DataFrameView
# Columns: []
#Index: [AAR2, ECT2]
print(selected.var)

# Nothing happens here at all
#for i, row in selected.obs.iteritems():
#    print(i, row)

for gene_name, row in selected.var.T.iteritems():
    # this prints like: Series([], Name: AAR2, dtype: float64)
    print(row)

    # Nothing happens here
    for cell_name, val in row.iteritems():
        print("{0}\t{1}\t{2}".format(gene_name, cell_name, val))

In case it's helpful, here's a Dropbox link for the fig1.h5 file


Solution

  • You're iterating the variable (gene) metadata for each gene, not the data matrix.

    Your genes don't have any metadata associated with it save for their names, which are stored in the index of the var metadata DataFrame. What you now store in the row variable is the empty metadata for individual genes.

    From your comment I infer that you want to iterate the matrix. You can do that like so:

    cx = adata.X.tocoo()    
    for cell, gene, value in zip(adata.obs_names[cx.row], adata.var_names[cx.col], cx.data):
        print(cell, gene, value)
    

    This of course only works if your matrix is sparse.

    If it's dense and you really want to iterate every value including zeros, I'd recommend this:

    for g, gene in enumerate(adata.var_names):
        for c, cell in enumerate(adata.obs_names):
            print(cell, gene, adata.X[c, g])