pythonbioinformaticsvalueerrorscanpy

ValueError with file extension


I downloaded a raw data set from GSE (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92332) which contains single cell analysis data. There are three different file formats matrix.mtx.gz, barcodes.tsv.gz and genes.tsv.tz

I now tried to run this code in order to load the data:

#Load data

data_file = "/Users/---/desktop/single-cell-tutorial/latest_notebook/GSE92332_RAW"
adata = sc.read(data_file, cache=True)
adata = adata.transpose()
adata.X = adata.X.toarray()

But I always get the following value error

ValueError: Reading with filekey '/Users/---/desktop/single-cell-tutorial/latest_notebook/GSE92332_RAW/MTX/mtx.gv' failed, the inferred filename PosixPath('/Users/---/desktop/single-cell-tutorial/latest_notebook/GSE92332_RAW/MTX/mtx.gv.h5ad') does not exist. If you intended to provide a filename, either use a filename ending on one of the available extensions {'csv', 'data', 'tab', 'h5ad', 'anndata', 'h5', 'tsv', 'xlsx', 'loom', 'txt', 'mtx.gz', 'soft.gz', 'mtx'} or pass the parameter ext.

I understand that I need to add an extension but regardless of whichever extension I add I still get the same error.

I tried all different extensions that are also file types (mtx.gz etc.), made an own folder with only the MTX data and tried calling that but nothing is working.


Solution

  • The scanpy.read method is for .h5ad files. If loading raw CellRanger MTX, then you should use the scanpy.read_10x_mtx method. E.g.,

    import scanpy as sc
    
    data_file = "path/to/GSE92332_RAW"
    adata = sc.read_10_mtx(data_file, cache=True)
    

    As commented, the .mtx and .tsv files likely need to be unzipped (run gzip -d *.gz from command line while in the folder). This is idiosyncratic to scanpy, which requires data with genes.tsv (pre-v3 CellRanger output) to be unzipped, whereas data with features.tsv (v3+ CellRanger output) can stay zipped. At least that's what the code shows.

    Since this appears to be many runs, you may also need the prefix argument to specify which particular run you want to load.