pythonpandasbioinformaticspyranges

How to convert chromosome name to same format in pyranges before performing a join


I have multiple .bed files and I want to perform join, intersection etc. operation on them. I am using pyranges library to read the .bed files and perform these operations. As .bed files allows naming chromosome with or without "chr" prefix, I would like to format all chromosome name in different .bed files to the same format before performing the operations. Therefore, operations results in outputs as expected.

I tried,

>>> import pandas as pd
>>> import pyranges as pr
>>> df1 = pd.DataFrame({"Chromosome": ["chr1", "chr2"], "Start": [100, 200],
...                    "End": [150, 201]})
>>> py1 = pr.PyRanges(df1)
>>> df2 = pd.DataFrame({"Chromosome": ["1", "2"], "Start": [1000, 2000],
...                    "End": [1500, 20010]})
>>> py2 = pr.PyRanges(df2)
>>> def modify_chrom_series(df):
...    df.Chromosome = df.Chromosome.apply(lambda val: val.replace("chr", ""))
...    return df
>>> def fix_chrom(regions):
...    return regions.apply(modify_chrom_series)
>>> py1 = fix_chrom(py1)
>>> py1
+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |       100 |       150 |
|            2 |       200 |       201 |
+--------------+-----------+-----------+
>>> py2 = fix_chrom(py2)
>>> py2

+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |      1000 |      1500 |
|            2 |      2000 |     20010 |
+--------------+-----------+-----------+

>>> py1["1"]    
Empty PyRanges
>>> py1["chr1"]
+--------------+-----------+-----------+
|   Chromosome |     Start |       End |
|   (category) |   (int32) |   (int32) |
|--------------+-----------+-----------|
|            1 |       100 |       150 |
+--------------+-----------+-----------+

>>> py1.join(py2)
Empty PyRanges

With the above code, chromosome name is formatted but the mapping of chromosome name in pyranges remains the same. Therefore, operations like join or query py1["1"] does not work as expected.

Is there a way to get the desired behavior using pyranges ?


Solution

  • The data in PyRanges class are stored in multiple places. Apart from .Chromosome, you have .dfs which is a dict. This keys in this dict are used when you do the py1["1"] call.

    You need to also update the dict

    >>> df1 = pd.DataFrame({"Chromosome": ["chr1", "chr2"], "Start": [100, 200],
                           "End": [150, 201]})
    >>> py1 = pr.PyRanges(df1)
    >>> py1.dfs["1"] = py1.dfs['chr1']
    >>> del py1.dfs['chr1']
    >>> py1["1"]
    
    +--------------+-----------+-----------+
    | Chromosome   |     Start |       End |
    | (category)   |   (int32) |   (int32) |
    |--------------+-----------+-----------|
    | chr1         |       100 |       150 |
    +--------------+-----------+-----------+
    Unstranded PyRanges object has 1 rows and 3 columns from 1 chromosomes.
    For printing, the PyRanges was sorted on Chromosome.
    

    Note that the name of the chromosome did not change in the table - it is because, as stated above, the data are stored in multiple places.

    To be honest - I don't understand the PyRanges deeply and I have no idea if it is safe to update the data like this.

    I strongly suggest to pre-process your data when they are in still in .bed format. This will ensure that the data are imported correctly to pyranges.

    Edit 1/8/20: The answer is based on pre-bugfix behavior and may not be needed in the future.