Having segmented a genome into adjacent non-overlapping bins, e.g. by tileGenome
, I have computed some property by some means for each bin (say either 1 or 2).
Now I want to merge adjacent with the same property. An minimal example is illustrated below:
library(GenomicRanges)
chrSizes <- c(chr1 = 1000, chr2 = 500)
bins <- tileGenome(chrSizes, tilewidth = 200, cut.last.tile.in.chrom = T)
bins$property <- rep(1:2, each = 4)
bins
GRanges object with 8 ranges and 1 metadata column:
seqnames ranges strand | property
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 1-200 * | 1
[2] chr1 201-400 * | 1
[3] chr1 401-600 * | 1
[4] chr1 601-800 * | 1
[5] chr1 801-1000 * | 2
[6] chr2 1-200 * | 2
[7] chr2 201-400 * | 2
[8] chr2 401-500 * | 2
-------
seqinfo: 2 sequences from an unspecified genome
The first 4 bins have property 1, so should be combined into one bin.
I have looked through the GRanges
documentation and can't find an obvious native solution.
Note, seqname
boundaries have to be taken into account (e.g. chr1 and chr2 remain separated irrespective of the property)
Obviously, I one could use a loop, but I'd rather use a native GRange solutions e.g. using union
which I might have overseen.
The desired output should look something like this:
seqnames ranges strand | property
<Rle> <IRanges> <Rle> | <integer>
[1] chr1 1-800 * | 1
[2] chr1 801-1000 * | 2
[3] chr2 1-500 * | 2
R GenomicRanges:
result <- unlist(reduce(split(bins, ~property)))
result$property <- names(result)
# GRanges object with 3 ranges and 1 metadata column:
# seqnames ranges strand | property
# <Rle> <IRanges> <Rle> | <character>
# 1 chr1 1-800 * | 1
# 2 chr1 801-1000 * | 2
# 2 chr2 1-500 * | 2
# -------
# seqinfo: 2 sequences from an unspecified genome
Python PyRanges:
import pandas as pd
from io import StringIO
import pyranges as pr
c = """Chromosome Start End Value
chr1 1 200 Python
chr1 201 400 Python
chr1 401 600 Python
chr1 601 800 Python
chr1 801 1000 R
chr2 1 200 R
chr2 201 400 R
chr2 401 500 R"""
df = pd.read_table(StringIO(c), sep=" ")
gr = pr.PyRanges(df)
gr.merge(by="Value", slack=1)
# +--------------+-----------+-----------+------------+
# | Chromosome | Start | End | Value |
# | (category) | (int32) | (int32) | (object) |
# |--------------+-----------+-----------+------------|
# | chr1 | 1 | 800 | Python |
# | chr1 | 801 | 1000 | R |
# | chr2 | 1 | 500 | R |
# +--------------+-----------+-----------+------------+
# Unstranded PyRanges object has 3 rows and 4 columns from 2 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.