rbioinformaticsintervalsbioconductorgenomicranges

Merging adjacent bins with same property in GenomicRanges object


Having segmented a genome into adjacent non-overlapping bins, e.g. by tileGenome, I have computed some property by some means for each bin (say either 1 or 2).

Now I want to merge adjacent with the same property. An minimal example is illustrated below:

library(GenomicRanges)
chrSizes <- c(chr1 = 1000, chr2 = 500)
bins   <- tileGenome(chrSizes, tilewidth = 200, cut.last.tile.in.chrom = T)
bins$property <- rep(1:2, each = 4)
bins
GRanges object with 8 ranges and 1 metadata column:
      seqnames    ranges strand |  property
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1     1-200      * |         1
  [2]     chr1   201-400      * |         1
  [3]     chr1   401-600      * |         1
  [4]     chr1   601-800      * |         1
  [5]     chr1  801-1000      * |         2
  [6]     chr2     1-200      * |         2
  [7]     chr2   201-400      * |         2
  [8]     chr2   401-500      * |         2
  -------
  seqinfo: 2 sequences from an unspecified genome

The first 4 bins have property 1, so should be combined into one bin.

I have looked through the GRanges documentation and can't find an obvious native solution. Note, seqname boundaries have to be taken into account (e.g. chr1 and chr2 remain separated irrespective of the property) Obviously, I one could use a loop, but I'd rather use a native GRange solutions e.g. using union which I might have overseen.

The desired output should look something like this:

      seqnames    ranges strand |  property
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1     1-800      * |         1
  [2]     chr1  801-1000      * |         2
  [3]     chr2     1-500      * |         2

Solution

  • R GenomicRanges:

    result <- unlist(reduce(split(bins, ~property)))
    result$property <- names(result)
    
    # GRanges object with 3 ranges and 1 metadata column:
    # seqnames    ranges strand |    property
    # <Rle> <IRanges>  <Rle> | <character>
    # 1     chr1     1-800      * |           1
    # 2     chr1  801-1000      * |           2
    # 2     chr2     1-500      * |           2
    # -------
    # seqinfo: 2 sequences from an unspecified genome
    

    Python PyRanges:

    import pandas as pd
    from io import StringIO
    import pyranges as pr
    
    c = """Chromosome Start End Value
    chr1 1 200 Python
    chr1 201 400 Python
    chr1 401 600 Python
    chr1 601 800 Python
    chr1 801 1000 R
    chr2 1 200 R
    chr2 201 400 R
    chr2 401 500 R"""
    
    df = pd.read_table(StringIO(c), sep=" ")
    gr = pr.PyRanges(df)
    gr.merge(by="Value", slack=1)
    
    # +--------------+-----------+-----------+------------+
    # | Chromosome   |     Start |       End | Value      |
    # | (category)   |   (int32) |   (int32) | (object)   |
    # |--------------+-----------+-----------+------------|
    # | chr1         |         1 |       800 | Python     |
    # | chr1         |       801 |      1000 | R          |
    # | chr2         |         1 |       500 | R          |
    # +--------------+-----------+-----------+------------+
    # Unstranded PyRanges object has 3 rows and 4 columns from 2 chromosomes.
    # For printing, the PyRanges was sorted on Chromosome.