rdataframebinning

(R) Bin a numeric column to count occurrences after group by


Apologies if the title of the post is a bit confusing. Let's say I have the following data frame:

set.seed(123)
test <- data.frame("chr" = rep("chr1",30), "position" = sample(c(1:50), 30, replace = F) , 
         "info" = sample(c("X","Y"), 30, replace = T), 
         "condition"= sample(c("soft","stiff"), 30, replace = T) )

## head(test)
   chr position info condition
1 chr1       31    Y      soft
2 chr1       15    Y      soft
3 chr1       14    X      soft
4 chr1        3    X      soft
5 chr1       42    X     stiff
6 chr1       43    X     stiff

I want to bin the position column. Let's say with a size of 10. Then based on the condition (either soft or stiff), I would like to count the occurrences in the info column. So the data would look something like this (not the actual result from the data above)

   chr start end condition count_Y count_X
1 chr1   1    10    soft      2       3
2 chr1   1    10    stiff     0       2
3 chr1   11   20    soft      2       5
4 chr1   11   20    soft      1       2
5 chr1   21   30    soft      2       0
6 chr1   21   30    stiff     0       4

To make it easier, it is probably better to create two data frames based on condition and then apply the binning and counting, but I am stuck on this part. Any help is appreciated. Many thanks.


Solution

  • Using cut or even easier using integer division %/% for the binning (Thx to @MrFlick for the hint), dplyr::count and tidyr::pivot_wider you could do:

    library(dplyr, warn=FALSE)
    library(tidyr)
    
    test |>
      mutate(
        bin = position %/% 10 + 1,
        start = (bin - 1) * 10 + 1,
        end = bin * 10
      ) |>
      count(chr, start, end, condition, info) |>
      tidyr::pivot_wider(
        names_from = info, 
        values_from = n, 
        names_prefix = "count_",
        values_fill = 0
      )
    #> # A tibble: 9 × 6
    #>   chr   start   end condition count_X count_Y
    #>   <chr> <dbl> <dbl> <chr>       <int>   <int>
    #> 1 chr1      1    10 soft            4       0
    #> 2 chr1      1    10 stiff           2       1
    #> 3 chr1     11    20 soft            3       3
    #> 4 chr1     21    30 soft            1       1
    #> 5 chr1     21    30 stiff           3       1
    #> 6 chr1     31    40 soft            0       2
    #> 7 chr1     31    40 stiff           2       1
    #> 8 chr1     41    50 soft            0       1
    #> 9 chr1     41    50 stiff           4       1