rtidytable

Splitting the group which contains the bottom 25th percentile of volume into 2 rows


library(tidytable)

my_dt <- tidytable(
     price = c(4.5, 4.6, 4.7, 4.8, 5.1),
     volume = c(0.4,1.2,0.3,8.7,6.5)
)

total = my_dt %>% summarize(total = sum(volume)) %>% pull(total)

trim = total*0.25

What I would like to do is find the price value where 25% of the total volume occurs. And then splitting that row into a 'trimmed' portion and a non trimmed portion. So in this example I would end up with something that looks like this:

post_trim <- tidytable(
         price = c(4.5, 4.6, 4.7, 4.8, 4.8, 5.1),
         volume = c(0.4, 1.2, 0.3, 2.375, 6.325, 6.5),
         trim = c(True, True, True, True, False, False)
    )

The way I had initially thought of how to do this would be to iterate through the rows of the dataframe, where the rows would be returned as lists and tracking the remainder after subtracting the quantity until I find a row where the remainder is less then the volume:

remainder = trim
pivot = 0

for (row in my_dt){
    if (remainder <= row[2]){
        pivot = row[1]
        break
    } else { 
        remainder = remainder - row[2]
    }

And then from here I just delete the row entry for where price == pivot and re-enter two rows one where volume is the remainder and the other where the volume is original volume - remainder.

I know loops over rows are generally not a good idea though so I'm wondering how I would accomplish this without doing that?


Solution

  • This kind of problem is a bit easier to work with in base R:

    trim <- sum(my_dt$volume) * 0.25
    csum <- cumsum(my_dt$volume)
    # identify 25th %ile row
    trow <- min(which(csum >= trim))
    
    # duplicate 25th %ile row
    my_dt <- my_dt[c(1:trow, trow:nrow(my_dt)), ]
    # trim value, with safeguard if first row
    my_dt$volume[[trow]] <- if (trow == 1) trim else trim - csum[[trow - 1]]
    # remainder in next row
    my_dt$volume[[trow + 1]] <- csum[[trow]] - trim
    # add logical flag
    my_dt$trim <- seq(nrow(my_dt)) <= trow
    

    Result:

    # A tibble: 6 × 3
      price volume trim 
      <dbl>  <dbl> <lgl>
    1   4.5   0.4  TRUE 
    2   4.6   1.2  TRUE 
    3   4.7   0.3  TRUE 
    4   4.8   2.37 TRUE 
    5   4.8   6.32 FALSE
    6   5.1   6.5  FALSE