rgroup-bycumulative-frequency

How do I figure cumulative distribution within groups (as cume_dist does) but separate tied values in R?


Here is a data set where YEAR is a grouping variable.

dat <- data.frame(YEAR = c(rep(1999,4),rep(2002,3)), VALUE = c(1,2,3,2,1,2,3))

I would like to add a column that looks at the VALUE column and says "where, within its year, does this value sit?" I'm having trouble phrasing it concisely, but pretty much exactly what cume_dist does except that cume_dist lumps tied values and I want them separated. cume_dist takes 1,2,2,3 and gives back 0.25, 0.75, 0.75, 1.00 and I need the ties separated: 1,2,2,4 should give 0.25, 0.50, 0.75, 1.0.

Here's a line based on cume_dist that works except for the tied values:

dat %>% group_by(YEAR) %>% mutate(cumdist = cume_dist(VALUE))

I tried to deconstruct cume_dist (which is rank in group divided by size of group) and use row_number, which separates ties, for the numerator and divide it by the number of rows in each year. This gives me the correct numerator:

dat %>% group_by(YEAR) %>% mutate(rownumber = row_number(VALUE))

But how do I divide those ranks by the number of values in each year (that is, divide all the ranks in 1999 by 4, and the ranks from 2002 by 3)?


Solution

  • Is this what you are after?

    dat %>%
      mutate(cumdist = seq.int(n())[order(VALUE)] / n(), .by = YEAR)
    

    which gives

      YEAR VALUE   cumdist
    1 1999     1 0.2500000
    2 1999     2 0.5000000
    3 1999     3 1.0000000
    4 1999     2 0.7500000
    5 2002     1 0.3333333
    6 2002     2 0.6666667
    7 2002     3 1.0000000