rdplyrpercentile

How to find the percentile for each observation in a data frame in R?


Suppose we have a simple data frame:

structure(c(2, 4, 5, 6, 8, 1, 2, 4, 6, 67, 8, 11), dim = c(6L, 
2L), dimnames = list(NULL, c("lo", "li")))

How can I find the percentile for each observation for both variables?


Solution

  • The most R friendly approach would be to (i) convert this to a dataframe (or tibble), (ii) reshape the data into long format, (iii) groupby lo and li, and (iv) calculate the percent rank.

    Here's the code:

    df%>%
      as_tibble() %>% # convert to dataframe
      gather(key=variable,value=value) %>% # gather into long form
      group_by(variable)%>%. # group by lo and li
      mutate(percentile=percent_rank(val)*100) # make new column
    
    variable   val percentile
       <chr>    <dbl>      <dbl>
     1 lo           2         20
     2 lo           4         40
     3 lo           5         60
     4 lo           6         80
     5 lo           8        100
     6 lo           1          0
     7 li           2          0
     8 li           4         20
     9 li           6         40
    10 li          67        100
    11 li           8         60
    12 li          11         80
    

    If you don't want to make the dataframe long, just do the two columns seperately:

    df%>%
      as_tibble()%>%
      mutate(lo_pr=percent_rank(lo)*100)%>%
      mutate(li_percentile=percent_rank(li)*100)
    
    
    lo    li lo_pr li_percentile
      <dbl> <dbl> <dbl>         <dbl>
    1     2     2    20             0
    2     4     4    40            20
    3     5     6    60            40
    4     6    67    80           100
    5     8     8   100            60
    6     1    11     0            80