rdplyrrangepercentile

How to create variable that shows the percentile ranges an observation is in


Say that I have the iris data.

I know that I can create a variable that shows the values that fall into a certain percentile:

library(tidyverse)
iris %>% mutate(Range = cut(Sepal.Length, quantile(Sepal.Length, probs=c(0,.2,.4,.6,.8,1)),include.lowest=TRUE))

This produces:

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   Range
1           4.3         3.0          1.1         0.1  setosa [4.3,4.6]
2           4.4         2.9          1.4         0.2  setosa [4.3,4.6]
3           4.6         3.1          1.5         0.2  setosa [4.3,4.6]
4           4.6         3.4          1.4         0.3  setosa [4.3,4.6]
5           4.7         3.2          1.3         0.2  setosa (4.6,4.8]
6           4.8         3.4          1.6         0.2  setosa (4.6,4.8]
7           4.8         3.0          1.4         0.1  setosa (4.6,4.8]
8           4.9         3.0          1.4         0.2  setosa   (4.8,5]
9           4.9         3.1          1.5         0.1  setosa   (4.8,5]
10          5.0         3.6          1.4         0.2  setosa   (4.8,5]
11          5.0         3.4          1.5         0.2  setosa   (4.8,5]
12          5.1         3.5          1.4         0.2  setosa   (5,5.4]
13          5.4         3.9          1.7         0.4  setosa   (5,5.4]
14          5.4         3.7          1.5         0.2  setosa   (5,5.4]
15          5.7         4.4          1.5         0.4  setosa (5.4,5.8]
16          5.8         4.0          1.2         0.2  setosa (5.4,5.8]

How can I also create another variable that shows the percentile range the observation falls in? I do not want to manually create the variable with ifelse statements, etc., but hope that there is a function that will create it automatically.

I am looking for something that would produce a table like this:

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   Percent  Range
1           4.3         3.0          1.1         0.1  setosa [4.3,4.6]  [0,.2]
2           4.4         2.9          1.4         0.2  setosa [4.3,4.6]  [0,.2]
3           4.6         3.1          1.5         0.2  setosa [4.3,4.6]  [0,.2]
4           4.6         3.4          1.4         0.3  setosa [4.3,4.6]  [0,.2]
5           4.7         3.2          1.3         0.2  setosa (4.6,4.8]  (.2,.4]
6           4.8         3.4          1.6         0.2  setosa (4.6,4.8]  (.2,.4]
7           4.8         3.0          1.4         0.1  setosa (4.6,4.8]  (.2,.4]
8           4.9         3.0          1.4         0.2  setosa   (4.8,5]  (.4,.6]
9           4.9         3.1          1.5         0.1  setosa   (4.8,5]  (.4,.6]
10          5.0         3.6          1.4         0.2  setosa   (4.8,5]  (.4,.6]
11          5.0         3.4          1.5         0.2  setosa   (4.8,5]  (.4,.6]
12          5.1         3.5          1.4         0.2  setosa   (5,5.4]  (.6,.8]
13          5.4         3.9          1.7         0.4  setosa   (5,5.4]  (.6,.8]
14          5.4         3.7          1.5         0.2  setosa   (5,5.4]  (.6,.8]
15          5.7         4.4          1.5         0.4  setosa (5.4,5.8]  [.8,1]
16          5.8         4.0          1.2         0.2  setosa (5.4,5.8]  [.8,1]

Solution

  • Yes, the See Also section of the ?quantile help page will point you to the ecdf function, "for empirical distributions of which quantile is an inverse".

    Interestingly, ecdf() is a functional, so we have to create a function with it, and then call that function on the input. We can then cut the result just as you did with the quantiles.

    iris %>%
      mutate(
        Range = cut(Sepal.Length, quantile(Sepal.Length, probs=c(0,.2,.4,.6,.8,1)),include.lowest=TRUE),
        ecdf = cut(ecdf(Sepal.Length)(Sepal.Length), breaks = c(0, 0.2, .4, .6, .8, 1), include.lowest = TRUE)
      )
    
    #     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species      Range      ecdf
    # 1            5.1         3.5          1.4         0.2     setosa    (5,5.6] (0.2,0.4]
    # 2            4.9         3.0          1.4         0.2     setosa    [4.3,5]   [0,0.2]
    # 3            4.7         3.2          1.3         0.2     setosa    [4.3,5]   [0,0.2]
    # 4            4.6         3.1          1.5         0.2     setosa    [4.3,5]   [0,0.2]
    # 5            5.0         3.6          1.4         0.2     setosa    [4.3,5] (0.2,0.4]
    # 6            5.4         3.9          1.7         0.4     setosa    (5,5.6] (0.2,0.4]
    # 7            4.6         3.4          1.4         0.3     setosa    [4.3,5]   [0,0.2]
    # 8            5.0         3.4          1.5         0.2     setosa    [4.3,5] (0.2,0.4]
    # ...