Say that I have the iris
data.
I know that I can create a variable that shows the values that fall into a certain percentile:
library(tidyverse)
iris %>% mutate(Range = cut(Sepal.Length, quantile(Sepal.Length, probs=c(0,.2,.4,.6,.8,1)),include.lowest=TRUE))
This produces:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Range
1 4.3 3.0 1.1 0.1 setosa [4.3,4.6]
2 4.4 2.9 1.4 0.2 setosa [4.3,4.6]
3 4.6 3.1 1.5 0.2 setosa [4.3,4.6]
4 4.6 3.4 1.4 0.3 setosa [4.3,4.6]
5 4.7 3.2 1.3 0.2 setosa (4.6,4.8]
6 4.8 3.4 1.6 0.2 setosa (4.6,4.8]
7 4.8 3.0 1.4 0.1 setosa (4.6,4.8]
8 4.9 3.0 1.4 0.2 setosa (4.8,5]
9 4.9 3.1 1.5 0.1 setosa (4.8,5]
10 5.0 3.6 1.4 0.2 setosa (4.8,5]
11 5.0 3.4 1.5 0.2 setosa (4.8,5]
12 5.1 3.5 1.4 0.2 setosa (5,5.4]
13 5.4 3.9 1.7 0.4 setosa (5,5.4]
14 5.4 3.7 1.5 0.2 setosa (5,5.4]
15 5.7 4.4 1.5 0.4 setosa (5.4,5.8]
16 5.8 4.0 1.2 0.2 setosa (5.4,5.8]
How can I also create another variable that shows the percentile range the observation falls in? I do not want to manually create the variable with ifelse statements, etc., but hope that there is a function that will create it automatically.
I am looking for something that would produce a table like this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Percent Range
1 4.3 3.0 1.1 0.1 setosa [4.3,4.6] [0,.2]
2 4.4 2.9 1.4 0.2 setosa [4.3,4.6] [0,.2]
3 4.6 3.1 1.5 0.2 setosa [4.3,4.6] [0,.2]
4 4.6 3.4 1.4 0.3 setosa [4.3,4.6] [0,.2]
5 4.7 3.2 1.3 0.2 setosa (4.6,4.8] (.2,.4]
6 4.8 3.4 1.6 0.2 setosa (4.6,4.8] (.2,.4]
7 4.8 3.0 1.4 0.1 setosa (4.6,4.8] (.2,.4]
8 4.9 3.0 1.4 0.2 setosa (4.8,5] (.4,.6]
9 4.9 3.1 1.5 0.1 setosa (4.8,5] (.4,.6]
10 5.0 3.6 1.4 0.2 setosa (4.8,5] (.4,.6]
11 5.0 3.4 1.5 0.2 setosa (4.8,5] (.4,.6]
12 5.1 3.5 1.4 0.2 setosa (5,5.4] (.6,.8]
13 5.4 3.9 1.7 0.4 setosa (5,5.4] (.6,.8]
14 5.4 3.7 1.5 0.2 setosa (5,5.4] (.6,.8]
15 5.7 4.4 1.5 0.4 setosa (5.4,5.8] [.8,1]
16 5.8 4.0 1.2 0.2 setosa (5.4,5.8] [.8,1]
Yes, the See Also section of the ?quantile
help page will point you to the ecdf
function, "for empirical distributions of which quantile
is an inverse".
Interestingly, ecdf()
is a functional, so we have to create a function with it, and then call that function on the input. We can then cut
the result just as you did with the quantiles.
iris %>%
mutate(
Range = cut(Sepal.Length, quantile(Sepal.Length, probs=c(0,.2,.4,.6,.8,1)),include.lowest=TRUE),
ecdf = cut(ecdf(Sepal.Length)(Sepal.Length), breaks = c(0, 0.2, .4, .6, .8, 1), include.lowest = TRUE)
)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Range ecdf
# 1 5.1 3.5 1.4 0.2 setosa (5,5.6] (0.2,0.4]
# 2 4.9 3.0 1.4 0.2 setosa [4.3,5] [0,0.2]
# 3 4.7 3.2 1.3 0.2 setosa [4.3,5] [0,0.2]
# 4 4.6 3.1 1.5 0.2 setosa [4.3,5] [0,0.2]
# 5 5.0 3.6 1.4 0.2 setosa [4.3,5] (0.2,0.4]
# 6 5.4 3.9 1.7 0.4 setosa (5,5.6] (0.2,0.4]
# 7 4.6 3.4 1.4 0.3 setosa [4.3,5] [0,0.2]
# 8 5.0 3.4 1.5 0.2 setosa [4.3,5] (0.2,0.4]
# ...