I have two data-frames, each with the same groups. The first data frame consists of the base data, the second an independent set of break points for each group. I want to use those break points to divide the data into bins. Is there a way to use group_by() %>% cut() or some similar sequence to perform this programatically, without manually specifying the break points?
Here is a reproducible example using iris.
data("iris")
boundaries <- iris %>%
group_by(Species) %>%
summarize(mean = mean(Sepal.Length), upper=mean(Sepal.Length+0.2), lower=mean(Sepal.Length-0.2)) %>%
data.frame()
boundaries
Species mean upper lower
1 setosa 5.006 5.206 4.806
2 versicolor 5.936 6.136 5.736
3 virginica 6.588 6.788 6.388
It's easy to apply a single set of break points across the data set:
boundsSample <- c(-Inf, as.numeric(boundaries[1,2:4]), Inf)
binned <- iris %>%
mutate(bin = cut(Sepal.Length, breaks=boundsSample, labels = c('1', '2', '3', '4')))
What I can't figure out is how to do this by group, without manually extracting and specifying the break points for each group. I'd like to do something like this, but am not sure how to get the group_by() to also apply to the breaks:
binned <- iris %>%
group_by(Species) %>%
mutate(bin = cut(Sepal.Length, breaks=boundaries, labels = c('1', '2', '3', '4')))
I tried to work around this by joining the boundaries data frame into the main data, so that each point would be associated with its specific boundaries, but this gives me an error that the breaks are not unique:
joined <- left_join(iris, boundaries, by="Species") %>%
mutate(bin = cut(Sepal.Length,
breaks=c(-Inf, lower, mean, upper, Inf),
labels = c('1', '2', '3', '4')))
Your 'joined' approach was very nearly correct. You can get around the problem by using unique()
on the variables which were duplicated by the join:
joined <- left_join(iris, boundaries, by="Species") %>%
mutate(bin = cut(Sepal.Length,
breaks=c(-Inf, unique(lower), unique(mean), unique(upper), Inf),
labels = c('1', '2', '3', '4')),
.by = Species)
You could get a similar result without joining using:
binned <- iris %>%
group_by(Species) %>%
mutate(bin = cut(Sepal.Length,
breaks=c(-Inf, boundaries[boundaries$Species == unique(Species),2:4], Inf),
labels = c('1', '2', '3', '4')))