I'm looking for a faster way to calculate a group mean with multiple grouping variables while excluding own group values. A thought experiment would be finding average value (e.g. price) for a county from the counties in the same state in the same year excluding own county's value. Here's a toy data set.
df <- data_frame(
state = rep(c("AL", "CA"), each = 6),
county = rep(letters[1:6], each = 2),
year = rep(c(2011:2012), 6),
value = sample.int(100, 12)
)
df %>%
group_by(state, county, year) %>%
summarise(q = mean(df$value[df$state == state & df$county != county & df$year == year]))
# Groups: state, county [6]
state county year q
<chr> <chr> <int> <dbl>
1 AL a 2011 56
2 AL a 2012 46
3 AL b 2011 50.5
4 AL b 2012 52
5 AL c 2011 55.5
6 AL c 2012 29
7 CA d 2011 52.5
8 CA d 2012 32
9 CA e 2011 68.5
10 CA e 2012 31.5
11 CA f 2011 32
12 CA f 2012 42.5
The above code gives me a desired result, but when I apply this to a larger dataset (with more grouping variables) it gets really slow. Do you have any suggestion on how to speed this up?
If the original approach is incorrect, please point that out as well.
library(dplyr)
df %>%
group_by(state, year) %>%
mutate(q = (sum(value) - value) / (n()-1))
#> # A tibble: 12 x 5
#> # Groups: state, year [4]
#> state county year value q
#> <chr> <chr> <int> <int> <dbl>
#> 1 AL a 2011 68 30.5
#> 2 AL a 2012 63 42
#> 3 AL b 2011 53 38
#> 4 AL b 2012 56 45.5
#> 5 AL c 2011 8 60.5
#> 6 AL c 2012 28 59.5
#> 7 CA d 2011 7 40
#> 8 CA d 2012 69 41
#> 9 CA e 2011 39 24
#> 10 CA e 2012 79 36
#> 11 CA f 2011 41 23
#> 12 CA f 2012 3 74
Data:
#data_frame is deprecate!
df <- tibble(
state = rep(c("AL", "CA"), each = 6),
county = rep(letters[1:6], each = 2),
year = rep(c(2011:2012), 6),
value = sample.int(100, 12)
)