In Foundations of Linear and Generalized Linear Models by Alan Agresti, the author points out that there is a difference between grouped and ungrouped data for binary date modeling. The format does not matter for inference, but it does matter for goodness-of-fit. I am having difficulty getting grouped data from ungrouped data in an efficient way in dplyr.
#ungrouped data
x = c(rep(0,4),rep(1,4),rep(2,4))
y = c(c(1,0,0,0,1,1,0,0,1,1,1,1))
data = as_tibble(list(x=x,y=y))
> data
# A tibble: 12 × 2
x y
<dbl> <dbl>
1 0 1
2 0 0
3 0 0
4 0 0
5 1 1
6 1 1
7 1 0
8 1 0
9 2 1
10 2 1
11 2 1
12 2 1
Now to get grouped data the form should look like the following
x ntrials nsuccesses
0 4 1
1 4 2
2 4 4
I have tried the following
data %>%
group_by(x,y) %>%
tally()
x y n
<dbl> <dbl> <int>
1 0 0 3
2 0 1 1
3 1 0 2
4 1 1 2
5 2 1 4
The problem is that y
is being separated into successes and failures.
You can just group by column x and then summarize based on column y:
data %>% group_by(x) %>% summarise(ntrials = n(), nsuccesses = sum(y))
# the number of successes is the sum of y if y is binary
# A tibble: 3 x 3
# x ntrials nsuccesses
# <dbl> <int> <dbl>
#1 0 4 1
#2 1 4 2
#3 2 4 4