rdplyrbinary-datatidyversegrouped-table

Grouped data in dplyr


In Foundations of Linear and Generalized Linear Models by Alan Agresti, the author points out that there is a difference between grouped and ungrouped data for binary date modeling. The format does not matter for inference, but it does matter for goodness-of-fit. I am having difficulty getting grouped data from ungrouped data in an efficient way in dplyr.

#ungrouped data
x = c(rep(0,4),rep(1,4),rep(2,4))
y = c(c(1,0,0,0,1,1,0,0,1,1,1,1))
data = as_tibble(list(x=x,y=y))
> data
# A tibble: 12 × 2
       x     y
   <dbl> <dbl>
1      0     1
2      0     0
3      0     0
4      0     0
5      1     1
6      1     1
7      1     0
8      1     0
9      2     1
10     2     1
11     2     1
12     2     1

Now to get grouped data the form should look like the following

x    ntrials   nsuccesses
0      4           1
1      4           2
2      4           4

I have tried the following

data %>% 
group_by(x,y) %>% 
  tally()
      x     y     n
  <dbl> <dbl> <int>
1     0     0     3
2     0     1     1
3     1     0     2
4     1     1     2
5     2     1     4

The problem is that y is being separated into successes and failures.


Solution

  • You can just group by column x and then summarize based on column y:

    data %>% group_by(x) %>% summarise(ntrials = n(), nsuccesses = sum(y))
    # the number of successes is the sum of y if y is binary
    
    # A tibble: 3 x 3
    #      x ntrials nsuccesses
    #  <dbl>   <int>      <dbl>
    #1     0       4          1
    #2     1       4          2
    #3     2       4          4