rdatasetaggregationcategorical-data

Aggregating Dataset to "ignore" categorical variable


I have this dataset which is structured like this

Neighborhood,  var1,   var2,   COUNTRY, DAY, categ 1, categ 2
     1          700     724      AL      0      YES    YES
     1          500     200      FR      0      YES     NO
    ....
     1          701     659      IT      1      NO      YES
     1          791     669      IT      1      NO      YES
    ....
     2          239     222      GE      0      YES      NO

and so on...

So that the hierarchy is "Neighborhood > DAY > COUNTRY" and for every neighborhood, for every day, for every country I have the observation of var1, var2, categ1 and categ2

I'm not interested for the moment in analyzing the country, so what I want to do is to aggregate that (by summing "over" the country field var1 and var2, the categorical variables categ1 and categ2 are not influenced by the country), and have a dataset that for each Neighborhood and for each Day gives me the info on var1, var2, categ1 and categ2

I'm quite new to R-programming and basically don't know a lot of packages (I would write a program in c++, but I'm forcing myself to learn R)... So do you have any idea on how to do this?

Data

df1 <- structure(list(Neighborhood = c(1L, 1L, 1L, 1L, 2L),
                      var1 = c(700L, 500L, 701L, 791L, 239L),
                      var2 = c(724L, 200L, 659L, 669L, 222L),
                      COUNTRY = c("AL", "FR", "IT", "IT", "GE"),
                      DAY = c(0L, 0L, 1L, 1L, 0L),
                      `categ 1` = c("YES", "YES", "NO", "NO", "YES"), 
                      `categ 2` = c("YES", "NO", "YES", "YES", "NO")),
                 .Names = c("Neighborhood", "var1", "var2", "COUNTRY", "DAY", "categ 1", "categ 2"),
                 class = "data.frame", row.names = c(NA, -5L))

EDIT: @akrun

when I try your command, the result is:

aggregate(.~Neighborhood+DAY+COUNTRY, data= df1[!grepl("^categ", names(df1))], mean)

     Neighborhood, DAY, COUNTRY, var1, var2

1            1      0      AL     700  724
2            1      0      FR     500  200
3            2      0      GE     239  222
4            1      1      IT     746  664

But (in this example) what I would like to have is:

         Neighborhood, DAY,  var1, var2

1            1          0     1200  924           //wher var1=700+500....
2            1          1     1492  1328
3            2          0     239  222
           

Solution

  • If we are not interested in the 'categ' columns, we can grep them out and use aggregate

    aggregate(.~Neighborhood+DAY, data= df1[!grepl("^(categ|COUNTRY)", names(df1))], sum)
    #   Neighborhood DAY var1 var2
    #1            1   0 1200  924
    #2            2   0  239  222
    #3            1   1 1492 1328
    

    Or using dplyr

    library(dplyr)
    df1 %>%
       group_by(Neighborhood, DAY) %>%
       summarise_each(funs(sum), matches("^var"))
    #  Neighborhood   DAY  var1  var2
    #         (int) (int) (int) (int)
    #1            1     0  1200   924
    #2            1     1  1492  1328
    #3            2     0   239   222