I have this dataset which is structured like this
Neighborhood, var1, var2, COUNTRY, DAY, categ 1, categ 2
1 700 724 AL 0 YES YES
1 500 200 FR 0 YES NO
....
1 701 659 IT 1 NO YES
1 791 669 IT 1 NO YES
....
2 239 222 GE 0 YES NO
and so on...
So that the hierarchy is "Neighborhood > DAY > COUNTRY" and for every neighborhood, for every day, for every country I have the observation of var1, var2, categ1 and categ2
I'm not interested for the moment in analyzing the country, so what I want to do is to aggregate that (by summing "over" the country field var1 and var2, the categorical variables categ1 and categ2 are not influenced by the country), and have a dataset that for each Neighborhood and for each Day gives me the info on var1, var2, categ1 and categ2
I'm quite new to R-programming and basically don't know a lot of packages (I would write a program in c++, but I'm forcing myself to learn R)... So do you have any idea on how to do this?
Data
df1 <- structure(list(Neighborhood = c(1L, 1L, 1L, 1L, 2L),
var1 = c(700L, 500L, 701L, 791L, 239L),
var2 = c(724L, 200L, 659L, 669L, 222L),
COUNTRY = c("AL", "FR", "IT", "IT", "GE"),
DAY = c(0L, 0L, 1L, 1L, 0L),
`categ 1` = c("YES", "YES", "NO", "NO", "YES"),
`categ 2` = c("YES", "NO", "YES", "YES", "NO")),
.Names = c("Neighborhood", "var1", "var2", "COUNTRY", "DAY", "categ 1", "categ 2"),
class = "data.frame", row.names = c(NA, -5L))
EDIT: @akrun
when I try your command, the result is:
aggregate(.~Neighborhood+DAY+COUNTRY, data= df1[!grepl("^categ", names(df1))], mean)
Neighborhood, DAY, COUNTRY, var1, var2
1 1 0 AL 700 724
2 1 0 FR 500 200
3 2 0 GE 239 222
4 1 1 IT 746 664
But (in this example) what I would like to have is:
Neighborhood, DAY, var1, var2
1 1 0 1200 924 //wher var1=700+500....
2 1 1 1492 1328
3 2 0 239 222
If we are not interested in the 'categ' columns, we can grep
them out and use aggregate
aggregate(.~Neighborhood+DAY, data= df1[!grepl("^(categ|COUNTRY)", names(df1))], sum)
# Neighborhood DAY var1 var2
#1 1 0 1200 924
#2 2 0 239 222
#3 1 1 1492 1328
Or using dplyr
library(dplyr)
df1 %>%
group_by(Neighborhood, DAY) %>%
summarise_each(funs(sum), matches("^var"))
# Neighborhood DAY var1 var2
# (int) (int) (int) (int)
#1 1 0 1200 924
#2 1 1 1492 1328
#3 2 0 239 222