I am working with a huge data frame without column names that I need to aggregate into a specific format. The original data frame has 4915 observations of 1140 variables. I need to aggregate the dataset to a sum of every five variables in the set and drop the initial columns.
I created a small sample dataset that looks similar to my data
v1.x <- c("AFG", "ALB", "DZA", "AND", "AGO", "ATG", "ARG", "ARM", "ABW", "AUS", "AUT", "AZE", "BHS", "BHR", "BGD", "BRB", "BLR", "BEL", "BLZ", "BEN")
v2.x <- c("Agriculture","Fishing","Mining and Quarrying","Food & Beverages","Textiles and Wearing Apparel","Wood and Paper",
"Petroleum, Chemical and Non-Metallic Mineral Products", "Metal Products", "Electrical and Machinery", "Transport Equipment",
"Other Manufacturing","Recycling" ,"Electricity, Gas and Water" ,"Construction" ,"Maintenance and Repair" ,"Wholesale Trade",
"Retail Trade", "Hotels and Restraurants", "Transport","Post and Telecommunications")
v1.y <- c(1:20)
v2.y <- c(12:31)
v3 <- c(5:24)
v4 <- c(2:21)
v5 <- c(1:20)
v6 <- c(7:26)
v7 <- c(5:24)
v8 <- c(1:20)
v9 <- c(2:21)
v10 <- c(4:23)
v11 <- c(2:21)
v12 <- c(3:22)
v13 <- c(6:25)
v14 <- c(3:22)
v15 <- c(8:27)
df <- data.frame(v1.x, v2.x, v1.y, v2.y, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15)
I retrieved the results for the first batch of rows by applying rowSum()
df1 <- df %>%
mutate(row_sum=rowSums(
select(., "v1.y", "v2.y", "v3", "v4", "v5"))) %>%
mutate(row_sum2=rowSums(
select(., "v6", "v7", "v8", "v8", "v10"))) %>%
mutate(row_sum3=rowSums(
select(., "v11", "v12", "v13", "v14", "v15"))) %>%
select(v1.x, v2.x, "row_sum", "row_sum2", "row_sum3")
the results
v1.x v2.x row_sum row_sum2 row_sum3
1 AFG Agriculture 21 17 22
2 ALB Fishing 26 21 27
3 DZA Mining and Quarrying 31 25 32
4 AND Food & Beverages 36 29 37
5 AGO Textiles and Wearing Apparel 41 33 42
6 ATG Wood and Paper 46 37 47```
However, because the original data frame contains more than 1000 variables it would be very cumbersum to apply mutate() or rowmsum() for every batch of variables.
We could use split.default
to split up every 5 numeric columns and get the rowwise sum with rowSums
and bind with the original data
library(dplyr)
library(purrr)
library(stringr)
df %>%
select(where(is.numeric)) %>%
split.default(as.integer(gl(ncol(.), 5, ncol(.)))) %>%
map_dfc(rowSums, na.rm = TRUE) %>%
rename_with(~ str_c("row_sum", .x)) %>%
bind_cols(df %>%
select(where(negate(is.numeric))), .)
-output
v1.x v2.x row_sum1 row_sum2 row_sum3
1 AFG Agriculture 21 19 22
2 ALB Fishing 26 24 27
3 DZA Mining and Quarrying 31 29 32
4 AND Food & Beverages 36 34 37
5 AGO Textiles and Wearing Apparel 41 39 42
6 ATG Wood and Paper 46 44 47
7 ARG Petroleum, Chemical and Non-Metallic Mineral Products 51 49 52
8 ARM Metal Products 56 54 57
9 ABW Electrical and Machinery 61 59 62
10 AUS Transport Equipment 66 64 67
11 AUT Other Manufacturing 71 69 72
12 AZE Recycling 76 74 77
13 BHS Electricity, Gas and Water 81 79 82
14 BHR Construction 86 84 87
15 BGD Maintenance and Repair 91 89 92
16 BRB Wholesale Trade 96 94 97
17 BLR Retail Trade 101 99 102
18 BEL Hotels and Restraurants 106 104 107
19 BLZ Transport 111 109 112
20 BEN Post and Telecommunications 116 114 117