I have multiple Excel files per year that I need to read in. Each file has multiple spreadsheets that are relevant for my analysis and I'd like to keep only those. The problem is that the first 3 rows are headers and sub-headers that I need collapsed into a single column name. I have been unable to find a solution so far since the sheets may be of varying column lengths and I need to capture all the data. I would like to combine all the sheets/dfs in a file to one dataframe as well.
My goals are:
This is how I would like the final output for the data. I have attached a sample spreadsheet.
subheading1_id | subheading1_name | subheading1_... | TeamA1_subheading1_a1 | etc_etc_etc |
---|---|---|---|---|
1 | blue | etc | 500 | blah |
2 | orange | etc | 700 | blah |
3 | purple | etc | 900 | blah |
Sample code - 1st attempt, works great as long as the columns for each spreadsheet are of the same length
library(tidyverse)
library(readxl)
path <- "C:/Example Spreadsheet.xlsx"
#read the sheets and only keep the cost share sheets
sheets <- excel_sheets(path)[grep("Data of interest",excel_sheets(path) )] # simplified to a single line
#read the data, as a list
excel_data <- lapply(sheets, read_excel, path = path, skip=0)
#convert to DF
temp_df <- excel_data %>%
bind_rows()
2nd attempt - this gets the appropriate output but, I end up having to repeat this same code over and over for each individual df
#Loop below converts each list element to a df
for (i in 1:length(excel_data)) {
assign(paste0("group_", i), as.data.frame(excel_data[[i]]))
}
#extract each row element
names1 <- str_remove(names(group_1), "All fields .+|[.]+[:digit:]+") %>% na_if("")
names2 <- str_remove(group_1[1,], "[.]+[:digit:]+") %>% na_if("")
names3 <- str_remove(group_1[2,], "[.]+[:digit:]+") %>% na_if("")
#combine each row element
temp_names <- tibble(n1 = c(names1, rep(NA, length(names3) - length(names1))),
n2 = c(names2, rep(NA, length(names3) - length(names2))),
n3 = names3) %>%
fill(n1, n2) %>%
replace_na(list(n1 = "")) %>%
mutate(full_name = paste(n1,n2,n3, sep = "_"))
#add the full name
temp_names <- temp_names$full_name
#convert to df and combine elements
names(group_1) <- temp_names[1:ncol(group_1)]
group_1 <- group_1[-c(1,2),]
Update: To add a little more context, the source data has a bunch of headers with empty data fields that don’t actually align with data that has been provided. The updated code restricts the column names to represent only those fields that would have data in them. The code below subsets valid columns based on the number of data columns with information. I updated the images as well to better show the problem.
Working answer This solves the problem and trims the excess data for the real dataset.
library(tidyverse)
library(openxlsx)
file <- "Example Spreadsheet.xlsx"
read_all_sheets = function(xlsxFile, ...) {
sheet_names = openxlsx::getSheetNames(xlsxFile)
sheet_names = sheet_names[grepl("Data of interest", sheet_names)]
sheet_list = as.list(rep(NA, length(sheet_names)))
names(sheet_list) = sheet_names
for (sn in sheet_names) {
sheet_list[[sn]] = openxlsx::read.xlsx(xlsxFile, sheet=sn, startRow =4, skipEmptyCols = FALSE, colNames = FALSE, ...)
}
return(sheet_list)
}
read_all_headers = function(xlsxFile, ...) {
sheet_names = openxlsx::getSheetNames(xlsxFile)
sheet_names = sheet_names[grepl("Data of interest", sheet_names)]
sheet_list = as.list(rep(NA, length(sheet_names)))
names(sheet_list) = sheet_names
for (sn in sheet_names) {
sheet_list[[sn]] = openxlsx::read.xlsx(xlsxFile, sheet=sn, rows = 1:3, skipEmptyCols = FALSE, skipEmptyRows = FALSE, colNames = FALSE, fillMergedCells = TRUE, ...)
}
return(sheet_list)
}
test <- read_all_sheets(file)
test2 <- read_all_headers(file)
col_new <- map(test2, ~ .x %>%
t() %>%
as_tibble() %>%
replace(is.na(.), '') %>%
mutate(col_name = paste(`1`, `2`, `3`, sep='_')) %>%
select(col_name))
for (i in 1:length(test)) {
cols <- list()
cols[[i]] <- col_new[[i]]$col_name[1:ncol(test[[i]])]
col_names <- cols[[i]]
colnames(test[[i]]) <- col_names
}
clean_names <- map_df(test, ~ .x %>%
rename_with(.fn = ~ str_replace_all(.x, "^_", ""), .x, .col = starts_with("_")) %>%
mutate(across(everything(), ~ as.character(.))))