rdplyrreadxl

Update: Automating manual Excel read and clean for multiple spreadsheets with headers and subheaders in R; prefer to use dplyr to address messy data


I have multiple Excel files per year that I need to read in. Each file has multiple spreadsheets that are relevant for my analysis and I'd like to keep only those. The problem is that the first 3 rows are headers and sub-headers that I need collapsed into a single column name. I have been unable to find a solution so far since the sheets may be of varying column lengths and I need to capture all the data. I would like to combine all the sheets/dfs in a file to one dataframe as well.

My goals are:

This is how I would like the final output for the data. I have attached a sample spreadsheet.

subheading1_id subheading1_name subheading1_... TeamA1_subheading1_a1 etc_etc_etc
1 blue etc 500 blah
2 orange etc 700 blah
3 purple etc 900 blah

Sample code - 1st attempt, works great as long as the columns for each spreadsheet are of the same length

library(tidyverse)
library(readxl)
path <- "C:/Example Spreadsheet.xlsx"
#read the sheets and only keep the cost share sheets
sheets <- excel_sheets(path)[grep("Data of interest",excel_sheets(path) )] # simplified to a single    line
#read the data, as a list
excel_data <- lapply(sheets, read_excel, path = path, skip=0)

#convert to DF
temp_df <- excel_data %>%
 bind_rows()

2nd attempt - this gets the appropriate output but, I end up having to repeat this same code over and over for each individual df

#Loop below converts each list element to a df
for (i in 1:length(excel_data)) {
assign(paste0("group_", i), as.data.frame(excel_data[[i]]))
}

#extract each row element 
names1 <- str_remove(names(group_1), "All fields .+|[.]+[:digit:]+") %>% na_if("")
names2 <- str_remove(group_1[1,], "[.]+[:digit:]+") %>% na_if("")
names3 <- str_remove(group_1[2,], "[.]+[:digit:]+") %>% na_if("") 
#combine each row element
temp_names <- tibble(n1 = c(names1, rep(NA, length(names3) - length(names1))), 
                     n2 = c(names2, rep(NA, length(names3) - length(names2))), 
                     n3 = names3) %>%
  fill(n1, n2) %>%
  replace_na(list(n1 = "")) %>%
  mutate(full_name = paste(n1,n2,n3, sep = "_"))

#add the full name
temp_names <- temp_names$full_name
#convert to df and combine elements
names(group_1) <- temp_names[1:ncol(group_1)]

group_1 <- group_1[-c(1,2),]

Update: To add a little more context, the source data has a bunch of headers with empty data fields that don’t actually align with data that has been provided. The updated code restricts the column names to represent only those fields that would have data in them. The code below subsets valid columns based on the number of data columns with information. I updated the images as well to better show the problem.

sample file sample file 2


Solution

  • Working answer This solves the problem and trims the excess data for the real dataset.

    library(tidyverse)
    library(openxlsx)
    
    file <- "Example Spreadsheet.xlsx"
    
    read_all_sheets = function(xlsxFile, ...) {
      sheet_names = openxlsx::getSheetNames(xlsxFile)
      sheet_names = sheet_names[grepl("Data of interest", sheet_names)]
      sheet_list = as.list(rep(NA, length(sheet_names)))
      names(sheet_list) = sheet_names
      for (sn in sheet_names) {
        sheet_list[[sn]] = openxlsx::read.xlsx(xlsxFile, sheet=sn, startRow =4, skipEmptyCols = FALSE, colNames = FALSE, ...)
      }
      return(sheet_list)
    }
    
    read_all_headers = function(xlsxFile, ...) {
      sheet_names = openxlsx::getSheetNames(xlsxFile)
      sheet_names = sheet_names[grepl("Data of interest", sheet_names)]
      sheet_list = as.list(rep(NA, length(sheet_names)))
      names(sheet_list) = sheet_names
      for (sn in sheet_names) {
        sheet_list[[sn]] = openxlsx::read.xlsx(xlsxFile, sheet=sn, rows = 1:3, skipEmptyCols = FALSE, skipEmptyRows = FALSE, colNames = FALSE, fillMergedCells = TRUE, ...)
      }
      return(sheet_list)
    }
    
    test <- read_all_sheets(file)
    
    test2 <- read_all_headers(file)
    
    col_new <- map(test2, ~ .x %>%
      t() %>%
      as_tibble() %>%
      replace(is.na(.), '') %>%
      mutate(col_name = paste(`1`, `2`, `3`, sep='_')) %>%
        select(col_name))
    
    
    for (i in 1:length(test)) {
      cols <- list()
      cols[[i]] <- col_new[[i]]$col_name[1:ncol(test[[i]])] 
      col_names <- cols[[i]]
      colnames(test[[i]]) <- col_names
    }
    
    clean_names <- map_df(test, ~ .x %>%
                            rename_with(.fn = ~ str_replace_all(.x, "^_", ""), .x, .col = starts_with("_")) %>%
                            mutate(across(everything(), ~ as.character(.))))