rxlsxpdf-scraping

Naming mutliple xlsx files with TRUE of FALSE if character string is present in a particular sheet


This code reads a xlsx file and creates individualy named files based on sheet number and a value found at a particular location (in this case temp[2,1]). However because each file and sheet is slightly different the names are inconsistant.

sheet_to_read <- c(11,12,13,14)
for( excelsheet in files) {
for (sheet in sheet_to_read) {
temp <- read_excel( path = excelsheet, sheet = sheet, col_names = FALSE)
write.csv( temp, file = paste0( "./", gsub("./", "", excelsheet), temp[2,1], sheet,".csv") )
}}

I would like is a way of naming the files with TRUE or FALSE if a specific character string is present any where within the sheet, in this case 'vivax'. So in this case:

GBD2016_2_915_Boletin Epidemiologico_2016_37sheet21true.xls
GBD2016_2_915_Boletin Epidemiologico_2016_37sheet22false.xls`

Example file: https://drive.google.com/file/d/1p4HAuFl7Codine1Vvb8SzA7OHTzraaHz/view?usp=sharing


Solution

  • Since you have a tibble, and it is not known which column has which type, I created this:

    isWordInTibble <- function(word, tibble) {
      # case insensitive
      any(unlist(
        sapply(1:ncol(tibble), 
               function(i) tolower(word) %in% tolower(as.character(unlist(tibble[, i]))))))
    }
    

    It looks whether the word is in any of the column vectors - looks through all columns.

    Replace your file argument in your write.csv command by:

    file = gsub(".xls", 
                paste0(substr(temp[2, 1], 
                              1, 
                              5), # just first 5 letters
                       gsub("sheet", "", sheet), 
                       substr(tolower(as.character(isWordInTibble("vivax", tmp))),
                              1, 
                              1), # just first letter ("t" or "f") 
                       ".csv"), 
                excelsheet)
    

    Then it might work!

    I tried to shorten the names using substr(, start, end) and gsub().

    Appendix

    Since you asked how to print then only the files containing the word but not those which don't contain it:

    In your example, instead of the write.csv() command, replace it with:

    if (isWordInTibble("vivax", tmP)) {
      write.csv(temp, 
                file = gsub(".xls", 
                            paste0(substr(temp[2, 1], 
                                          1, 
                                          5), # just first 5 letters
                                   gsub("sheet", "", sheet), 
                                   substr(tolower(as.character(isWordInTibble("vivax", temp))),
                                          1, 
                                          1), # just first letter ("t" or "f") 
                                   ".csv"), 
                            excelsheet))
    }
    

    Then it prints out only if the isWordInTibble returns TRUE.