rexcelfilesplitting

How to split large Excel file into multiple Excel files using R


I'm looking for a way to split up a large Excel file into a number of smaller Excel files using R.

Specifically, there are three things I would like to do:

  1. I have a large data set consisting of information regarding students (their school, the area in which the school is located, test score A, test score B) that I would like to split up into individual files, one file per school containing all of the students attending that specific school.
  2. I would also like all of the individual Excel files to contain an image covering the first row and columns A, B, C & D of every Excel file. The image will be the same for all the schools in the data set.
  3. Lastly, I would also like the Excel files, after being created, to end up in individual folders on my desktop. The folders name would be the area in which the schools are located. An area has about 3-5 schools so the folder would contain 3-5 Excel files, 1 for each school.

My data is structured like this:

Area School Student ID Test score A Test score B
North A 134 24 31
North A 221 26 33
South B 122 22 21
South B 126 25 25

I have data covering roughly 200 schools located in 5 different areas.

Any guidance on how to do this would be greatly appreciated!


Solution

  • As some of the comments have referenced, this will be hard to solve without knowing your specific operating environment & folder structure, I solved this using Windows 10/ C drive user folder but you can customize to your system. You're going to need a folder with all the images from the school saved by the name (or the ID I created) of the school and they will all need to be the same format (JPG or PNG). Plus, you need folders created for each Area you want to output to (openxlsx can write the files but not create the folders for you). Once you have those setup, something similar to this should work for you, but I would highly recommend referring to the openxlsx documentation for more info:

    library(dplyr)
    library(openxlsx)
    
    # Load your excel file into a df
    # g0 = openxlsx::read.xlsx(<your excel file & sheet..see openxlsx documentation>)
    
    # Replace this tibble with your actual excel file, this was just for an example
    g0 = tibble(
      Area = c("North","North","North","North"),
      School = c("A","A","B","B"),
      Student_ID = c(134,221,122,126),
      test_score_a = c(24,26,22,25),
      test_score_b = c(31,33,21,25))
    
    # Need a numeric school id for the loop
    g0$school_id = as.numeric(as.factor(g0$School))
    
    # Loop through schools, filter using dplyr and create a sheet per school
    for (i in 1:n_distinct(g0$school_id)){
      g1 = g0 %>% 
      filter(school_id == i)
      
      ## Create a new workbook
      wb <- createWorkbook(as.character(g1$School))
      
      ## Add some worksheets
      addWorksheet(wb, as.character(g1$School))
      
      ## Insert images
      ## I left the image as a direct path for example but you can change to a
      ## relative path once you get it working
      img <- system.file("C:","Users","your name","Documents","A","A.jpg", package = "openxlsx")
      insertImage(wb, as.character(g1$School), img, startRow = 5, startCol = 3, width = 6, height = 5)
    
      ## Save workbook
      saveWorkbook(wb, paste0("C://Users//your name//Documents//",g0$Area,"//",g0$school,".xlsx"), overwrite = TRUE)
    }