rloopscsvlarge-files

Split large file in R into smaller files with a loop


I have a csv file with 12,626,756 rows that I need to split into smaller files so a colleague can open them in Excel. I want to create a loop that splits the file into files that fit within Excel's row limit and exports them as CSV files until it reaches the end (it should produce 13 files)

#STEP 1: load data
data <- read.csv(".../Desktop/Data/file.csv", header = TRUE)

#STEP 2: count rows
totalrows <- nrow(data)

#STEP 3: determine how many splits you need 
excelrowlimit <- 1048576 - 5
filesrequired <- ceiling(totalrows/ excelrowlimit)

for example:

csvfile 1 should contain rows 1:1048571
csvfile 2 should contain rows 1048572:2097143
csvfile 3 should contain rows 2097144:3145715
csvfile 4 should contain rows 3145716:4194287
... and so on

how can I write a loop statement that (1) splits by number of files needed and (2) gives a different file name to each csv export?


Solution

  • Here's a solution expanding my comment above. This should have a smaller memory requirement than any other solution as it does not require copying all or part of the original data frame.

    library(tidyverse)
    
    rowCount <- 1048571
    data %>% 
      mutate(Group = ceiling((row_number()) / rowCount)) %>% 
      group_by(Group) %>% 
      group_walk(
        function(.x, .y) {
          write.csv(.x, file = paste0("file", .y$Group, ".csv"))
        }
      )