rcsvmatrixdataframepre-allocation

Progressive appending of data from read.csv


I want to construct a data frame by reading in a csv file for each day in the month. My daily csv files contain columns of characters, doubles, and integers of the same number of rows. I know the maximum number of rows for any given month and the number of columns remains the same for each csv file. I loop through each day of a month with fileListing, which contains the list of csv file names (say, for January):

output <- matrix(ncol=18, nrow=2976)
for ( i in 1 : length( fileListing ) ){
    df = read.csv( fileListing[ i ], header = FALSE, sep = ',', stringsAsFactors = FALSE, row.names = NULL )
    # each df is a data frame with 96 rows and 18 columns

    # now insert the data from the ith date for all its rows, appending as you go
        for ( j in 1 : 18 ){        
            output[ , j ]   = df[[ j ]]
        }
}

Sorry for having revised my question as I figured out part of it (duh), but should I use rbind to progressively insert data at the bottom of the data frame, or is that slow?

Thank you.

BSL


Solution

  • If the data is fairly small relative to your available memory, just read the data in and don't worry about it. After you have read in all the data and done some cleaning, save the file using save() and have your analysis scripts read in that file using load(). Separating reading/cleaning scripts from analysis clips is a good way to reduce this problem.

    A feature to speed up the reading of read.csv is to use the nrow and colClass arguments. Since you say that you know that number of rows in each file, telling R this will help speed up the reading. You can extract the column classes using

    colClasses <- sapply(read.csv(file, nrow=100), class)
    

    then give the result to the colClass argument.

    If the data is getting close to being too large, you may consider processing individual files and saving intermediate versions. There are a number of related discussions to managing memory on the site that cover this topic.

    On memory usage tricks: Tricks to manage the available memory in an R session

    On using the garbage collector function: Forcing garbage collection to run in R with the gc() command