dataframecsvjulia

efficient way to join 65,000 .csv files


I have say 65,000 .csv files that I need to work with in julia language.

The goal is to perform basic statistics on the data set.

I had some ways of joining all the data sets

#1 - set a common index and leftjoin() - perform statistics row wise 
#2 - vcat() the dataframes on top of each other - vertically stacked use group by

Eitherway the final data frames are very large ! and become slow in processing

Is there an efficient way of doing this ?

I thought of performing either #1 or #2 and splitting the joining operations in thirds, lets say after 20,000 joins save to .csv and operate in chunks then at the end join all 3 in one last operation.

Well not sure how to replicate making 65k .csv files but basically below I loop through the files in the directory, load the csv then vcat() to one df. Question more relating to if there is a better way to manage the size of the operation. vcat() makes something grow. Ahead of time maybe I can cycle through the .csv files, obtain file dimensions per .csv, initialize the full dataframe to final output size, then cycle through each .csv row by row and populate the initialized df.

using CSV
using DataFrames

# read all files in directory
csv_dir_tmax = cd(readdir, "C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax")

# initialize outputs
tmax_all = DataFrame(Date = [], TMAX = [])
c=1
for c = 1:length(csv_dir_tmax)
    print("Starting csv file ", csv_dir_tmax[c]," - Iteration ",c,"\n")
        if c <= length(csv_dir_tmax)
    csv_tmax = CSV.read(join(["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax/", csv_dir_tmax[c]]), DataFrame, header=true)
        tmax_all = vcat(tmax_all, csv_tmax)
    end
end

Solution

  • The following approach should be relatively efficient (assuming that data fits into memory):

    tmax_all = reduce(vcat, [CSV.read("YOUR_DIR$x", DataFrame) for x in csv_dir_tmax])