pythonpandaslarge-datamedianstandard-deviation

Best way to calculate cursory statistics from very large CSV


I have some data in CSV format (16 billion rows, 170 columns).

I can extract each column using cut and load "just" one column from a file into Pandas using pd.load_csv(), but it is painfully slow and uses about 228GB of RAM while loading then settles back to 46GB for one of the columns while for some others tested my system with 256GB of RAM starts swapping and grinds to a halt.

Is there some way which is reasonably fast and requires less RAM to calculate standard stats like mean, median, standard deviation, and standard error on each column?

System(s) are all running Ubuntu 20.04.3 LTS and I can install any package available through standard repos.

NOTE: Some columns have u for unknown/missing data while some just have nothing for the same but otherwise all the columns are either integers or floats.


Solution

  • If anyone is looking for an answer, the comments have some good suggestions for not using CSV files.

    In almost all cases, using something other than CSV is best, but sometimes (like in my case), it's what you have to work with. There are a couple of solutions that work reasonably well depending on factors.

    I was unable to find a solution, so I just wrote my own.

    Calculating the Standard Deviation and Standard Error (and Confidence Intervals) does not require holding all variables in RAM; however, if you opt not to hold them in RAM you will have to read them twice. Once to calculate the Mean, and the second for the sum of the difference between the mean and the values squared (sometimes referred to the Mean Squares). With those two numbers and the number of variables you can calculate most of the most-common stats.

    Example code:

    #!/usr/bin/env python3
    import csv
    import math
    
    
    def calc_some_stats(infile, col_idx):
        n, tot = 0, 0
        with open(infile, 'r') as fh:
            reader = csv.reader(fh)
            for row in reader:
                try:
                    val = float(row[col_idx])
                    n += 1
                    tot += val
                except ValueError:
                    # Ignore nulls, 'u', and 'nan'
                    pass
                pass
            pass
        mean, sum_mean_sq = tot / n, 0
        with open(infile, 'r') as fh:
            reader = csv.reader(fh)
            for row in reader:
                try:
                    val = float(row[col_idx])
                    sum_mean_sq += (mean - val)**2
                except ValueError:
                    pass
                pass
            pass
        variance = sum_mean_sq / n
        standard_deviation = math.sqrt(variance)
        standard_error = standard_deviation / math.sqrt(n)
        return n, mean, standard_deviation, standard_error
    
    
    n, mean, stdev, sem = calc_some_stats("somefile.csv", 12)