I have some data in CSV format (16 billion rows, 170 columns).
I can extract each column using cut
and load "just" one column from a file into Pandas using pd.load_csv()
, but it is painfully slow and uses about 228GB of RAM while loading then settles back to 46GB for one of the columns while for some others tested my system with 256GB of RAM starts swapping and grinds to a halt.
Is there some way which is reasonably fast and requires less RAM to calculate standard stats like mean, median, standard deviation, and standard error on each column?
System(s) are all running Ubuntu 20.04.3 LTS and I can install any package available through standard repos.
NOTE: Some columns have u
for unknown/missing data while some just have nothing for the same but otherwise all the columns are either integers or floats.
If anyone is looking for an answer, the comments have some good suggestions for not using CSV files.
In almost all cases, using something other than CSV is best, but sometimes (like in my case), it's what you have to work with. There are a couple of solutions that work reasonably well depending on factors.
I was unable to find a solution, so I just wrote my own.
Calculating the Standard Deviation and Standard Error (and Confidence Intervals) does not require holding all variables in RAM; however, if you opt not to hold them in RAM you will have to read them twice. Once to calculate the Mean, and the second for the sum of the difference between the mean and the values squared (sometimes referred to the Mean Squares). With those two numbers and the number of variables you can calculate most of the most-common stats.
Example code:
#!/usr/bin/env python3
import csv
import math
def calc_some_stats(infile, col_idx):
n, tot = 0, 0
with open(infile, 'r') as fh:
reader = csv.reader(fh)
for row in reader:
try:
val = float(row[col_idx])
n += 1
tot += val
except ValueError:
# Ignore nulls, 'u', and 'nan'
pass
pass
pass
mean, sum_mean_sq = tot / n, 0
with open(infile, 'r') as fh:
reader = csv.reader(fh)
for row in reader:
try:
val = float(row[col_idx])
sum_mean_sq += (mean - val)**2
except ValueError:
pass
pass
pass
variance = sum_mean_sq / n
standard_deviation = math.sqrt(variance)
standard_error = standard_deviation / math.sqrt(n)
return n, mean, standard_deviation, standard_error
n, mean, stdev, sem = calc_some_stats("somefile.csv", 12)