pythonnumpysortingbigdatapercentile

Calculating percentile over multiple large files without holding them all in memory (Python)


I'm trying to calculate the 99th percentile of a certain value in a climate dataset. I have hourly observations spread across a lat-lon grid of 361 x 576 points for 43 years (1980-2022).

Is there a way to iteratively calculate the 99th percentile by loading in each year one by one and discarding some amount of the previous data to keep the memory needed limited? I've done some reading on algorithms but haven't found anything yet that would suit my needs.

Thanks in advance!


Solution

  • Based on Stef's observation that we only need to keep track of the top 1%, but I do that with NumPy arrays and partitioning:

    import numpy as np
    
    # Simulated files (3x5 grid instead of 361x576)
    def files():
        np.random.seed(123)
        for year in range(43):
            yield np.random.random((8760, 3, 5))
    
    # Just for testing; Compute expectation with np.percentile
    def reference(files):
        whole = np.vstack(files)
        return np.percentile(whole, 99, 0, method='closest_observation')
    
    # The actual solution
    def solution(files):
    
        # Preparation for getting top 1% = top k
        k = round(43 * 8760 / 100)
        def top_k(a):
            return np.partition(a, ~k, 0)[~k:]
    
        # Compute the result
        result = top_k(next(files))
        for other in files:
            other = top_k(other)
            both = np.vstack((result, other))
            result = top_k(both)
        return result[0]
    
    expect = reference(files())
    result = solution(files())
    
    print('expect:')
    print(expect)
    
    print('\nresult:')
    print(result)
    
    print('\ndifference:')
    print(result - expect)
    

    Output (Attempt This Online!):

    expect:
    [[0.98989099 0.98997275 0.99009619 0.98969514 0.99034828]
     [0.9898165  0.99010689 0.98995486 0.99006558 0.98968308]
     [0.98996179 0.98979965 0.9896849  0.98996829 0.99012452]]
    
    result:
    [[0.98989099 0.98997275 0.99009619 0.98969514 0.99034828]
     [0.9898165  0.99010689 0.98995486 0.99006558 0.98968308]
     [0.98996179 0.98979965 0.9896849  0.98996829 0.99012452]]
    
    difference:
    [[0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0.]]
    

    With simulated 36x58 grid, the "Compute the result" part took less than 30 seconds, on the ATO site. So for your real data on your own computer, it should take less than an hour.