pythonpandasdictionary

Assign multi-index variable values based on the number of elements in a dataframe that match a selection criteria


I have a large csv dataset the looks like the following:

id,x,y,z
34295,695.117,74.0177,70.6486
20915,800.784,98.5225,19.3014
30369,870.428,98.742,23.9953
48151,547.681,53.055,174.176
34026,1231.02,73.7678,203.404
34797,782.725,73.9831,218.592
15598,983.502,82.9373,314.081
34076,614.738,86.3301,171.316
20328,889.016,98.9201,13.3068
...

If I consider each of these lines an element, I would like to have a data structure where I can easily divide space into x,y,z ranges (3-d blocks of space) and determine how many elements are within a given block.

For instance if I divided into cubes of 100 x 100 x 100:

counts[900][100][100] = 3

because id's 20915, 30369, and 20328 from the excerpt of the csv above are all within the range x = 800-900, y = 0-100, and z = 0-100.

The brute force way to create something like this is to create a multi-level dictionary as follows:

import numpy
import pandas

df = pandas.read_csv("test.csv")

xs = numpy.linspace(0, 1300, 14, endpoint=True)
ys = numpy.linspace(0, 1000, 11, endpoint=True)
zs = numpy.linspace(0, 1000, 11, endpoint=True)

c = {}
for x_index, x in enumerate(xs[:-1]):
    c[xs[x_index + 1]] = {}
    for y_index, y in enumerate(ys[:-1]):
        c[xs[x_index + 1]][ys[y_index + 1]] = {}
        for z_index, z in enumerate(zs[:-1]):
            c[xs[x_index + 1]][ys[y_index + 1]][zs[z_index + 1]] = df[(df["x"] > xs[x_index]) & (df["x"] <= xs[x_index + 1]) & (df["y"] > ys[y_index]) & (df["y"] <= ys[y_index + 1]) & (df["z"] > zs[z_index]) & (df["z"] <= zs[z_index + 1])]["id"].count()
            if (c[xs[x_index + 1]][ys[y_index + 1]][zs[z_index + 1]] > 0):
                print("c[" + str(xs[x_index + 1]) + "][" + str(ys[y_index + 1]) + "][" + str(zs[z_index + 1]) + "] = " + str(c[xs[x_index + 1]][ys[y_index + 1]][zs[z_index + 1]]))

This gives the expected output of:

c[600.0][100.0][200.0] = 1
c[700.0][100.0][100.0] = 1
c[700.0][100.0][200.0] = 1
c[800.0][100.0][300.0] = 1
c[900.0][100.0][100.0] = 3
c[1000.0][100.0][400.0] = 1
c[1300.0][100.0][300.0] = 1

But since the actual production CSV file is very large, it is quite slow. Any suggestions for how to make it fast and a little less clunky?


Solution

  • You could cut and value_counts:

    tmp = df[['x', 'y', 'z']]
    bins = np.arange(0, np.ceil(np.max(tmp)/100)*100, 100)
    
    tmp.apply(lambda s: pd.cut(s, bins, labels=bins[1:])).value_counts().to_dict()
    

    Output:

    {(900.0, 100.0, 100.0): 3,
     (600.0, 100.0, 200.0): 1,
     (700.0, 100.0, 100.0): 1,
     (700.0, 100.0, 200.0): 1,
     (800.0, 100.0, 300.0): 1,
     (1000.0, 100.0, 400.0): 1}
    

    Or round up to the nearest 100 before value_counts:

    (np.ceil(df[['x', 'y', 'z']].div(100))
       .mul(100).astype(int)
       .value_counts(sort=False)
       .to_dict()
    )
    

    Output:

    {(600, 100, 200): 1,
     (700, 100, 100): 1,
     (700, 100, 200): 1,
     (800, 100, 300): 1,
     (900, 100, 100): 3,
     (1000, 100, 400): 1,
     (1300, 100, 300): 1}