I have a large csv dataset the looks like the following:
id,x,y,z
34295,695.117,74.0177,70.6486
20915,800.784,98.5225,19.3014
30369,870.428,98.742,23.9953
48151,547.681,53.055,174.176
34026,1231.02,73.7678,203.404
34797,782.725,73.9831,218.592
15598,983.502,82.9373,314.081
34076,614.738,86.3301,171.316
20328,889.016,98.9201,13.3068
...
If I consider each of these lines an element, I would like to have a data structure where I can easily divide space into x,y,z ranges (3-d blocks of space) and determine how many elements are within a given block.
For instance if I divided into cubes of 100 x 100 x 100:
counts[900][100][100] = 3
because id's 20915, 30369, and 20328 from the excerpt of the csv above are all within the range x = 800-900, y = 0-100, and z = 0-100.
The brute force way to create something like this is to create a multi-level dictionary as follows:
import numpy
import pandas
df = pandas.read_csv("test.csv")
xs = numpy.linspace(0, 1300, 14, endpoint=True)
ys = numpy.linspace(0, 1000, 11, endpoint=True)
zs = numpy.linspace(0, 1000, 11, endpoint=True)
c = {}
for x_index, x in enumerate(xs[:-1]):
c[xs[x_index + 1]] = {}
for y_index, y in enumerate(ys[:-1]):
c[xs[x_index + 1]][ys[y_index + 1]] = {}
for z_index, z in enumerate(zs[:-1]):
c[xs[x_index + 1]][ys[y_index + 1]][zs[z_index + 1]] = df[(df["x"] > xs[x_index]) & (df["x"] <= xs[x_index + 1]) & (df["y"] > ys[y_index]) & (df["y"] <= ys[y_index + 1]) & (df["z"] > zs[z_index]) & (df["z"] <= zs[z_index + 1])]["id"].count()
if (c[xs[x_index + 1]][ys[y_index + 1]][zs[z_index + 1]] > 0):
print("c[" + str(xs[x_index + 1]) + "][" + str(ys[y_index + 1]) + "][" + str(zs[z_index + 1]) + "] = " + str(c[xs[x_index + 1]][ys[y_index + 1]][zs[z_index + 1]]))
This gives the expected output of:
c[600.0][100.0][200.0] = 1
c[700.0][100.0][100.0] = 1
c[700.0][100.0][200.0] = 1
c[800.0][100.0][300.0] = 1
c[900.0][100.0][100.0] = 3
c[1000.0][100.0][400.0] = 1
c[1300.0][100.0][300.0] = 1
But since the actual production CSV file is very large, it is quite slow. Any suggestions for how to make it fast and a little less clunky?
You could cut
and value_counts
:
tmp = df[['x', 'y', 'z']]
bins = np.arange(0, np.ceil(np.max(tmp)/100)*100, 100)
tmp.apply(lambda s: pd.cut(s, bins, labels=bins[1:])).value_counts().to_dict()
Output:
{(900.0, 100.0, 100.0): 3,
(600.0, 100.0, 200.0): 1,
(700.0, 100.0, 100.0): 1,
(700.0, 100.0, 200.0): 1,
(800.0, 100.0, 300.0): 1,
(1000.0, 100.0, 400.0): 1}
Or round up to the nearest 100 before value_counts
:
(np.ceil(df[['x', 'y', 'z']].div(100))
.mul(100).astype(int)
.value_counts(sort=False)
.to_dict()
)
Output:
{(600, 100, 200): 1,
(700, 100, 100): 1,
(700, 100, 200): 1,
(800, 100, 300): 1,
(900, 100, 100): 3,
(1000, 100, 400): 1,
(1300, 100, 300): 1}