I'm working with 2D data, and I'm aware of how to bin the data to form a 2D histogram using np.histogram2d, and also how to find the bin-location of a particular element using np.digitize.
The code I use to find which histogram bin a particular element is located in looks something like this:
bins = [[0, 0.3, 0.5, 0.7, 1.1], [0, 0.3, 0.7, 1.1]]
values = np.random.random((10, 2))
digitised = []
for i in range(len(bins)):
digitised.append(np.digitize(values[:, i], bins[i], right=True))
digitised = np.concatenate(digitised).reshape(2, 10)
where the first row of the 'digitised' list list corresponds to the x-direction and the second row for the y-direction, i.e. if digitised[0][0] = 4 and digitised[1][0] = 2, then the first element in my 'values' list is in the 4th x-bin and 2nd y-bin.
The code I use to compute the overall 2D histogram is:
bins_x = np.array([0, 0.3, 0.5, 0.7, 1.1])
bins_y = np.array([0, 0.3, 0.7, 1.1])
H, edge_x, edge_y = np.histogram2d(values[:, 0], values[:, 1], bins=(bins_x, bins_y))
H = H.T
and the output of the above code block would look something like this:
H:
array([[0., 3., 0., 0.],
[1., 0., 0., 1.],
[1., 0., 1., 3.]])
I'm interested in extracting a list of lists of elements within each overall bin. For example, in the H[0][1]
entry, where there are three values, I would like to extract a list of which elements in values go into this entry, but in a more general sense, extract a list for every bin in this 2D histogram
This would be possible using a double for-loop, e.g. sorting through the x-values of the 'digitised' list first, then finding the y-values, and grouping them together. However, to the best of my knowledge, this would require a copious number of if statements to sort through all the individual bins, which would get quite inefficient for a larger dataset (e.g. an 8 x 7 grid compared to the 4 x 3 example here).
I would be super grateful for any advice or suggestions as to how to go about doing this, thank you!
If you are not left with Numpy only, you can use Scipy functions to calculate both the histogram and bin numbers for each element of the source 2D array.
H, edge_x, edge_y, binnumber = scipy.stats.binned_statistic_2d(
values[:, 0],
values[:, 1],
None,
bins=(bins_x, bins_y),
statistic='count',
expand_binnumbers=True
)
If you would like to combine all elements under their bin values you can use the following snippet:
from collections import defaultdict
bin_values = defaultdict(list)
for value_i, (bin_x, bin_y) in enumerate(binnumber.T):
bin_values[(bin_x, bin_y)].append(values[value_i])
So to know which elements are located in the first bin alongside X and third bin alongside Y you checks the corresponding element of the bin_values
dictionary:
> bin_values[(1, 3)]
[array([0.92643067, 0.98808226]), array([0.8453115 , 0.75003263])]
Please check the documentation for more info.
scipy.stats.binned_statistics_2d
EDIT:
If you do print(bin_values[(2, 2)])
(giving that there is no entry for (2, 2)) you will get []
. This list is generated automatically and placed into bin_values as soon as you look up a non-existing key in the dictionary. If you really need to see empty lists in the print(bin_values)
output immediately, you can set them up like this
import itertools
for bin_index_2d in itertools.product(range(len(bins[0])), range(len(bins[1]))):
if bin_index_2d not in bin_values:
bin_values[bin_index_2d] = []