I have a dictionary with integer keys and float values. I also have a 2D awkward array with integer entries (I'm using awkward1). I want to replace these integers with the corresponding float according to the dictionary, keeping the awkward array format.
Assuming the keys run from 0 to 999, my solution so far is something like this:
resultArray = ak.where(myArray == 0, myDict.get(0), 0)
for key in range(1,1000):
resultArray = resultArray + ak.where(myArray == key, myDict.get(key), 0)
Is there a faster way to do this?
Update
Minimal reproducible example of my working code:
import awkward as ak # Awkward 1
myArray = ak.from_iter([[0, 1], [2, 1, 0]]) # Creating example array
myDict = {0: 19.5, 1: 34.1, 2: 10.9}
resultArray = ak.where(myArray == 0, myDict.get(0), 0)
for key in range(1,3):
resultArray = resultArray + ak.where(myArray == key, myDict.get(key), 0)
myArray:
<Array [[0, 1], [2, 1, 0]] type='2 * var * int64'>
resultArray:
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>
When I mentioned in a comment that np.searchsorted is where you should be looking, I hadn't noticed that myDict
includes every consecutive integer as a key. Having a dense lookup table like this would allow faster algorithms, which also happen to be simpler in Awkward Array.
So, assuming that there's a key in myDict
for each integer from 0
up to some value, you can equally well represent the lookup table as
>>> lookup = ak.Array([myDict[i] for i in range(len(myDict))])
>>> lookup
<Array [19.5, 34.1, 10.9] type='3 * float64'>
The problem of picking values at 0
, 1
, and 2
becomes just an array-slice. (This array-slice is an O(n) algorithm for array length n, unlike np.searchsorted
, which would be O(n log n). That's the cost of having sparse lookup keys.)
The problem, however, is that myArray
is nested and lookup
is not. We can give lookup
the same depth as myArray
by slicing it up:
>>> multilookup = lookup[np.newaxis][np.zeros(len(myArray), np.int64)]
>>> multilookup
<Array [[19.5, 34.1, 10.9, ... 34.1, 10.9]] type='2 * 3 * float64'>
>>> multilookup.tolist()
[[19.5, 34.1, 10.9], [19.5, 34.1, 10.9]]
And then multilookup[myArray]
is exactly what you want:
>>> multilookup[myArray]
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>
The lookup had to be duplicated because each list within myArray
uses global indexes in the whole lookup
. If the memory involved in creating multilookup
is prohibitive, you could instead break myArray
down to match it:
>>> flattened, num = ak.flatten(myArray), ak.num(myArray)
>>> flattened
<Array [0, 1, 2, 1, 0] type='5 * int64'>
>>> num
<Array [2, 3] type='2 * int64'>
>>> lookup[flattened]
<Array [19.5, 34.1, 10.9, 34.1, 19.5] type='5 * float64'>
>>> ak.unflatten(lookup[flattened], nums)
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>
If your keys are not dense from 0
up to some integer, then you'll have to use np.searchsorted
:
>>> keys = ak.Array(myDict.keys())
>>> values = ak.Array([myDict[key] for key in keys])
>>> keys
<Array [0, 1, 2] type='3 * int64'>
>>> values
<Array [19.5, 34.1, 10.9] type='3 * float64'>
In this case, the keys
are trivial because it is dense. When using np.searchsorted
, you have to explicitly cast the flat Awkward Arrays as NumPy (for now; we're looking to fix that).
>>> lookup_index = np.searchsorted(np.asarray(keys), np.asarray(flattened), side="left")
>>> lookup_index
array([0, 1, 2, 1, 0])
Then we pass it through the trivial keys
(which doesn't change it, in this case) before passing it to the values
.
>>> keys[lookup_index]
<Array [0, 1, 2, 1, 0] type='5 * int64'>
>>> values[keys[lookup_index]]
<Array [19.5, 34.1, 10.9, 34.1, 19.5] type='5 * float64'>
>>> ak.unflatten(values[keys[lookup_index]], num)
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>
But the thing I was waffling about in yesterday's comment was that you have to do this on the flattened form of myArray
(flattened
) and reintroduce the structure later ak.unflatten, as above. But perhaps we should wrap np.searchsorted
as ak.searchsorted
to recognize a fully structured Awkward Array in the second argument, at least. (It has to be unstructured to be in the first argument.)