pythonnumpyintersectcompound-key

Numpy arrays with compound keys; find subset in both


I have two 2D numpy arrays shaped:

(19133L, 12L)
(248L, 6L)

In each case, the first 3 fields form an identifier.

I want to reduce the larger matrix so that it only contains rows with identifiers that also exist in the second matrix. So the shape should be (248L, 12L). How can I do this?

I would then like to sort it so that the arrays are indexed by the first value, second value and third value so that (3 3 4) comes after (3 3 5) etc. Is there a multi field sort function?

Edit:

I have tried pandas:

df1 = DataFrame(arr1.astype(str))
df2 = DataFrame(arr2.astype(str))

df1.set_index([0,1,2])
df2.set_index([0,1,2])

out = merge(df1,df2,how="inner") 
print(out.shape)

But this results in (0,13) shape


Solution

  • Use pandas.

    pandas.set_index() allows multiple keys. So set the index to the first three columns (use drop=False, inplace=True) to avoid needlessly mutating or copying your dataframe.

    Then, merge(...how='inner') to intersect your dataframes.

    In general, numpy runs out of steam very quickly for arbitrary dataframe manipulations; your default thing should be to try pandas. Also much more performant.