I have two 2D numpy arrays shaped:
(19133L, 12L)
(248L, 6L)
In each case, the first 3 fields form an identifier.
I want to reduce the larger matrix so that it only contains rows with identifiers that also exist in the second matrix. So the shape should be (248L, 12L). How can I do this?
I would then like to sort it so that the arrays are indexed by the first value, second value and third value so that (3 3 4) comes after (3 3 5) etc. Is there a multi field sort function?
Edit:
I have tried pandas:
df1 = DataFrame(arr1.astype(str))
df2 = DataFrame(arr2.astype(str))
df1.set_index([0,1,2])
df2.set_index([0,1,2])
out = merge(df1,df2,how="inner")
print(out.shape)
But this results in (0,13) shape
Use pandas.
pandas.set_index() allows multiple keys. So set the index to the first three columns (use drop=False, inplace=True
) to avoid needlessly mutating or copying your dataframe.
Then, merge(...how='inner') to intersect your dataframes.
In general, numpy runs out of steam very quickly for arbitrary dataframe manipulations; your default thing should be to try pandas. Also much more performant.