pythonpandasinterval-tree

Interval intersection in pandas


Update 5:

This feature has been released as part of pandas 20.1 (on my birthday :] )

Update 4:

PR has been merged!

Update 3:

The PR has moved here

Update 2:

It seems like this question may have contributed to re-opening the PR for IntervalIndex in pandas.

Update:

I no longer have this problem, since I'm actually now querying for overlapping ranges from A and B, not points from B which fall within ranges in A, which is a full interval tree problem. I won't delete the question though, because I think it's still a valid question, and I don't have a good answer.

Problem statement

I have two dataframes.

In dataframe A, two of the integer columns taken together represent an interval.

In dataframe B, one integer column represents a position.

I'd like to do a sort of join, such that points are assigned to each interval they fall within.

Intervals are rarely but occasionally overlapping. If a point falls within that overlap, it should be assigned to both intervals. About half of points won't fall within an interval, but nearly every interval will have at least one point within its range.

What I've been thinking

I was initially going to dump my data out of pandas, and use intervaltree or banyan or maybe bx-python but then I came across this gist. It turns out that the ideas shoyer has in there never made it into pandas, but it got me thinking -- it might be possible to do this within pandas, and since I want this code to be as fast as python can possibly go, I'd rather not dump my data out of pandas until the very end. I also get the feeling that this is possible with bins and pandas cut function, but I'm a total newbie to pandas, so I could use some guidance! Thanks!

Notes

Potentially related? Pandas DataFrame groupby overlapping intervals of variable length


Solution

  • This feature is was released as part of pandas 20.1