pandasmerge-asof

Nearest matching for groups, when merge_asof fails because of warning that left frame is not properly sorted


This should be simple using pandas merge_asof function, but unfortunately it's not working because the function complains: ValueError: left keys must be sorted.

I want to assign the correct values of seniority of individuals to their achievements over time. I have a data frame with the achievements of 3,500 people over time. In total, there are > 75,000 achievements over a time period of 1970 to this year. Now, the individuals progress in seniority over time.

I want to match the achievements to their seniority.

Below, under the heading ACHIEVEMENTDATA is an example of data for two people. The relevant identifiers are identifier and achievement_year. achieve_count is the number of achievements by achievement_year for a person.

Now, I want a column (seniority) added to dfa based on achievement data over time from df (see PERSONALNDATA below), such that the rows in dfa reflect the proper seniority for each row in dfa. I manually did that below.

Note that some rows in df have duplicates identifier per year (A in 2015 and 2019) in these cases, rely on the row with the highest value of seniority.

PERSONALNDATA (df)
identifier  seniority   year
A   2   2009
A   3   2015
A   3   2015
A   4   2019
A   4   2019
A   4   2023
B   2   2012
B   4   2024


ACHIEVEMENTDATA (dfa):
identifier  achievement_year    achieve_count   seniority
A   2003    2   
A   2004    3   
A   2005    1   
A   2006    3   
A   2007    1   
A   2008    1   
A   2010    2   2
A   2011    2   2
A   2012    2   2
A   2013    4   2
A   2014    8   2
A   2015    4   3
A   2016    4   3
A   2017    4   3
A   2018    7   3
A   2019    4   4
A   2020    12  4
A   2021    8   4
A   2022    5   4
A   2023    7   4
A   2024    5   4
B   2007    1   
B   2009    1   
B   2010    2   
B   2011    1   
B   2012    2   2
B   2013    1   2
B   2014    1   2
B   2017    3   2
B   2019    1   2
B   2020    2   2
B   2021    1   2
B   2023    2   2
B   2024    2   4

Solution

  • See rule #5 in this link: Sort both dataframes according to the column listed for the ‘on’ parameter.

    This is also a rather hidden rule given its necessity, especially if you data was sorted initially. Regardless, just like watching out for nonetypes before running a merge ensure you sort both data frames according to the on column and all should be well. This is actually a rule that merge_asof shares with merge_ordered() so ensure you sort before that kind of merge also. To find out more about merge_ordered() go here.

    I had sorted the frames on 'identifier', and 'year'