This should be simple using pandas merge_asof function, but unfortunately it's not working because the function complains: ValueError: left keys must be sorted
.
I want to assign the correct values of seniority of individuals to their achievements over time. I have a data frame with the achievements of 3,500 people over time. In total, there are > 75,000 achievements over a time period of 1970 to this year. Now, the individuals progress in seniority over time.
I want to match the achievements to their seniority.
Below, under the heading ACHIEVEMENTDATA is an example of data for two people. The relevant identifiers are identifier
and achievement_year
. achieve_count
is the number of achievements by achievement_year
for a person.
Now, I want a column (seniority
) added to dfa
based on achievement data over time from df
(see PERSONALNDATA below), such that the rows in dfa
reflect the proper seniority for each row in dfa
. I manually did that below.
Note that some rows in df
have duplicates identifier per year (A in 2015 and 2019) in these cases, rely on the row with the highest value of seniority.
PERSONALNDATA (df)
identifier seniority year
A 2 2009
A 3 2015
A 3 2015
A 4 2019
A 4 2019
A 4 2023
B 2 2012
B 4 2024
ACHIEVEMENTDATA (dfa):
identifier achievement_year achieve_count seniority
A 2003 2
A 2004 3
A 2005 1
A 2006 3
A 2007 1
A 2008 1
A 2010 2 2
A 2011 2 2
A 2012 2 2
A 2013 4 2
A 2014 8 2
A 2015 4 3
A 2016 4 3
A 2017 4 3
A 2018 7 3
A 2019 4 4
A 2020 12 4
A 2021 8 4
A 2022 5 4
A 2023 7 4
A 2024 5 4
B 2007 1
B 2009 1
B 2010 2
B 2011 1
B 2012 2 2
B 2013 1 2
B 2014 1 2
B 2017 3 2
B 2019 1 2
B 2020 2 2
B 2021 1 2
B 2023 2 2
B 2024 2 4
See rule #5 in this link: Sort both dataframes according to the column listed for the ‘on’ parameter.
This is also a rather hidden rule given its necessity, especially if you data was sorted initially. Regardless, just like watching out for nonetypes before running a merge ensure you sort both data frames according to the on column and all should be well. This is actually a rule that merge_asof shares with merge_ordered() so ensure you sort before that kind of merge also. To find out more about merge_ordered() go here.
I had sorted the frames on 'identifier'
, and 'year'