I can work on this problem by using either pandas or sql queries.
Say I have two tables or dataframes. The first functions as a description of an observation so to speak. It is indexed by the id I am interested in and a set of further columns:
category year
id1
1 A 2016
1 B 2016
1 C 2016
2 A 2017
2 B 2017
Furthermore we have a table which functions as our base population, something like this
category year
id2
0 A 2014
1 B 2016
2 C 2017
3 A 2017
4 B 2014
5 C 2017
6 A 2018
7 B 2017
I want to be able to use the values in each row of the first table as a condition to select elements of the second table. For example: The first id "1" has 3 descriptions:
{A, 2016}, {B, 2016}, {C, 2016}
I want to create a condition out of those values that reads like this:
((category = A) or (category = B) or (category = C)) and (year > 2016)
(The year is always the same for each id1) I want to count all elements of the population that fulfill the condition derived from the row of the index id1 of the observations.
What I want at the end of the day:
count
id1
1 6
2 3
There are 6 elements of the population that fulfill the requirements of the observations with id1 "1" (All elements that are either category A, B or C and are newer than 2016)
My idea to the solution is to create a conditional sub select or join the tables and then filter the rows but I am stuck.
This problem can be solved by using SQL joins and the GROUP BY.
SELECT o.id1, COUNT(*)
FROM observations o
JOIN population p
ON p.category = o.category AND p.year > o.year
GROUP BY o.id1;
or if you're working with pandas dataframes, the solution would look like this:
import pandas as pd
# Merge the dataframes on the 'category' column
merged = pd.merge(df1, df2, on='category', suffixes=('_obs', '_pop'))
# Filter rows where 'year' in population dataframe is greater than 'year' in observations dataframe
filtered = merged[merged.year_pop > merged.year_obs]
result = filtered['id1'].value_counts().reset_index()
result.columns = ['id1', 'count']