[SOLVED] How to query table, using a where clause consisting of another table's data?

How to query table, using a where clause consisting of another table's data?

I can work on this problem by using either pandas or sql queries.

Say I have two tables or dataframes. The first functions as a description of an observation so to speak. It is indexed by the id I am interested in and a set of further columns:

     category    year
id1
1    A           2016
1    B           2016
1    C           2016
2    A           2017
2    B           2017

Furthermore we have a table which functions as our base population, something like this

     category    year
id2
0    A           2014
1    B           2016
2    C           2017
3    A           2017
4    B           2014
5    C           2017
6    A           2018
7    B           2017

I want to be able to use the values in each row of the first table as a condition to select elements of the second table. For example: The first id "1" has 3 descriptions:

{A, 2016},  {B, 2016}, {C, 2016}

I want to create a condition out of those values that reads like this:

((category = A) or (category = B) or (category = C)) and (year > 2016)

(The year is always the same for each id1) I want to count all elements of the population that fulfill the condition derived from the row of the index id1 of the observations.

What I want at the end of the day:

     count
id1
1    6
2    3

There are 6 elements of the population that fulfill the requirements of the observations with id1 "1" (All elements that are either category A, B or C and are newer than 2016)

My idea to the solution is to create a conditional sub select or join the tables and then filter the rows but I am stuck.

Solution

This problem can be solved by using SQL joins and the GROUP BY.

SELECT o.id1, COUNT(*)
FROM observations o
JOIN population p 
ON p.category = o.category AND p.year > o.year
GROUP BY o.id1;

or if you're working with pandas dataframes, the solution would look like this:

import pandas as pd
# Merge the dataframes on the 'category' column
merged = pd.merge(df1, df2, on='category', suffixes=('_obs', '_pop'))
# Filter rows where 'year' in population dataframe is greater than 'year' in observations dataframe
filtered = merged[merged.year_pop > merged.year_obs]

result = filtered['id1'].value_counts().reset_index()

result.columns = ['id1', 'count']