pythonsqlpandasmultiple-tables

How to query table, using a where clause consisting of another table's data?


I can work on this problem by using either pandas or sql queries.

Say I have two tables or dataframes. The first functions as a description of an observation so to speak. It is indexed by the id I am interested in and a set of further columns:

     category    year
id1
1    A           2016
1    B           2016
1    C           2016
2    A           2017
2    B           2017

Furthermore we have a table which functions as our base population, something like this

     category    year
id2
0    A           2014
1    B           2016
2    C           2017
3    A           2017
4    B           2014
5    C           2017
6    A           2018
7    B           2017

I want to be able to use the values in each row of the first table as a condition to select elements of the second table. For example: The first id "1" has 3 descriptions:

{A, 2016},  {B, 2016}, {C, 2016}

I want to create a condition out of those values that reads like this:

((category = A) or (category = B) or (category = C)) and (year > 2016)

(The year is always the same for each id1) I want to count all elements of the population that fulfill the condition derived from the row of the index id1 of the observations.

What I want at the end of the day:

     count
id1
1    6
2    3

There are 6 elements of the population that fulfill the requirements of the observations with id1 "1" (All elements that are either category A, B or C and are newer than 2016)

My idea to the solution is to create a conditional sub select or join the tables and then filter the rows but I am stuck.


Solution

  • This problem can be solved by using SQL joins and the GROUP BY.

    SELECT o.id1, COUNT(*)
    FROM observations o
    JOIN population p 
    ON p.category = o.category AND p.year > o.year
    GROUP BY o.id1; 
    

    or if you're working with pandas dataframes, the solution would look like this:

    import pandas as pd
    # Merge the dataframes on the 'category' column
    merged = pd.merge(df1, df2, on='category', suffixes=('_obs', '_pop'))
    # Filter rows where 'year' in population dataframe is greater than 'year' in observations dataframe
    filtered = merged[merged.year_pop > merged.year_obs]
    
    result = filtered['id1'].value_counts().reset_index()
    
    result.columns = ['id1', 'count']