pythonsqlpython-polars

Convert the values in a SQL array by SELECTing from another table?


I have a table "Data" containing arrays of FOOs, and a separate table "Lookup" where I can find the BAR for each FOO. I want to write a SQL query which returns the Data table, but with arrays of the BARs that correspond to the FOOs in the original.

I appreicate it may be an awkward table structure from a SQL perspective, and I won't go into why it is this way. I just want to know if what I want is technically possible.

As an example, the "Data" table might contain arrays of cities, and I want arrays of countries.

Data

Name Cities
'Mark' ['Manchester', 'Paris']
'Aisha' ['London', 'Munich']

Lookup

Country Capital_city Other_cities
'Britain' 'London' ['Manchester', 'Bristol']
'France' 'Paris' ['Bordeaux', 'Marseilles']
'Germany' 'Berlin' ['Munich', 'Cologne']

Desired output

Name Cities
'Mark' ['Britain', 'France']
'Aisha' ['Britain', 'Germany']

What I have tried so far

I have a working query which can find the country for a given city @City:

SELECT Country
FROM Lookup
WHERE (
    Capital_City = @City
    OR @City IN (SELECT Cities FROM Lookup)
)

What I can't figure out is: how to do this for every single item in every array in the Cities column of my Data table?

Note: I'm using the SQL API from the python Polars module. It seems to generally support core SQL features.


Solution

  • As far as I know, what you want to do is technically possible in SQL, but handling arrays in SQL can be tricky though. Since you’re using Polars’ SQL API, you can make use of Polars’ explode() function in combination with SQL joins to transform your array-based data structure.

    Polars SQL supports array handling so that you can use UNNEST or explode in Python to flatten arrays before joining. So, for a SQL query, you can try this:

    WITH Exploded AS (
        SELECT Name, UNNEST(Cities) AS City
        FROM Data
    ),
    Matched AS (
        SELECT e.Name, l.Country
        FROM Exploded e
        JOIN Lookup l 
        ON e.City = l.Capital_City OR e.City = ANY(l.Other_cities)
    )
    SELECT Name, ARRAY_AGG(DISTINCT Country) AS Countries
    FROM Matched
    GROUP BY Name;
    

    If you intend to use Polars in Python, you can use Polars’ DataFrame API instead, like this:

    import polars as pl
    
    # sample DF :)
    data = pl.DataFrame({
        "Name": ["Mark", "Aisha"],
        "Cities": [["Manchester", "Paris"], ["London", "Munich"]]
    })
    
    lookup = pl.DataFrame({
        "Country": ["Britain", "France", "Germany"],
        "Capital_City": ["London", "Paris", "Berlin"],
        "Other_cities": [["Manchester", "Bristol"], ["Bordeaux", "Marseilles"], ["Munich", "Cologne"]]
    })
    
    data_exploded = data.explode("Cities")
    
    lookup_exploded = lookup.explode("Other_cities")
    
    # join on Capital_City and Other_cities
    matched = (
        data_exploded.join(lookup, left_on="Cities", right_on="Capital_City", how="left")
        .select(["Name", "Country"])
        .fill_null("")
        .vstack(
            data_exploded.join(lookup_exploded, left_on="Cities", right_on="Other_cities", how="left")
            .select(["Name", "Country"])
            .fill_null("")
        )
        .filter(pl.col("Country") != "")
    )
    
    result = matched.group_by("Name").agg(pl.col("Country").unique())
    
    print(result)
    

    You should get something like:

    
    shape: (2, 2)
    ┌───────┬─────────────────────────┐
    │ Name  │ Country                 │
    ├───────┼─────────────────────────┤
    │ Mark  │ ['Britain', 'France']   │
    │ Aisha │ ['Britain', 'Germany']  │
    └───────┴─────────────────────────┘
    

    So use SQL if your engine supports UNNEST() for array operations and Use Polars’ .explode()method if SQL has limitations and of course ensure DISTINCT values when aggregating back into arrays.