I have a table "Data" containing arrays of FOOs, and a separate table "Lookup" where I can find the BAR for each FOO. I want to write a SQL query which returns the Data table, but with arrays of the BARs that correspond to the FOOs in the original.
I appreicate it may be an awkward table structure from a SQL perspective, and I won't go into why it is this way. I just want to know if what I want is technically possible.
As an example, the "Data" table might contain arrays of cities, and I want arrays of countries.
Name | Cities |
---|---|
'Mark' |
['Manchester', 'Paris'] |
'Aisha' |
['London', 'Munich'] |
Country | Capital_city | Other_cities |
---|---|---|
'Britain' |
'London' |
['Manchester', 'Bristol'] |
'France' |
'Paris' |
['Bordeaux', 'Marseilles'] |
'Germany' |
'Berlin' |
['Munich', 'Cologne'] |
Name | Cities |
---|---|
'Mark' |
['Britain', 'France'] |
'Aisha' |
['Britain', 'Germany'] |
I have a working query which can find the country for a given city @City
:
SELECT Country
FROM Lookup
WHERE (
Capital_City = @City
OR @City IN (SELECT Cities FROM Lookup)
)
What I can't figure out is: how to do this for every single item in every array in the Cities column of my Data table?
Note: I'm using the SQL API from the python Polars module. It seems to generally support core SQL features.
As far as I know, what you want to do is technically possible in SQL, but handling arrays in SQL can be tricky though. Since you’re using Polars’ SQL API
, you can make use of Polars’ explode()
function in combination with SQL joins
to transform your array-based data structure.
Polars SQL
supports array handling so that you can use UNNEST
or explode in Python to flatten arrays before joining. So, for a SQL query, you can try this:
WITH Exploded AS (
SELECT Name, UNNEST(Cities) AS City
FROM Data
),
Matched AS (
SELECT e.Name, l.Country
FROM Exploded e
JOIN Lookup l
ON e.City = l.Capital_City OR e.City = ANY(l.Other_cities)
)
SELECT Name, ARRAY_AGG(DISTINCT Country) AS Countries
FROM Matched
GROUP BY Name;
If you intend to use Polars
in Python
, you can use Polars’ DataFrame API
instead, like this:
import polars as pl
# sample DF :)
data = pl.DataFrame({
"Name": ["Mark", "Aisha"],
"Cities": [["Manchester", "Paris"], ["London", "Munich"]]
})
lookup = pl.DataFrame({
"Country": ["Britain", "France", "Germany"],
"Capital_City": ["London", "Paris", "Berlin"],
"Other_cities": [["Manchester", "Bristol"], ["Bordeaux", "Marseilles"], ["Munich", "Cologne"]]
})
data_exploded = data.explode("Cities")
lookup_exploded = lookup.explode("Other_cities")
# join on Capital_City and Other_cities
matched = (
data_exploded.join(lookup, left_on="Cities", right_on="Capital_City", how="left")
.select(["Name", "Country"])
.fill_null("")
.vstack(
data_exploded.join(lookup_exploded, left_on="Cities", right_on="Other_cities", how="left")
.select(["Name", "Country"])
.fill_null("")
)
.filter(pl.col("Country") != "")
)
result = matched.group_by("Name").agg(pl.col("Country").unique())
print(result)
You should get something like:
shape: (2, 2)
┌───────┬─────────────────────────┐
│ Name │ Country │
├───────┼─────────────────────────┤
│ Mark │ ['Britain', 'France'] │
│ Aisha │ ['Britain', 'Germany'] │
└───────┴─────────────────────────┘
So use SQL if your engine supports UNNEST()
for array operations and Use Polars’ .explode()
method if SQL has limitations and of course ensure DISTINCT
values when aggregating back into arrays.