I have the following tables and am trying to look up county codes for a list of several hundred thousand cities.
create table counties (
zip_code_from char(5) not null,
zip_code_thru char(5) not null,
county_code char(3) not null
);
create table cities (
city text not null,
zip_code char(5) not null
);
My first approach was using a "between" in the join:
select
ci.city, ci.zip_code, co.county_code
from
cities ci
join counties co on
co.zip_code between ci.zip_code_from and ci.zip_code_thru
I know in the Oracle world, this was frowned upon, and indeed the performance appears to be miserable. It takes over 8 minutes to process around 16,000 cities. The zip code table has around 80,000 records. I'm guessing that this syntax is a glorified cross-join?
Both the from and thru codes are indexed, and I have control over the structures, so I can change the table if it helps.
My only other thought is to go ahead and expand the table out to all possible values -- something similar to this:
select
generate_series (
cast (zip_code_from as int),
cast (zip_code_thru as int)
) as zip_code,
*
from counties
This would expand the data to over 200,000 records, which isn't a big deal, but I wasn't sure if this is my only recourse to have queries that aren't horrible.
I'm guessing that even doing that on the fly and not having indexes would be preferable to the between
in my join, but I was hoping there is an alternative, either in terms of my SQL and/or something I can do with the structure of the table itself.
I've seen this question posted for other DBMS platforms, but I've been able to pull off mini-miracles with PostgreSQL that weren't possible (or practical) in other databases, so I was hopeful there is something I've missed.
Months later, this has cropped its head again, and I decided to test some of my theories.
The original query:
select
ci.city, ci.zip_code, co.fips_code
from
cities ci
join counties co on
ci.zip_code between co.from_zip_code and co.thru_zip_code
Does in fact implement a cartesian. The query returns 34,000 rows and takes 597 seconds.
If I "pre-explode" the zip code ranges into discrete records:
with exploded_zip as (
select
generate_series (
cast (from_zip_code as int),
cast (thru_zip_code as int)
)::text as zip_code,
*
from counties
)
select
ci.city, ci.zip_code, co.fips_code
from
cities ci
join exploded_zip co on
ci.zip_code = co.zip_code
The query returns the exact same rows but finishes in 2.8 seconds.
So it seems the bottom line is that using a between
in a join (or any inequality) is a really bad idea.