Suppose we have a DB table with fields
"id", "category", "subcategory", "brand", "name", "description", etc.
What's a good way of creating separate tables for
category
, subcategory
and brand
and the corresponding columns and rows in the original table becoming foreign key references?
To outline the operations involved:
In this case, the PostgreSQL DB is accessed via Sequel in a Ruby app, so available interfaces are the command line, Sequel, PGAdmin, etc...
The question: how would you do this?
-- Some test data
CREATE TABLE animals
( id SERIAL NOT NULL PRIMARY KEY
, name varchar
, category varchar
, subcategory varchar
);
INSERT INTO animals(name, category, subcategory) VALUES
( 'Chimpanzee' , 'mammals', 'apes' )
,( 'Urang Utang' , 'mammals', 'apes' )
,( 'Homo Sapiens' , 'mammals', 'apes' )
,( 'Mouse' , 'mammals', 'rodents' )
,( 'Rat' , 'mammals', 'rodents' )
;
-- [empty] table to contain the "squeezed out" domain
CREATE TABLE categories
( id SERIAL NOT NULL PRIMARY KEY
, category varchar
, subcategory varchar
, UNIQUE (category,subcategory)
);
-- The original table needs a "link" to the new table
ALTER TABLE animals
ADD column category_id INTEGER -- NOT NULL
REFERENCES categories(id)
;
-- FK constraints are helped a lot by a supportive index.
CREATE INDEX animals_categories_fk ON animals (category_id);
-- Chained query to:
-- * populate the domain table
-- * initialize the FK column in the original table
WITH ins AS (
INSERT INTO categories(category, subcategory)
SELECT DISTINCT a.category, a.subcategory
FROM animals a
RETURNING *
)
UPDATE animals ani
SET category_id = ins.id
FROM ins
WHERE ins.category = ani.category
AND ins.subcategory = ani.subcategory
;
-- Now that we have the FK pointing to the new table,
-- we can drop the redundant columns.
ALTER TABLE animals DROP COLUMN category, DROP COLUMN subcategory;
-- show it to the world
SELECT a.*
, c.category, c.subcategory
FROM animals a
JOIN categories c ON c.id = a.category_id
;
Note: the fragment:
WHERE ins.category = ani.category AND ins.subcategory = ani.subcategory
will lead to problems if these columns contain NULLs. It would be better to compare them using
(ins.category,ins.subcategory) IS NOT DISTINCT FROM (ani.category,ani.subcategory)