Table1
has roughly 50k rows, like this:
Id | Name | Dob | Device |
---|---|---|---|
1 | Bill | 09/30/1987 | bira1 |
2 | Kayle | 06/12/1967 | kale2 |
3 | Gill | 05/23/1976 | bira1 |
4 | Kane | 01/30/1963 | bira3 |
5 | Hena | 02/12/1978 | kale1 |
Table2
has 200 rows:
Type | Device | color | price |
---|---|---|---|
Apple | bira1 | Grey | 100$ |
Apple | bira1 | Green | 120$ |
Samsung | kale2 | silver | 200$ |
Samsung | kale1 | blue | 10$ |
Samsung | kale2 | black | 190$ |
Table2
doesn't have any Id
or a primary key. When I join the tables using column Device
, it usually doesn't return the right information, since one person can only have one device in Table1
, but there are many different colors for the same device and it is difficult to find which colors the person has.
How can I normalize Table2
?
It usually takes a long time to run the query.
You could add an id int identity primary key
to your Device
table. Then add a device_id int foreign key references Device(id)
to your People
table.
To populate it:
Device
that exists only once in Device
: use its generated id
to fill in the device_id
of all corresponding entries in People
Device
to represent a generic version of the device (without price nor color), and use its id
as the device_id
(until the people tell you which model exactly they bought)Then add a not null
constraint to device_id
, and drop your People.Device
column (you now have normalized it).
Technically:
-- Devices having only one variant are attributed to people:
update p
set p.DeviceId = d.Id
from People p
join (select Device, min(Id) Id from Device group by Device having count(*) = 1) d
on d.Device = p.Device;
-- Devices having many variants, with no possibility to know which one each user has, generate an additional "generic" variant that gets attributed to users:
insert into Device (Type, Device)
select Type, Device
from Device
group by Type, Device
having count(*) > 1
;
update p
set p.DeviceId = d.Id
from People p
join (select Device, Id from Device where color is null and price is null) d
on d.Device = p.Device
where p.DeviceId is null;
-- People with an unknown device get an ad-hoc one generated:
insert into Device (Device)
select distinct Device from People where DeviceId is null;
update People
set DeviceId = (select Id from Device where Device.Device = People.Device)
where DeviceId is null;
-- Now get rid of the redundant column, and enforce the constraint:
alter table People drop column Device;
alter table People alter column DeviceId int not null;
Which ends up with:
Type | Device | color | price | Id |
---|---|---|---|---|
Apple | bira1 | Grey | 100$ | 1 |
Apple | bira1 | Green | 120$ | 2 |
Samsung | kale2 | silver | 200$ | 3 |
Samsung | kale1 | blue | 10$ | 4 |
Samsung | kale2 | black | 190$ | 5 |
Apple | bira1 | null | null | 6 |
Samsung | kale2 | null | null | 7 |
null | bira3 | null | null | 8 |
Id | Name | Dob | DeviceId |
---|---|---|---|
1 | Bill | 1987-09-30 | 6 |
2 | Kayle | 1967-06-12 | 7 |
3 | Gill | 1976-05-23 | 6 |
4 | Kane | 1963-01-30 | 8 |
5 | Hena | 1978-02-12 | 4 |
(see it running on your dataset)