sql-server t-sql data-warehouse sql-data-warehouse

How to find unused rows in a dimension table

I have a dimension table in my database that has grown too large. With that I mean that is has too many records - over a million - because it grew at the same pace as the linked facts. This is mostly due to a bad design, and I'm trying to clean it up.

One of the things I try to do is to remove dimension records which are no longer used. The fact tables are regularly maintained and old snapshots are removed. Because the dimensions were not maintained like that, there are many rows in the table whose primary key value no longer appears in any of the linked fact tables anymore. All the fact tables have foreign key constraints.

Is there a way to locate table rows whose primary key value no longer appears in any of the tables which are linked with a foreign key constraint?

I tried writing a script to track this. Basically this:

select key from dimension 
where not exists (select 1 from fact1 where fk = pk) 
and not exists (select 1 from fact2 where fk = pk) 
and not exists (select 1 from fact3 where fk = pk)

But with a lot of linked tables this query dies after some time - at least, my management studio crashed. So I'm not sure if there are any other options.

Solution

we had to do something similar to this at one of my clients. The query, like yours with "not exists.... and not exists.... and not exists...." was taking ~22 hours to run before we change our strategy to handle this in ~20 minutes.

As Nsousa suggest, you have to split the query so SQL Server doesn't have to handle all data in one shot, having to unnecessarily use tempdb and all other things.

First, create new table with all keys in it. The reason to create this table is to not have to read the full table scan for every query, having more keys on a 8k page and to deal with a smaller and smaller set of keys after each delete.

create table DimensionkeysToDelete (Dimkey char(32) primary key nonclustered);
insert into DimensionkeysToDelete 
select key from dimension order by key;

Then, instead of deleting unused key, delete the keys that exists in facts table, beginning with the fact table that has the least numbers of rows. Make sure facts table have proper indexing for performance.

delete from DimensionkeysToDelete 
from DimensionkeysToDelete d 
inner join fact1 on f.fk = d.Dimkey;

delete from DimensionkeysToDelete 
from DimensionkeysToDelete d 
inner join fact2 on f.fk = d.Dimkey;

delete from DimensionkeysToDelete 
from DimensionkeysToDelete d 
inner join fact3 on f.fk = d.Dimkey;

Once all facts tables done, only unused keys remains in DimensionkeysToDelete. To answers your question, just perform a select on this table to get all unused key for that particular dimension, or join it with the dimension to get data.

But, from what I understand of your needs for cleaning up you warehouse, use this table to delete from the orignal dimension table. At this step, you might also want take some action for auditing purposes (ie: insert in an audit table 'Key ' + key + ' deleted on + convert(datetime, getdate(),121) + ' by script X'.... )

I think this can be optimize, take a look at the execution plan, but my client was happy with it so we didn't have to put much effort in it.