asp.netsqlsql-serverdatabaset-sql

Fastest Way to Count Distinct Values in a Column, Including NULL Values


The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.

This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.

This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).

What is the most efficient way to do this?


Update After Testing

I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest

  1. Count/GroupBy (Cheran)
  2. CountDistinct+SubQuery (Ellis)
  3. dense_rank (Eriksson)
  4. Count+Max (Andriy)

Testing Results (in seconds):

   Method          20_Columns   5_Columns   1_Column (Large)   1_Column (Small)
1) Count/GroupBy      10.8          4.8            2.8               0.14       
2) CountDistinct      12.4          4.8            3                 0.7         
3) dense_rank        226           30              6                 4.33 
4) Count+Max          98.5         44             16                12.5        

Notes:

Thanks for the help and suggestions!


Solution

  • SELECT COUNT(*)
    FROM (SELECT ColumnName
          FROM TableName
          GROUP BY ColumnName) AS s;
    

    GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.