I am using Snowflake database and ran this query to find total count, number of distinct records and difference:
select
(select count(*) from mytable) as total_count,
(select count(*) from (select distinct * from mytable)) as distinct_count,
(select count(*) from mytable) - (select count(*) from (select distinct * from mytable)) as duplicate_count
from mytable limit 1;
Result:
1,759,867
1,738,924
20,943 (duplicate_count)
But when try with the other approach (group ALL columns and find where count is > 1):
select count(*) from (
SELECT
a, b, c, d, e,
COUNT(*)
FROM
mytable
GROUP BY
a, b, c, d, e
HAVING
COUNT(*) > 1
)
I get 5,436
.
Why there is a difference in number of duplicates? (20,943
vs 5,436
)
Thanks.
Okay. Let's start from one simple example:
create table #test
(a int, b int, c int, d int, e int)
insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (5,4,3,2,1)
insert into #test values (5,4,3,2,1)
insert into #test values (1,1,1,1,1)
And try your subquery to understand what you will get:
SELECT
a, b, c, d, e,
COUNT(*)
FROM
#test
GROUP BY
a, b, c, d, e
HAVING
COUNT(*) > 1
Think about a while...
Dang Dang Dang Dang ~
a b c d e (No column name)
1 2 3 4 5 5
5 4 3 2 1 2
It will only return two rows because you used 'group by'. But it still counted the duplicate numbers for each a,b,c,d,e combinations.
If you want the total number of the duplicates, try this:
select sum(sub_count) from (
SELECT
a, b, c, d, e,
COUNT(*) - 1 as sub_count
FROM
#test
GROUP BY
a, b, c, d, e
HAVING
COUNT(*) > 1)a
You need to minus one in this case if I understand your original queries correctly. Correct me if I am wrong.