I need to spot an invalid combination of values within a large data set. The demo below is a very small data set; however, it illustrates the priniple. (I use Redshift)
I receive a data set like with 3 columns: client_id, t_number, t_name.
Client_id, t_number, t_name are repeated many times within the data set (when a clients records a new transaction each of these is recorded as a new line in a file). t_number (transaction number - (an id of transaction)) and t_name (transaction name) should be an unique combination. A specific t_number should only match to a single t_name (these are duplicated data essentially as number and name should reflect the same activity).
The data set has no null values. There are more t_numbers than t_names. This suggests that there must be a t_name (set of t_names) with more than one t_number assigned.
The simplified data set is below:
create table tmp.test_v1 (
id varchar(5)
, t_number smallint
, t_name varchar(5));
Insert into tmp.test_v1 values
('id_1', 1, 'aaa')
,('id_2', 1, 'aaa')
,('id_3', 2, 'bbb')
,('id_4', 3, 'ccc')
,('id_5', 2, 'bbb')
,('id_6', 4, 'ddd')
,('id_3', 5, 'aaa')
,('id_1', 2, 'bbb')
,('id_2', 3, 'ccc')
,('id_4', 4, 'ddd')
,('id_7', 6, 'eee')
,('id_8', 6, 'eee')
,('id_9', 6, 'eee');
The t_name 'aaa' is associated with t_number: 1 and 5. This is a situation that i want to catch.
I came up with a following query:
select t_name, t_number from tmp.test_v1
where t_name = (select t_name from (
select a.t_name, b.t_number
from tmp.test_v1 as a
left join tmp.test_v1 as b on a.t_name = b.t_name
group by a.t_name, b.t_number
order by a.t_name) c
group by t_name
having count(t_name) > 1)
group by t_name, t_number;
The above query returns:
'---------------------'
| t_name | t_number |
'---------------------'
| aaa | 1 |
'---------------------'
| aaa | 5 |
'---------------------'
I can see that t_name 'aaa' has numbers 1,5 assigned to it.
Problem: this does not scale well. On a data set with 130 million records it seems to run endlessly.
What would be a more efficient way to arrive at the same/ similar result?
I think the join is taking all the time. You can avoid the join and maybe make it faster with:
select distinct t_name, t_number from test_v1 where t_name in (
select t_name
from ( select distinct t_name, t_number from test_v1 ) t1
group by t_name
having count(t_number)>1
)
Try it here: https://www.db-fiddle.com/f/vhwn1WKB7uV2yUZ4FhNPH/0