sqlamazon-redshiftcomposite-key

Find an illegal combination of values (composite key) in a table


I need to spot an invalid combination of values within a large data set. The demo below is a very small data set; however, it illustrates the priniple. (I use Redshift)

I receive a data set like with 3 columns: client_id, t_number, t_name.

Client_id, t_number, t_name are repeated many times within the data set (when a clients records a new transaction each of these is recorded as a new line in a file). t_number (transaction number - (an id of transaction)) and t_name (transaction name) should be an unique combination. A specific t_number should only match to a single t_name (these are duplicated data essentially as number and name should reflect the same activity).

The data set has no null values. There are more t_numbers than t_names. This suggests that there must be a t_name (set of t_names) with more than one t_number assigned.

The simplified data set is below:

create table tmp.test_v1 (
id varchar(5)  
, t_number smallint  
, t_name varchar(5));

Insert into tmp.test_v1 values  
    ('id_1', 1, 'aaa')  
    ,('id_2', 1, 'aaa')  
    ,('id_3', 2, 'bbb')  
    ,('id_4', 3, 'ccc')  
    ,('id_5', 2, 'bbb')  
    ,('id_6', 4, 'ddd')  
    ,('id_3', 5, 'aaa')  
    ,('id_1', 2, 'bbb')  
    ,('id_2', 3, 'ccc')  
    ,('id_4', 4, 'ddd')  
    ,('id_7', 6, 'eee')  
    ,('id_8', 6, 'eee')  
    ,('id_9', 6, 'eee');

The t_name 'aaa' is associated with t_number: 1 and 5. This is a situation that i want to catch.

I came up with a following query:

 select t_name, t_number from tmp.test_v1  
    where t_name = (select t_name from (  
    select a.t_name, b.t_number  
    from tmp.test_v1 as a   
    left join tmp.test_v1 as b on a.t_name = b.t_name  
    group by a.t_name, b.t_number  
    order by a.t_name) c  
    group by t_name  
    having count(t_name) > 1)  
    group by t_name, t_number;

The above query returns:

'---------------------'  
| t_name  | t_number  |  
'---------------------'    
| aaa     |    1      |
'---------------------'   
| aaa     |    5      |        
'---------------------' 

I can see that t_name 'aaa' has numbers 1,5 assigned to it.

Problem: this does not scale well. On a data set with 130 million records it seems to run endlessly.

What would be a more efficient way to arrive at the same/ similar result?


Solution

  • I think the join is taking all the time. You can avoid the join and maybe make it faster with:

    select distinct t_name, t_number from test_v1 where t_name in (
      select t_name
      from ( select distinct t_name, t_number from test_v1 ) t1
      group by t_name
      having count(t_number)>1
    )
    

    Try it here: https://www.db-fiddle.com/f/vhwn1WKB7uV2yUZ4FhNPH/0