pythonvalidationamazon-s3hivequbole

Comparing one day worth of data from S3 buckets faster


Consider 2 data flows below

1. Front End Box ----> S3 Bucket-1

2. Front End Box ----> Kafka --> Storm ---> S3 Bucket-2

The logs from the boxes are being transferred to S3 buckets. The requirement is to replace flow 1 by flow 2.

Now the data needs to validated between Bucket-1 and Bucket-2 to guarantee that flow 2 can be used.

Following technological options were tried :

1. Python : boto3 Apis
2. Qubole

Both work on a limited data set, qubole is more scalable than python script. But still it takes very long time to do it(never finished, had to kill after running overnight). We are looking at half billion entries here.

Query

SELECT
    count(*)
FROM
    TableA LEFT OUTER JOIN TableB
        ON TableA.id = TableB.id
WHERE
    TableB.id IS NULL
    AND TableA.id IS NOT NULL

Question

Any suggestion for tools, ways to achieve this faster ?

Are there any ways to avoid the join ?


Solution

  • Finally was able to avoid the join. Following solution works fine

    select sum_cat, count(*)
    FROM
    (
       select id, sum(category) as sum_cat 
       from 
       (
           select distinct id, 1 as category
           from Table-1
    
           UNION ALL 
    
           select distinct id, 1 as category
           from Table-2 
    
           UNION ALL 
    
           select distinct id, 2 as category
           from Table-3
    
           UNION ALL 
    
           select distinct id, 2 as category
           from Table-4
    
      )all_ids
       group by log_id
    )a
     group by sum_cat;
    

    Explanation

    1. The data comparison it to be done with the data from Table-1 and Table-2 with that in Table-3 and Table-4
    2. So we assign a category to id from these set of tables
    3. All the ids from set A will have category = 1 and set B records have category = 2
    4. Now we sum of the category values and group by ids. So when id is present in both sets it will have value 3. The ids present only in set A will have value 1, which are missing records.