Consider 2 data flows below
1. Front End Box ----> S3 Bucket-1
2. Front End Box ----> Kafka --> Storm ---> S3 Bucket-2
The logs from the boxes are being transferred to S3 buckets. The requirement is to replace flow 1 by flow 2.
Now the data needs to validated between Bucket-1
and Bucket-2
to guarantee that flow 2 can be used.
Following technological options were tried :
1. Python : boto3 Apis
2. Qubole
Both work on a limited data set, qubole is more scalable than python script. But still it takes very long time to do it(never finished, had to kill after running overnight). We are looking at half billion entries here.
Query
SELECT
count(*)
FROM
TableA LEFT OUTER JOIN TableB
ON TableA.id = TableB.id
WHERE
TableB.id IS NULL
AND TableA.id IS NOT NULL
Question
Any suggestion for tools, ways to achieve this faster ?
Are there any ways to avoid the join ?
Finally was able to avoid the join. Following solution works fine
select sum_cat, count(*)
FROM
(
select id, sum(category) as sum_cat
from
(
select distinct id, 1 as category
from Table-1
UNION ALL
select distinct id, 1 as category
from Table-2
UNION ALL
select distinct id, 2 as category
from Table-3
UNION ALL
select distinct id, 2 as category
from Table-4
)all_ids
group by log_id
)a
group by sum_cat;
Explanation
Table-1
and Table-2
with that in Table-3
and Table-4
category
to id from these set of tablescategory = 1
and set B records have category = 2