sqlamazon-web-servicesprestoorcamazon-athena

Super-slow Athena join query on low amount of data


I have two databases, each contains a table, which is stored in a single S3 file like: part-00000-77654909-37c7-4c9e-8840-b2838792f98d-c000.snappy.orc of size ~83MB.

I'm trying to execute a primitive left join query:

select b.id
from "db-b".tbl b
    left join "db-a".tbl a on (
        b.id = a.id
        or (
            b.externalid__c is not null
            and b.externalid__c = a.externalid__c
        )
        or (
            b.externalid__c is null
            and b.internalid__c = a.internalid__c
        )
    )
where a.id is null

And it timeouts after 30 minutes.

30 minutes is not enough to join two 83MB files? Why?


Solution

  • This can be solved by using multiple joins instead of single joining condition:

    select b.id
    from "db-b".tbl b
        left join "db-a".tbl a1 on b.id = a.id
        left join "db-a".tbl a2 on b.externalid__c is not null
                and b.externalid__c = a.externalid__c
        left join "db-a".tbl a3 on b.externalid__c is null
                and b.internalid__c = a.internalid__c
    where coalesce(a1.id, a2.id, a3.id) is null
    

    And it runs for a few seconds, just as you would expect.

    Not sure why, but here's a mentioning of some optimization difficulties of OR clause: https://dba.stackexchange.com/questions/308080/presto-left-join-using-multiple-operators