azureazure-databricksdatabricks-sql

Databricks SQL Queries Taking Long Time to Execute without JOIN key/Field


The following query will take over one hour to execute:

SELECT DISTINCT

t.TradeAmount

FROM crm.NewTransaction t
LEFT OUTER JOIN drd.PriceHistoric PPH
  ON t.TradeDate = PPH.PriceDate

When I add the JOIN field t.TradeDate to the fields to be selected, the query runs in less than a second.

SELECT DISTINCT

t.TradeAmount
,t.TradeDate
FROM crm.NewTransaction t
LEFT OUTER JOIN drd.PMSPriceHistoric PPH
  ON t.TradeDate = PPH.PriceDate

There must me something fundamental about SQL queries that is causing this. Any thoughts on why just adding one of the fields (t.TradeDate) used to join the tables makes the query execute without any issues I am executing the code on Azure Databricks


Solution

  • I do agree with @HABO that it is due to the distinct query.

    The tables that you are using in the two queries are different and in the first query you are selecting only distinct of one column whereas in the second query you are doing it for 2 columns.

    This makes the difference between the two queries.

    The distinct will return the unique rows of selected columns. If you select multiple columns in the distinct, it will give unique rows across those columns.

    For example,

    You can see the time difference between the below results.

    enter image description here

    enter image description here

    This depends on the unique combination of rows in those columns. This is why your two queries are different to each other.