The following query will take over one hour to execute:
SELECT DISTINCT
t.TradeAmount
FROM crm.NewTransaction t
LEFT OUTER JOIN drd.PriceHistoric PPH
ON t.TradeDate = PPH.PriceDate
When I add the JOIN field t.TradeDate to the fields to be selected, the query runs in less than a second.
SELECT DISTINCT
t.TradeAmount
,t.TradeDate
FROM crm.NewTransaction t
LEFT OUTER JOIN drd.PMSPriceHistoric PPH
ON t.TradeDate = PPH.PriceDate
There must me something fundamental about SQL queries that is causing this. Any thoughts on why just adding one of the fields (t.TradeDate) used to join the tables makes the query execute without any issues I am executing the code on Azure Databricks
I do agree with @HABO that it is due to the distinct
query.
The tables that you are using in the two queries are different and in the first query you are selecting only distinct of one column whereas in the second query you are doing it for 2 columns.
This makes the difference between the two queries.
The distinct
will return the unique rows of selected columns. If you select multiple columns in the distinct, it will give unique rows across those columns.
For example,
You can see the time difference between the below results.
This depends on the unique combination of rows in those columns. This is why your two queries are different to each other.