I feel like I must be missing something obvious here, but I can't seem to dynamically set a variable value in Spark SQL.
Let's say I have two tables, tableSrc
and tableBuilder
, and I'm creating tableDest
.
I've been trying variants on
SET myVar FLOAT = NULL
SELECT
myVar = avg(myCol)
FROM tableSrc;
CREATE TABLE tableDest(
refKey INT,
derivedValue FLOAT
);
INSERT INTO tableDest
SELECT
refKey,
neededValue * myVar AS `derivedValue`
FROM tableBuilder
Doing this in T-SQL is trivial, in a surprising win for Microsoft (DECLARE
...SELECT
). Spark, however, throws
Error in SQL statement: ParseException: mismatched input 'SELECT' expecting <EOF>(line 53, pos 0)
but I can't seem to assign a derived value to a variable for reuse. I tried a few variants, but the closest I got was assigning a variable to a string of a select statement.
Please note that this is being adapted from a fully functional script in T-SQL, and so I'd just as soon not split out the dozen or so SQL variables to compute all those variables with Python spark queries just to insert {var1}
, {var2}
, etc in a multi hundred line f-string. I know how to do this, but it will be messy, difficult, harder to read, slower to migrate, and worse to maintain and would like to avoid this if at all possible.
September 2024 Update:
Databricks Runtime 14.1 and higher now properly supports variables.
-- DBR 14.1+
DECLARE VARIABLE dataSourceStr STRING = "foobar";
SELECT * FROM hive_metastore.mySchema.myTable WHERE dataSource = dataSourceStr;
-- Returns where dataSource column is 'foobar'
Databricks just released SQL user defined functions, which can deal with the similar problem with no performance penalty, for your example it would look like:
CREATE TEMP FUNCTION myVar()
RETURNS FLOAT
LANGUAGE SQL
RETURN
SELECT
avg(myCol)
FROM tableSrc;
And then for use:
SELECT
refKey,
neededValue * myVar() AS `derivedValue`
FROM tableBuilder