pythonpysparkapache-spark-sql

How to pass a function argument of type Dataframe to a SparkSQL query


I have a spark.sql query that is wrapped inside a function. I want to pass the query a function argument that is a dataframe but getting some error. Can anyone see if I am doing something wrong?

Function:

1  def my_function(df_table: DataFrame) -> DataFrame:
2 
3    sql_query = f"""
4    SELECT DISTINCT dt.CountryId,
5    Cast(RIGHT(dt.RegionIdentifier, 2) as Integer) as RegionID
6    FROM {df_table} dt 
7    WHERE dt.CountryCode = 23
8    """
9
10   df = spark.sql(sql_query)
11   return df

and this is how I call it in a notebook:

df_table = spark.table('path_to_table/_location/')
my_function(df_table)

Error message I get:

[PARSE_SYNTAX_ERROR] Syntax error at or near '['. SQLSTATE: 42601

If i remove the {df_table} on LINE-6, and put the name of table hardcoded, it works. Is there a way to pass in the table name as Dataframe and have it passed as an arg?

and when I print the sql_query, it shows:

SELECT DISTINCT dt.CountryId,
Cast(RIGHT(dt.RegionIdentifier, 2) as Integer) as RegionID
FROM DataFrame[CountryId: bigint, RegionID: int, RegionIdentifier: string, TimeOf: timestamp] dt
WHERE dt.CountryCode = 23

Solution

  • To be able to access the DataFrame within a SQL query, register it as a temporary view using method DataFrame.createOrReplaceTempView:

    def my_function(df_table: DataFrame) -> DataFrame:
        temp_view_name = "some_temp_view"
        df_table.createOrReplaceTempView(temp_view_name)
    
        sql_query = f"""
        SELECT DISTINCT dt.CountryId,
        Cast(RIGHT(dt.RegionIdentifier, 2) as Integer) as RegionID
        FROM {temp_view_name} dt 
        WHERE dt.CountryCode = 23
        """
    
        df = spark.sql(sql_query)
        return df