I have a python file (main.py) executing in adf, and i have another file (udf.py) for the pandas udfs. I am importing the udf.py functions into main.py like below:
from udf import func1, func2
when this is executing in adf we are getting error saying: CONTEXT_ONLY_VALID_ON_DRIVER
So, how can i pass the spark context or how to solve this issue.
everything works fine if kept in same file as plain script than modularized. I am suspecting it is something to do with worker and driver nodes. but need some light on the same.
You are referencing spark context inside the broad cast variable that is in your udf function. Instead of doing reference or sending spark object to your udf, create a spark session in udf itself and use it.
Here, is the sample code i have tried in udf.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
def addone():
udfAddOne = udf(lambda x: x + 1, IntegerType())
df = spark.createDataFrame(data,schema=schema).withColumn("number_plus_one", udfAddOne(col("id")))
return df