[SOLVED] Creating/Registering a PySpark UDF and apply it to one column

Creating/Registering a PySpark UDF and apply it to one column

I am just a little confused on how to create the spark udf. I have right now a function parse_xml and do the following:

spark.udf.register("parse_xml_udf", parse_xml)
parsed_df = xml_df.withColumn("parsed_xml", parse_xml_udf(xml_df["raw_xml"]))

where xml_df is the original spark df and raw_xml is the column I want to apply the function on.

I have seen a few places a line like spark_udf = udf(parse_xml, StringType()) -- what is the difference between this and the spark.udf.register line? Additionally, if I apply the function to that one column, is it applying it to each row? In other words, should my UDF be returning the output for one single row?

Solution

This spark.udf.register("squaredWithPython", squared) if you want to use with SQL like this: %sql select id, squaredWithPython(id) as id_squared from test
This squared_udf = udf(squared, LongType()) if just with data frame usage like this: display(df.select("id", squared_udf("id").alias("id_squared")))

That's all, but things not always clearly explained in the manuals.