pythonapache-sparkpysparkuser-defined-functions

Creating/Registering a PySpark UDF and apply it to one column


I am just a little confused on how to create the spark udf. I have right now a function parse_xml and do the following:

spark.udf.register("parse_xml_udf", parse_xml)
parsed_df = xml_df.withColumn("parsed_xml", parse_xml_udf(xml_df["raw_xml"]))

where xml_df is the original spark df and raw_xml is the column I want to apply the function on.

I have seen a few places a line like spark_udf = udf(parse_xml, StringType()) -- what is the difference between this and the spark.udf.register line? Additionally, if I apply the function to that one column, is it applying it to each row? In other words, should my UDF be returning the output for one single row?


Solution

  • That's all, but things not always clearly explained in the manuals.