dataframeapache-sparkpysparkdecodeurldecode

How to decode a column in URL format?


Do you know how to decode the 'campaign' column below in PySpark? The records in this column are strings in URL format:

+--------------------+------------------------+
|user_id             |campaign                |
+--------------------+------------------------+
|alskd9239as23093    |MM+%7C+Cons%C3%B3rcios+%|
|lfifsf093039388     |Aquisi%C3%A7%C3%A3o+%7C |
|kasd877191kdsd999   |Aquisi%C3%A7%C3%A3o+%7C |
+--------------------+------------------------+

I know that it is possible to do this with the urllib library in Python. However, my dataset is large and it takes too long to convert it to a pandas dataframe. How to do this with a Spark DataFrame?


Solution

  • There is no need to convert to intermediate pandas dataframe, you can use pyspark user defined functions (udf) to unquote the quoted string:

    from pyspark.sql import functions as F
    from urllib.parse import unquote
    
    df.withColumn('campaign', F.udf(unquote, F.StringType())('campaign'))
    

    If there are null values in the campaign column, then you have to do null check before unquoting the strings:

    f = lambda s: unquote(s) if s else s
    df.withColumn('campaign',  F.udf(f, F.StringType())('campaign'))
    

    +-----------------+-----------------+
    |          user_id|         campaign|
    +-----------------+-----------------+
    | alskd9239as23093|MM+|+Consórcios+%|
    |  lfifsf093039388|      Aquisição+||
    |kasd877191kdsd999|      Aquisição+||
    +-----------------+-----------------+