Do you know how to decode the 'campaign' column below in PySpark? The records in this column are strings in URL format:
+--------------------+------------------------+
|user_id |campaign |
+--------------------+------------------------+
|alskd9239as23093 |MM+%7C+Cons%C3%B3rcios+%|
|lfifsf093039388 |Aquisi%C3%A7%C3%A3o+%7C |
|kasd877191kdsd999 |Aquisi%C3%A7%C3%A3o+%7C |
+--------------------+------------------------+
I know that it is possible to do this with the urllib
library in Python. However, my dataset is large and it takes too long to convert it to a pandas dataframe. How to do this with a Spark DataFrame?
There is no need to convert to intermediate pandas dataframe, you can use pyspark user defined functions (udf) to unquote
the quoted string:
from pyspark.sql import functions as F
from urllib.parse import unquote
df.withColumn('campaign', F.udf(unquote, F.StringType())('campaign'))
If there are null
values in the campaign
column, then you have to do null check before unquoting the strings:
f = lambda s: unquote(s) if s else s
df.withColumn('campaign', F.udf(f, F.StringType())('campaign'))
+-----------------+-----------------+
| user_id| campaign|
+-----------------+-----------------+
| alskd9239as23093|MM+|+Consórcios+%|
| lfifsf093039388| Aquisição+||
|kasd877191kdsd999| Aquisição+||
+-----------------+-----------------+