apache-sparkpysparksplituser-defined-functionspandas-udf

Apply wordninja.split() using pandas_udf


I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:

E.g. wordninja.split('culturetosuccess') outputs ['culture','to','success']

Using pandas_udf, I have:

@pandas_udf(ArrayType(StringType()))
def split_word(x):
   splitted = wordninja.split(x)
   return splitted

However, it throws an error when I apply it on the column sld:

df1=df.withColumn('test', split_word(col('sld')))

typeerror: expected string or bytes-like object

What I tried:

I noticed that there is a similar problem with the well-known function split(), but the workaround is to use string.str as mentioned here. This doesn't work on wordninja.split.

Any work around this issue?

Edit: I think in a nutshell the issue is: the pandas_udf input is pd.series while wordninja.split expects string.

My df looks like this:

+-------------+
|sld          |
+-------------+
|"hellofriend"|
|"restinpeace"|
|"this"       |
|"that"       |
+-------------+

I want something like this:

+-------------+---------------------+
|    sld      |         test        |
+-------------+---------------------+
|"hellofriend"|["hello","friend"]   |
|"restinpeace"|["rest","in","peace"]|
|"this"       |["this"]             |
|"that"       |["that"]             |
+-------------+---------------------+

Solution

  • Just use .apply to perform computation on each element of the Pandas series, something like this:

    @pandas_udf(ArrayType(StringType()))
    def split_word(x: pd.Series) -> pd.Series:
       splitted = x.apply(lambda s: wordninja.split(s))
       return splitted