I have a dataframe df
with the column sld
of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:
E.g. wordninja.split('culturetosuccess')
outputs ['culture','to','success']
Using pandas_udf
, I have:
@pandas_udf(ArrayType(StringType()))
def split_word(x):
splitted = wordninja.split(x)
return splitted
However, it throws an error when I apply it on the column sld
:
df1=df.withColumn('test', split_word(col('sld')))
typeerror: expected string or bytes-like object
What I tried:
I noticed that there is a similar problem with the well-known function split()
, but the workaround is to use string.str
as mentioned here. This doesn't work on wordninja.split
.
Any work around this issue?
Edit: I think in a nutshell the issue is:
the pandas_udf
input is pd.series
while wordninja.split
expects string.
My df looks like this:
+-------------+
|sld |
+-------------+
|"hellofriend"|
|"restinpeace"|
|"this" |
|"that" |
+-------------+
I want something like this:
+-------------+---------------------+
| sld | test |
+-------------+---------------------+
|"hellofriend"|["hello","friend"] |
|"restinpeace"|["rest","in","peace"]|
|"this" |["this"] |
|"that" |["that"] |
+-------------+---------------------+
Just use .apply
to perform computation on each element of the Pandas series, something like this:
@pandas_udf(ArrayType(StringType()))
def split_word(x: pd.Series) -> pd.Series:
splitted = x.apply(lambda s: wordninja.split(s))
return splitted