I am trying to tokenize natural language for the first sentence in wikipedia in order to find 'is a' patterns. n-grams of the tokens and left over text would be the next step. "Wellington is a town in the UK." becomes "town
is a attr_root
in the country
." Then find common patterns using n-grams.
For this I need to replace string values in a string column using other string columns in the dataframe. In Pandas I can do this using
df['Test'] = df.apply(lambda x: x['Name'].replace(x['Rep'], x['Sub']), axis=1)
but I cannot find the equivalent vaex
method. This issue led me to believe that this should be possible in vaex
based on Maarten Breddels' example code, however when trying it I get the below error.
import pandas as pd
import vaex
df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Rep": ["Braund", "Henry", "Miss."],
"Sub": ["<surname>", "<name>", "<title>"],
}
)
dfv = vaex.from_pandas(df)
def func(x, y, z):
return x.replace(y, z)
dfv['Test'] = dfv.apply(func, arguments=[df.Name.astype('str'), df.Rep.astype('str'), df.Sub.astype('str')])
Gives
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\dataframe.py", line 455, in apply
arguments = _ensure_strings_from_expressions(arguments)
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 780, in _ensure_strings_from_expressions
return [_ensure_strings_from_expressions(k) for k in expressions]
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 780, in <listcomp>
return [_ensure_strings_from_expressions(k) for k in expressions]
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 782, in _ensure_strings_from_expressions
return _ensure_string_from_expression(expressions)
File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 775, in _ensure_string_from_expression
raise ValueError('%r is not of string or Expression type, but %r' % (expression, type(expression)))
ValueError: 0 Braund, Mr. Owen Harris
1 Allen, Mr. William Henry
2 Bonnell, Miss. Elizabeth
Name: Name, dtype: object is not of string or Expression type, but <class 'pandas.core.series.Series'>
How can I accomplish this in vaex
?
Turns out I had a bug. Needed dfv
in the call to apply
instead of df
.
Also got this faster method from the nice people at vaex
.
import pyarrow as pa
import pandas as pd
import vaex
df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Rep": ["Braund", "Henry", "Miss."],
"Sub": ["<surname>", "<name>", "<title>"],
}
)
dfv = vaex.from_pandas(df)
@vaex.register_function()
def replacer(x, y, z):
res = []
for i, j, k in zip(x.tolist(), y.tolist(), z.tolist()):
res.append(i.replace(j, k))
return pa.array(res)
dfv['Test'] = dfv.func.replacer(dfv['Name'], dfv['Rep'], dfv['Sub'])