pythonvaex

vaex apply does not work when using dataframe columns


I am trying to tokenize natural language for the first sentence in wikipedia in order to find 'is a' patterns. n-grams of the tokens and left over text would be the next step. "Wellington is a town in the UK." becomes "town is a attr_root in the country." Then find common patterns using n-grams.

For this I need to replace string values in a string column using other string columns in the dataframe. In Pandas I can do this using

df['Test'] = df.apply(lambda x: x['Name'].replace(x['Rep'], x['Sub']), axis=1)

but I cannot find the equivalent vaex method. This issue led me to believe that this should be possible in vaex based on Maarten Breddels' example code, however when trying it I get the below error.

import pandas as pd
import vaex

df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Rep": ["Braund", "Henry", "Miss."],
        "Sub": ["<surname>", "<name>", "<title>"],
    }
)
dfv = vaex.from_pandas(df)

def func(x, y, z):
    return x.replace(y, z)

dfv['Test'] = dfv.apply(func, arguments=[df.Name.astype('str'), df.Rep.astype('str'), df.Sub.astype('str')])

Gives

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\dataframe.py", line 455, in apply
    arguments = _ensure_strings_from_expressions(arguments)
  File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 780, in _ensure_strings_from_expressions
    return [_ensure_strings_from_expressions(k) for k in expressions]
  File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 780, in <listcomp>
    return [_ensure_strings_from_expressions(k) for k in expressions]
  File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 782, in _ensure_strings_from_expressions
    return _ensure_string_from_expression(expressions)
  File "C:\Users\User\AppData\Roaming\Python\Python37\site-packages\vaex\utils.py", line 775, in _ensure_string_from_expression
    raise ValueError('%r is not of string or Expression type, but %r' % (expression, type(expression)))
ValueError: 0     Braund, Mr. Owen Harris
1    Allen, Mr. William Henry
2    Bonnell, Miss. Elizabeth
Name: Name, dtype: object is not of string or Expression type, but <class 'pandas.core.series.Series'>

How can I accomplish this in vaex?


Solution

  • Turns out I had a bug. Needed dfv in the call to apply instead of df.

    Also got this faster method from the nice people at vaex.

    import pyarrow as pa
    import pandas as pd
    import vaex
    
    df = pd.DataFrame(
        {
            "Name": [
                "Braund, Mr. Owen Harris",
                "Allen, Mr. William Henry",
                "Bonnell, Miss. Elizabeth",
            ],
            "Rep": ["Braund", "Henry", "Miss."],
            "Sub": ["<surname>", "<name>", "<title>"],
        }
    )
    dfv = vaex.from_pandas(df)
    
    
    @vaex.register_function()
    def replacer(x, y, z):
        res = []
        for i, j, k in zip(x.tolist(), y.tolist(), z.tolist()):
            res.append(i.replace(j, k))
        return pa.array(res)
    
    dfv['Test'] = dfv.func.replacer(dfv['Name'], dfv['Rep'], dfv['Sub'])