pythonpandas

Split sentences in pandas into sentence number and words


I have a pandas dataframe like this:

Text            start    end    entity     value
I love apple      7       11    fruit      apple
I ate potato      6       11    vegetable  potato

I have tried to use a for loop It's running slow and I don't think this is what we should do with pandas.

I want to create another pandas dataframe base on this like:

Sentence#         Word        Tag
  1                I         Object 
  1               love       Object
  1               apple      fruit
  2                I         Object
  2               ate        Object
  2               potato     vegetable

Split the text column into words and sentence numbers. Other than the entity word, the other words will be tagged as Object.


Solution

  • Use split, stack and map:

    u = df.Text.str.split(expand=True).stack()
    
    pd.DataFrame({
        'Sentence': u.index.get_level_values(0) + 1, 
        'Word': u.values, 
        'Entity': u.map(dict(zip(df.value, df.entity))).fillna('Object').values
    })
    
       Sentence    Word     Entity
    0         1       I     Object
    1         1    love     Object
    2         1   apple      fruit
    3         2       I     Object
    4         2     ate     Object
    5         2  potato  vegetable
    

    Side note: If running v0.24 or later, please use .to_numpy() instead of .values.