pythonpandasdataframekeyword-extraction

Rearrange row upon column value


I have a DataFrame where I would like to rearrange the data of a given columns.

What I have:

    text                                                KEYWORD
0   Fetch.ai will transform economies, healthcare,...   supplies chain issues
1                                                       self
2                                                       secured key partnership
3                                                       real world challenge
4                                                       autonomous economic agent
5                                                       learning traffic signal
6                                                       autonomous machine learning
7                                                       disruptive ai tech
8                                                       parking issues
9                                                       traffic reduction
10      
11      
12  The two most popular cryptocurrencies on the p...   bitcoin
13                                                      limited supplies
14                                                      ethereum
    

What I would like:

    text                                                KEYWORD
0   Fetch.ai will transform economies, healthcare,...   supplies chain issues, self, secured key partnership,  real world challenge, autonomous economic agent, learning traffic signal, autonomous machine learning, disruptive ai tech, parking issues, traffic reduction
1   The two most popular cryptocurrencies on the p...   bitcoin, limited supplies, emphasized text, ethereum

Each row containing text are displayed in the "Text" column. The "Text" column has been analyzed and keywords have been extracted from it and displayed in the "KEYWORD" column. The annoying part is that if 10 key words are extracted from the "Text" column, it will create 10 rows and add 1 keyword per row. I would like to join all of these keywords into a single row (corresponding to the good text).

Unfortunately I do not have access to the keyword extraction process which was done by a software.


Solution

  • Try with groupby:

    #replace blank cells with NaN
    df = df.replace(r"^\s*$",np.nan,regex=True)
    
    #drop rows that are all NaN and forward fill
    df = df.dropna(how="all").ffill()
    
    #groupby and aggregate
    output = df.groupby("text", as_index=False)["KEYWORD"].agg(", ".join)
    
    >>> output
                                                    text                                            KEYWORD
    0  Fetch.ai will transform economies, healthcare,...  supplies chain issues, self, secured key partn...
    1  The two most popular cryptocurrencies on the p...                bitcoin, limited supplies, ethereum