pythonpandaspdftextmining

Matching a list of sentences (tokens with ntlk) with a column in pandas dataframe


I'm new to python, so still struggling with the basics, but I have this be sorted out and any help would be greatly appreciated. So I have this long dataframe with hundreds of rows formed by text of a specific pdf page extract from a medical exam, each row is a different person.

I succefully extracted the text (using pymupdf) and iterated it for each row, cleaned the text as much I could and ended up with a dataframe similar to this one below with a column of sentences obtained using nltk sent_tokenize and multiple rows.

import pandas as pd
from nltk.tokenize import sent_tokenize

df = pd.DataFrame({"text":["hello, this is a sentence. the sun shines. the the night is beautiful",
              "the sun shines",
              "the night is beautiful. tomorrow i work"]})

df["token"] = df["text"].apply(sent_tokenize)

The last part of my task is to match specific sentences from a list medical phrases (specific for the exam) to those in my dataframe and keep only the matches, in a new column, for example. For that, I found this thread Looping through list and row for keyword match in pandas dataframe with @furas solution, clean and looked like would do the job. So, in the end, I have a pandas column of sentences (ntlk tokens) and list of medical phrases also (ntlk tokens as well) and need to match them.

specific_sent = "the sun shines. hello, this is a sentence."
query = sent_tokenize(''.join(specific_sent))

df["query_match"] = df["token"].str.contains(query) 
df["word"] = df["token"].str.extract('({})'.format(query))

When I run this code, I get this error "TypeError: unhashable type: 'list'", which is not uncommon and I have of an understanding of it, but I'm struggling to overcome. Any help on how to overcome this error in this particular example and ways to prevent this error in the future is really appreciated. Thanks!

This is an example of desired output:

text token query_match word
hello, this is a sentence. the sun shines. the night is beautiful [hello, this is a sentence., the sun shines., the night is beautiful] True the sun shines., hello, this is a sentence.
the sun shines. [the sun shines.] True the sun shines.
the night is beautiful. tomorrow i work [the night is beautiful., tomorrow i work.] False NaN

Solution

  • Once you tokenize each sentence in the DataFrame and the specific sentence, you obtain lists from which you can find the elements in common and construct the column word. After that you can also populate the column query_match checking if the resulting lists, containing the elements in common, are empty or not.

    df = pd.DataFrame({"text":["hello, this is a sentence. the sun shines. the the night is beautiful",
                  "the sun shines.",
                  "the night is beautiful. tomorrow i work"]})
    
    specific_sent = "the sun shines. hello, this is a sentence."
    query = sent_tokenize(''.join(specific_sent))
    
    df["token"] = df["text"].apply(sent_tokenize)
    
    # check elements in common between each sentence and query
    df["word"] = df["token"].apply(lambda x: list(set(query).intersection(x)))
    
    # if they had elements in common insert True, otherwise False
    df["query_match"] = df["word"].apply(lambda x: 'True' if x else 'False')