pythonbert-language-modeltopic-modeling

topic modeling from quotes


Based on the folloiwng link : quotes

with help of following code(this site was based on javascript, so first i have disabled it)

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.common.keys import Keys
browser =webdriver.Chrome()
browser.get("https://quotes.toscrape.com/")
elem = browser.find_elements(By.CLASS_NAME, 'author')  # Find the search box
quot_choosing =browser.find_elements(By.CLASS_NAME,'text')
autors=[]
quotes =[]
for  author in elem:
    autors.append(author.text)
for quote in quot_choosing:
    quotes.append(quote.text)
print(autors)
print(quotes)

autor_saying =pd.DataFrame({"Author":autors,"Quotes":quotes})
autor_saying.to_csv("quotes.csv",index=False)
print(autor_saying.head())
browser.quit()

i haved author's and quote's information in csv file and then read it as it is given it bellow :

import pandas as pd
from bertopic import BERTopic
model =BERTopic()

summarization =[]
data =pd.read_csv("quotes.csv")
print(data.head())
for  index, row in data.iterrows():
    topics, probs =model.fit_transform([row['Quotes']])
    print(topics)

here is result :

   Author                                             Quotes
0  Albert Einstein  “The world as we have created it is a process ...
1     J.K. Rowling  “It is our choices, Harry, that show what we t...
2  Albert Einstein  “There are only two ways to live your life. On...
3      Jane Austen  “The person, be it gentleman or lady, who has ...
4   Marilyn Monroe  “Imperfection is beauty, madness is genius and...

additionally i want to use bertopic model to detect topic from given site : topic modeling

but my code gives me following error :

ValueError: Transform unavailable when model was fit with only a single data sample.

could you help me please how to fix it? how to detect topic presented in sentences?


Solution

  • You should train using all quotes at once and not one-by-one. So instead of

    for  index, row in data.iterrows():
        topics, probs =model.fit_transform([row['Quotes']])
        print(topics)
    

    try

    topics, probs = model.fit_transform(data['Quotes'].tolist())
    data['Topic'] = topics
    data['Probability'] = probs
    print(data.head())