I want to extract the username, post title, post time and the message content from a Dell Community Forum thread of a particular date and store it into an excel file.
For example, URL: https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017
I want to extract the post title: "I am getting time sync errror and the last synced time shown as a day in 2015"
And details(username, post time, message) of comments for the date 10-25-2022 only
Not any other comments.
I'm very new to this.
Till now I've just managed to extract information(no username) without the date filter.
I'm very new to this.
Till now I've just managed to extract information(no username) without the date filter.
import requests
from bs4 import BeautifulSoup
url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
###### time ######
time = doc.find_all('span', attrs={'class':'local-time'})
print(time)
##################
##### date #######
date = doc.find_all('span', attrs={'class':'local-date'})
print(date)
#################
#### message ######
article_text = ''
article = doc.find_all("div", {"class":"lia-message-body-content"})
for element in article:
article_text += '\n' + ''.join(element.find_all(text = True))
print(article_text)
##################
all_data = []
for t, d, m in zip(time, date, article):
all_data.append([t.text, d.get_text(strip=True),m.get_text(strip=True, separator='\n')])
with open('data.csv', 'w', newline='', encoding="utf-8") as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in all_data:
writer.writerow(row)
it seems to me you have an issue with your selectors and the fact that you're searching for them in the general scope (the entire HTML body). My approach would be to narrow down 'components' and search inside them:
div
that holds all commentsHere is how you can achieve this:
import requests
from bs4 import BeautifulSoup
url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"
result = requests.get(url)
soup = BeautifulSoup(result.text, "html.parser")
date = '10-25-2022'
comments = []
comments_section = soup.find('div', {'class':'lia-component-message-list-detail-with-inline-editors'})
comments_body = comments_section.find_all('div', {'class':'lia-linear-display-message-view'})
for comment in comments_body:
if date in comment.find('span',{'class':'local-date'}).text:
comments.append({
'name': comment.find('a',{'class':'lia-user-name-link'}).text,
'date': comment.find('span',{'class':'local-date'}).text,
'comment': comment.find('div',{'class':'lia-message-body-content'}).text,
})
data = {
"title": soup.find('div', {'class':'lia-message-subject'}).text,
"comments": comments
}
print(data)
This script generates an JSON object (stringified) that looks like this:
{
"title":"\n\n\n\n\n\t\t\t\t\t\t\tI am getting time sync errror and the last synced time shown as a day in 2015\n\t\t\t\t\t\t\n\n\n\n",
"comments":[
{
"name":"jraju",
"date":"10-25-2022",
"comment":"This pc is desktop inspiron 3910 model . The dell supplied only this week."
},
{
"name":"Mary G",
"date":"10-25-2022",
"comment":"Try rebooting the computer and connecting to the internet again to see if that clears it up.\\xa0\nDon't forget to run Windows Update to get all the necessary updates on a new computer.\\xa0\n\\xa0"
},
{
"name":"RoHe",
"date":"10-25-2022",
"comment":"You might want to read Fix: Time synchronization failed on Windows 11.\nTotally ignore the part about downloading the software tool, and scroll down that same page to the part: How to manually sync time on a Windows 11 PC.\nNOTE: In step #6, if time.windows.com doesn't work, pick a different server from the drop-down menu on that screen.\n\nRon\\xa0\\xa0 Forum Member since 2004\\xa0\\xa0 I'm not a Dell employee"
}
]
}