pythonweb-scrapingbeautifulsoupcsvwriter

Extract data from Dell Community Forum for a specific date


I want to extract the username, post title, post time and the message content from a Dell Community Forum thread of a particular date and store it into an excel file.

For example, URL: https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017

I want to extract the post title: "I am getting time sync errror and the last synced time shown as a day in 2015"

And details(username, post time, message) of comments for the date 10-25-2022 only

  1. jraju, 04:20 AM, "This pc is desktop inspiron 3910 model . The dell supplied only this week."
  2. Mary G, 09:10 AM, "Try rebooting the computer and connecting to the internet again to see if that clears it up. Don't forget to run Windows Update to get all the necessary updates on a new computer."
  3. RoHe, 01:00 PM, "You might want to read Fix: Time synchronization failed on Windows 11. Totally ignore the part about downloading the software tool, and scroll down that same page to the part: How to manually sync time on a Windows 11 PC. NOTE: In step #6, if time.windows.com doesn't work, pick a different server from the drop-down menu on that screen."

Not any other comments.

I'm very new to this.

Till now I've just managed to extract information(no username) without the date filter.

I'm very new to this.

Till now I've just managed to extract information(no username) without the date filter.


import requests
from bs4 import BeautifulSoup

url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"

result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")

###### time ######
time = doc.find_all('span', attrs={'class':'local-time'})
print(time)
##################

##### date #######
date = doc.find_all('span', attrs={'class':'local-date'})
print(date)
#################

#### message ######
article_text = ''
article = doc.find_all("div", {"class":"lia-message-body-content"})
for element in article:
    article_text += '\n' + ''.join(element.find_all(text = True))
    
print(article_text)
##################
all_data = []
for t, d, m in zip(time, date, article):
    all_data.append([t.text, d.get_text(strip=True),m.get_text(strip=True, separator='\n')])

with open('data.csv', 'w', newline='', encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        writer.writerow(row)

Solution

  • it seems to me you have an issue with your selectors and the fact that you're searching for them in the general scope (the entire HTML body). My approach would be to narrow down 'components' and search inside them:

    1. Locate the div that holds all comments
    2. Search inside it for each comment comment container
    3. Get the username, date and comment info from each comment container

    Here is how you can achieve this:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"
    
    result = requests.get(url)
    soup = BeautifulSoup(result.text, "html.parser")
    
    date = '10-25-2022'
    comments = []
    
    comments_section = soup.find('div', {'class':'lia-component-message-list-detail-with-inline-editors'})
    comments_body = comments_section.find_all('div', {'class':'lia-linear-display-message-view'})
    for comment in comments_body:
        if date in comment.find('span',{'class':'local-date'}).text:
            comments.append({
                'name': comment.find('a',{'class':'lia-user-name-link'}).text,
                'date': comment.find('span',{'class':'local-date'}).text,
                'comment': comment.find('div',{'class':'lia-message-body-content'}).text,
            })
    
    data = {
        "title": soup.find('div', {'class':'lia-message-subject'}).text,
        "comments": comments
    }
    
    print(data)
    

    This script generates an JSON object (stringified) that looks like this:

    {
       "title":"\n\n\n\n\n\t\t\t\t\t\t\tI am getting time sync errror and the last synced time shown as a day in 2015\n\t\t\t\t\t\t\n\n\n\n",
       "comments":[
          {
             "name":"jraju",
             "date":"10-25-2022",
             "comment":"This pc is desktop inspiron 3910 model . The dell supplied only this week."
          },
          {
             "name":"Mary G",
             "date":"10-25-2022",
             "comment":"Try rebooting the computer and connecting to the internet again to see if that clears it up.\\xa0\nDon't forget to run Windows Update to get all the necessary updates on a new computer.\\xa0\n\\xa0"
          },
          {
             "name":"RoHe",
             "date":"10-25-2022",
             "comment":"You might want to read Fix: Time synchronization failed on Windows 11.\nTotally ignore the part about downloading the software tool, and scroll down that same page to the part: How to manually sync time on a Windows 11 PC.\nNOTE: In step #6, if time.windows.com doesn't work, pick a different server from the drop-down menu on that screen.\n\nRon\\xa0\\xa0 Forum Member since 2004\\xa0\\xa0 I'm not a Dell employee"
          }
       ]
    }