pythonhtmlweb-scrapingbeautifulsouphtml5lib

How to scrape the different content with the same html attributes and values?


I'm able to scrape a bunch of data from a webpage, but I'm struggling with extracting the specific content from subsections that have the exact same attributes and values. Here is the html:

   <li class="highlight">
     Relationship Issues
      </li>
   <li class="highlight">
     Depression
      </li>
   <li class="highlight">
     Spirituality
      </li>

                                            <li class="">
                                                           ADHD
                                                   </li>
                                           <li class="">
                                                           Alcohol Use
                                                   </li>
                                           <li class="">
                                                           Anger Management
                                                   </li>

Using that html as a reference I have the following:

import requests
from bs4 import BeautifulSoup
import html5lib
import re

headers = {'User-Agent': 'Mozilla/5.0'}
URL = "website.com"


page = requests.get(URL, headers=headers)

soup = BeautifulSoup(page.content, 'html5lib')

specialties = soup.find_all('div', {'class': 'spec-list attributes-top'})

for x in specialties:
   Specialty_1 = x.find('li', {'class': 'highlight'}).text
   Specialty_2 = x.find('li', {'class': 'highlight'}).text
   Specialty_3 = x.find('li', {'class': 'highlight'}).text

So the ideal outcome is to have: Specialty_1 = Relationship Issues; Specialty_2 = Depression; Specialty_3 = Spirituality

AND

Issue_1 = ADHD; Issue_2 = Alcohol Use; Issue_3 = Anger Management

Would appreciate any and all help!


Solution

  • You could develop Andrej's dictionary idea and use if else based on class being present to determine prefix and extend the select to include the additional section. You need to reset the numbering for the new section e.g. with a flag

    results = {}
    flag = False
    counter = 1
    
    for j in soup.select(".specialties-list li, .attributes-issues li"):
        if j['class']:
            results[f'Specialty_{counter}'] =  j.text.strip()
        else:   
            if not flag:
                counter = 1
                flag = True
            results[f'Issue_{counter}'] = j.text.strip()
        counter +=1 
            
    print(results)