pythonbeautifulsoupelementdata-scrubbing

How to use BeautifulSoup to find specific class elements on a web page


Goal: To perform a web search that looks up a business and from the results, looks for either a "Permanently Closed" text or "Open" with hours or basically anything BUT "Permanently closed."

Problem: I'm using BeautifulSoup to parse the search results, but it only seems to find the correct element by class 50% of the time.

import urllib as u
from bs4 import BeautifulSoup as bs
import time
from PIL import Image
from io import BytesIO, StringIO

comp = pandas.DataFrame(data=[['ALL CITY FITNESS 2', '1005 E PESCADERO AVE SITE 211', 'TRACY', 'CA', '', '']], 
                        columns=['NAME','ADDRESS','CITY','STATE','VERIFIED','STATUS'])

for i in comp.index:
    if comp.loc[i, 'VERIFIED'] != 'YES':
        location, address, city, state = comp.loc[i, ['NAME', 'ADDRESS', 'CITY', 'STATE']]
        print(location, address, city, state)
        search_string = f'{location} {address} {city}, {state}'
        # search_html = Str(search_string).htmlconvert() # This is a custom function
        search_html = 'ALL%20CITY%20FITNESS%202%201005%20E%20PESCADERO%20AVE%20SITE%20211%20TRACY%2C%20CA'
        url = f'https://www.bing.com/search?q={search_html}'

        try:
            req = u.request.urlopen(url)
            soup = bs(req, "xml")
            
            # This checks if there is a Permanently Closed indicator on the page
            # This works pretty consistently
            for item in soup.find_all(class_='b_alert'):
                print(item.text)
                # Mark Location as closed
                comp.loc[i, 'STATUS'] = 'INACTIVE'
            else:
                # This however, and the one below it rarely work
                for check in soup.find_all(class_='e_green b_positive'):
                    print(check.text)

                for check in soup.find_all('span', class_='e_green b_positive'):
                    print(check.text)

            comp.loc[i, 'VERIFIED'] = 'YES'
            time.sleep(3)

        except Exception as e:
            errors.append([i, search_string, e])
print(comp)

I performed this search manually and inspected the element, which is where I retrieved this class name. I've tried adding the '.' so that it was 'e_green.b_positive' and also removed it, as shown above. Neither seem to work, or at least don't work 100% of the time. What do I have wrong with my syntax where this is getting missed?


Solution

  • I'm not sure why this affects it but it actually has to do with how you're encoding your html, or rather the end format of your html that you're using to run the search.

    Add '&qs=n&form=QBRE&=%25eManage%20Your%20Search%20History%25E&sp=-1&p' to the end of your url variable, and I bet your code will find those class items now.