pythonweb-scrapingbeautifulsoupelement

How do I scrape an element with different location by page


I am scraping an element that is located in different positions by page. My current code is somewhat working but will randomly not return the value. When I set the seller = None, it makes other instances of the value None, when it should be a seller name.

My goal is to scrape 100s of pages for the single element based on unique locations (and continue to add new locations of element) and if element is not on page, have element equal to None.

I have tried for statements, if / else statements, and recently got somewhat working code (thanks stackoverflow) using try / except to first try and see if the element is in a specific area and if not, move to another. Again, this is not 100% working.

soup = BeautifulSoup(r.text, 'lxml')
if url == product_url:
  try:
    loc1 = soup.find('div', attrs={'id':'availability-brief', 'class':'a- 
    section a-spacing-none'})

    seller = loc1.find('a', href=re.compile('dp_merchant'), attrs= 
    {'id':'sellerProfileTriggerId'}).text.strip()

  except:
     try:
       loc2 = soup.find('div', attrs={'id':'sns-availability', 'class':'a- 
       section a-spacing-none'})

       seller = loc2.find('span', text = re.compile('text'), attrs= 
       {'class':'a-size-base'}).text.strip()

     except:
       seller = None

  print(seller)
  prod_dict = {'seller':seller}
  print(url)
  print(prod_dict)

When using my code, I will get the seller name and if not present, it will return none, but set other returned values to 'none' when an actual seller name is present. If the code is ran again, it may not return the seller name as before. Ex: Run 1 , page 1: seller name = foo. Run 2, page 1: seller name = None. I expect the code to search the locations specified and return the text and if not in the locations specified, seller = None and continue through all pages. And also be able to add new locations as they are discovered. Thanks!


Solution

  • I solved this by defining the element before the loop and then using 'pass' at the end of the loop.

     soup = BeautifulSoup(r.text, 'lxml')
     if url == product_url:
     seller = 'NA'   
       try:
         loc1 = soup.find('div', attrs={'id':'availability-brief', 'class':'a- 
         section a-spacing-none'})
    
         seller = loc1.find('a', href=re.compile('dp_merchant'), attrs= 
         {'id':'sellerProfileTriggerId'}).text.strip()
    
       except:
         try:
           loc2 = soup.find('div', attrs={'id':'sns-availability', 'class':'a- 
           section a-spacing-none'})
    
           seller = loc2.find('span', text = re.compile('text'), attrs= 
           {'class':'a-size-base'}).text.strip()
    
         except:
           pass
    
       print(seller)
       prod_dict = {'seller':seller}
       print(url)
       print(prod_dict)