python-3.xweb-scrapingbeautifulsoupscrapyscrapinghub

How to iterate through a list of Beautful soup tag elements and get a particular text if found else an empty string?


In this picture we have a list of Tag items where in some world Parole Chiave is present and in others, it is not

Case1:

<li style="padding:5px;border-bottom:1px solid #ccc">
 <div itemscope="" itemtype="http://schema.org/LocalBusiness">
  <h5 itemprop="name">
   Derattizzazione Disinfestazione Punteruolo Rosso - Quark Srl
  </h5>
  <div itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
   <span itemprop="streetAddress">
    Via S. Pellico, 198/L
   </span>
   <br/>
   <span itemprop="postalCode">
    63039,
    <span itemprop="addressLocality">
     San Benedetto del Tronto
    </span>
    (AP)
   </span>
   <br/>
  </div>
  <span itemprop="telephone">
   tel: 800 99 83 01
  </span>
  <br/>
  <span>
   sito:quarksrl.it
  </span>
  <br/>
  <span>
   parole chiave:
   <strong>
    derattizzazione,consulenza ambientale,disinfestazione ratti,allontanamento piccioni,punteruolo rosso
   </strong>
  </span>
 </div>
</li>

Case2:

<li style="padding:5px;border-bottom:1px solid #ccc">
 <div itemscope="" itemtype="http://schema.org/LocalBusiness">
  <h5 itemprop="name">
   V&amp;b Home Comfort
  </h5>
  <div itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
   <span itemprop="streetAddress">
    via delle Torri, 5
   </span>
   <br/>
   <span itemprop="postalCode">
    63100,
    <span itemprop="addressLocality">
     Ascoli Piceno
    </span>
    (AP)
   </span>
   <br/>
  </div>
  <span>
   sito:vebhomecomfort.it
  </span>
  <br/>
 </div>
</li>

in case 1 the text 'parole chiave:' is present so I want to fetch the data which is thereafter and in case 2 element is not present so I want None or 'Empty Text' there. or is there any way to do the same in scrapy? I really appreciate your efforts in taking out time thanks!


Solution

  • If txt is the string from case 1 + case 2, then you cam use this script to extract the elements:

    from bs4 import BeautifulSoup
    
    
    soup = BeautifulSoup(txt, 'html.parser')
    
    for li in soup.select('li'):
        name = li.select_one('h5').get_text(strip=True, separator=' ')
        address = li.select_one('[itemprop="streetAddress"]').get_text(strip=True, separator=' ')
        postal_code = li.select_one('[itemprop="postalCode"]').get_text(strip=True, separator=' ')
        address_locality = li.select_one('[itemprop="addressLocality"]').get_text(strip=True, separator=' ')
    
        telephone = li.select_one('[itemprop="telephone"]')
        telephone = telephone.get_text(strip=True, separator=' ') if telephone else '-'
    
        web = li.find(lambda t: t.name=='span' and t.get_text(strip=True).startswith('sito:'))
        web = web.get_text(strip=True, separator=' ').replace('sito:', '') if web else '-'
    
        keywords = li.find(lambda t: t.name=='span' and t.get_text(strip=True).startswith('parole chiave:'))
        keywords = keywords.get_text(strip=True, separator=' ').replace('parole chiave:', '').split(',') if keywords else []
    
        print(name)
        print(address)
        print(postal_code)
        print(address_locality)
        print(telephone)
        print(web)
        print(keywords)
        print('-' * 80)
    

    Prints:

    Derattizzazione Disinfestazione Punteruolo Rosso - Quark Srl
    Via S. Pellico, 198/L
    63039, San Benedetto del Tronto (AP)
    San Benedetto del Tronto
    tel: 800 99 83 01
    quarksrl.it
    [' derattizzazione', 'consulenza ambientale', 'disinfestazione ratti', 'allontanamento piccioni', 'punteruolo rosso']
    --------------------------------------------------------------------------------
    V&b Home Comfort
    via delle Torri, 5
    63100, Ascoli Piceno (AP)
    Ascoli Piceno
    -
    vebhomecomfort.it
    []
    --------------------------------------------------------------------------------