python-3.x web-scraping beautifulsoup scrapy scrapinghub

How to iterate through a list of Beautful soup tag elements and get a particular text if found else an empty string?

Case1:

<li style="padding:5px;border-bottom:1px solid #ccc">
 <div itemscope="" itemtype="http://schema.org/LocalBusiness">
  <h5 itemprop="name">
   Derattizzazione Disinfestazione Punteruolo Rosso - Quark Srl
  </h5>
  <div itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
   <span itemprop="streetAddress">
    Via S. Pellico, 198/L
   </span>
   <br/>
   <span itemprop="postalCode">
    63039,
    <span itemprop="addressLocality">
     San Benedetto del Tronto
    </span>
    (AP)
   </span>
   <br/>
  </div>
  <span itemprop="telephone">
   tel: 800 99 83 01
  </span>
  <br/>
  <span>
   sito:quarksrl.it
  </span>
  <br/>
  <span>
   parole chiave:
   <strong>
    derattizzazione,consulenza ambientale,disinfestazione ratti,allontanamento piccioni,punteruolo rosso
   </strong>
  </span>
 </div>
</li>

Case2:

<li style="padding:5px;border-bottom:1px solid #ccc">
 <div itemscope="" itemtype="http://schema.org/LocalBusiness">
  <h5 itemprop="name">
   V&amp;b Home Comfort
  </h5>
  <div itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
   <span itemprop="streetAddress">
    via delle Torri, 5
   </span>
   <br/>
   <span itemprop="postalCode">
    63100,
    <span itemprop="addressLocality">
     Ascoli Piceno
    </span>
    (AP)
   </span>
   <br/>
  </div>
  <span>
   sito:vebhomecomfort.it
  </span>
  <br/>
 </div>
</li>

in case 1 the text 'parole chiave:' is present so I want to fetch the data which is thereafter and in case 2 element is not present so I want None or 'Empty Text' there. or is there any way to do the same in scrapy? I really appreciate your efforts in taking out time thanks!

Solution

If txt is the string from case 1 + case 2, then you cam use this script to extract the elements:

from bs4 import BeautifulSoup


soup = BeautifulSoup(txt, 'html.parser')

for li in soup.select('li'):
    name = li.select_one('h5').get_text(strip=True, separator=' ')
    address = li.select_one('[itemprop="streetAddress"]').get_text(strip=True, separator=' ')
    postal_code = li.select_one('[itemprop="postalCode"]').get_text(strip=True, separator=' ')
    address_locality = li.select_one('[itemprop="addressLocality"]').get_text(strip=True, separator=' ')

    telephone = li.select_one('[itemprop="telephone"]')
    telephone = telephone.get_text(strip=True, separator=' ') if telephone else '-'

    web = li.find(lambda t: t.name=='span' and t.get_text(strip=True).startswith('sito:'))
    web = web.get_text(strip=True, separator=' ').replace('sito:', '') if web else '-'

    keywords = li.find(lambda t: t.name=='span' and t.get_text(strip=True).startswith('parole chiave:'))
    keywords = keywords.get_text(strip=True, separator=' ').replace('parole chiave:', '').split(',') if keywords else []

    print(name)
    print(address)
    print(postal_code)
    print(address_locality)
    print(telephone)
    print(web)
    print(keywords)
    print('-' * 80)

Prints:

Derattizzazione Disinfestazione Punteruolo Rosso - Quark Srl
Via S. Pellico, 198/L
63039, San Benedetto del Tronto (AP)
San Benedetto del Tronto
tel: 800 99 83 01
quarksrl.it
[' derattizzazione', 'consulenza ambientale', 'disinfestazione ratti', 'allontanamento piccioni', 'punteruolo rosso']
--------------------------------------------------------------------------------
V&b Home Comfort
via delle Torri, 5
63100, Ascoli Piceno (AP)
Ascoli Piceno
-
vebhomecomfort.it
[]
--------------------------------------------------------------------------------