Case1:
<li style="padding:5px;border-bottom:1px solid #ccc">
<div itemscope="" itemtype="http://schema.org/LocalBusiness">
<h5 itemprop="name">
Derattizzazione Disinfestazione Punteruolo Rosso - Quark Srl
</h5>
<div itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">
Via S. Pellico, 198/L
</span>
<br/>
<span itemprop="postalCode">
63039,
<span itemprop="addressLocality">
San Benedetto del Tronto
</span>
(AP)
</span>
<br/>
</div>
<span itemprop="telephone">
tel: 800 99 83 01
</span>
<br/>
<span>
sito:quarksrl.it
</span>
<br/>
<span>
parole chiave:
<strong>
derattizzazione,consulenza ambientale,disinfestazione ratti,allontanamento piccioni,punteruolo rosso
</strong>
</span>
</div>
</li>
Case2:
<li style="padding:5px;border-bottom:1px solid #ccc">
<div itemscope="" itemtype="http://schema.org/LocalBusiness">
<h5 itemprop="name">
V&b Home Comfort
</h5>
<div itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">
via delle Torri, 5
</span>
<br/>
<span itemprop="postalCode">
63100,
<span itemprop="addressLocality">
Ascoli Piceno
</span>
(AP)
</span>
<br/>
</div>
<span>
sito:vebhomecomfort.it
</span>
<br/>
</div>
</li>
in case 1 the text 'parole chiave:' is present so I want to fetch the data which is thereafter and in case 2 element is not present so I want None or 'Empty Text' there. or is there any way to do the same in scrapy? I really appreciate your efforts in taking out time thanks!
If txt
is the string from case 1 + case 2, then you cam use this script to extract the elements:
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
for li in soup.select('li'):
name = li.select_one('h5').get_text(strip=True, separator=' ')
address = li.select_one('[itemprop="streetAddress"]').get_text(strip=True, separator=' ')
postal_code = li.select_one('[itemprop="postalCode"]').get_text(strip=True, separator=' ')
address_locality = li.select_one('[itemprop="addressLocality"]').get_text(strip=True, separator=' ')
telephone = li.select_one('[itemprop="telephone"]')
telephone = telephone.get_text(strip=True, separator=' ') if telephone else '-'
web = li.find(lambda t: t.name=='span' and t.get_text(strip=True).startswith('sito:'))
web = web.get_text(strip=True, separator=' ').replace('sito:', '') if web else '-'
keywords = li.find(lambda t: t.name=='span' and t.get_text(strip=True).startswith('parole chiave:'))
keywords = keywords.get_text(strip=True, separator=' ').replace('parole chiave:', '').split(',') if keywords else []
print(name)
print(address)
print(postal_code)
print(address_locality)
print(telephone)
print(web)
print(keywords)
print('-' * 80)
Prints:
Derattizzazione Disinfestazione Punteruolo Rosso - Quark Srl
Via S. Pellico, 198/L
63039, San Benedetto del Tronto (AP)
San Benedetto del Tronto
tel: 800 99 83 01
quarksrl.it
[' derattizzazione', 'consulenza ambientale', 'disinfestazione ratti', 'allontanamento piccioni', 'punteruolo rosso']
--------------------------------------------------------------------------------
V&b Home Comfort
via delle Torri, 5
63100, Ascoli Piceno (AP)
Ascoli Piceno
-
vebhomecomfort.it
[]
--------------------------------------------------------------------------------