I have an HTML file like this:(More than 100 records)
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">John Smith</h3>
<span class="light-text">Center - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jenna Smith</h3>
<span class="light-text">West - VAR - Employee I</span>
</div>
<div class="cell-62 pl-1 pt-0_5">
<h3 class="very-big-text light-text">Jordan Smith</h3>
<span class="light-text">East - VAR - Employee II</span>
</div>
I need to extract the names IF they are Employee I, which makes it challenging. How can I select those tags that have Employee I in the next tag? Or should I use a different method? Is it even possible to use condition in this case?
with open("file.html", 'r') as input:
html = input.read()
print(re.search(r'\bEmployee I\b',html).group(0))
Like, how can I specify to go to read previous tag?
import re
from bs4 import BeautifulSoup
with open('inputfile.html', encoding='utf-8') as fp:
soup = BeautifulSoup(fp.read(), 'html.parser')
names = [span.parent.find('h3').string
for span in
soup.find_all('span',
class_='light-text',
string=re.compile('Employee I$'))
]
print(names)
gives
['John Smith', 'Jenna Smith']
I've formatted the list comprehension over several lines, for clarity, so that it may be easier to see where to adjust things accordingly to other use cases. Of course, a normal for-loop and appending to a list also works fine; I just like list comprehensions.
The re.compile('Employee I$')
is necessary to avoid matching on 'Employee II'
. The class_
argument is an extra, and may not be needed.
The rest is near self-explanatory, especially with the BeautifulSoup documentation next to it.
Note that if the .string
attribute used to be .text
, in case you're using an older version of BeautifulSoup.