I have extracted data between span tags based on its class, from a webpage. But at times, the webpage splits a line into multiple fragments and stores it in consecutive tags. All the children span tags have the same class name.
Following is the HTML snippet:
<p class="Paragraph SCX">
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
This week
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
(12/
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
11
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
- 12/1
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
7
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
):
</span>
</span>
<span class="EOP SCX">
</span>
</p>
From the above HTML snippet, I need to extract only the innermost span data.
Python code to extract data using BS4:
for data in elem.find_all('span', class_="TextRun"):
a = data.find('span').contents[0]
a = a.string.replace(u'\xa0', '')
print (a)
events_parsed_thisweek.append(a)
This code results in each data being separately printed as separate entity. Required Output:
This Week ((12/11 - 12/17):
Any idea how to combine these span tag data together? Thanks!
Give this a go. Make sure to wrap the whole html
within content
variable.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,'lxml')
data = ''.join([' '.join(item.text.split()) for item in soup.select(".NormalTextRun")])
print(data)
Output:
This week(12/11- 12/17):