pythonhtmlweb-scrapingbeautifulsoup

How to extract from multiple <span> tags and group the data together using BS4?


I have extracted data between span tags based on its class, from a webpage. But at times, the webpage splits a line into multiple fragments and stores it in consecutive tags. All the children span tags have the same class name.

Following is the HTML snippet:

<p class="Paragraph SCX">
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            This week
        </span>
    </span>
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            &nbsp;(12/
        </span>
    </span>
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            11
        </span>
    </span>
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            &nbsp;- 12/1
        </span>
    </span>
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            7
        </span>
    </span>
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            ):
        </span>
    </span>
    <span class="EOP SCX">
        &nbsp;
    </span>
</p>

From the above HTML snippet, I need to extract only the innermost span data.

Python code to extract data using BS4:

for data in elem.find_all('span', class_="TextRun"):
    a = data.find('span').contents[0]
    a = a.string.replace(u'\xa0', '')
    print (a)
    events_parsed_thisweek.append(a)

This code results in each data being separately printed as separate entity. Required Output:

This Week ((12/11 - 12/17):

Any idea how to combine these span tag data together? Thanks!


Solution

  • Give this a go. Make sure to wrap the whole html within content variable.

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(content,'lxml')
    data = ''.join([' '.join(item.text.split()) for item in soup.select(".NormalTextRun")])
    print(data)
    

    Output:

    This week(12/11- 12/17):