htmlpython-3.xweb-scrapingbeautifulsoup

How to Identify html text with corresponding header-like objects?


Below is an html example but my use case involves different types of unstructured text. What is a good generic approach to tie (label) each of the 2 text paragraphs below with their parent header (SUMMARY1)? The header here isn't really a header tag but its just a bolded text. I am trying to extract and identify text paragraphs along with their corresponding header sections irrespective of whether the header is really a standard header or something like below:

<!doctype html>
    <html lang="en">
        <head>
            <meta charset="utf-8">

            <title>Europe Test  - Some stats</title>
            <meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
    <b><location">SUMMARY1</b>
    </p>
    <p>
      This is a region in <location>Europe</location>
      where the climate is good.
    </p>
    <p>
      Total <location>Europe</location> population estimate was used back then.
    </p>

<div class="aspNetHidden"></div>
        </body>
    </html>

I am trying to come up with a JSON like this: {SUMMARY1: ['This is a region in Europe where the climate is good','Total Europe population estimate was used back then']}

Please advise. Thank you.


Solution

  • I was initially thinking about using the newspaper module, but failed to find a way to get the SUMMARY1 as the only part of a "summary" or "description" or anywhere else on the resulting Article object. In any case, check out this module - may really help you to parse HTML articles.

    But, if using BeautifulSoup, you may initially locate the header, then get the next p elements with find_all_next():

    from bs4 import BeautifulSoup, NavigableString
    import newspaper
    
    
    html = """
    <!doctype html>
        <html lang="en">
            <head>
                <meta charset="utf-8">
    
                <title>Europe Test  - Some stats</title>
                <meta name="description" content="Watch videos and find the latest information.">
    <body>
    <p>
        <b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
        </p>
        <p>
          This is a region in <location>Europe</location>
          where the climate is good.
        </p>
        <p>
          Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
        </p>
    
    <div class="aspNetHidden"></div>
            </body>
        </html>"""
    
    soup = BeautifulSoup(html, "lxml")
    header = soup.find("b")
    parts = [p.get_text(strip=True, separator=" ") for p in header.find_all_next("p")]
    print({header.get_text(strip=True): parts})
    

    Prints:

    {'SUMMARY1': [
         'This is a region in Europe where the climate is good.', 
         'Total Europe population estimate was used back then.']}