Below is an html example but my use case involves different types of unstructured text. What is a good generic approach to tie (label) each of the 2 text paragraphs below with their parent header (SUMMARY1)? The header here isn't really a header tag but its just a bolded text. I am trying to extract and identify text paragraphs along with their corresponding header sections irrespective of whether the header is really a standard header or something like below:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Europe Test - Some stats</title>
<meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
<b><location">SUMMARY1</b>
</p>
<p>
This is a region in <location>Europe</location>
where the climate is good.
</p>
<p>
Total <location>Europe</location> population estimate was used back then.
</p>
<div class="aspNetHidden"></div>
</body>
</html>
I am trying to come up with a JSON like this: {SUMMARY1: ['This is a region in Europe where the climate is good','Total Europe population estimate was used back then']}
Please advise. Thank you.
I was initially thinking about using the newspaper
module, but failed to find a way to get the SUMMARY1
as the only part of a "summary" or "description" or anywhere else on the resulting Article
object. In any case, check out this module - may really help you to parse HTML articles.
But, if using BeautifulSoup
, you may initially locate the header, then get the next p
elements with find_all_next()
:
from bs4 import BeautifulSoup, NavigableString
import newspaper
html = """
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Europe Test - Some stats</title>
<meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
<b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
</p>
<p>
This is a region in <location>Europe</location>
where the climate is good.
</p>
<p>
Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
</p>
<div class="aspNetHidden"></div>
</body>
</html>"""
soup = BeautifulSoup(html, "lxml")
header = soup.find("b")
parts = [p.get_text(strip=True, separator=" ") for p in header.find_all_next("p")]
print({header.get_text(strip=True): parts})
Prints:
{'SUMMARY1': [
'This is a region in Europe where the climate is good.',
'Total Europe population estimate was used back then.']}