Background:
I'm somewhat familiar with parsing XML with Java via the DOM.
What I'm trying to do:
I'm trying to parse an HL7 / XML Structured Product Label from NLM Daily Med website. An example url of what I am trying to parse is : Atenolol SPL
What I've tried so far:
I've tried DOM, ElementTree, lxml, and minidom. The closest I've been able to come has been using this code:
#!/usr/bin/python3
import xml.sax
from xml.dom.minidom import parse
import xml.dom.minidom
# ------Using SAX Parser---------------
class MovieHandler(xml.sax.ContentHandler):
def __init__(self):
self.CurrentData = ""
self.type = ""
self.title = ""
self.text = ""
self.description = ""
self.displayName = ""
# Call when an element starts
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "code":
print ("*****Section*****")
code = attributes["code"]
#displayName = attributes["displayName"]
print ("Code:", code)
#print("Display Name:", displayName)
# Call when an elements ends
def endElement(self, tag):
if self.CurrentData == "type":
print ("Type:", self.type)
elif self.CurrentData == "displayName":
print("Display Name:", self.displayName)
elif self.CurrentData == "title":
print ("Title:", self.CurrentData.title())
elif self.CurrentData == "text":
print ("Text:", self.text)
elif self.CurrentData == "description":
print ("Description:", self.description)
self.CurrentData = ""
# Call when a character is read
def characters(self, content):
if self.CurrentData == "type":
self.type = content
elif self.CurrentData == "format":
self.format = content
elif self.CurrentData == "year":
self.year = content
elif self.CurrentData == "rating":
self.rating = content
elif self.CurrentData == "stars":
self.stars = content
elif self.CurrentData == "description":
self.description = content
if (__name__ == "__main__"):
# create an XMLReader
parser = xml.sax.make_parser()
# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# override the default ContextHandler
Handler = MovieHandler()
parser.setContentHandler(Handler)
parser.parse(saved_file_path)
The results in console are:
*****Section***** Code: 34391-3 Title: Title
*****Section***** Code: 57664-264
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-264-88
*****Section***** Code: 57664-264-13
*****Section***** Code: 57664-264-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 57664-265
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-265-88
*****Section***** Code: 57664-265-13
*****Section***** Code: 57664-265-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 57664-266
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 50VV3VW0TI
*****Section***** Code: 368GB5141J
*****Section***** Code: 70097M6I30
*****Section***** Code: 57664-266-88
*****Section***** Code: 57664-266-13
*****Section***** Code: 57664-266-18
*****Section***** Code: SPLCOLOR
*****Section***** Code: SPLSHAPE
*****Section***** Code: SPLSCORE
*****Section***** Code: SPLSIZE
*****Section***** Code: SPLIMPRINT
*****Section***** Code: SPLCOATING
*****Section***** Code: SPLSYMBOL
*****Section***** Code: 34066-1 Title: Title Title: Title
*****Section***** Code: 34089-3 Title: Title
*****Section***** Code: 34090-1 Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34067-9 Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34070-3 Title: Title
*****Section***** Code: 34071-1 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 42232-9 Title: Title
*****Section***** Code: 34072-9 Title: Title
*****Section***** Code: 34073-7 Title: Title
*****Section***** Code: 34083-6 Title: Title
*****Section***** Code: 34091-9 Title: Title
*****Section***** Code: 42228-7 Title: Title
*****Section***** Code: 34080-2 Title: Title
*****Section***** Code: 34081-0 Title: Title
*****Section***** Code: 34082-8 Title: Title Title: Title Title: Title
*****Section***** Code: 34084-4 Title: Title Title: Title Title: Title Text:
*****Section***** Code: 34088-5 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title Text:
*****Section***** Code: 34068-7 Title: Title Title: Title Title: Title Title: Title Title: Title Title: Title
*****Section***** Code: 34069-5 Title: Title
Process finished with exit code 0
The issues / whats not working:
I dont really need the sections prior to the sections containing "Code: XXXXX-X"
For each of those sections I want to get the values for the <title>
, <text>
, and <paragraph>
tags for that section and all sub-sections of that section.
While I've been able to use the tutorials for DOM, ElementTree, lxml, and minidom, the target XML is non-standard and contains multiple attributes in a single tag, for example:
<code code="34090-1" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="Clinical Pharmacology section" />
And some nodes/elements will contain a shortcut end tag (as seen above) while others will have a full traditional end tag.
No wonder healthcare is so complicated!
So how do I get the contents of the tag and iterate over the subsections to do the same?
I hope I got your question right, this code loads the XML via requests
module and then extract each <code>
and subsequent <title>
and <paragraph>
inside <text>
:
import requests
from bs4 import BeautifulSoup
url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/f36d4ed3-dcbb-4465-9fa6-1da811f555e6.xml'
soup = BeautifulSoup( requests.get(url).text, 'html.parser' )
for section in soup.select('section:has(> code[code]):has(> title)'):
print('Code = ', section.select_one('code')['code'])
for title in section.select('title'):
print()
print('Title = ', title.text)
print('*' * 80)
txt = title.find_next_sibling('text')
if not txt:
continue
for paragraph in txt.select('paragraph'):
for tag in paragraph.select('br'):
tag.replace_with("\n")
print()
lines = '\n'.join(line.strip() for line in paragraph.get_text().splitlines() if line.strip())
print(lines)
print('-' * 120 + '\n')
Prints:
Code = 34066-1
Title = BOXED WARNING
********************************************************************************
Title = Cessation of Therapy with Atenolol
********************************************************************************
Patients with coronary artery disease, who are being treated with atenolol, should be advised against abrupt discontinuation of therapy. Severe exacerbation of angina and the occurrence of myocardial infarction
and ventricular arrhythmias have been reported in angina patients following the abrupt discontinuation of therapy with beta-blockers. The last two complications may occur with or without preceding exacerbation o
f the angina pectoris. As with other beta-blockers, when discontinuation of atenolol tablet, USP, is planned, the patients should be carefully observed and advised to limit physical activity to a minimum. If the
angina worsens or acute coronary insufficiency develops, it is recommended that atenolol tablet, USP be promptly reinstituted, at least temporarily. Because coronary artery disease is common and may be unrecogn
ized, it may be prudent not to discontinue atenolol tablet, USP, therapy abruptly even in patients treated only for hypertension. (See DOSAGE AND ADMINISTRATION.)
------------------------------------------------------------------------------------------------------------------------
Code = 34089-3
Title = DESCRIPTION
********************************************************************************
Atenolol, USP, a synthetic, beta1-selective (cardioselective) adrenoreceptor blocking agent, may be chemically described as benzeneacetamide, 4 -[2'-hydroxy- 3'-[(1- methylethyl) amino] propoxy]-. The molecular
and structural formulas are:
Atenolol (free base) has a molecular weight of 266.34. It is a relatively polar hydrophilic compound with a water solubility of 26.5 mg/mL at 37°C and a log partition coefficient (octanol/water) of 0.23. It is f
reely soluble in 1N HCl (300 mg/mL at 25°C) and less soluble in chloroform (3 mg/mL at 25°C).
Atenolol is available as 25, 50 and 100 mg tablets for oral administration.
Each tablet contains the labeled amount of atenolol, USP and the following inactive ingredients: povidone, microcrystalline cellulose, corn starch, sodium lauryl sulfate, croscarmellose sodium, colloidal silicon
dioxide, sodium stearyl fumarate and magnesium stearate.
------------------------------------------------------------------------------------------------------------------------
...and so on.