pythonxmlbeautifulsoupxml-parsing

How to extract text content under an XML tag using beautifulsoup


I have an XML file that looks like this:

<sec id="sec2.1">
<title>Study design</title>
<p id="p0055">
This is a secondary analysis of the Childhood Acute Illness and Nutrition (CHAIN) Network prospective cohort which, between November 2016 and January 2019, recruited 3101 children at nine hospitals in Africa and South Asia: Dhaka and Matlab Hospitals (Bangladesh), Banfora Referral Hospital (Burkina Faso), Kilifi County, Mbagathi County and Migori County Hospitals (Kenya), Queen Elizabeth Hospital (Malawi), Civil Hospital (Pakistan), and Mulago National Referral Hospital (Uganda). As described in the published study protocol,
<xref rid="bib11" ref-type="bibr">
<sup>11</sup>
</xref>
children were followed throughout hospital admission and after discharge with follow-up visits at 45, 90 and 180-days post-discharge. Catchment settings differed in urbanisation, access to health care and prevalence of background comorbidities such as HIV and malaria. Prior to study start, sites were audited to optimise care as per national and World Health Organisation (WHO) guidelines.
<xref rid="bib12" ref-type="bibr">
<sup>12</sup>
</xref>
Cross-network harmonisation of clinical definitions and methods was prioritised through staff training and the use of standard operation procedures and case report forms (available online,
<ext-link ext-link-type="uri" xlink:href="https://chainnetwork.org/resources/" id="intref0010">https://chainnetwork.org/resources/</ext-link>
).
</p>
</sec>

How can I extract the text in the <p id="p0055"> element using BeautifulSoup?

Approaching this problem with the code below seems not to work.

with open('test.xml', 'r') as file:
    soup = BeautifulSoup(file, 'xml')

# Find and print all tags
for tag in soup.find_all('sec'):
    print(tag.text)

Solution

  • You have to select your element more specific. Based on your code chain .p to your tag to get always the first <p> in the selected <sec>:

    for tag in soup.find_all('sec'):
        print(tag.p.get_text(strip=True))
    

    or use an extra find():

    for tag in soup.find_all('sec'):
        print(tag.find('p').get_text(strip=True))
    

    or be as specific as possible and select by id if it is known:

    soup.find('p', id='p0055').get_text(strip=True)
    

    May also use .get_text() instead of old style .text.