pythonxmlweb-scrapingtxtpubmed

Extracting PubMed data in xml format from txt batches in Python


I asked this question before and it was a perfect solution

A perfectly working code for multiple traditional xml files is below.

import pandas as pd
from glob import glob
from bs4 import BeautifulSoup

l = list()

for f in glob('*.xml'): # Changed to .txt here
    pub = dict()

    with open(f, 'r') as xml_file:
        xml = xml_file.read()

    soup = BeautifulSoup(xml, "lxml")
    pub['PMID'] = soup.find('pmid').text
    pub_list = soup.find('publicationtypelist')
    pub['Publication_type'] = list()
    for pub_type in pub_list.find_all('publicationtype'):
    pub['Publication_type'].append(pub_type.text)
    try:
        pub['NCTID'] = soup.find('accessionnumber').text
    except:
        pub['NCTID'] = None
    l.append(pub)

 df = pd.DataFrame(l)
 df = df.explode('Publication_type', ignore_index=True)

It gave me my desired output

    PMID        Publication_type    NCTID
0   34963793    Journal Article     NCT02649218
1   34963793    Review              NCT02649218
2   34535952    Journal Article     None
3   34090787    Journal Article     NCT02424799
4   33615122    Journal Article     NCT01922037

The only thing I changed since - I extracted data, using R and easyPubMed package. Data was extracted in batches (100 articles each) and stored in xml format in txt docs. I have 150 txt documents in total. Instead of ~25000 rows it now extracts only ~250.

How to update the Python code above and get the same output, when the input files have changed? I add several txt files here for reproducibility. Need to extract PMID, Publication_type, NCTID.


Solution

  • Previous code only builds a data frame for an XML of a single article not an XML of hundreds of articles. Therefore, you need to capture select nodes under every <PubmedArticle> instance in XML. Right now only the first article is being captured in each XML.

    Consider etree's iterparse solution that is less memory-intensive to read large XML where you extract needed nodes between opening and closing of <PubmedArticle> nodes:

    import pandas as pd
    import xml.etree.ElementTree as ET
    
    data =  []                               # INITIALIZE DATA LIST
    for xml_file in glob('*.txt'):
        for event, elem in ET.iterparse(xml_file, events=('start', 'end')):
            if event == 'start':
                if elem.tag == "PubmedArticle":
                    pub = {}                 # INITIALIZE ARTICLE DICT
    
                if elem.tag == 'PMID':
                    pub["PMID"] = elem.text
                    pub["PublicationType"] = []
                    pub["NCTID"] = None
    
                elif elem.tag == 'PublicationType':
                    pub["PublicationType"].append(elem.text)
                    
                elif elem.tag == 'AccessionNumber':
                    pub["NCTID"] = elem.text
    
            if event == 'end':
                if elem.tag == "PubmedArticle":
                    pub["Source"] = xml_file
                    data.append(pub)         # APPEND MULTIPLE ARTICLES
    
            elem.clear()
    
    # BUILD XML DATA FRAME
    final_df = (
        pd.DataFrame(data)
          .explode('PublicationType', ignore_index=True)
    )