I asked this question before and it was a perfect solution
A perfectly working code for multiple traditional xml
files is below.
import pandas as pd
from glob import glob
from bs4 import BeautifulSoup
l = list()
for f in glob('*.xml'): # Changed to .txt here
pub = dict()
with open(f, 'r') as xml_file:
xml = xml_file.read()
soup = BeautifulSoup(xml, "lxml")
pub['PMID'] = soup.find('pmid').text
pub_list = soup.find('publicationtypelist')
pub['Publication_type'] = list()
for pub_type in pub_list.find_all('publicationtype'):
pub['Publication_type'].append(pub_type.text)
try:
pub['NCTID'] = soup.find('accessionnumber').text
except:
pub['NCTID'] = None
l.append(pub)
df = pd.DataFrame(l)
df = df.explode('Publication_type', ignore_index=True)
It gave me my desired output
PMID Publication_type NCTID
0 34963793 Journal Article NCT02649218
1 34963793 Review NCT02649218
2 34535952 Journal Article None
3 34090787 Journal Article NCT02424799
4 33615122 Journal Article NCT01922037
The only thing I changed since - I extracted data, using R and easyPubMed
package. Data was extracted in batches (100 articles each) and stored in xml
format in txt
docs. I have 150 txt documents in total. Instead of ~25000 rows it now extracts only ~250.
How to update the Python code above and get the same output, when the input files have changed? I add several txt
files here for reproducibility. Need to extract PMID
, Publication_type
, NCTID
.
Previous code only builds a data frame for an XML of a single article not an XML of hundreds of articles. Therefore, you need to capture select nodes under every <PubmedArticle>
instance in XML. Right now only the first article is being captured in each XML.
Consider etree's iterparse
solution that is less memory-intensive to read large XML where you extract needed nodes between opening and closing of <PubmedArticle>
nodes:
import pandas as pd
import xml.etree.ElementTree as ET
data = [] # INITIALIZE DATA LIST
for xml_file in glob('*.txt'):
for event, elem in ET.iterparse(xml_file, events=('start', 'end')):
if event == 'start':
if elem.tag == "PubmedArticle":
pub = {} # INITIALIZE ARTICLE DICT
if elem.tag == 'PMID':
pub["PMID"] = elem.text
pub["PublicationType"] = []
pub["NCTID"] = None
elif elem.tag == 'PublicationType':
pub["PublicationType"].append(elem.text)
elif elem.tag == 'AccessionNumber':
pub["NCTID"] = elem.text
if event == 'end':
if elem.tag == "PubmedArticle":
pub["Source"] = xml_file
data.append(pub) # APPEND MULTIPLE ARTICLES
elem.clear()
# BUILD XML DATA FRAME
final_df = (
pd.DataFrame(data)
.explode('PublicationType', ignore_index=True)
)