I want to make a dictionary with the year as the keys, and the number of publications containing a keyword that was published in that year as the value.
I've written this script:
from Bio import Entrez
from Bio import Medline
from metapub import PubMedFetcher
fetch = PubMedFetcher()
from collections import Counter
pmids = fetch.pmids_for_query('cancer',retmax=100000000)
year_dict = {}
print(len(pmids))
for pmid in pmids:
pubmed_rec = Entrez.efetch(db='pubmed',id=pmid,retmode='text',rettype='medline')
records = Medline.parse(pubmed_rec)
for rec in records:
if rec.get('DP'):
pub_date = rec.get('DP')
split_date = pub_date.split()[0]
if split_date not in year_dict:
year_dict[split_date] = 1
else:
year_dict[split_date] +=1
print(year_dict)
It works when I do a little test setting retmax = 100, the output is:
{'2021': 98}
But there's so many papers in reality (>1 million), it's prohibitively slow. Can anyone suggest an alternative method (where I enter a keyword, and it'll return a dictionary of years and the number of papers published that year with that keyword)? I need the query word ('cancer') to actually be a keyword for the paper, not just a word that's mentioned anywhere in the paper.
I'm wondering if it's easier to somehow do it as a filter and counter, i.e. use Efetch to filter all words with keyword X and year of publication Y, and repeat say 100 times from 2021 back 100 years, rather than my method of iterating through each. But haven't worked out a way to do it.
Instead of reading the date of publication from each record, you can query publication dates directly.
Demo:
from metapub import PubMedFetcher
fetch = PubMedFetcher()
from time import sleep
year_dict = {}
for year in range(2000, 2022):
pmids = fetch.pmids_for_query('cancer '+str(year)+'/01/01[MDAT] : '+str(year)+'/12/31[MDAT]',retmax=10000000)
year_dict[year] = len(pmids)
print(str(year)+":", len(pmids))
sleep(3)
Output:
2000: 2808
2001: 287
2002: 169
2003: 9722
2004: 149017
2005: 39909
2006: 166419
2007: 89953
2008: 61164
2009: 73170
2010: 40381
2011: 53915
2012: 46640
2013: 189352
2014: 72613
2015: 157995
2016: 247184
2017: 139309
2018: 818714
2019: 1101298
2020: 484091
2021: 420468