pythonbiopythonncbi

Number of NCBI publications published with a keyword, grouped by year


I want to make a dictionary with the year as the keys, and the number of publications containing a keyword that was published in that year as the value.

I've written this script:

from Bio import Entrez
from Bio import Medline
from metapub import PubMedFetcher
fetch = PubMedFetcher()
from collections import Counter


pmids = fetch.pmids_for_query('cancer',retmax=100000000) 
year_dict = {}
print(len(pmids))
for pmid in pmids:
    pubmed_rec = Entrez.efetch(db='pubmed',id=pmid,retmode='text',rettype='medline')
    records = Medline.parse(pubmed_rec)
    for rec in records:
        if rec.get('DP'):
            pub_date = rec.get('DP')
            split_date = pub_date.split()[0]
            if split_date not in year_dict:
                year_dict[split_date] = 1
            else:
                year_dict[split_date] +=1   
print(year_dict)

It works when I do a little test setting retmax = 100, the output is:

{'2021': 98}

But there's so many papers in reality (>1 million), it's prohibitively slow. Can anyone suggest an alternative method (where I enter a keyword, and it'll return a dictionary of years and the number of papers published that year with that keyword)? I need the query word ('cancer') to actually be a keyword for the paper, not just a word that's mentioned anywhere in the paper.

I'm wondering if it's easier to somehow do it as a filter and counter, i.e. use Efetch to filter all words with keyword X and year of publication Y, and repeat say 100 times from 2021 back 100 years, rather than my method of iterating through each. But haven't worked out a way to do it.


Solution

  • Instead of reading the date of publication from each record, you can query publication dates directly.

    Demo:

    from metapub import PubMedFetcher
    fetch = PubMedFetcher()
    from time import sleep
    
    
    year_dict = {}
    for year in range(2000, 2022):
        pmids = fetch.pmids_for_query('cancer '+str(year)+'/01/01[MDAT] : '+str(year)+'/12/31[MDAT]',retmax=10000000)
        year_dict[year] = len(pmids)
        print(str(year)+":", len(pmids))
        sleep(3)
    

    Output:

    2000: 2808
    2001: 287
    2002: 169
    2003: 9722
    2004: 149017
    2005: 39909
    2006: 166419
    2007: 89953
    2008: 61164
    2009: 73170
    2010: 40381
    2011: 53915
    2012: 46640
    2013: 189352
    2014: 72613
    2015: 157995
    2016: 247184
    2017: 139309
    2018: 818714
    2019: 1101298
    2020: 484091
    2021: 420468