pythonbioinformaticsbiopythonpubmed

Different result on browser search vs Bio.entrez search


I am getting different result when I use Bio Entrez to search. For example when I search on browser using query "covid side effect" I get 344 result where as I get only 92 when I use Bio Entrez. This is the code I was using.

from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.esearch(db="pubmed", retmax=40, term="covid side effect", idtype="acc")
record = Entrez.read(handle)
handle.close()
print(record['Count'])

I was hoping if someone could help me with this discrepancy.


Solution

  • For some reason everyone seemed to have same issue whether it's R api or Python API. I have found a work around to get the same result. It is slow but it gets job done. If your result is less than 10k you could probably use Selenium to get the pubmedid. Else, we can scrape the data using code below. I hope this will help someone in future.

    import requests
    # # Custom Date Range
    # req = requests.get("https://pubmed.ncbi.nlm.nih.gov/?term=covid&filter=dates.2009/01/01-2020/03/01&format=pmid&sort=pubdate&size=200&page={}".format(i))
    
    # # Custom Year Range
    # req = requests.get("https://pubmed.ncbi.nlm.nih.gov/?term=covid&filter=years.2010-2019&format=pmid&sort=pubdate&size=200&page={}".format(i))
    
    
    # #Relative Date 
    # req = requests.get("https://pubmed.ncbi.nlm.nih.gov/?term=covid&filter=datesearch.y_1&format=pmid&sort=pubdate&size=200&page={}".format(i))
    
    # # filter language 
    # # &filter=lang.english
    
    # # filter human 
    # #&filter=hum_ani.humans
    
    # Systematic Review
    #&filter=pubt.systematicreview
    
    # Case Reports 
    # &filter=pubt.casereports
    
    # Age
    # &filter=age.newborn
    
    search = "covid lungs"
    # search_list = "+".join(search.split(' '))
    
    def id_retriever(search_string):
        string = "+".join(search_string.split(' '))
        result = []
        old_result = len(result)
        for page in range(1,10000000):
            req = requests.get("https://pubmed.ncbi.nlm.nih.gov/?term={string}&format=pmid&sort=pubdate&size=200&page={page}".format(page=page,string=string))
    
            for j in req.iter_lines():
                decoded = j.decode("utf-8").strip(" ")
                length = len(decoded)
                if "log_displayeduids" in decoded and length > 46:
                    data = (str(j).split('"')[-2].split(","))
                    result = result + data
                    data = []
            new_result = len(result)
            if new_result != old_result:
                old_result = new_result
            else:
                break
        return result
    
    ids=id_retriever(search)
    len(ids)