I am working on a project that requires to work with Genia corpus. According to the literature Genia Corpus is made from articles extracted by searching 3 Mesh terms : “transcription factor”, “blood cell” and “human” on Medline/Pubmed. I want to extract full text article(which are freely available) for the articles in Genia corpus from Pubmed. I have tried many approaches but I am not able to find a way to download full text in text or XML or Pdf format.
Using Entrez utils provided by NCBI :
I have tried using the approach mentioned here - http://www.hpa-bioinformatics.org.uk/bioruby-api/classes/Bio/NCBI/REST/EFetch/Methods.html#M002197
which uses the Ruby gem Bio like this to get the information for a given PubMed ID - Bio::NCBI::REST::EFetch.pubmed(15496913)
But, it doesn't return the full text for the PMID.
Internally, it makes a call like this - http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=1372388&retmode=text&rettype=medline
But, both the Ruby gem and the above call don't return the full text.
On further Internet search, I found that the allowed values for PubMed for rettype and retmode don't have an option to get the full text, as mentioned in the table here - http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly
All the examples and other scripts I have seen on the Internet are only about extracting abstracts. authors etc. and none of them discuss extracting the full text.
Here is another link that I found that uses Python package Bio, but only accesses the information about authors - https://www.biostars.org/p/172296/
How can I download full text of the article in text or XML or Pdf format using Entrez utils provided by NCBI? Or are there already available scripts or web crawlers that I can use?
You can use biopython
to get articles which are on PubMedCentral and then get PDF from it. For all articles which are hosted somewhere else, it is difficult to get a generic solution to get the PDF.
It seems that PubMedCentral does not want you to download articles in bulk. Requests via urllib
are blocked, but the same URL works from a browser.
from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
#id is a string list with pubmed IDs
#two of have a public PMC article, one does not
handle = Entrez.efetch("pubmed", id="19304878,19088134", retmode="xml")
records = Entrez.parse(handle)
#checks for all records if they have a PMC identifier
#prints the URL for downloading the PDF
for record in records:
if record.get('MedlineCitation'):
if record['MedlineCitation'].get('OtherID'):
for other_id in record['MedlineCitation']['OtherID']:
if other_id.title().startswith('Pmc'):
print('http://www.ncbi.nlm.nih.gov/pmc/articles/%s/pdf/' % (other_id.title().upper()))