I am trying to search for papers with specific words in the title using biopython
. More precisely, the word viral or virus in papers published between 2010 and 2015. Here is the code I have:
import re
from Bio import Medline
handle = Entrez.esearch(db="pubmed", # database to search
term="2010[Date - Publication]:2015[Date - Publication]"
)
record = Entrez.read(handle)
handle.close()
pmid_list = record["IdList"] #list of records
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)
titles = [] # start with empty list of titles
for record in records:
ti_list = record['TI'] #titles
for title in ti_list:
if title == "virus" and title not in titles: #searching viral/virus
titles.append(title)
print('Publications with viral or virus in the title:')
for record in records:
print(" ", title)
If I simply print(record['TI'], then I get a list of all titles in my search query. However, I'm not able to search the specific word. I think my mistake may be in the "if title == "virus" (because obviously no paper will be titled with that word alone).
I am pretty stuck. Is there a better way to be searching for this word in the titles of the papers I've queried?
Thanks.
Edit: Updated code with re.search
(and still no luck)
r = re.compile(r"\bvir(al|us)\b")
titles = set() # start with empty list of titles
for record in records:
ti_list = record['TI'] # titles
for title in ti_list:
if r.search(title): #
titles.add(title)
print('Publications with viral or virus in the title:')
for record in records:
print(" ", title)
New code:
import re
from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text",
term="2010[Date - Publication]:2015[Date - Publication]")
titles = []
for record in Medline.parse(handle):
for title in record['TI']:
titles.append(title)
handle.close()
for title in titles:
print(title)
If you want to match substrings use in to see if any of the words are contained in the title:
words = ("viral","virus")
if any(w in title for w in words) and title not in titles: #
But you seem to want to filter the records getting any record title that contains viral or virus:
st = {"viral","virus"}
filtered_records = [ record for record in records if any(w in st for w in record['TI'] )]
If you want to match substrings and use a pattern then you actually need to make it a regex, "vir(al|us)"
is just a string in your code:
import re
r = re.compile("vir(al|us)")
filtered_records = [record for record in records if any(r.search(w) for w in record['TI'])]
The regex in your own loop would go where your if is:
import re
r = re.compile(r"vir(al|us)")
if r.search(title) and title not in titles:
.......
If you don't want viruses etc.. to match then use a word boundary for your regex:
r = re.compile(r"\bvir(al|us)\b")
You should also make titles a set which cannot have dupes, a working example using your own code:
r = re.compile(r"\bvir(al|us)\b")
titles = set() # start with empty list of titles
for record in records:
ti_list = record['TI'] # titles
for title in ti_list:
if r.search(title): #
titles.add(title)
Which can become a set comprehension:
r = re.compile(r"\bvir(al|us)\b")
titles = {title for record in records for title in record['TI'] if r.search(title)} # titles
Since record['TI']
returns a string and not a list:
r = re.compile(r"\bvir(al|us)\b")
titles = set()
for record in records:
title = record['TI'] # title is a str not a list
if r.search(title): #
titles.add(title)
Do the same with the set comp or any other example.