I have a weird situation. I am mining PubMed data using rentrez
. When I run entrez_search()
and then entrez_summary()
and then entrez_fetch()
I get the message this error (full code at the bottom of post):
Error: HTTP failure: 400
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eEfetchResult PUBLIC "-//NLM//DTD efetch 20131226//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20131226/efetch.dtd">
<eFetchResult>
<ERROR>Cannot retrieve history data. query_key: 1, WebEnv: NCID_1_51629226_130.14.18.34_9001_1531773486_1795859931_0MetA0_S_MegaStore, retstart: 0, retmax: 552</ERROR>
<ERROR>Can't fetch uids from history because of: NCBI C++ Exception:
Error: UNK_MODULE(CException::eInvalid) "UNK_FILE", line 18446744073709551615: UNK_FUNC ---
</ERROR>
</eFetchResult>
After searching around, I thought I had found the solution in this discussion of query size. When I decreased retmax_set
from 500 to 10, the code worked. I then iteratively determined the maximum retmax_set
value that would not throw an error and discovered what seems to me to be very weird behavior.
The search term_set = "transcription AND enhancer AND promoter AND 2017:2018[PDAT]"
yeilds 552 records. When running my code with different values of retmax
:
retmax_set
<= 183 worksretmax_set
>= 184 gives the error described aboveA modified search term_set = "transcription AND enhancer AND promoter AND 2018[PDAT]"
yeilds 186 records. When running this search with different values of retmax
:
retmax_set
<= 61 worksretmax_set
>= 62 gives the error described aboveThe search term_set = "transcription AND enhancer AND promoter AND 2017[PDAT]"
yeilds 395 records (for some reason PubMed labels 29 records as being published in 2017 and 2018). When running my code on this search term with different values of retmax
:
retmax_set
<= 131 worksretmax_set
>= 132 gives the error described aboveInterestingly, all three searches start to fail when the retmax
value is greater than one third of the total number of records (552 / 3 = 184, 186 / 3 = 62, 395 / 3 = 131.67). I'm going to modify my code to calculate retmax_set
based on the number of results returned by entrez_search
, but I have no idea why rentrez
or NCBI is doing this. Any ideas?
> ## set search term
> term_set = "transcription AND enhancer AND promoter AND 2017:2018[PDAT]"
> ## load package
> library(rentrez)
> ## set maximum records batch
> retmax_set = 182
> ## search pubmed using web history
> search <- entrez_search(
+ db = "pubmed",
+ term = term_set,
+ use_history = T
+ )
> ## get summaries of search hits
> summary <- list(); for (seq_start in seq(0, search$count - 1, retmax_set)) {
+ summary1 <- entrez_summary(
+ db = "pubmed",
+ web_history = search$web_history,
+ retmax = retmax_set,
+ retstart = seq_start
+ )
+ summary <- c(summary, summary1)
+ }
> ## download full XML refs for hits
> XML_refs <- entrez_fetch(
+ db = "pubmed",
+ web_history = search$web_history,
+ rettype = "xml",
+ parsed = TRUE
+ )
>
>
> ## set search term
> term_set = "transcription AND enhancer AND promoter AND 2017:2018[PDAT]"
> ## load package
> library(rentrez)
> ## set maximum records batch
> retmax_set = 183
> ## search pubmed using web history
> search <- entrez_search(
+ db = "pubmed",
+ term = term_set,
+ use_history = T
+ )
> ## get summaries of search hits
> summary <- list(); for (seq_start in seq(0, search$count - 1, retmax_set)) {
+ summary1 <- entrez_summary(
+ db = "pubmed",
+ web_history = search$web_history,
+ retmax = retmax_set,
+ retstart = seq_start
+ )
+ summary <- c(summary, summary1)
+ }
> ## download full XML refs for hits
> XML_refs <- entrez_fetch(
+ db = "pubmed",
+ web_history = search$web_history,
+ rettype = "xml",
+ parsed = TRUE
+ )
>
>
> ## set search term
> term_set = "transcription AND enhancer AND promoter AND 2017:2018[PDAT]"
> ## load package
> library(rentrez)
> ## set maximum records batch
> retmax_set = 184
> ## search pubmed using web history
> search <- entrez_search(
+ db = "pubmed",
+ term = term_set,
+ use_history = T
+ )
> ## get summaries of search hits
> summary <- list(); for (seq_start in seq(0, search$count - 1, retmax_set)) {
+ summary1 <- entrez_summary(
+ db = "pubmed",
+ web_history = search$web_history,
+ retmax = retmax_set,
+ retstart = seq_start
+ )
+ summary <- c(summary, summary1)
+ }
> ## download full XML refs for hits
> XML_refs <- entrez_fetch(
+ db = "pubmed",
+ web_history = search$web_history,
+ rettype = "xml",
+ parsed = TRUE
+ )
Error: HTTP failure: 400
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eEfetchResult PUBLIC "-//NLM//DTD efetch 20131226//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20131226/efetch.dtd">
<eFetchResult>
<ERROR>Cannot retrieve history data. query_key: 1, WebEnv: NCID_1_51629226_130.14.18.34_9001_1531773486_1795859931_0MetA0_S_MegaStore, retstart: 0, retmax: 552</ERROR>
<ERROR>Can't fetch uids from history because of: NCBI C++ Exception:
Error: UNK_MODULE(CException::eInvalid) "UNK_FILE", line 18446744073709551615: UNK_FUNC ---
</ERROR>
</eFetchResult>
>
>
> ## set search term
> term_set = "transcription AND enhancer AND promoter AND 2017:2018[PDAT]"
> ## load package
> library(rentrez)
> ## set maximum records batch
> retmax_set = 185
> ## search pubmed using web history
> search <- entrez_search(
+ db = "pubmed",
+ term = term_set,
+ use_history = T
+ )
> ## get summaries of search hits
> summary <- list(); for (seq_start in seq(0, search$count - 1, retmax_set)) {
+ summary1 <- entrez_summary(
+ db = "pubmed",
+ web_history = search$web_history,
+ retmax = retmax_set,
+ retstart = seq_start
+ )
+ summary <- c(summary, summary1)
+ }
> ## download full XML refs for hits
> XML_refs <- entrez_fetch(
+ db = "pubmed",
+ web_history = search$web_history,
+ rettype = "xml",
+ parsed = TRUE
+ )
Error: HTTP failure: 400
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eEfetchResult PUBLIC "-//NLM//DTD efetch 20131226//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20131226/efetch.dtd">
<eFetchResult>
<ERROR>Cannot retrieve history data. query_key: 1, WebEnv: NCID_1_52654089_130.14.22.215_9001_1531773493_484860305_0MetA0_S_MegaStore, retstart: 0, retmax: 552</ERROR>
<ERROR>Can't fetch uids from history because of: NCBI C++ Exception:
Error: UNK_MODULE(CException::eInvalid) "UNK_FILE", line 18446744073709551615: UNK_FUNC ---
</ERROR>
</eFetchResult>
It turns out rentrez uses 0-base counting. So the 552 records correspond to retstart
values of 0 to 551. Since my code was looking for values 1 to 552 it missed the first record (#0) and then threw an error when it looked for the non-existent record #552.