I was discovering rentrez
package in RStudio (Version 1.1.442) on a lab computer in Linux (Ubuntu 20.04.2) according to this manual.
However, later when I wanted to run the same code on my laptop in Windows 8 Pro (RStudio 2021.09.0 )
library (rentrez)
entrez_dbs()
entrez_db_searchable("gene")
#res <- entrez_search (db = "gene", term = "(Vibrio[Organism] OR vibrio[All Fields]) AND (16s[All Fields]) AND (rna[All Fields]) AND (owensii[All Fields] OR navarrensis[All Fields])", retmax = 500, use_history = TRUE)
I can not get rid of this error, even after closing the session or reinstalling rentrez
package
Error in curl::curl_fetch_memory(url, handle = handle) : schannel: next InitializeSecurityContext failed: SEC_E_ILLEGAL_MESSAGE (0x80090326) - This error usually occurs when a fatal SSL/TLS alert is received (e.g. handshake failed).
This is the main problem that I faced.
Later I decided to address pages containing details about the genes and their sequences in FASTA format modifying a code that I have previously used. It uses rvest
and rselenium
packages and the results were perfect.
# Specifying a webpage
url <- "https://www.ncbi.nlm.nih.gov/gene/66940694" # the last 9 numbers is gene id
library(rvest)
library(RSelenium)
# Opening a browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(url)
# Clicked outside in an empty space next to the FASTA button and copied a full xPath (redirecting to a FASTA data containing webpage)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of a webpage: left it from the old code for the case of a long gene
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
# Let's get gene FASTA, for example
page <- read_html(remDr$getPageSource()[[1]])
fasta <- page %>%
html_nodes('pre') %>%
html_text()
print(fasta)
Output: ">NZ_QKKR01000022.1:c3037-151 Vibrio paracholerae strain 2016V-1111 2016V-1111_ori_contig_18, whole genome shotgun sequence\nGGT...
The code worked well to obtain other details about the gene like its accession number, position, organism and etc.
Later I tried to change the code to get simultaneously the same information for several gene IDs following the explanations I got here for the other project of mine.
# Specifying a list of gene IDs
res_id <- c('57838769','61919208','66940694')
dt <- res_id # <lapply> looping function refused to work if an argument had a different name rather than <dt>
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
## Writing a function of GET_FASTA dependent on GENE_ID (x)
get_fasta <- function(x){
link = paste0('https://www.ncbi.nlm.nih.gov/gene/',x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()
... there is a continuation below but an error was appearing here, saying that the same xPath, which was successfully used before, can not be found.
Error: Summary: NoSuchElement Detail: An element could not be located on the page using the given search parameters. class: org.openqa.selenium.NoSuchElementException Further Details: run errorDetails method
I tried to delete /a[2]
to get /html/.../p
at the end of the xPath as it was working in the initial code, but an error was appearing later again.
webElem <- remDr$findElement("css", "body")
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of FASTA on the website
fasta <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('pre') %>%
html_text()
fasta
return(fasta)
}
## Writing a function of GET_ACC_NUM dependent on GENE_ID (x)
get_acc_num <- function(x){
link = paste0( 'https://www.ncbi.nlm.nih.gov/gene/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p')$clickElement()
webElem <- remDr$findElement("css", "body")
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of ACC_NUM on the website
acc_num <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.itemid') %>%
html_text() %>%
str_sub(start= -17)
acc_num
return(acc_num)
}
## Collecting all FUNCTION into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
fasta <- get_fasta(x)
acc_num <- get_acc_num(x)
# Combine into a tibble
combined_data <- tibble( Acc_Number = acc_num,
FASTA = fasta)
}
## Running FUNCTION for all x
df <- lapply(dt, get_data_table)
head(df)
I also tried to write the code
rvest
,for (i in res_id) {}
,/html/.../p/a[2]
or .../p
using if () {} else {}
but the results were even more confusing.
I am studying R coding while working on such tasks, so any suggestions and critics are welcome.
The node pre
is not a valid one. We have to look for value inside class
or 'id` etc.
webElem$sendKeysToElement(list(key = "end")
you don't need this command as there is no necessity yo scroll the page.
Below is code to get you the sequence of genes.
First we have to get the links to sequence of genes which we do it by rvest
library(rvest)
library(dplyr)
res_id <- c('57838769','61919208','66940694')
link = vector()
for(i in res_id){
url = paste0('https://www.ncbi.nlm.nih.gov/gene/', i)
df = url %>%
read_html() %>%
html_node('.note-link')
link1 = xml_attrs(xml_child(df, 3))[["href"]]
link1 = paste0('https://www.ncbi.nlm.nih.gov', link1)
link = rbind(link, link1)
}
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_ADAF01000001.1?report=fasta&from=257558&to=260444"
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_VARQ01000103.1?report=fasta&from=64&to=2616&strand=true"
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_QKKR01000022.1?report=fasta&from=151&to=3037&strand=true"
After obtaining the links we shall get the sequence of genes which we do it by RSelenium
. I tried to do it with rvest
but couldn't get the sequence.
Launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
Function to get the sequence
get_seq = function(link){
remDr$navigate(link)
Sys.sleep(5)
df = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes(xpath = '//*[@id="viewercontent1"]') %>%
html_text()
return(df)
}
df = lapply(link, get_seq)
Now we have list df
with all the info.