Obtaining data from NCBI gene database with R

Rentrez package

I was discovering rentrez package in RStudio (Version 1.1.442) on a lab computer in Linux (Ubuntu 20.04.2) according to this manual. However, later when I wanted to run the same code on my laptop in Windows 8 Pro (RStudio 2021.09.0 )

library (rentrez)
entrez_dbs()
entrez_db_searchable("gene")
#res <- entrez_search (db = "gene", term = "(Vibrio[Organism] OR vibrio[All Fields]) AND (16s[All Fields]) AND (rna[All Fields]) AND (owensii[All Fields] OR navarrensis[All Fields])", retmax = 500, use_history = TRUE)

I can not get rid of this error, even after closing the session or reinstalling rentrez package

Error in curl::curl_fetch_memory(url, handle = handle) : schannel: next InitializeSecurityContext failed: SEC_E_ILLEGAL_MESSAGE (0x80090326) - This error usually occurs when a fatal SSL/TLS alert is received (e.g. handshake failed).

This is the main problem that I faced.

RSelenium package

Later I decided to address pages containing details about the genes and their sequences in FASTA format modifying a code that I have previously used. It uses rvest and rselenium packages and the results were perfect.

# Specifying a webpage

url <- "https://www.ncbi.nlm.nih.gov/gene/66940694"    # the last 9 numbers is gene id

library(rvest)
library(RSelenium)

# Opening a browser

driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(url)

# Clicked outside in an empty space next to the FASTA button and copied a full xPath (redirecting to a FASTA data containing webpage)

remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()

webElem <- remDr$findElement("css", "body")

#scrolling to the end of a webpage: left it from the old code for the case of a long gene

for (i in 1:5){
  Sys.sleep(2)
  webElem$sendKeysToElement(list(key = "end"))

# Let's get gene FASTA, for example

page <- read_html(remDr$getPageSource()[[1]])
fasta <- page %>%
html_nodes('pre') %>%
  html_text()
print(fasta)

Output: ">NZ_QKKR01000022.1:c3037-151 Vibrio paracholerae strain 2016V-1111 2016V-1111_ori_contig_18, whole genome shotgun sequence\nGGT...

The code worked well to obtain other details about the gene like its accession number, position, organism and etc.

Looping of the process for several gene IDs

Later I tried to change the code to get simultaneously the same information for several gene IDs following the explanations I got here for the other project of mine.

# Specifying a list of gene IDs

res_id <- c('57838769','61919208','66940694')
dt <- res_id    # <lapply> looping function refused to work if an argument had a different name rather than <dt>
  
driver <- rsDriver(browser = c("firefox"))  
remDr <- driver[["client"]]

## Writing a function of GET_FASTA dependent on GENE_ID (x)

get_fasta <- function(x){
  link = paste0('https://www.ncbi.nlm.nih.gov/gene/',x)
  remDr$navigate(link)
  remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()

... there is a continuation below but an error was appearing here, saying that the same xPath, which was successfully used before, can not be found.

Error: Summary: NoSuchElement Detail: An element could not be located on the page using the given search parameters. class: org.openqa.selenium.NoSuchElementException Further Details: run errorDetails method

I tried to delete /a[2] to get /html/.../p at the end of the xPath as it was working in the initial code, but an error was appearing later again.

  webElem <- remDr$findElement("css", "body")

  for (i in 1:5){
    Sys.sleep(2)
    webElem$sendKeysToElement(list(key = "end"))
  } 

 # Addressing selectors of FASTA on the website
  
  fasta <- remDr$getPageSource()[[1]] %>% 
    read_html() %>%
    html_nodes('pre') %>%
    html_text()
  fasta
  return(fasta)
}

## Writing a function of GET_ACC_NUM dependent on GENE_ID (x)

get_acc_num <- function(x){
  link = paste0( 'https://www.ncbi.nlm.nih.gov/gene/', x)
  remDr$navigate(link)
  remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p')$clickElement()
  webElem <- remDr$findElement("css", "body")

  for (i in 1:5){
    Sys.sleep(2)
    webElem$sendKeysToElement(list(key = "end"))
  } 
  
  # Addressing selectors of ACC_NUM on the website
  
  acc_num <- remDr$getPageSource()[[1]] %>% 
    read_html() %>%
    html_nodes('.itemid') %>%
    html_text() %>%
    str_sub(start= -17)
  acc_num
  return(acc_num)
}

## Collecting all FUNCTION into tibble

get_data_table <- function(x){

      # Extract the Basic information from the HTML
      fasta <- get_fasta(x)
      acc_num <- get_acc_num(x)

      # Combine into a tibble
      combined_data <- tibble( Acc_Number = acc_num,
                               FASTA = fasta)
}

## Running FUNCTION for all x

df <- lapply(dt, get_data_table)

head(df)

I also tried to write the code

only with rvest,
to write the loop with for (i in res_id) {},
to introduce two different xPaths ending with /html/.../p/a[2] or .../p using if () {} else {}

but the results were even more confusing.

I am studying R coding while working on such tasks, so any suggestions and critics are welcome.

Solution

The node pre is not a valid one. We have to look for value inside class or 'id` etc.

webElem$sendKeysToElement(list(key = "end") you don't need this command as there is no necessity yo scroll the page.

Below is code to get you the sequence of genes.

First we have to get the links to sequence of genes which we do it by rvest

library(rvest)
library(dplyr)
res_id <- c('57838769','61919208','66940694')

link = vector()
for(i in res_id){
  url = paste0('https://www.ncbi.nlm.nih.gov/gene/', i)
  df = url %>%
    read_html() %>% 
    html_node('.note-link') 
  
  link1 = xml_attrs(xml_child(df, 3))[["href"]]
  link1 = paste0('https://www.ncbi.nlm.nih.gov', link1)
  link = rbind(link, link1)
}

link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_ADAF01000001.1?report=fasta&from=257558&to=260444"       
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_VARQ01000103.1?report=fasta&from=64&to=2616&strand=true" 
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_QKKR01000022.1?report=fasta&from=151&to=3037&strand=true"

After obtaining the links we shall get the sequence of genes which we do it by RSelenium. I tried to do it with rvest but couldn't get the sequence.

Launch browser

library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]

Function to get the sequence

get_seq = function(link){
  remDr$navigate(link)
  Sys.sleep(5)
  df = remDr$getPageSource()[[1]] %>% 
    read_html() %>% 
    html_nodes(xpath = '//*[@id="viewercontent1"]') %>% 
    html_text()
  return(df)
}

df = lapply(link, get_seq)

Now we have list df with all the info.