rlinuxselenium-webdriverweb-scraping

web-scraping using R selenider on linux error --user-data-dir


I'm attempting to web-scrape using the following R code (which was obtained from this thread: link to other question

library(selenider)
library(rvest)

session <- selenider_session("selenium", browser = "chrome")
Sys.sleep(3)

open_url("https://egamersworld.com/callofduty/matches")

elements <- session |> get_page_source() |> html_elements(".item_teams__cKXQT")

res <- data.frame(
  home_team_name = elements |> 
    html_elements(".item_team__evhUQ:nth-child(1) .item_teamName__NSnfH") |> 
    html_text(trim = TRUE),
  home_team_odds = elements |> 
    html_elements(".item_team__evhUQ:nth-child(1) .item_odd__Lm2Wl") |> 
    html_text(trim = TRUE),
  away_team_name = elements |> 
    html_elements(".item_team__evhUQ:nth-child(3) .item_teamName__NSnfH") |> 
    html_text(trim = TRUE),
  away_team_odds = elements |> 
    html_elements(".item_team__evhUQ:nth-child(3) .item_odd__Lm2Wl") |> 
    html_text(trim = TRUE),
  match_date = elements |> 
    html_elements(".item_scores__Vi7YX .item_date__g4cq_") |> 
    html_text(trim = TRUE),
  match_time = elements |> 
    html_elements(".item_scores__Vi7YX .item_time__xBia_") |> 
    html_text(trim = TRUE),
  match_type = elements |> 
    html_elements(".item_scores__Vi7YX .item_bo__u2C9Q") |> 
    html_text(trim = TRUE)
)

This code works fine when I run it locally on Windows 10, however, I have linux server running that I'd like this script to run on. When I run it on linux I get the following error:

Error in `create_selenium_client_internal()`:
! A Selenium session could not be started
Caused by error in `httr2::req_perform()`:
! HTTP 500 Internal Server Error.
✖ Session not created.
✖ Could not start a new session. Error while creating session with the driver service. Stopping driver service: Could not start a new session. Response code 500. Message: probably user data directory is already in use, please specify a unique value for --user-data-dir argument, or don't use --user-data-dir 
  Host info: host: 'Unknown', ip: 'Unknown'
  Build info: version: '4.29.0', revision: '18ae989'
  System info: os.name: 'Linux', os.arch: 'amd64', os.version: '5.15.0-134-generic', java.version: '11.0.26'
  Driver info: driver.version: unknown
  Build info: version: '4.29.0', revision: '18ae989'
  System info: os.name: 'Linux', os.arch: 'amd64', os.version: '5.15.0-134-generic', java.version: '11.0.26'
  Driver info: driver.version: unknown

I've also attempted creating and setting a directory manually

server_options = selenium_options(server_options = selenium_server_options(extra_args = c("--user-data-dir=/tmp/testing")))

session <- selenider_session(
  "selenium", 
  browser = "chrome",
  options = server_options
)

Which only ends up with the same error. I've tried killing all chrome processes running as well, it doesn't seem to help. Is there a way to fix this issue?

Another important note is that I have some other python selenium scripts that work fine on the server. In those scripts, there is no setting of --user-data-dir manually. I'm trying to transition my code to R as I'm much more proficient in R as compared to Python.


Solution

  • With assistance from TimG's method I've found a working solution. It is a more manual way of launching chromote and then utilizing rvest. Here is my working code, where instead of reading a bunch of elements, I'm simply grabbing some team names that are on the page.

    library(rvest)
    library(chromote)
    
    
    b <- chromote::ChromoteSession$new()
    
    # if we don't set some headers, the javascript on the page will not load
    # due to cloudflare blockage
    
    b$Emulation$setDeviceMetricsOverride(
      width = 1280,
      height = 800,
      deviceScaleFactor = 1,
      mobile = FALSE
    )
    
    user_agent <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    b$Emulation$setUserAgentOverride(userAgent = user_agent)
    
    b$Page$navigate("https://egamersworld.com/callofduty/matches")
    b$Page$loadEventFired()
    
    Sys.sleep(3)
    
    html <- b$Runtime$evaluate("document.documentElement.outerHTML")$result$value
    
    parsed_html = read_html(html)
    
    teams = parsed_html %>%
      rvest::html_elements(".item_teamName__NSnfH") %>%
      html_text()
    
    
    b$close()