I'm attempting to web-scrape using the following R code (which was obtained from this thread: link to other question
library(selenider)
library(rvest)
session <- selenider_session("selenium", browser = "chrome")
Sys.sleep(3)
open_url("https://egamersworld.com/callofduty/matches")
elements <- session |> get_page_source() |> html_elements(".item_teams__cKXQT")
res <- data.frame(
home_team_name = elements |>
html_elements(".item_team__evhUQ:nth-child(1) .item_teamName__NSnfH") |>
html_text(trim = TRUE),
home_team_odds = elements |>
html_elements(".item_team__evhUQ:nth-child(1) .item_odd__Lm2Wl") |>
html_text(trim = TRUE),
away_team_name = elements |>
html_elements(".item_team__evhUQ:nth-child(3) .item_teamName__NSnfH") |>
html_text(trim = TRUE),
away_team_odds = elements |>
html_elements(".item_team__evhUQ:nth-child(3) .item_odd__Lm2Wl") |>
html_text(trim = TRUE),
match_date = elements |>
html_elements(".item_scores__Vi7YX .item_date__g4cq_") |>
html_text(trim = TRUE),
match_time = elements |>
html_elements(".item_scores__Vi7YX .item_time__xBia_") |>
html_text(trim = TRUE),
match_type = elements |>
html_elements(".item_scores__Vi7YX .item_bo__u2C9Q") |>
html_text(trim = TRUE)
)
This code works fine when I run it locally on Windows 10, however, I have linux server running that I'd like this script to run on. When I run it on linux I get the following error:
Error in `create_selenium_client_internal()`:
! A Selenium session could not be started
Caused by error in `httr2::req_perform()`:
! HTTP 500 Internal Server Error.
✖ Session not created.
✖ Could not start a new session. Error while creating session with the driver service. Stopping driver service: Could not start a new session. Response code 500. Message: probably user data directory is already in use, please specify a unique value for --user-data-dir argument, or don't use --user-data-dir
Host info: host: 'Unknown', ip: 'Unknown'
Build info: version: '4.29.0', revision: '18ae989'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '5.15.0-134-generic', java.version: '11.0.26'
Driver info: driver.version: unknown
Build info: version: '4.29.0', revision: '18ae989'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '5.15.0-134-generic', java.version: '11.0.26'
Driver info: driver.version: unknown
I've also attempted creating and setting a directory manually
server_options = selenium_options(server_options = selenium_server_options(extra_args = c("--user-data-dir=/tmp/testing")))
session <- selenider_session(
"selenium",
browser = "chrome",
options = server_options
)
Which only ends up with the same error. I've tried killing all chrome processes running as well, it doesn't seem to help. Is there a way to fix this issue?
Another important note is that I have some other python selenium scripts that work fine on the server. In those scripts, there is no setting of --user-data-dir manually. I'm trying to transition my code to R as I'm much more proficient in R as compared to Python.
With assistance from TimG's method I've found a working solution. It is a more manual way of launching chromote and then utilizing rvest. Here is my working code, where instead of reading a bunch of elements, I'm simply grabbing some team names that are on the page.
library(rvest)
library(chromote)
b <- chromote::ChromoteSession$new()
# if we don't set some headers, the javascript on the page will not load
# due to cloudflare blockage
b$Emulation$setDeviceMetricsOverride(
width = 1280,
height = 800,
deviceScaleFactor = 1,
mobile = FALSE
)
user_agent <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
b$Emulation$setUserAgentOverride(userAgent = user_agent)
b$Page$navigate("https://egamersworld.com/callofduty/matches")
b$Page$loadEventFired()
Sys.sleep(3)
html <- b$Runtime$evaluate("document.documentElement.outerHTML")$result$value
parsed_html = read_html(html)
teams = parsed_html %>%
rvest::html_elements(".item_teamName__NSnfH") %>%
html_text()
b$close()