I am trying to pull team names and odds from this webpage : https://www.winamax.fr/paris-sportifs/sports/1/48/162285
I have found that the data is dynamically rendered via JavaScript.
library(dplyr)
library(rvest)
url <- "https://www.winamax.fr/paris-sportifs/sports/1/48/162285"
age <- 'max-age=0'
t <- httr::GET(url,add_headers('User-Agent' = 'Mozilla/5.0', 'cache-control' = age,
'sec-ch-ua-platform' = "Windows"), accept("text/html"))
get2json <- content(t, as = "text")
ext <- read_html(get2json) %>%
html_element("body") %>%
html_nodes("#page-content > script:nth-child(3)") %>%
html_text()
In this we have the team names (competitor1Name & competitor2Name). The odds are in the "odds" part at the end.
However, I'm not sure of the exact format of this and how to extract and properly format it in a data frame. I have tried several things like below but there's always an issue in the parsing.
t <- ext %>%
str_replace("var PRELOADED_STATE = \\{", "[{") %>%
# str_replace("\\}\\]", '') %>%
str_replace("\\;", '') %>%
lapply(function(x) as.data.frame(fromJSON(x)))
Seems like you can clean your ext
simply by removing the beginning "var PRELOADED_STATE = " and ";$" using sub
:
cleaned_json <- sub("var PRELOADED_STATE = ", "", ext)
cleaned_json <- sub(";$", "", cleaned_json)
then parse it
parsed_data <- fromJSON(cleaned_json)
Traverse the JSON response like
oc_df <- lapply(parsed_data$outcomes, function(el) {
result <- list(
betid = el$betId,
label = el$label,
code = el$code
)
# Remove NULL elements
result <- result[!sapply(result, is.null)]
# Convert to data frame
as.data.frame(result, stringsAsFactors = FALSE)
})
# connect together as single dataframe
oc_df <- do.call(rbind, oc_df)
row.names(oc_df) <- NULL
# add odds
oc_df$odds <- unlist(parsed_data$odds)
oc_df <- oc_df %>% filter(grepl("1|x|2", code) & !grepl("sr", code))
giving
betid | label | code | odds |
---|---|---|---|
393595273 | Molde | 1 | 1.30 |
393595273 | Unentschieden | x | 5.60 |
393595273 | Shamrock Rovers | 2 | 10.00 |
393595281 | FC Kopenhagen | 1 | 1.86 |
393595281 | Unentschieden | x | 3.55 |
393595281 | Heidenheim | 2 | 4.50 |
. . .