I am trying to get some data from this page, namely game names and odds and rounds: https://www.winamax.fr/paris-sportifs/sports/1/7/4
I first tried using a GET
request from the httr
package, by setting first the required request headers and then passing them into my request:
library(httr)
#__Headers_____
ck<-{"_login_type=email; logindata=6a492b4639712b6d5938537965496c3263306b334c4d574d3159306e717261707251454b5044587a4d7432363157596b344f694b746d513150664d46695a3545; PHPSESSIONID=djgirno2ipl57l1pka4mthvsr3; _tt_enable_cookie=1; _ttp=b89oq2h1tREYg0mvhJoBNO8Vrmc.tt.1; PHPSESSID=aghdo7tvju46i07vnu5rj5gapj; _ga=GA1.2.396858676.1740467536; _gid=GA1.2.212202420.1740467536; io=V8NDFEhqgPiTNKBDA7wK; AWSALB=0WUs81OH1hpqaW/lYAdlshSXJL5XA50aSqpiJxdjD8D8uJ63dHdNWe4dDkTlHs64FTXIgV2FEItDF8nBE3J7KyovhLm3uyomU4MzbMKtyQKySI8yKe/Xb3VGqfB8; AWSALBCORS=0WUs81OH1hpqaW/lYAdlshSXJL5XA50aSqpiJxdjD8D8uJ63dHdNWe4dDkTlHs64FTXIgV2FEItDF8nBE3J7KyovhLm3uyomU4MzbMKtyQKySI8yKe/Xb3VGqfB8"}
og<- "https://www.winamax.fr"
ref<- "https://www.winamax.fr/"
cr<- "cors"
sfs<- "same-site"
###----_________SCRAPING__________---------
ulr<- "https://sports-eu-west-3.winamax.fr/uof-sports-server/socket.io/?language=FR&version=3.0.5&embed=false&EIO=3&transport=polling&t=PKxymKy&sid=V8NDFEhqgPiTNKBDA7wK"
t<-httr::GET(ulr,add_headers('User-Agent' = 'Chrome/133.0.0.0', 'Cookie' = ck,
"Origin" = og, "Referer" = ref, 'Sec-Ch-Ua-Platform' = "macOS",
'Sec-Fetch-Mode' = cr,
'Sec-Fetch-Site' = sfs ), accept_json())
get2json<- content(t, as = "text")
But I wasn't able to retrieve anything so I started looking into websocket
but I am not familiar with this.
I was able to locale the websocket
url using the network tab of my browser and the query string parameters in the payload tab so I did the following:
library(websocket)
websocket_url <- "wss://sports-eu-west-3.winamax.fr/uof-sports-server/socket.io/"
header_socket <- list(language = "FR",
version = "3.0.5",
embed = "false",
EIO = "3",
transport = "websocket",
sid = "V8NDFEhqgPiTNKBDA7wK")
ws <- WebSocket$new(websocket_url, headers = header_socket)
However I am not able to connect and I get the following message:
> ws <- WebSocket$new(websocket_url, headers = header_socket)
[2025-02-25 11:16:27] [error] Server handshake response error: websocketpp.processor:20 (Invalid HTTP status.)
I'm a neophyte when it comes to the websocket
library so my understanding is that the connection is not accepted due to some missing parameters but I'm not sure where to look for and I am stuck.
Any hints appreciated.
As they use socket.io, regular / vanilla WebSocket client might not get along with that service.
From https://socket.io/docs/v2/ :
What Socket.IO is not
Socket.IO is NOT a WebSocket implementation. Although Socket.IO indeed uses WebSocket as a transport when possible, it adds additional metadata to each packet. That is why a WebSocket client will not be able to successfully connect to a Socket.IO server, and a Socket.IO client will not be able to connect to a plain WebSocket server either.
So here's a bit different take --
chrome-headless-shell
through chromote
for running socket.io client. I'm currently on dev version of chromote
that comes with few new tools, e.g. local_chrome_version(binary = "chrome-headless-shell")
to download and use latest stable headless shell. But using any Chromium-based browser (like regular Brave or Chrome) should work just fine, headless shell can also be downloaded manually - https://googlechromelabs.github.io/chrome-for-testing/ - and we can point current release version of chromote
to a specific binary through CHROMOTE_CHROME
env.var, if needed.chromote
. As service also broadcasts its own messages, not just responds to our requests, there's also a basic message filtering.library(chromote)
library(promises)
library(tidyr)
library(dplyr)
library(purrr)
library(uuid)
library(glue)
# start timer
tictoc::tic()
# download & use latest stable chrome-headless-shell, available in chromote dev version (>0.4)
local_chrome_version(binary = "chrome-headless-shell")
#> chromote will now use version 134.0.6998.90 of `chrome-headless-shell` for
#> win64.
# or set / change executable path, if needed
# Sys.setenv(CHROMOTE_CHROME="C:/Program Files/BraveSoftware/Brave-Browser/Application/brave.exe")
b <- ChromoteSession$new()
# for debugging:
# b$parent$debug_messages(TRUE)
# b$view()
b$Page$navigate("about:blank") |> invisible()
p <- b$Runtime$evaluate(r"(
// start with 'about:blank', load socket.io client
const script = document.createElement('script');
script.src = 'https://sports-eu-west-3.winamax.fr/uof-sports-server/socket.io/socket.io.js';
document.head.appendChild(script);
script.onload = () => {
// resolvers for promises, add when sending, resolve when server returned response
const req_promise_resolvers = new Map();
const socket = io(
'https://sports-eu-west-3.winamax.fr',
{
path: '/uof-sports-server/socket.io/',
query: {language: 'FR', version: '3.3.0', embed: 'false'},
transports: ['websocket']
}
);
socket.on('connect', () => {
console.log('socket.io connected!');
});
// handler for "m" messages, ignore responses without requestId
socket.on('m', (message) => {
if (!message.requestId) {
// console.log('< bcast msg:', message);
} else if (req_promise_resolvers.has(message.requestId)){
// console.log('< resp msg:', message);
// call resolve() to fulfill a promise that was created with sepcific requestId
req_promise_resolvers.get(message.requestId)(message);
req_promise_resolvers.delete(message.requestId);
} else {
// console.log('< unexpexted msg:', message);
}
});
// acessible sendReceive() method, returns promise
window.sendReceive = (message, requestId) => {
return new Promise((resolve, reject) => {
// store resolve method in Map
req_promise_resolvers.set(requestId, resolve);
message.requestId = requestId;
socket.emit('m', message);
setTimeout(() => {reject("timeout")}, 5000);
});
};
};
)", wait_ = FALSE)$
then(\(value){
promise(function(resolve, reject) {
later::later(function() resolve(TRUE), 4)
})
})
b$wait_for(p)
#> [1] TRUE
# ^- just a crude delay,
# don't want to block chromote coms with Sys.sleep()
# wrapper to call sendReceive() through chromote session
send_recieve <- function(sess, msg){
glue('sendReceive({msg}, "{UUIDgenerate()}");') |>
sess$Runtime$evaluate(awaitPromise = TRUE, returnByValue = TRUE)
}
Ligue 1 McDonald's, tournament 4 ( https://www.winamax.fr/paris-sportifs/sports/1/7/4 ):
ligue_1 <-
send_recieve(b, '{route: "tournament:4"}')$result$value
lobstr::tree(ligue_1, max_length = 10)
#> <list>
#> ├─tournaments: <list>
#> │ └─4: <list>
#> │ └─matches: <list>
#> │ ├─50955819
#> │ ├─50955825
#> │ ├─50955823
#> │ ├─50955829
#> │ ├─50955821
#> │ ├─50955813
#> ...
# listviewer::jsonedit(ligue_1)
odds_ <-
ligue_1$odds |>
tibble::enframe(name = "outId", value = "odds") |>
mutate(odds = unlist(odds), outId = strtoi(outId))
out_ <-
ligue_1$outcomes |>
bind_rows(.id = "outId") |>
mutate(outId = strtoi(outId)) |>
select(outId, label, pctDist = percentDistribution)
bets_ <-
ligue_1$bets |>
keep(\(bet) bet$template == "3way") |>
tibble(bets = _) |>
hoist(bets, "betId", outId = "outcomes") |>
unnest_longer(outId)
ligue_1$matches |>
tibble(matches = _) |>
hoist(matches, "matchId", "title", "mainBetId", "sportId", "categoryId", "tournamentId", "matchStart") |>
arrange(matchStart) |>
left_join(bets_, by = join_by(mainBetId == betId)) |>
left_join(out_) |>
left_join(odds_) |>
select(title, label, pctDist, odds)
#> Joining with `by = join_by(outId)`
#> Joining with `by = join_by(outId)`
#> # A tibble: 55 × 4
#> title label pctDist odds
#> <chr> <chr> <int> <dbl>
#> 1 Strasbourg - Lyon Strasbourg 15 2.65
#> 2 Strasbourg - Lyon Match nul 44 3.65
#> 3 Strasbourg - Lyon Lyon 41 2.45
#> 4 Reims - Marseille Reims 2 5.3
#> 5 Reims - Marseille Match nul 8 4.1
#> 6 Reims - Marseille Marseille 90 1.6
#> 7 Saint-Étienne - Paris SG Saint-Étienne 1 12
#> 8 Saint-Étienne - Paris SG Match nul 1 8
#> 9 Saint-Étienne - Paris SG PSG 98 1.18
#> 10 Monaco - Nice Monaco 50 1.82
#> # ℹ 45 more rows
When including menu:true
in request, whole menu structure will be delivered, incl. tournament IDs:
ligue_1_menu <-
send_recieve(b, '{route: "tournament:4", menu:true}')$result$value
sports <-
ligue_1_menu$sports |>
{\(sports) tibble(sports) |> mutate(sportId = names(sports) |> strtoi())}() |>
unnest_wider(sports) |>
select(sportName, sportId, categoryId = categories) |>
unnest_longer(categoryId)
tournaments <-
ligue_1_menu$tournaments |>
{\(trn) tibble(trn) |> mutate(tournamentId = names(trn) |> strtoi())}() |>
hoist(trn, "tournamentName", "mainMatchCount", "liveMatchCount", "tvMatchCount")
matches_ <-
ligue_1_menu$matches |>
tibble(matches = _) |>
hoist(matches, "matchId", "title", "mainBetId", "sportId", "categoryId", "tournamentId") |>
left_join(tournaments) |>
left_join(sports)
#> Joining with `by = join_by(tournamentId)`
#> Joining with `by = join_by(sportId, categoryId)`
ligue_1_menu$categories |>
{\(cat) tibble(cat) |> mutate(categoryId = names(cat) |> strtoi())}() |>
unnest_wider(cat) |>
left_join(sports) |>
select(sportName, categoryName, tournamentId = tournaments) |>
unnest_longer(tournamentId) |>
left_join(tournaments) |>
arrange(sportName, categoryName, tournamentName) |>
filter(sportName == "Football")
#> Joining with `by = join_by(categoryId)`
#> Joining with `by = join_by(tournamentId)`
#> # A tibble: 67 × 8
#> sportName categoryName tournamentId tournamentName mainMatchCount
#> <chr> <chr> <int> <chr> <int>
#> 1 Football Allemagne 41 2. Bundesliga 10
#> 2 Football Allemagne 42 Bundesliga 19
#> 3 Football Allemagne 43 Coupe d'Allemagne 3
#> 4 Football Angleterre 2 Championship 12
#> 5 Football Angleterre 16 FA Cup 5
#> 6 Football Angleterre 1 Premier League 21
#> 7 Football Arabie Saoudite 3708 Saudi Pro League 1
#> 8 Football Argentine 162285 Primera División 32
#> 9 Football Australie 144 A-League 13
#> 10 Football Autriche 29 Bundesliga 7
#> # ℹ 57 more rows
#> # ℹ 3 more variables: liveMatchCount <int>, tvMatchCount <int>,
#> # trn <named list>
tictoc::toc()
#> 6.1 sec elapsed
Created on 2025-03-21 with reprex v2.1.1