rweb-scrapingwebsocket

Connection to socket.io with R websocket package not working


I am trying to get some data from this page, namely game names and odds and rounds: https://www.winamax.fr/paris-sportifs/sports/1/7/4

I first tried using a GET request from the httr package, by setting first the required request headers and then passing them into my request:

library(httr)
#__Headers_____

ck<-{"_login_type=email; logindata=6a492b4639712b6d5938537965496c3263306b334c4d574d3159306e717261707251454b5044587a4d7432363157596b344f694b746d513150664d46695a3545; PHPSESSIONID=djgirno2ipl57l1pka4mthvsr3; _tt_enable_cookie=1; _ttp=b89oq2h1tREYg0mvhJoBNO8Vrmc.tt.1; PHPSESSID=aghdo7tvju46i07vnu5rj5gapj; _ga=GA1.2.396858676.1740467536; _gid=GA1.2.212202420.1740467536; io=V8NDFEhqgPiTNKBDA7wK; AWSALB=0WUs81OH1hpqaW/lYAdlshSXJL5XA50aSqpiJxdjD8D8uJ63dHdNWe4dDkTlHs64FTXIgV2FEItDF8nBE3J7KyovhLm3uyomU4MzbMKtyQKySI8yKe/Xb3VGqfB8; AWSALBCORS=0WUs81OH1hpqaW/lYAdlshSXJL5XA50aSqpiJxdjD8D8uJ63dHdNWe4dDkTlHs64FTXIgV2FEItDF8nBE3J7KyovhLm3uyomU4MzbMKtyQKySI8yKe/Xb3VGqfB8"}
og<- "https://www.winamax.fr"
ref<- "https://www.winamax.fr/"
cr<- "cors"
sfs<- "same-site"

###----_________SCRAPING__________---------

ulr<- "https://sports-eu-west-3.winamax.fr/uof-sports-server/socket.io/?language=FR&version=3.0.5&embed=false&EIO=3&transport=polling&t=PKxymKy&sid=V8NDFEhqgPiTNKBDA7wK"

t<-httr::GET(ulr,add_headers('User-Agent' = 'Chrome/133.0.0.0', 'Cookie' = ck, 
                             "Origin" = og, "Referer" = ref, 'Sec-Ch-Ua-Platform' = "macOS",
'Sec-Fetch-Mode' = cr, 
'Sec-Fetch-Site' = sfs ), accept_json())

get2json<- content(t, as = "text")

But I wasn't able to retrieve anything so I started looking into websocket but I am not familiar with this.

I was able to locale the websocket url using the network tab of my browser and the query string parameters in the payload tab so I did the following:

library(websocket)
websocket_url <- "wss://sports-eu-west-3.winamax.fr/uof-sports-server/socket.io/"

header_socket <- list(language = "FR",
                      version = "3.0.5",
                      embed = "false",
                      EIO = "3",
                      transport = "websocket",
                      sid = "V8NDFEhqgPiTNKBDA7wK")

ws <- WebSocket$new(websocket_url, headers = header_socket)

However I am not able to connect and I get the following message:

> ws <- WebSocket$new(websocket_url, headers = header_socket)
[2025-02-25 11:16:27] [error] Server handshake response error: websocketpp.processor:20 (Invalid HTTP status.)

I'm a neophyte when it comes to the websocket library so my understanding is that the connection is not accepted due to some missing parameters but I'm not sure where to look for and I am stuck.

Any hints appreciated.


Solution

  • As they use socket.io, regular / vanilla WebSocket client might not get along with that service.
    From https://socket.io/docs/v2/ :

    What Socket.IO is not
    Socket.IO is NOT a WebSocket implementation. Although Socket.IO indeed uses WebSocket as a transport when possible, it adds additional metadata to each packet. That is why a WebSocket client will not be able to successfully connect to a Socket.IO server, and a Socket.IO client will not be able to connect to a plain WebSocket server either.

    So here's a bit different take --

    library(chromote)
    library(promises)
    library(tidyr)
    library(dplyr)
    library(purrr)
    library(uuid)
    library(glue)
    
    # start timer
    tictoc::tic()
    
    # download & use latest stable chrome-headless-shell, available in chromote dev version (>0.4)
    local_chrome_version(binary = "chrome-headless-shell")
    #> chromote will now use version 134.0.6998.90 of `chrome-headless-shell` for
    #> win64.
    
    # or set / change executable path, if needed
    # Sys.setenv(CHROMOTE_CHROME="C:/Program Files/BraveSoftware/Brave-Browser/Application/brave.exe")
    
    b <- ChromoteSession$new()
    # for debugging:
    # b$parent$debug_messages(TRUE)
    # b$view()
    
    
    b$Page$navigate("about:blank") |> invisible()
    
    p <- b$Runtime$evaluate(r"(
    // start with 'about:blank', load socket.io client
    const script = document.createElement('script');
    script.src = 'https://sports-eu-west-3.winamax.fr/uof-sports-server/socket.io/socket.io.js';
    document.head.appendChild(script);
    
    script.onload = () => {
      // resolvers for promises, add when sending, resolve when server returned response 
      const req_promise_resolvers = new Map();
      const socket = io(
        'https://sports-eu-west-3.winamax.fr', 
        {
          path: '/uof-sports-server/socket.io/', 
          query: {language: 'FR', version: '3.3.0', embed: 'false'},
          transports: ['websocket']
        }
      );
      
      socket.on('connect', () => {
        console.log('socket.io connected!');
      });
    
      // handler for "m" messages, ignore responses without requestId
      socket.on('m', (message) => {
        if (!message.requestId) {
          // console.log('< bcast msg:', message);
        } else if (req_promise_resolvers.has(message.requestId)){
          // console.log('< resp  msg:', message);
          // call resolve() to fulfill a promise that was created with sepcific requestId
          req_promise_resolvers.get(message.requestId)(message);
          req_promise_resolvers.delete(message.requestId);
        } else {
          // console.log('< unexpexted msg:', message);
        }
      });
    
      // acessible sendReceive() method, returns promise
      window.sendReceive = (message, requestId) => {
        return new Promise((resolve, reject) => {
          // store resolve method in Map
          req_promise_resolvers.set(requestId, resolve);
          message.requestId = requestId;
          socket.emit('m', message);
          setTimeout(() => {reject("timeout")}, 5000);
        });
      };
    };                    
    )", wait_ = FALSE)$
      then(\(value){
        promise(function(resolve, reject) {
          later::later(function() resolve(TRUE), 4)
        })
      })
    b$wait_for(p)
    #> [1] TRUE
    # ^- just a crude delay, 
    # don't want to block chromote coms with Sys.sleep()
    
    # wrapper to call sendReceive() through chromote session
    send_recieve <- function(sess, msg){
      glue('sendReceive({msg}, "{UUIDgenerate()}");') |> 
        sess$Runtime$evaluate(awaitPromise = TRUE, returnByValue = TRUE)
    }
    

    Ligue 1 McDonald's, tournament 4 ( https://www.winamax.fr/paris-sportifs/sports/1/7/4 ):

    ligue_1 <- 
      send_recieve(b, '{route: "tournament:4"}')$result$value
    
    lobstr::tree(ligue_1, max_length = 10)
    #> <list>
    #> ├─tournaments: <list>
    #> │ └─4: <list>
    #> │   └─matches: <list>
    #> │     ├─50955819
    #> │     ├─50955825
    #> │     ├─50955823
    #> │     ├─50955829
    #> │     ├─50955821
    #> │     ├─50955813
    #> ...
    # listviewer::jsonedit(ligue_1)
    
    odds_ <- 
      ligue_1$odds |> 
      tibble::enframe(name = "outId", value = "odds") |> 
      mutate(odds = unlist(odds), outId = strtoi(outId))
    
    out_ <- 
      ligue_1$outcomes |> 
      bind_rows(.id = "outId") |> 
      mutate(outId = strtoi(outId)) |> 
      select(outId, label, pctDist = percentDistribution)
    
    bets_ <-   
      ligue_1$bets |> 
      keep(\(bet) bet$template == "3way") |> 
      tibble(bets = _) |> 
      hoist(bets, "betId", outId = "outcomes") |> 
      unnest_longer(outId)
      
    ligue_1$matches |> 
      tibble(matches = _) |> 
      hoist(matches, "matchId", "title", "mainBetId", "sportId", "categoryId", "tournamentId", "matchStart") |> 
      arrange(matchStart) |> 
      left_join(bets_, by = join_by(mainBetId == betId)) |> 
      left_join(out_) |> 
      left_join(odds_) |> 
      select(title, label, pctDist, odds)
    #> Joining with `by = join_by(outId)`
    #> Joining with `by = join_by(outId)`
    
    #> # A tibble: 55 × 4
    #>    title                    label         pctDist  odds
    #>    <chr>                    <chr>           <int> <dbl>
    #>  1 Strasbourg - Lyon        Strasbourg         15  2.65
    #>  2 Strasbourg - Lyon        Match nul          44  3.65
    #>  3 Strasbourg - Lyon        Lyon               41  2.45
    #>  4 Reims - Marseille        Reims               2  5.3 
    #>  5 Reims - Marseille        Match nul           8  4.1 
    #>  6 Reims - Marseille        Marseille          90  1.6 
    #>  7 Saint-Étienne - Paris SG Saint-Étienne       1 12   
    #>  8 Saint-Étienne - Paris SG Match nul           1  8   
    #>  9 Saint-Étienne - Paris SG PSG                98  1.18
    #> 10 Monaco - Nice            Monaco             50  1.82
    #> # ℹ 45 more rows
    

    When including menu:true in request, whole menu structure will be delivered, incl. tournament IDs:

    
    ligue_1_menu <- 
      send_recieve(b, '{route: "tournament:4", menu:true}')$result$value 
    
    sports <- 
      ligue_1_menu$sports |>
      {\(sports) tibble(sports) |> mutate(sportId = names(sports) |> strtoi())}() |> 
      unnest_wider(sports) |> 
      select(sportName, sportId, categoryId = categories) |> 
      unnest_longer(categoryId)
    
    tournaments <- 
      ligue_1_menu$tournaments |> 
      {\(trn) tibble(trn) |> mutate(tournamentId = names(trn) |> strtoi())}() |> 
      hoist(trn, "tournamentName", "mainMatchCount", "liveMatchCount", "tvMatchCount")
    
    matches_ <- 
      ligue_1_menu$matches |> 
      tibble(matches = _) |> 
      hoist(matches, "matchId", "title", "mainBetId", "sportId", "categoryId", "tournamentId") |> 
      left_join(tournaments) |> 
      left_join(sports)
    #> Joining with `by = join_by(tournamentId)`
    #> Joining with `by = join_by(sportId, categoryId)`
    
    ligue_1_menu$categories |> 
      {\(cat) tibble(cat) |> mutate(categoryId = names(cat) |> strtoi())}() |> 
      unnest_wider(cat) |> 
      left_join(sports) |> 
      select(sportName, categoryName, tournamentId = tournaments) |> 
      unnest_longer(tournamentId) |> 
      left_join(tournaments) |> 
      arrange(sportName, categoryName, tournamentName) |> 
      filter(sportName == "Football")
    #> Joining with `by = join_by(categoryId)`
    #> Joining with `by = join_by(tournamentId)`
    
    #> # A tibble: 67 × 8
    #>    sportName categoryName    tournamentId tournamentName    mainMatchCount
    #>    <chr>     <chr>                  <int> <chr>                      <int>
    #>  1 Football  Allemagne                 41 2. Bundesliga                 10
    #>  2 Football  Allemagne                 42 Bundesliga                    19
    #>  3 Football  Allemagne                 43 Coupe d'Allemagne              3
    #>  4 Football  Angleterre                 2 Championship                  12
    #>  5 Football  Angleterre                16 FA Cup                         5
    #>  6 Football  Angleterre                 1 Premier League                21
    #>  7 Football  Arabie Saoudite         3708 Saudi Pro League               1
    #>  8 Football  Argentine             162285 Primera División              32
    #>  9 Football  Australie                144 A-League                      13
    #> 10 Football  Autriche                  29 Bundesliga                     7
    #> # ℹ 57 more rows
    #> # ℹ 3 more variables: liveMatchCount <int>, tvMatchCount <int>,
    #> #   trn <named list>
    
    tictoc::toc()
    #> 6.1 sec elapsed
    

    Created on 2025-03-21 with reprex v2.1.1