rjsonhttr

Convert JSON type scraped string to dataframe in R


I am trying to pull team names and odds from this webpage : https://www.winamax.fr/paris-sportifs/sports/1/48/162285

I have found that the data is dynamically rendered via JavaScript.

library(dplyr)
library(rvest)

url <- "https://www.winamax.fr/paris-sportifs/sports/1/48/162285"
age <- 'max-age=0'

t <- httr::GET(url,add_headers('User-Agent' = 'Mozilla/5.0',  'cache-control' = age,
                             'sec-ch-ua-platform' = "Windows"), accept("text/html"))

get2json <- content(t, as = "text")
ext <- read_html(get2json) %>% 
  html_element("body") %>%
  html_nodes("#page-content > script:nth-child(3)") %>%
  html_text() 

In this we have the team names (competitor1Name & competitor2Name). The odds are in the "odds" part at the end.

However, I'm not sure of the exact format of this and how to extract and properly format it in a data frame. I have tried several things like below but there's always an issue in the parsing.

t <- ext %>% 
  str_replace("var PRELOADED_STATE = \\{", "[{") %>%
  # str_replace("\\}\\]", '') %>% 
  str_replace("\\;", '') %>%
  lapply(function(x) as.data.frame(fromJSON(x)))

Solution

  • Seems like you can clean your ext simply by removing the beginning "var PRELOADED_STATE = " and ";$" using sub:

    cleaned_json <- sub("var PRELOADED_STATE = ", "", ext)
    cleaned_json <- sub(";$", "", cleaned_json)
    

    then parse it

    parsed_data <- fromJSON(cleaned_json)
    

    Traverse the JSON response like

    oc_df <- lapply(parsed_data$outcomes, function(el) {
      result <- list(
        betid = el$betId,
        label = el$label,
        code = el$code
      )
      # Remove NULL elements
      result <- result[!sapply(result, is.null)]
      
      # Convert to data frame
      as.data.frame(result, stringsAsFactors = FALSE)
    })
    
    # connect together as single dataframe
    oc_df <- do.call(rbind, oc_df) 
    row.names(oc_df) <- NULL
    # add odds
    oc_df$odds <- unlist(parsed_data$odds)
    
    oc_df <- oc_df %>% filter(grepl("1|x|2", code) & !grepl("sr", code))
    

    giving

    betid label code odds
    393595273 Molde 1 1.30
    393595273 Unentschieden x 5.60
    393595273 Shamrock Rovers 2 10.00
    393595281 FC Kopenhagen 1 1.86
    393595281 Unentschieden x 3.55
    393595281 Heidenheim 2 4.50

    . . .