rjsonrvestjsonlitestringi

Webscrape script variable and convert string into JSON in R


I scrape information with rvest and store it in a dataframe. All information on various institutions and their context characteristics is stored in one string. It looks similar to JSON, but it isn't. I followed another stack post but am not successful. I think string manipulation should do the job. Finally, "title", "street", "number", etc. should be variables and each institution should be a row. Thank you very much

library('tidyverse')
library('rvest')
library('stringr')
library('stringi')
library('jsonlite')

rubyhash <- "https://www.blutspenden.de/blutspendedienste/#" %>%
  read_html() %>% 
  html_nodes("body") %>% 
  html_nodes("script:first-of-type") %>%  
  html_text() %>% 
  as_tibble() %>% 
  slice(1)

substr(rubyhash$value,1,150)
"\n        var instituionsmap_data = '[{\"title\":\"Plasmazentrum Heidelberg\",\"street\":\"Hans-B\\u00f6ckler-Stra\\u00dfe\",\"number\":\"2A\",\"zip\":\"69115\",\"city\":\""

rubyhash$json <- str_replace(rubyhash$value, "var instituionsmap_data =", "")
rubyhash$json <- trimws(rubyhash$json)

substr(rubyhash$json,1,150)
"'[{\"title\":\"Plasmazentrum Heidelberg\",\"street\":\"Hans-B\\u00f6ckler-Stra\\u00dfe\",\"number\":\"2A\",\"zip\":\"69115\",\"city\":\"Heidelberg\",\"phone\":\"06221 89466960"

fromJSON(rubyhash$json)

Solution

  • The data you are trying to parse is an array of different json strings, each one containing the equivalent of a data frame row. As well as removing the javascript variable assignment at the start, you need to split the array up into its component json strings before parsing:

    rubyhash$value %>%
      str_replace("var instituionsmap_data = '\\[\\{", "") %>%
      str_replace("\\}\\]';\n", '') %>% # Removes the javascript chars at the end
      strsplit('\\},\\{') %>% # Split into component json strings
      getElement(1) %>%
      sapply(function(x) paste0('{', x, '}'), USE.NAMES = FALSE) %>%
      lapply(function(x) as.data.frame(fromJSON(x))) %>%
      bind_rows() %>%
      as_tibble()
    #> # A tibble: 195 x 14
    #>    title street number zip   city  phone fax   email~1 email url   rekon~2   uid
    #>    <chr> <chr>  <chr>  <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <int> <int>
    #>  1 Plas~ Hans-~ "2A"   69115 Heid~ 0622~ ""    "info(~ java~ http~      48   567
    #>  2 Plas~ Kamps~ "88 -~ 44137 Dort~ 0231~ ""    "info-~ java~ http~      16   568
    #>  3 Plas~ Roteb~ "25"   70178 Stut~ 0711~ ""    "stutt~ java~ http~      16   571
    #>  4 Plas~ K1 2   ""     68159 Mann~ 6211~ ""    ""      java~ http~     112   575
    #>  5 DRK-~ Fried~ ""     68167 Mann~ 0621~ ""    ""      java~ www.~      49   359
    #>  6 DRK-~ Gunze~ "35"   76530 Bade~ 0722~ ""    ""      java~ www.~      33   387
    #>  7 DRK ~ Helmh~ ""     89081 Ulm   0731~ ""    ""      java~ www.~      49   389
    #>  8 Blut~ Im Ne~ "305"  69120 Heid~ 0622~ ""    ""      java~ http~      49   400
    #>  9 Blut~ Otfri~ ""     72076 Tübi~ 0707~ ""    "bluts~ java~ www.~      49   402
    #> 10 Blut~ Diako~ ""     74523 Schw~ 0791~ ""    ""      java~ www.~      32   403
    #> # ... with 185 more rows, 2 more variables: lat <chr>, lon <chr>, and
    #> #   abbreviated variable names 1: email_display, 2: rekonvaleszentenplasma
    

    Created on 2022-09-01 with reprex v2.0.2