rweb-scrapingleafletopenstreetmaprselenium

Scrape leaflet map coordinates from a dynamic website


I am trying to scrape the marker coordinates from a website containing a leaflet map (osm data). I have been trawling the web for answers and it appears that a simple query to the parsed html will not be sufficient, due to the dynamic nature of the website. Hence I've been using RSelenium. After looking into the html and playing with ChatGPT, I've gotten this far:

library(RSelenium)
library(rvest)
library(xml2)

remote_driver <- rsDriver(browser = "firefox",
                      chromever =  NULL,
                      verbose = FALSE,
                      port = 4445L)
remDr <- remote_driver$client

remDr$navigate("https://www.hejfish.com/d/1356-strobl-wasser-fliegenstrecke-traun")

# Scroll to the end of the page to trigger marker loading
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")

map_markers <- remDr$findElements(using = "css", value = ".MapIcons__marker_icon___vDTMo")

ChatGPT advises me to extract the coordinates like this:

coordinates <- lapply(map_markers, function(marker) {
  lat <- marker$getElementAttribute("data-lat")$value
  lon <- marker$getElementAttribute("data-lon")$value
  c(lat = as.numeric(lat), lon = as.numeric(lon))
})

Unfortunately this doesn't work: Error: attempt to apply non-function. I assume there is no attribute within the extracted element called "data-lat" or "data-lon". However, from inspecting the website's html I can't find anything which looks even vaguely like coordinates within the marker's code. Inspired by this post, I also checked the network tab and was able to find coordinates for the bounding box, but not for the two markers. Other posts talk about information hidden within script tags, etc. but this is beyond my abilities.

Any help on scraping the coordinates will be much appreciated!


Solution

  • Here's one way to approach this, note that it involves an API key that was truncated in this example so you'd need to extract it yourself.

    I'd start by coming up with some search string to use in a network tab of dev. tools. Leaflet marker popups include Google Maps links, which could be handy for this:
    devtools: coordinates from gmaps link

    Link itself is most likely generated by javascript, but coordinate values are probably transferred as-is; being too specific here might not work, so let's just search for latitude, 48.235585. Also, make sure all requests are captured, i.e. refresh the page if needed. This will lead us to api.hejfish.com/fisher/areas/1356/map API endpoint:
    devtools: network tab search

    We could try to open that URL, but even in the same browser session, we are greeted by error 403 - Zugriff verweigert . When checking request headers, several session cookies and extra headers are set, most notably X-Api-Key. Just to check if we can replicate that request ourselves, let's copy it as cURL command:
    devtools: copy as cURL (bash)

    We could use it through command line or try to convert it to R or Python though https://curlconverter.com/r/ or similar tools, but let's just see if httr2::curl_translate() could handle this:

    library(httr2)
    
    # read copied cURL command from clipboard, ends up as a vector of lines;
    # paste it back into a single string and feed to curl_translate():
    clipr::read_clip() |>
      paste0(collapse = "\n") |> 
      curl_translate()
    #> request("https://api.hejfish.com/fisher/areas/1356/map") %>% 
    #>   req_headers(
    #>     authority = "api.hejfish.com",
    #>     accept = "application/json",
    #>     `accept-language` = "en-GB,en;q=0.9,et-EE;q=0.8,et;q=0.7,en-US;q=0.6",
    #>     origin = "https://www.hejfish.com",
    #>     referer = "https://www.hejfish.com/",
    #>     `sec-ch-ua` = "\"Not_A Brand\";v=8\", \"Chromium\";v=\"120\", \"Google Chrome\";v=\"120",
    #>     `sec-ch-ua-mobile` = "?0",
    #>     `sec-ch-ua-platform` = "\"Windows\"",
    #>     `sec-fetch-dest` = "empty",
    #>     `sec-fetch-mode` = "cors",
    #>     `sec-fetch-site` = "same-site",
    #>     `user-agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    #>     `x-api-key` = "4O...",
    #>     `x-locale` = "de_DE",
    #>     `x-requested-with` = "XMLHttpRequest",
    #>   ) %>% 
    #>   req_perform()
    

    Looks solid! And with a complete API key it actually completes with status 200 OK. We could leave it like this or perhaps trim it down a bit, for handling JSON response we can use httr2::resp_body_json():

    library(httr2)
    map_ <- 
      request("https://api.hejfish.com/fisher/areas/1356/map") |>
      req_headers(`x-api-key` = "4O...") |>
      req_perform() |>
      resp_body_json(simplifyVector = TRUE)
    
    map_$data$locations
    #>        lat      lng
    #> 1 48.23559 14.29808
    #> 2 48.24744 14.32493