rweb-scrapingleafletmapsshapefile

How to extract data from leaflet-generated pages


Is it possible to scrape the polygon data from this interactive map in R?

🔗 https://fogocruzado.org.br/mapadosgruposarmados

The map shows territories controlled by armed groups across different years. I'd like to automate the extraction of this data for all years and save each as multiple shapefiles.

They mention an API, but this specific data doesn’t seem to be available through it.

I did this but give me something that makes no sense.

# Load the HTML file
html_file = read_html(html_path)

# Extract path data from the HTML
path_data = html_file %>%
  html_nodes("path.leaflet-interactive") %>%
  html_attr("d")

# Function to clean and extract coordinates from path data
clean_coordinates <- function(path) {
  # Remove "M", "L", "z" commands (move, line, close)
  path <- gsub("[MLz]", "", path)  # Remove M, L, and z commands
  path <- gsub("[^0-9,.-]", " ", path)  # Remove any non-numeric characters except comma and period
  coords <- strsplit(path, " ")  # Split by spaces to separate coordinates
  coords <- as.numeric(unlist(coords))  # Flatten list and convert to numeric
  
  # Ensure coordinates are in pairs (lat, lon)
  coords_matrix <- matrix(coords, ncol = 2, byrow = TRUE)
  
  # Ensure the coordinates are in [lon, lat] order for GeoJSON
  coords_matrix <- coords_matrix[, c(2, 1)]  # Switch lat, lon order
  
  # Optionally, adjust the coordinates if they are too far off (e.g., by shifting or scaling)
  # coords_matrix <- coords_matrix - min(coords_matrix, na.rm = TRUE)  # Adjust if needed
  
  return(coords_matrix)
}

# Apply the function to clean the coordinates
coordinates_list <- lapply(path_data, clean_coordinates)

# Create GeoJSON features
geojson <- list(
  type = "FeatureCollection",
  features = lapply(coordinates_list, function(coord) {
    list(
      type = "Feature",
      geometry = list(
        type = "Polygon",  # or "LineString" based on the data
        coordinates = list(coord)  # Directly use the coordinates list
      )
    )
  })
)

# Convert the result to JSON
geojson_json <- toJSON(geojson, pretty = TRUE)

Blockquote


Solution

  • Ahhh I finally got it :)

    1. use selenider to open your url
    2. use execute_js_expr to..
    3. search for the leaflet container document.querySelectorAll('.leaflet-container')
    4. find the obscurely hidden getMap function to retrieve the map object map = mapContainers[0].htmlwidget_data_init_result.getMap()
    5. With this we only need to filter the layer.options.group == "2023" - there are multiple layers, not all are important here. Somehow I could only retrieve the group 2023. You might need to instruct selenider to click on the other layers to retrieve the data for the other years
    6. push the coordinates, label and color to a polygons list
    let polygon = {
                    coordinates: layer.getLatLngs(),
                    fillColor: layer.options.fillColor,
                    label: layer.options.label || 'Unknown'
                };
    polygons.push(polygon);
    
    1. We can actually return javascript objects here to R, but we have to be carefull to JSON-stringify first return JSON.stringify(polygons);
    2. Finally use toJSON to write the polygons to dataframe
    3. optionally plot the results in our own leaflet

    Code

    library(selenider)
    library(jsonlite)
    
    session <- selenider_session("selenium", browser = "chrome")
    open_url("https://fogocruzado.org.br/mapadosgruposarmados")
    
    
    map_analysis <- execute_js_expr("
      
      // retrieve map object
      let mapContainers = document.querySelectorAll('.leaflet-container');  
      let map = mapContainers[0].htmlwidget_data_init_result.getMap();
      
      console.log(map._layers); // look at the selenider-console
      
      let layers = map._layers; // just for debugging and inspecting layers
      
      // collect polygons
      let polygons = [];
      for (let layerId in layers) {
        let layer = layers[layerId];
        
        // Filter for group 2023 - idk where the other groups are
        if (layer.options && layer.options.group == '2023') {
            
            // for Polygon-Layer (L.Polygon oder L.Rectangle)
            if (layer.getLatLngs) {
                let polygon = {
                    coordinates: layer.getLatLngs(),
                    fillColor: layer.options.fillColor,
                    label: layer.options.label || 'Unknown',
                    popup: layer.options.popup // get popup data too!
                };
                polygons.push(polygon);
            }
        }
      }
    
      //console.log(polygons);
      return JSON.stringify(polygons);
    ")
    
    polygons_data <- jsonlite::fromJSON(map_analysis)
    
    
    # now build our own leaflet map using the data
    
    library(leaflet)
    
    m <- leaflet() %>%
      addTiles()
    
    for (i in 1:nrow(polygons_data)) {
      
      coord_entry <- polygons_data$coordinates[[i]]
      
      # Some coordinates NULL- yikes
      if (!is.null(coord_entry) && length(coord_entry) > 0 && is.data.frame(coord_entry[[1]])) {
        
        coords_df <- coord_entry[[1]]
        
        # Check if the data frame has both lat and lng and is not empty
        if (nrow(coords_df) > 2 && all(c("lat", "lng") %in% colnames(coords_df))) {
          
          m <- m %>% addPolygons(
            lng = coords_df$lng,
            lat = coords_df$lat,
            fillColor = polygons_data$fillColor[i],
            color = "#000000",
            weight = 1,
            opacity = 1,
            fillOpacity = 0.5,
            popup = polygons_data$popup[i]
          )
        }
      }
    }
    
    m
    

    Results

    out2

    Notes