A website 'https://www.environnement.gouv.qc.ca/eau/potable/distribution/resultats.asp', stores the data in 3 different tables : 1. region, 2. mrc and 3. reseau. Essentially, I'm trying to extract the data in this hierarchichal order : region/mrc/reseau
. So a region could have multiple mrc and eahc would have a different table. In the end, I want only one table with seprate columns for id="region", id="mrc" and id="reseau".
I'm thinking that it could be either a for loop (whenever you find a region, make a dataframe with the information until you reach another region). But can't find how to loop for all regions. Another idea would be to group_by region and mrc and bind the rows of each groups to make the giant table.
library(rvest)
url = 'https://www.environnement.gouv.qc.ca/eau/potable/distribution/resultats.asp'
page <- read_html(url)
page |>
html_table()
# Extract the tables by ID
region_table <- page |> html_elements("#region") |> html_table(fill = TRUE)
mrc_table <- page |> html_elements("#mrc") |> html_table(fill = TRUE)
reseau_table <- page |> html_elements("#reseau") |> html_table(fill = TRUE)
Here everything is extracted, but not respecting the hierarchy.
I've been trying this but not able to make the hierarchy...
# Extract info
regions <- page %>% html_nodes(xpath = "//table[@id='region']//tr/td/b")
mrc<- page %>% html_nodes(xpath = "//table[@id='mrc']//tr/td/b")
reseau<- page %>% html_nodes(xpath = "//table[@id='reseau']//tr/td")
Or this
region_name <- regions[1] %>% html_text(trim = TRUE)
# Get parent node ??
region_node <- regions[1] %>% html_node(xpath = "./ancestor::table/following-sibling::*")
The logic and code to scrape every webpage is going to be somewhat different and there's no guarantee the HTML structure won't change in the future. In order to scrape a page, you're going to have to spend some time view the HTML of the page itself. If you do that you can see that there are tables for the regions and mrc values as well and that each of those tables was given an ID of "region" or "mrc" (which is technically invalid HTML because IDs should be unique).
<div style="text-align:center; padding:0 0 0 15px;">
<table id="region" border="0" cellpadding="2" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="98%">
<tr>
<td><b><font size="3">Région administrative : Abitibi-Témiscamingue</font></b></td>
</table>
</div>
<div align="center">
<table id="mrc" border="0" cellpadding="2" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="98%">
<tr>
<td><b>MRC : Abitibi</b></td>
</table>
</div>
<div align="center">
<table id="reseau" border="1" cellpadding="2" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="98%">
<!-- Proper tabular data -->
</table>
</div>
We can loop through all the tables, see if the contents are for region, mrc, or reseau, keep track of the current region/mrc values and combine them with the reseau data and then finally merge everything together.
library(rvest)
url = 'https://www.environnement.gouv.qc.ca/eau/potable/distribution/resultats.asp'
page <- read_html(url)
tables <- page |> html_elements("table")
group <- list(region=NA, mrc=NA)
label <- function(x) html_text(table) |> trimws() |> gsub("^[^:]+: ", "", x=_)
annotated <- list()
for (table in tables) {
type <- html_attr(table, "id", "-")
if (type == "region") {
group$region = label(table)
} else if (type == "mrc") {
group$mrc = label(table)
} else if (type == "reseau") {
annotated[[length(annotated)+1]] <- cbind(group, html_table(table, header=TRUE))
}
}
dd <- do.call("rbind", annotated)
It also seems that there are encoding issues on the page so everything is not quite UTF-8 but I'm not sure exactly what the mixture is.