rweb-scrapingrvest

How to scrape a large table from a php website using R


I am trying to scrape the table from 'https://www.metabolomicsworkbench.org/data/mb_structure_ajax.php'.

The code I found online (rvest) did not work

library(rvest)
url <- "https://www.metabolomicsworkbench.org/data/mb_structure_ajax.php"
A <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="containerx"]/div[1]/table') %>%
  html_table()

A is 'list of 0'

How should I fix this code or is there any better way to do it?

Thanks in advance.


Solution

  • The page source is generated by JS. Here is what you do:

    1. Open the Dev Tool of the browser and go to the Network tab. enter image description here
    2. Click on one of the pages and see what's going on (I clicked to page 4). You can see that the page sent a POST request to https://www.metabolomicsworkbench.org/data/mb_structure_tableonly.php and get the content of it. enter image description here Here are the parameters: enter image description here
    3. Mimic the POST request by rvest. Here is the code to scrape all pages:
    library(rvest)
    
    url <- "https://www.metabolomicsworkbench.org/data/mb_structure_tableonly.php"
    pg <- html_session(url)
    data <- 
      purrr::map_dfr(
        1:4288, # you might wanna change it to a small number to try first or scrape multiple times and combine data frames later, in case something happens in the middle
        function(i) {
          pg <- rvest:::request_POST(pg,
                                     url,
                                     body = list(
                                       page = i
                                     ))
          read_html(pg) %>%
            html_node("table") %>%
            html_table() 
        }
      )