htmlrweb-scrapinghtml-tablervest

Scraping HTML table using R, want to retain URL


I am currently using rvest to scrape the 2 HTML tables from the website https://www.genome.jp/kegg/tables/br08606.html#5. Specifically I am looking to scrape the second table (the one with Category Bacteria and Archaea). The table are linked to other websites, particularly in the Category, Source and some of the columns under Organism. I want to retain all of these links when scraping the table but I am not sure how I can accomplish that. If I could have some guidance that would be amazing... This is what I am tried so far after scouring the internet.

library(rvest)
library(dplyr)
library(tidyverse)
library(janitor)

item <- read_html("https://www.genome.jp/kegg/tables/br08606.html#5")

tables <- item %>% html_table(fill = TRUE)

Bacteria_table <- tables[[2]]
Bacteria_table <- Bacteria_table %>% clean_names()

source <- item %>%
  html_nodes("table") %>%
  .[[2]] %>%
  html_nodes(xpath = "//td/a") %>%
  html_attr("href")

print(source)

Bacteria_table_links <- data.frame(Bacteria_table)
Bacteria_table_links$source_links <- source

Solution

  • you can do this:

    library(rvest)
    library(dplyr)
    library(tidyverse)
    library(janitor)
    
    item <- read_html("https://www.genome.jp/kegg/tables/br08606.html#5")
    # Extract the table content
    tables <- item %>% html_table(fill = TRUE)
    Bacteria_table <- tables[[2]] %>% clean_names()
    
    # Extract all rows from the table and their links
    table_rows <- item %>%
      html_nodes("table") %>%
      .[[2]] %>%
      html_nodes("tr")  # Each row in the table
    
    # Extract links for each row
    links_list <- table_rows %>%
      map(~ .x %>%
            html_nodes("td a") %>%      # Get <a> tags within the row
            html_attr("href") %>%       # Extract href attributes
            paste(collapse = "; ")      # Combine multiple links with ";"
      )
    
    # Add extracted links to the table
    # Remove the header row from the links_list to align with the data
    links_list <- links_list[-1]  # Assuming the first row is the header
    
    
    Bacteria_table <- Bacteria_table %>%
      mutate(
        source = map_chr(links_list, ~ strsplit(.x, ";")[[1]] %>% rev() %>% .[1] %>% trimws()),
        org_link_2 = map_chr(links_list, ~ strsplit(.x, ";")[[1]] %>% rev() %>% .[2] %>% trimws()),
        org_link_1 = map_chr(links_list, ~ strsplit(.x, ";")[[1]] %>% rev() %>% .[3] %>% trimws())
      )
    

    Resulting in:

    # A tibble: 6 × 10
      category category_2     category_3  organisms organisms_2 organisms_3                   year source                                    org_link_2 org_link_1
      <chr>    <chr>          <chr>       <chr>     <chr>       <chr>                        <int> <chr>                                     <chr>      <chr>     
    1 Bacteria Enterobacteria Escherichia eco       KGB         Escherichia coli K-12 MG1655  1997 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…
    2 Bacteria Enterobacteria Escherichia ecj       KGB         Escherichia coli K-12 W3110   2001 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…
    3 Bacteria Enterobacteria Escherichia ecd       KGB         Escherichia coli K-12 DH10B   2008 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…
    4 Bacteria Enterobacteria Escherichia ebw       KGB         Escherichia coli K-12 BW2952  2009 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…
    5 Bacteria Enterobacteria Escherichia ecok      KGB         Escherichia coli K-12 MDS42   2013 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…
    6 Bacteria Enterobacteria Escherichia ecoc      KGB         Escherichia coli K-12 C3026   2023 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…