I am currently using rvest to scrape the 2 HTML tables from the website https://www.genome.jp/kegg/tables/br08606.html#5. Specifically I am looking to scrape the second table (the one with Category Bacteria and Archaea). The table are linked to other websites, particularly in the Category, Source and some of the columns under Organism. I want to retain all of these links when scraping the table but I am not sure how I can accomplish that. If I could have some guidance that would be amazing... This is what I am tried so far after scouring the internet.
library(rvest)
library(dplyr)
library(tidyverse)
library(janitor)
item <- read_html("https://www.genome.jp/kegg/tables/br08606.html#5")
tables <- item %>% html_table(fill = TRUE)
Bacteria_table <- tables[[2]]
Bacteria_table <- Bacteria_table %>% clean_names()
source <- item %>%
html_nodes("table") %>%
.[[2]] %>%
html_nodes(xpath = "//td/a") %>%
html_attr("href")
print(source)
Bacteria_table_links <- data.frame(Bacteria_table)
Bacteria_table_links$source_links <- source
you can do this:
library(rvest)
library(dplyr)
library(tidyverse)
library(janitor)
item <- read_html("https://www.genome.jp/kegg/tables/br08606.html#5")
# Extract the table content
tables <- item %>% html_table(fill = TRUE)
Bacteria_table <- tables[[2]] %>% clean_names()
# Extract all rows from the table and their links
table_rows <- item %>%
html_nodes("table") %>%
.[[2]] %>%
html_nodes("tr") # Each row in the table
# Extract links for each row
links_list <- table_rows %>%
map(~ .x %>%
html_nodes("td a") %>% # Get <a> tags within the row
html_attr("href") %>% # Extract href attributes
paste(collapse = "; ") # Combine multiple links with ";"
)
# Add extracted links to the table
# Remove the header row from the links_list to align with the data
links_list <- links_list[-1] # Assuming the first row is the header
Bacteria_table <- Bacteria_table %>%
mutate(
source = map_chr(links_list, ~ strsplit(.x, ";")[[1]] %>% rev() %>% .[1] %>% trimws()),
org_link_2 = map_chr(links_list, ~ strsplit(.x, ";")[[1]] %>% rev() %>% .[2] %>% trimws()),
org_link_1 = map_chr(links_list, ~ strsplit(.x, ";")[[1]] %>% rev() %>% .[3] %>% trimws())
)
Resulting in:
# A tibble: 6 × 10
category category_2 category_3 organisms organisms_2 organisms_3 year source org_link_2 org_link_1
<chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr>
1 Bacteria Enterobacteria Escherichia eco KGB Escherichia coli K-12 MG1655 1997 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…
2 Bacteria Enterobacteria Escherichia ecj KGB Escherichia coli K-12 W3110 2001 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…
3 Bacteria Enterobacteria Escherichia ecd KGB Escherichia coli K-12 DH10B 2008 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…
4 Bacteria Enterobacteria Escherichia ebw KGB Escherichia coli K-12 BW2952 2009 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…
5 Bacteria Enterobacteria Escherichia ecok KGB Escherichia coli K-12 MDS42 2013 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…
6 Bacteria Enterobacteria Escherichia ecoc KGB Escherichia coli K-12 C3026 2023 https://ftp.ncbi.nlm.nih.gov/genomes/all… /genome/e… /kegg-bin…