I'm trying to automate scraping text from a website using rvest
but I'm getting the error below when I try a loop that reads web page urls from vector: book.titles.urls
. However, when I try to scrape the desired text from a single page (without the loop), it works just fine:
Working Code
library(rvest)
library(tidyverse)
#Paste URL to be read by read_html function
lex.url <- 'https://fab.lexile.com/search/results?keyword=The+True+Story+of+the+Three+Little+Pigs'
lex.webpage <- read_html(lex.url)
#Use CSS selectors to scrape lexile numbers and covert data to text
lex.num <- html_nodes(lex.webpage, '.results-lexile-code')
lex.num.txt <- html_text(lex.num[1])
lex.num.txt
> lex.num.txt
[1] "AD510L"
Reprex
library(rvest)
library(tidyverse)
book.titles <- c("The+True+Story+of+the+Three+Little+Pigs",
"The+Teacher+from+the+Black+Lagoon",
"A+Letter+to+Amy",
"The+Principal+from+the+Black+Lagoon",
"The+Art+Teacher+from+the+Black+Lagoon")
book.titles.urls <- paste0("https://fab.lexile.com/search/results?keyword=", book.titles)
out <- length(book.titles)
for (i in seq_along(book.titles.urls)) {
node1 <- html_session(i)
lex.url <- as.character(book.titles.urls[i])
lex.webpage <- read_html(lex.url[i])
lex.num <- html_nodes(node1, lex.webpage[i], '.results-lexile-code')
lex.num.txt <- html_text(lex.num[i][1])
out <- lex.num.txt[i]
}
Error code
Error in httr::handle(url) : is.character(url) is not TRUE
The error is due to you are passing an integer to the html_session function, the function is expecting a character string (i.e. a url). I do not believe it is necessary to create as session, generally this function is used if you need to log into the web site with as user id and password.
You can simplify your loop:
#output list
output<-list()
j<-1 #index
for (i in book.titles.urls) {
lex.num <- html_nodes(read_html(i), '.results-lexile-code')
# process the returned list of nodes, lex.num, here
output[[j]]<-html_text(lex.num)
j<-j+1
}
I have not tested this but I will provide this warning: When scraping a web site, please ensure you are agree and abide to terms of service agreement.
Edit:
Here is a further simplification using lapply
which returns a list of vectors with the result of each call statement
library(dplyr)
listofresults<-lapply(book.titles.urls, function(i) {read_html(i) %>%
html_nodes( '.results-lexile-code') %>%
html_text()})