rweb-scrapingrvest

Function for "Next Page" rvest scrape


I've added the final code I used at the bottom in case anyone has a similar question. I used the answer provided below but added a couple of nodes, system sleep time (to prevent being kicked off server), and an if argument to prevent an error after the last valid page is scraped.

I'm trying to pull several pages from a website using the next page function. I created a dataframe with a nextpage variable and filled in the first value with the starting url.

#building dataframe with variables
bframe <- data.frame(matrix(ncol = 3, nrow = 10000))
x <- c("curpage", "nexturl", "posttext")
colnames(bframe) <- x

#assigning first value for nexturl
bframe$nexturl[[1]] <- "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"

I want to pull text as follows (I know the code is clunky -- I am brand new at this -- but it does get what I want)

##create html object
blogfunc    <-  read_html("http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/")
##create object with post content scraped
posttext    <-  blogfunc    %>% 
    html_nodes(".article-content")%>%           
    html_text()                 
posttext    <-  gsub('[\a]', '', blogfunc)
posttext    <-  gsub('[\t]', '', blogfunc)
posttext    <-  gsub('[\n]', '', blogfunc)
##scrape next url
nexturl <-  blogfunc    %>% 
    html_nodes(".prev-post-link-wrap a") %>%    
    html_attr("href")           

Any suggestions on turning the above into a function and using it to fill in the dataframe? I am struggling to apply online examples.


Working answer with sleep time and if argument for after last valid page.

```{r}
library(rvest)    
url <- "http://www.ashleyannphotography.com/blog/2008/05/31/the-making-of-a-wet-willy/"
#Select first page.

getPostContent <- function(url){
    Sys.sleep(2)
    #Introduces pauses to convince server not robot.
    read_html(url) %>% 
        html_nodes(".article-content")%>%           
        html_text() %>% 
        gsub(x = ., pattern = '[\a\t\n]', replacement = '')
  }
#Pulls node for post content.

getDate <- function(url) {
    Sys.sleep(2.6)
    read_html(url) %>% 
        html_node(".updated") %>%
        html_text()
}
#Pulls node for date.

getTitle <- function(url) {
    Sys.sleep(.8)
    read_html(url) %>% 
        html_node(".article-title") %>%
        html_text()
    }
#Pulls node for title.

getNextUrl <- function(url) {
    Sys.sleep(.2)
    read_html(url) %>% 
        html_node(".prev-post-link-wrap a") %>%
        html_attr("href")
    }
#Pulls node for url to previous post.

scrapeBackMap <- function(url, n){
    Sys.sleep(3)
    purrr::map_df(1:n, ~{
        if(!is.na(url)){
#Only run if URL is not NA
        oUrl <- url
        date <- getDate(url)
        post <- getPostContent(url)
        title <- getTitle(url)
        url <<- getNextUrl(url)

        data.frame(curpage = oUrl, 
                        nexturl = url,
                        posttext = post,
                        pubdate = date,
                        ptitle = title
#prepares functions for dataframe
                        )}
    })
}
   res <- scrapeBackMap(url, 3000)
   class(res)
   str(res)
#creates dataframe
```

Solution

  • The idea I'm following is to scrape each post content, find the 'previous post' url, navigate to that url and repeat the process.

    library(rvest)    
    
    url <-  "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/"
    

    Scrape post's content

    getPostContent <- function(url){
        read_html(url) %>% 
            html_nodes(".article-content")%>%           
            html_text() %>% 
            gsub(x = ., pattern = '[\a\t\n]', replacement = '')
        }
    

    Scrape next url

    getNextUrl <- function(url) {
        read_html(url) %>% 
            html_node(".prev-post-link-wrap a") %>%
            html_attr("href")
    }
    

    Once we have these 'support' function we can glue them together.

    Apply function n times

    I guess a for loop or while may be set to continue until the getNextUrl return NULL, but I preferred to define a n of jump back and apply the function at each 'jump'.

    Starting with the original url we retrieve its content, then overwrite url with the new value extracted and continue until the loop is broken.

    scrapeBackApply <- function(url, n) {
        sapply(1:n, function(x) {
            r <- getPostContent(url)
            # Overwrite global 'url'
            url <<- getNextUrl(url)
            r
        })
    }
    

    Alternatively we can use purrr::map family and map_df in particular to obtain directly a data.frame as your bframe.

    scrapeBackMap <- function(url, n) {
        purrr::map_df(1:n, ~{
            oUrl <- url
            post <- getPostContent(url)
            url <<- getNextUrl(url)
            data.frame(curpage = oUrl, 
                            nexturl = url,
                            posttext = post)
        })
    }
    

    Results

    res <- scrapeBackApply(url, 2)
    class(res)
    #> [1] "character"
    str(res)
    #>  chr [1:2] "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ ...
    
    res <- scrapeBackMap(url, 4)
    class(res)
    #> [1] "data.frame"
    str(res)
    #> 'data.frame':    4 obs. of  3 variables:
    #>  $ curpage : chr  "http://www.ashleyannphotography.com/blog/2017/04/02/canopy-anna-turner/" "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/"
    #>  $ nexturl : chr  "http://www.ashleyannphotography.com/blog/2017/03/31/a-guest-post-an-snapshop-interview/" "http://www.ashleyannphotography.com/blog/2017/03/29/explore-il-casey-small-town-big-things/" "http://www.ashleyannphotography.com/blog/2017/03/27/explore-ok-oklahoma-wondertorium/" "http://www.ashleyannphotography.com/blog/2017/03/24/the-youngest-cousin/"
    #>  $ posttext: chr  "Six years ago this month, my eldest/oldest/elder/older daughter<U+0085>Okay sidenote <U+0096> the #1 grammar correction I receive on a regula"| __truncated__ "Today I am guest posting over on the Bought Beautifully blog about something new my family tried as a way to usher in our Easte"| __truncated__ "A couple of weeks ago, we drove to Illinois to watch one my nieces in a track meet and another niece in her high school musical"| __truncated__ "Often the activities we do as a family tend to cater more towards our older kids than the girls. The girls are always in the mi"| __truncated__