rweb-scrapingrcurl

Scraping multiple webpages using getURIAsynchronous()


I am a novice in R. I am trying to scrape multiple https web pages using the getURIAsynchronous() function from the RCurl package. However, for each and every url, the function was returning "" as the result.

I tried using the url.exists() function from the same package to see if it returns TRUE or FALSE. To my surprise, it returned the value as FALSE. But the url exists.

Since these https urls I am using are my company specific urls, I am not in a position to provide examples here due to confidentiality reasons. However, using readLines() successfully extracts all html content from the website. But this is slow and time-consuming for thousands of urls. Any idea why getURIAsynchronous() is returning "" instead of scraping the html content? My focus here is to only scrape the entire html content and I can parse the data myself.

Is there any other package out there than can help me scrape multiple https websites faster instead of doing one page at a time?

UPDATE: Below is a small example similar to what I have been trying to do. In this case, its just a few urls to scrape but in my project, I have few thousand of them. When I try to extract text using a similar code below, I get "" for all urls.

library(RCurl)

source_url <- c("https://cran.r-project.org/web/packages/RCurl/index.html", "https://cran.r-project.org/web/packages/rvest/index.html")

multi_urls <- getURIAsynchronous(source_url)
multi_urls <- as.list(multi_urls)

Solution

  • I don't know which specific URL you are trying to scrape from, but the code below will demonstrate how to loop through several URLs, and scrape data from each. Maybe you can leverage this code to achieve your specific goal(s).

    library(rvest)
    library(stringr)
    
    #create a master dataframe to store all of the results
    complete <- data.frame()
    
    yearsVector <- c("2010", "2011", "2012", "2013", "2014", "2015")
    #position is not needed since all of the info is stored on the page
    #positionVector <- c("qb", "rb", "wr", "te", "ol", "dl", "lb", "cb", "s")
    positionVector <- c("qb")
    for (i in 1:length(yearsVector)) {
        for (j in 1:length(positionVector)) {
            # create a url template 
            URL.base <- "http://www.nfl.com/draft/"
            URL.intermediate <- "/tracker?icampaign=draft-sub_nav_bar-drafteventpage-tracker#dt-tabs:dt-by-position/dt-by-position-input:"
            #create the dataframe with the dynamic values
            URL <- paste0(URL.base, yearsVector[i], URL.intermediate, positionVector[j])
            #print(URL)
    
            #read the page - store the page to make debugging easier
            page <- read_html(URL)
    
            #find records for each player
            playersloc <- str_locate_all(page, "\\{\"personId.*?\\}")[[1]]
            # Select the first column [, 1] and select the second column [, 2]
            players <- str_sub(page, playersloc[, 1] + 1, playersloc[, 2] - 1)
            #fix the cases where the players are named Jr.
            players <- gsub(", ", "_", players)
    
            #split and reshape the data in a data frame
            play2 <- strsplit(gsub("\"", "", players), ',')
            data <- sapply(strsplit(unlist(play2), ":"), FUN = function(x) { x[2] })
            df <- data.frame(matrix(data, ncol = 16, byrow = TRUE))
            #name the column names
            names(df) <- sapply(strsplit(unlist(play2[1]), ":"), FUN = function(x) { x[1] })
    
    
            #store the temp values into the master dataframe
            complete <- rbind(complete, df)
        }
    }
    

    Also . . .

    library(rvest)
    library(stringr)
    library(tidyr)
    
    site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0' 
    
    webpage <- read_html(site)
    draft_table <- html_nodes(webpage, 'table')
    draft <- html_table(draft_table)[[1]]
    
    
    jump <- seq(0, 800, by = 100)
    site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
                  'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
                  '&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
                  '&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
                  '&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
                  '&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
                  '&order_by_asc=&offset=', jump, sep="")
    
    dfList <- lapply(site, function(i) {
        webpage <- read_html(i)
        draft_table <- html_nodes(webpage, 'table')
        draft <- html_table(draft_table)[[1]]
    })
    
    finaldf <- do.call(rbind, dfList)