rscreen-scrapingrvest

Scraping In R creates 18 tibbles


I am trying to learn how to scrape data in R. Using some help from other resources and chat gpt I have code that will scrape a table of NAIA baseball stats, but it creates 18 tibbles. It does grab all the stats which is good!

When I use bind_rows() to put everything in one df I get 3582 observations (199 teams * 18 tibbles) and several NA values across the columns. It seems like bind_rows() will create 199 rows with the stats from the 1st tibble, then 199 rows with stats from 2nd tibble and so on til all 18 tibbles are included with all the stats.

I would like to create one df with 199 rows with all the stats. I have attached a picture showing the df has stats for the first group of stats, then the list of teams starts over with the 2nd group of stats.

Pic

library(dplyr)
library(rvest)

url <- "https://naiastats.prestosports.com/sports/bsb/2022-23/teams"

page <- read_html(url)

data <- page %>% html_nodes("table") %>% html_table()

combined_data <- bind_rows(data)

Solution

  • Instead of bind_rows you would want to left_join() from "dplyr" tables 2 to N onto the first table. The problem you have (as MrFlick mentioned) is there are the 2 different sets of stats and some of the table use the same column names across multiple tables.

    In this code I am only using the first 9 tables - these are the overall record stats. For the conference records use tables 10 to 18.

    library(dplyr)
    library(rvest)
    
    url <- "https://naiastats.prestosports.com/sports/bsb/2022-23/teams"
    
    page <- read_html(url)
    
    data <- page %>% html_nodes("table") %>% html_table()
    
    #get the first table and remove the ranking column
    output <- data[1][[1]][, -1]
    #For tables 2 to 9 - season records only
    for(i in 2:9) {
       #join the next table to new master table
       #if there are duplicate columns there are renamed with a .x and .y
       output <- left_join(output, data[i][[1]][, -1], by=join_by(Team == Team))
       #remove the duplicate columns - the .y
       output <- select(output, !ends_with(".y"))
       #reset the original column names - remove the .x
       colnames(output) <- sub(".x", "", colnames(output))
    }
    

    The above code will produce a data frame 199 rows long and 46 variables wide.