rweb-scrapingyahoo-finance

Scraping fund meta data in Yahoo Finance (not prices)


Yahoo finance changed their website structure. The following R code worked previously to scrape fund meta data. The code no longer works to scrape target variables (PE, NAV, Beta, Yield, etc). The bug in the code is the node_txt variable. It references a 'table' that is no longer there. I'm not an expert at inspecting page source to find where the list data is stored. Any help on updating the code is appreciated.

library(rvest)
library(purrr)
library(dplyr)

ticker <- "IVV"
url <- paste0("https://finance.yahoo.com/quote/",ticker)

ivv_html <- read_html(url)

node_txt <- ".svelte-tx3nkj" # This contains "table" info of interest

df <- ivv_html %>% 
  html_nodes(paste0(".container", node_txt)) %>%
  map_dfr(~{
    tibble(
      label = html_nodes(.x, paste0(".label", node_txt)) %>% 
        html_text(trim = TRUE),
      value = html_nodes(.x, paste0(".value", node_txt)) %>% 
        html_text(trim = TRUE)
    )
  })

df %>% 
  filter(label %in% c("NAV", "PE Ratio (TTM)", "Yield", "Beta (5Y Monthly)", "Expense Ratio (net)"))

Solution

  • Search for yf-tx3nkj in the li class attribute extracting the text within span and reform into a matrix.

    library(xml2)
    
    ivv_html |>
      xml_find_all("//li[contains(@class, 'yf-tx3nkj')]/span/text()") |>
      as.character() |>
      tail(12) |>
      matrix(6, 2, byrow = TRUE)
    
    ##      [,1]                     [,2]    
    ## [1,] "NAV"                    "561.61"
    ## [2,] "PE Ratio (TTM)"         "28.29" 
    ## [3,] "Yield"                  "1.30%" 
    ## [4,] "YTD Daily Total Return" "18.34%"
    ## [5,] "Beta (5Y Monthly)"      "1.00"  
    ## [6,] "Expense Ratio (net)"    "0.03%"