rweb-scrapinghtml-tablervestgoogle-finance

How to web scrape google finance in which page url is not changed for multiple pages with R?


I want to web scrape stocks' financial tables for different years with R. However, I can obtain the financial tables for the last year, which appears as default. But I also want to obtain data from previous years. How can I achieve this? Here is the code I use:

# Load libraries 

library(tidyverse)
library(rvest)
library(readxl)
library(magrittr)

google_finance <- read_html("https://www.google.com/finance/quote/AAPL:NASDAQ?") |> 
  html_node(".UulDgc") |> 
  html_table()

And the result is:

> google_finance |> 
+   head(5)
# A tibble: 5 × 3
  `(USD)`          Mar 2024infoFiscal Q…¹ `Y/Y change`
  <chr>            <chr>                  <chr>       
1 "RevenueThe tot… 90.75B                 -4.31%      
2 "Operating expe… 14.37B                 5.22%       
3 "Net incomeComp… 23.64B                 -2.17%      
4 "Net profit mar… 26.04                  2.20%       
5 "Earnings per s… 1.53                   0.66% 

As you can see, we can only see the financial tables of the last period (March 2024). In this case, what should we do to scrape the financial tables for all years?


Solution

  • I think you will need to use RSelenium for this, which will launch a browser and click on buttons for you. Here I am using firefox as a browser, you might need to change some default settings to get your browser settings correct. You will also need Java SDK installed.

    library(RSelenium)
    library(rvest)
    library(glue)
    
    # Initiate a Remote Driver using forefox; this step may also install some pre 
    # and post binary files. 
    rd <- rsDriver(browser = "firefox", chromever = NULL)
    
    # Assign client
    remDr <- rd$client
    
    url <- "https://www.google.com/finance/quote/AAPL:NASDAQ"
    
    # Extract names of buttons
    aapl_html <- read_html(url)
    
    btn_names <- aapl_html %>% 
      html_node(".zsnTKc") %>% 
      html_attr("aria-owns") %>% 
      strsplit(., split = " ") %>% 
      unlist()
    
    # Using the Remote Driver, navigate to url of interest  
    remDr$navigate(url)
    
    # In a loop, find button of interest by its xpath, click and extract table
    
    df_ls <- lapply(
      X = btn_names
      ,FUN = function(x) {
        
        # Find button using xPath
        btn <- remDr$findElement(using = "xpath", glue("//*[@id='{x}']"))
        
        # Nifty trick to visually see which button is being clicked
        btn$highlightElement()  
        
        # Click the button
        btn$clickElement()
        
        # Wait for elements to complete loading
        Sys.sleep(1)
        
        # Read HTML after each button is clicked
        rem_aapl_html <- remDr$getPageSource()[[1]]
        
        # Extract table
        aapl_tbl <- rem_aapl_html %>% 
          read_html() %>% 
          html_node(".slpEwd") %>% 
          html_table()
        
      }
    )
    
    # Close Remote Driver and server
    remDr$close()
    rd$server$stop()