htmlcssrrvest

rvest to scrape images


I've worked on this for couple of weeks without success. My long term goal is to scrape each image from the following website (link:https://bioguide.congress.gov/search). For starters, I'm trying to get just one location of the image stored in the 'img alt' property of the html code.

The html code shows this

<div class="l-grid__item l-grid__item--3/12 l-grid__item--12/12@mobile--sm l-grid__item--4/12@desktop l-grid__item--6/12@tablet"><div tabindex="0" class="c-card u-flex u-flex--column u-height--100% u-cursor--pointer u-bxs--dark-lg:hover c-card--@print"><div class="u-height--100% u-width--100% u-p u-flex u-flex--centered u-mb--auto"><div aria-hidden="true" class="u-max-width--80% u-max-height--250px"><img alt="/photo/66c88d1d7401a93215e0b225.jpg" class="u-max-height--250px u-height--auto u-width--auto u-block" src="/photo/66c88d1d7401a93215e0b225.jpg"></div></div><div class="u-flex u-flex--column u-flex--no-shrink u-p u-bg--off-white u-fw--bold u-color--primary u-text--center u-bt--light-gray"><div class="u-cursor--pointer u-mb--xs">AANDAHL, Fred George</div><div class="u-fz--sm u-fw--semibold">1897 – 1966</div></div></div></div>

I used the following R code but I get character(0)

library(httr)
library(rvest)

# Fetch the HTML content with a custom User-Agent
response <- GET("https://bioguide.congress.gov/search", 
                user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"))

# Parse the content
page <- read_html(content(response, as = "text", encoding = "UTF-8"))

# Navigate to the div with class starting with 'l-grid__item' and extract img alt attributes
img_alt_values <- page %>%
  html_nodes(xpath = "//div[starts-with(@class, 'l-grid__item')]") %>%
  html_nodes(xpath = ".//img") %>%
  html_attr("alt")

Can anyone suggest how I get past this?


Solution

  • Having a look at the network traffic it can be seen that the data is returned from an API where the page's search function generates a POST request with a JSON payload. We can use httr2 to make these requests and return up to 100 records at a time, although to make things more minimal I limit each request to 3 records in the code below.

    The url and payload are:

    library(httr2)
    library(jsonlite)
    library(tidyverse)
    
    # API address
    url <- "https://app-elastic-prod-eus2-001.azurewebsites.net/search"
    
    # JSON payload  
    payload_string <- r"({
        "index": "bioguideprofiles",
        "aggregations": [
            {
                "field": "jobPositions.congressAffiliation.congress.name",
                "subFields": [
                    "jobPositions.congressAffiliation.congress.startDate",
                    "jobPositions.congressAffiliation.congress.endDate"
                ]
            },
            {
                "field": "jobPositions.congressAffiliation.partyAffiliation.party.name"
            },
            {
                "field": "jobPositions.job.name"
            },
            {
                "field": "jobPositions.congressAffiliation.represents.regionCode"
            }
        ],
        "size": 12,
        "from": 0,
        "sort": [
            {
                "_score": true
            },
            {
                "field": "unaccentedFamilyName",
                "order": "asc"
            },
            {
                "field": "unaccentedGivenName",
                "order": "asc"
            },
            {
                "field": "unaccentedMiddleName",
                "order": "asc"
            }
        ],
        "keyword": "",
        "filters": {
    
        },
        "matches": [
    
        ],
        "searchType": "OR",
        "applicationName": "bioguide.house.gov"
    }
    )"
    

    We need to convert the payload to an R list so we can easily modify the from argument in the request with req_body_json_modify():

    # Convert to R list
    payload_list <- fromJSON(payload_string)
    
    # Get n records of first x records
    request_size <- 3L         # 100 max per request
    total_records <- 15L       # 12953 records in database
    from <- seq(1L, total_records, request_size) - 1L  # Sequence of starting positions
    
    # Generate base request
    req <- request(url) |>
        req_method("POST") |>
        req_body_json(payload_list) 
    
    # Generate list of requests (5 requests of 3 records each)
    requests <- from |> 
       lapply(\(n) req |> req_body_json_modify(from = n, size = request_size))
    
    # Execute requests
    responses <- req_perform_sequential(requests, on_error = "return")
    
    # Parse responses and extract image URL
    results <- resps_data(
      responses,
      \(r) r |>
        resp_body_json(simplifyDataFrame = TRUE) |>
        pluck("filteredHits")  |>
        select(starts_with("unaccented"), any_of("image"))
      ) |>
      bind_rows() |>
      hoist("image", "contentUrl") |> 
      select(-image) |> 
      mutate(image_url = ifelse(is.na(contentUrl), NA, paste0("https://bioguide.congress.gov/photo/", basename(contentUrl))), .keep = "unused") |> 
      as_tibble()
    

    Where results contains the derived image URLs:

    # A tibble: 15 × 4
       unaccentedFamilyName unaccentedGivenName unaccentedMiddleName image_url                                       
       <chr>                <chr>               <chr>                <chr>                                           
     1 Aandahl              Fred                George               https://bioguide.congress.gov/photo/66c88d1d740…
     2 Abbitt               Watkins             Moorman              https://bioguide.congress.gov/photo/ad79716f164…
     3 Abbot                Joel                NA                   NA                                              
     4 Abbott               Amos                NA                   NA                                              
     5 Abbott               Joseph              Carter               https://bioguide.congress.gov/photo/39253c461f2…
     6 Abbott               Joseph              NA                   https://bioguide.congress.gov/photo/43ba0fd5299…
     7 Abbott               Josiah              Gardner              https://bioguide.congress.gov/photo/470dc5df4ba…
     8 Abbott               Nehemiah            NA                   NA                                              
     9 Abdnor               James               NA                   https://bioguide.congress.gov/photo/a32ba2ea44f…
    10 Abel                 Hazel               Hempel               https://bioguide.congress.gov/photo/07f3a896ce1…
    11 Abele                Homer               E.                   https://bioguide.congress.gov/photo/a58aa67c32f…
    12 Abercrombie          James               NA                   NA                                              
    13 Abercrombie          John                William              https://bioguide.congress.gov/photo/76a90e5795f…
    14 Abercrombie          Neil                NA                   https://bioguide.congress.gov/photo/66cbb14989f…
    15 Abernethy            Charles             Laban                https://bioguide.congress.gov/photo/00ff9ca93d0…
    

    There is a heap of other data returned with each query but will leave how to wrangle it all to you.