I've worked on this for couple of weeks without success. My long term goal is to scrape each image from the following website (link:https://bioguide.congress.gov/search). For starters, I'm trying to get just one location of the image stored in the 'img alt' property of the html code.
The html code shows this
<div class="l-grid__item l-grid__item--3/12 l-grid__item--12/12@mobile--sm l-grid__item--4/12@desktop l-grid__item--6/12@tablet"><div tabindex="0" class="c-card u-flex u-flex--column u-height--100% u-cursor--pointer u-bxs--dark-lg:hover c-card--@print"><div class="u-height--100% u-width--100% u-p u-flex u-flex--centered u-mb--auto"><div aria-hidden="true" class="u-max-width--80% u-max-height--250px"><img alt="/photo/66c88d1d7401a93215e0b225.jpg" class="u-max-height--250px u-height--auto u-width--auto u-block" src="/photo/66c88d1d7401a93215e0b225.jpg"></div></div><div class="u-flex u-flex--column u-flex--no-shrink u-p u-bg--off-white u-fw--bold u-color--primary u-text--center u-bt--light-gray"><div class="u-cursor--pointer u-mb--xs">AANDAHL, Fred George</div><div class="u-fz--sm u-fw--semibold">1897 – 1966</div></div></div></div>
I used the following R code but I get character(0)
library(httr)
library(rvest)
# Fetch the HTML content with a custom User-Agent
response <- GET("https://bioguide.congress.gov/search",
user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"))
# Parse the content
page <- read_html(content(response, as = "text", encoding = "UTF-8"))
# Navigate to the div with class starting with 'l-grid__item' and extract img alt attributes
img_alt_values <- page %>%
html_nodes(xpath = "//div[starts-with(@class, 'l-grid__item')]") %>%
html_nodes(xpath = ".//img") %>%
html_attr("alt")
Can anyone suggest how I get past this?
Having a look at the network traffic it can be seen that the data is returned from an API where the page's search function generates a POST request with a JSON payload. We can use httr2
to make these requests and return up to 100 records at a time, although to make things more minimal I limit each request to 3 records in the code below.
The url and payload are:
library(httr2)
library(jsonlite)
library(tidyverse)
# API address
url <- "https://app-elastic-prod-eus2-001.azurewebsites.net/search"
# JSON payload
payload_string <- r"({
"index": "bioguideprofiles",
"aggregations": [
{
"field": "jobPositions.congressAffiliation.congress.name",
"subFields": [
"jobPositions.congressAffiliation.congress.startDate",
"jobPositions.congressAffiliation.congress.endDate"
]
},
{
"field": "jobPositions.congressAffiliation.partyAffiliation.party.name"
},
{
"field": "jobPositions.job.name"
},
{
"field": "jobPositions.congressAffiliation.represents.regionCode"
}
],
"size": 12,
"from": 0,
"sort": [
{
"_score": true
},
{
"field": "unaccentedFamilyName",
"order": "asc"
},
{
"field": "unaccentedGivenName",
"order": "asc"
},
{
"field": "unaccentedMiddleName",
"order": "asc"
}
],
"keyword": "",
"filters": {
},
"matches": [
],
"searchType": "OR",
"applicationName": "bioguide.house.gov"
}
)"
We need to convert the payload to an R list so we can easily modify the from
argument in the request with req_body_json_modify()
:
# Convert to R list
payload_list <- fromJSON(payload_string)
# Get n records of first x records
request_size <- 3L # 100 max per request
total_records <- 15L # 12953 records in database
from <- seq(1L, total_records, request_size) - 1L # Sequence of starting positions
# Generate base request
req <- request(url) |>
req_method("POST") |>
req_body_json(payload_list)
# Generate list of requests (5 requests of 3 records each)
requests <- from |>
lapply(\(n) req |> req_body_json_modify(from = n, size = request_size))
# Execute requests
responses <- req_perform_sequential(requests, on_error = "return")
# Parse responses and extract image URL
results <- resps_data(
responses,
\(r) r |>
resp_body_json(simplifyDataFrame = TRUE) |>
pluck("filteredHits") |>
select(starts_with("unaccented"), any_of("image"))
) |>
bind_rows() |>
hoist("image", "contentUrl") |>
select(-image) |>
mutate(image_url = ifelse(is.na(contentUrl), NA, paste0("https://bioguide.congress.gov/photo/", basename(contentUrl))), .keep = "unused") |>
as_tibble()
Where results
contains the derived image URLs:
# A tibble: 15 × 4
unaccentedFamilyName unaccentedGivenName unaccentedMiddleName image_url
<chr> <chr> <chr> <chr>
1 Aandahl Fred George https://bioguide.congress.gov/photo/66c88d1d740…
2 Abbitt Watkins Moorman https://bioguide.congress.gov/photo/ad79716f164…
3 Abbot Joel NA NA
4 Abbott Amos NA NA
5 Abbott Joseph Carter https://bioguide.congress.gov/photo/39253c461f2…
6 Abbott Joseph NA https://bioguide.congress.gov/photo/43ba0fd5299…
7 Abbott Josiah Gardner https://bioguide.congress.gov/photo/470dc5df4ba…
8 Abbott Nehemiah NA NA
9 Abdnor James NA https://bioguide.congress.gov/photo/a32ba2ea44f…
10 Abel Hazel Hempel https://bioguide.congress.gov/photo/07f3a896ce1…
11 Abele Homer E. https://bioguide.congress.gov/photo/a58aa67c32f…
12 Abercrombie James NA NA
13 Abercrombie John William https://bioguide.congress.gov/photo/76a90e5795f…
14 Abercrombie Neil NA https://bioguide.congress.gov/photo/66cbb14989f…
15 Abernethy Charles Laban https://bioguide.congress.gov/photo/00ff9ca93d0…
There is a heap of other data returned with each query but will leave how to wrangle it all to you.