Edit My question has changed considerably as I learned more about the problem. After adding several edits this post became unwieldy, so I'm starting over with a clean slate and the final question.
I'm trying to download all public datasets for the state of Utah through the state Socrata portal using R. There are more than 10,000 records listed, so deep scrolling methods using the scroll_id
parameter are required. Using this parameter in the query returns results alphabetically by resource.id
. However, using a NULL id defaults to starting with the first letter alphabetically and will not return all results because several thousand records begin with a number.
My question is: how can I query with the scroll_id
parameter such that it starts with the record with the lowest number, rather than first letter?
The workaround I have is first querying the portal without a scroll_id
parameter. This will return 10,000 results in arbitrary order. I then take the lowest digit record from that set and re-query with deep scrolling from there. This produces what I imagine is most records, but probably not all. I'm sure this is an absurd workaround for a simple problem, but here is the code I'm working with:
pacman::p_load(tidyverse,
httr,
jsonlite)
# Function to query without scroll_id
get_socrata_data <- function(domain,
limit = 15000) {
"http://api.us.socrata.com/api/catalog/v1?domains=${domain}&limit=${limit}" %>%
str_interp() %>%
httr::GET() %>%
content("text") %>%
fromJSON(flatten = TRUE)
}
# Scroll function that includes scroll_id
scroll_socrata <- function(domain,
limit = 15000,
scroll_id = NULL) {
"http://api.us.socrata.com/api/catalog/v1?domains=${domain}&scroll_id=${scroll_id}&limit=${limit}" %>%
str_interp() %>%
httr::GET() %>%
content("text") %>%
fromJSON(flatten = TRUE)
}
ut_domain <- 'opendata.utah.gov'
# Initial query without scroll_id
ut_results <- get_socrata_data(ut_domain)
nrow(ut_results$results)
# If results are capped, restart with first digit id using scroll function
if (nrow(ut_results$results) == 10000) {
# Pull out first record starting with digit
digit_start <- ut_results$results |>
filter(str_starts(resource.id, '\\d'))
first_digit_id <- first(sort(digit_start$resource.id))
# Data frame for results
output <- data.frame()
# Restart query with scroll_socrata function that includes scroll_id
page <- scroll_socrata(ut_domain,
scroll_id = first_digit_id)
output <- bind_rows(output, page$results)
# more pages if necessary
while (nrow(page$results) == 10000) {
page <- scroll_socrata(ut_domain,
scroll_id = last(page$results$resource.id))
output <- bind_rows(output, page$results)
}
}
# check total
nrow(output)
At long last, the very simple answer to my problem. Entering a scroll_id of '0' will query records starting with the resource.id with the lowest digit.
scroll_socrata <- function(domain,
limit = 15000,
scroll_id = NULL) {
"http://api.us.socrata.com/api/catalog/v1?domains=${domain}&scroll_id=${scroll_id}&limit=${limit}" %>%
str_interp() %>%
httr::GET() %>%
content("text") %>%
fromJSON(flatten = TRUE)
}
ut_domain <- 'opendata.utah.gov'
# DF for results
output <- data.frame()
# First query starting with lowest digit resource.id
page <- scroll_socrata(ut_domain,
limit = 15000,
scroll_id = '0')
output <- bind_rows(output, page$results)
# More pages if necessary
while (nrow(page$results) == 10000) {
page <- scroll_socrata(ut_domain,
limit = 15000,
scroll_id = last(page$results$resource.id))
output <- bind_rows(output, page$results)
}
output