I am trying to check if a large list of URLs "exist" in R. Let me know if you can help!
My objective: I am trying to check whether URLs from the Psychology Today online therapist directory exist. I have a data frame of many possible URLs from this directory. Some of them do exist, but some of them do not exist. When URLs do not exist, they return to a generic Psychology Today online website.
For example, this URL exists: "https://www.psychologytoday.com/us/therapists/new-york/a?page=10". This is the tenth page of New York therapists whose last names start with "A." There are at least 10 pages of New York therapists whose names start with "A," so the page exists.
However, this URL does not exist: "https://www.psychologytoday.com/us/therapists/new-york/a?page=119". There are not 119 pages of therapists in New York whose last name starts with "A". Accordingly, the Psychology Today website redirects you to a generic site: "https://www.psychologytoday.com/us/therapists/new-york/a".
My ultimate goal is to get a complete listing of all pages that do exist for New York therapists whose last names start with "A" (and then I will repeat this for other letters, etc.).
Previous post on this topic: There is a prior StackOverflow post on this topic (Check if URL exists in R), and I have implemented the solutions from this post. However, each of the solutions from this previous post falsely reports that my specific URLs of interest do not exist, even if they do exist!
My code: I have tried the below code to check if these URLs exist. Both code solutions are drawn from the prior post on this topic (linked above). However, both code solutions tell me that URLs that do exist on Psychology Today do not exist. I am not sure why this is!
Loading packages:
### Load packages and set user agent
pacman::p_load(dplyr, tidyr, stringr, tidyverse, RCurl, pingr)
# Set alternative user agent globally for whole session
options(HTTPUserAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")
# Check user agent string again
options("HTTPUserAgent")
Keep only URLs that are "real": RCurl solution
url.exists("https://www.psychologytoday.com/us/therapists/new-york/a?page=3")
Result: This solution returns "FALSE", even though this page does exist!
Keep only directory page URLs that are "real": StackExchange post comment #1 solution
### Function for checking if URLs are "real"
# From StackOverflow: https://stackoverflow.com/questions/52911812/check-if-url-exists-in-r
#' @param x a single URL
#' @param non_2xx_return_value what to do if the site exists but the
#' HTTP status code is not in the `2xx` range. Default is to return `FALSE`.
#' @param quiet if not `FALSE`, then every time the `non_2xx_return_value` condition
#' arises a warning message will be displayed. Default is `FALSE`.
#' @param ... other params (`timeout()` would be a good one) passed directly
#' to `httr::HEAD()` and/or `httr::GET()`
url_exists <- function(x, non_2xx_return_value = FALSE, quiet = FALSE,...) {
suppressPackageStartupMessages({
require("httr", quietly = FALSE, warn.conflicts = FALSE)
})
# you don't need thse two functions if you're alread using `purrr`
# but `purrr` is a heavyweight compiled pacakge that introduces
# many other "tidyverse" dependencies and this doesnt.
capture_error <- function(code, otherwise = NULL, quiet = TRUE) {
tryCatch(
list(result = code, error = NULL),
error = function(e) {
if (!quiet)
message("Error: ", e$message)
list(result = otherwise, error = e)
},
interrupt = function(e) {
stop("Terminated by user", call. = FALSE)
}
)
}
safely <- function(.f, otherwise = NULL, quiet = TRUE) {
function(...) capture_error(.f(...), otherwise, quiet)
}
sHEAD <- safely(httr::HEAD)
sGET <- safely(httr::GET)
# Try HEAD first since it's lightweight
res <- sHEAD(x, ...)
if (is.null(res$result) ||
((httr::status_code(res$result) %/% 200) != 1)) {
res <- sGET(x, ...)
if (is.null(res$result)) return(NA) # or whatever you want to return on "hard" errors
if (((httr::status_code(res$result) %/% 200) != 1)) {
if (!quiet) warning(sprintf("Requests for [%s] responded but without an HTTP status code in the 200-299 range", x))
return(non_2xx_return_value)
}
return(TRUE)
} else {
return(TRUE)
}
}
### Create URL list
some_urls <- c("https://www.psychologytoday.com/us/therapists/new-york/a?page=10", # Exists
"https://www.psychologytoday.com/us/therapists/new-york/a?page=4", # Exists
"https://www.psychologytoday.com/us/therapists/new-york/a?page=140", # Does not exist
"https://www.psychologytoday.com/us/therapists/new-york/a?page=3" # Exists
)
### Check if URLs exist
data.frame(
exists = sapply(some_urls, url_exists, USE.NAMES = FALSE),
some_urls,
stringsAsFactors = FALSE
) %>% dplyr::tbl_df() %>% print()
Result: This solution returns "FALSE" for every URL, even though 3 out of 4 of them do exist!
Please let me know if you have any advice! I greatly appreciate any advice or suggestions you may have. Thank you!
Both solutions are based on libcurl
.
Default user agent of httr
includes versions of Curl, RCurl and httr.
You can check it with verbose mode:
> httr::HEAD(some_urls[1], httr::verbose())
-> HEAD /us/therapists/new-york/a?page=10 HTTP/2
-> Host: www.psychologytoday.com
-> user-agent: libcurl/7.68.0 r-curl/4.3.2 httr/1.4.3 <<<<<<<<< Here is the problem. I think the site disallows webscraping. You need to check the related robots.txt file(s).
-> accept-encoding: deflate, gzip, br
-> cookie: summary_id=62e1a40279e4c
-> accept: application/json, text/xml, application/xml, */*
->
<- HTTP/2 403
<- date: Wed, 27 Jul 2022 20:56:28 GMT
<- content-type: text/html; charset=iso-8859-1
<- server: Apache/2.4.53 (Amazon)
<-
Response [https://www.psychologytoday.com/us/therapists/new-york/a?page=10]
Date: 2022-07-27 20:56
Status: 403
Content-Type: text/html; charset=iso-8859-1
<EMPTY BODY>
You can set user-agent header per function calls. I do not know the global option way in this case:
> user_agent <- httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")
> httr::HEAD(some_urls[1], user_agent, httr::verbose())
-> HEAD /us/therapists/new-york/a?page=10 HTTP/2
-> Host: www.psychologytoday.com
-> user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36
-> accept-encoding: deflate, gzip, br
-> cookie: summary_id=62e1a40279e4c
-> accept: application/json, text/xml, application/xml, */*
->
<- HTTP/2 200
<- date: Wed, 27 Jul 2022 21:01:07 GMT
<- content-type: text/html; charset=utf-8
<- server: Apache/2.4.54 (Amazon)
<- x-powered-by: PHP/7.0.33
<- content-language: en-US
<- x-frame-options: SAMEORIGIN
<- expires: Wed, 27 Jul 2022 22:01:07 GMT
<- cache-control: private, max-age=3600
<- last-modified: Wed, 27 Jul 2022 21:01:07 GMT
<- set-cookie: search-language=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; HttpOnly
NOTE: bunch of set-cookie deleted here
<- set-cookie: search-language=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; HttpOnly
<- via: 1.1 ZZ
<-
Response [https://www.psychologytoday.com/us/therapists/new-york/a?page=10]
Date: 2022-07-27 21:01
Status: 200
Content-Type: text/html; charset=utf-8
<EMPTY BODY>
NOTE: I did not investigate the url.exists
of RCurl. You need to ensure somehow it uses the right user-agent string.
In a nutshell with no verbose
:
> user_agent <- httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")
> (httr::status_code(httr::HEAD(some_urls[1], user_agent)) %/% 200) == 1
[1] TRUE
>
I think you can write your own solution from here.