rwindowssslweb-scrapingreadlines

readLines() cannot open the connection


I am working in RStudio (RStudio 2023.03.0+386 "Cherry Blossom" Release) and trying to readLines() from an http address that I know is correct.

The code is as follows:

con <- url("http://biostat.jhsph.edu/~jleek/contact.html")
htmlCode <- readLines(con)
close(con)

And the error I get is:

Error in readLines(con) : 
    cannot open the connection to 'https://biostat.jhsph.edu/~jleek/contact.html'
In addition: Warning message:
  In readLines(con) :
    URL 'https://biostat.jhsph.edu/~jleek/contact.html': status was 'SSL connect error'

Following is the sessionInfo() output:

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United 
States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RMySQL_0.10.25 DBI_1.1.3      sqldf_0.4-11   RSQLite_2.3.1  
gsubfn_0.7     proto_1.0.0    httpuv_1.6.9  
[8] httr_1.4.5     readr_2.1.4   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10      rstudioapi_0.14  magrittr_2.0.3   hms_1.1.3            
bit_4.0.5        R6_2.5.1        
 [7] rlang_1.1.0      fastmap_1.1.1    fansi_1.0.4      blob_1.2.4       
tcltk_4.2.3      tools_4.2.3     
[13] utf8_1.2.3       cli_3.6.0        bit64_4.0.5      tibble_3.2.0     
lifecycle_1.0.3  tzdb_0.3.0      
[19] later_1.3.0      vctrs_0.6.0      promises_1.2.0.1 cachem_1.0.7     
memoise_2.0.1    glue_1.6.2      
[25] compiler_4.2.3   pillar_1.9.0     chron_2.3-60     pkgconfig_2.0.3 

Solution

  • Actually your code works fine for me, but I'm running Linux, so it's hard to say. Perhaps you need to install OpenSSL.

    You could try a different method in url,

    con <- url("https://biostat.jhsph.edu/~jleek/contact.html", method='libcurl')
    htmlCode <- readLines(con)
    close(con)
    head(htmlCode, 5)
    # [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"
    # [2] ""                                                                                                                 
    # [3] "<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">"                                        
    # [4] ""                                                                                                                 
    # [5] "<head>"    
    

    or without url,

    htmlCode <- readLines('https://biostat.jhsph.edu/~jleek/contact.html')
    head(htmlCode, 1)
    # [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"
    

    or, as a workaround, try download the file first and read then (note, that download.file also has a method argument.).

    tmp <- tempfile()
    download.file('https://biostat.jhsph.edu/~jleek/contact.html', tmp)  
    htmlCode <- readLines(tmp)
    unlink(tmp)
    head(htmlCode, 1)
    # [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"
    

    Or, use some packages out there, e.g.

    XML::htmlTreeParse(RCurl::getURL('https://biostat.jhsph.edu/~jleek/contact.html'))$children$html
    # <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    #   <head>
    #   <meta name="Description" content="Welcome to Jeff Leek&apos;s Research Group"/>
    # ...
    

    Hope this helps.