rpdfhttrhttr2

PDF URL opens in a browser, but I can't get it with httr


When I open this URL in the browser:

https://processo.stj.jus.br/processo/dj/documento/?=&sequencial=300060606&num_registro=202500087810&data=20250313&data_pesquisa=20250313&componente=MON

It opens as a PDF. But when I try to open it using httr/httr2, I get an HTML:

url1 <- "https://processo.stj.jus.br/processo/dj/documento/?=&sequencial=300060606&num_registro=202500087810&data=20250313&data_pesquisa=20250313&componente=MON"

response <- httr::GET(url1)

print(response):

Response [https://processo.stj.jus.br/processo/dj/documento/?=&sequencial=300060606&num_registro=202500087810&data=20250313&data_pesquisa=20250313&componente=MON]
  Date: 2025-04-01 07:51
  Status: 200
  Content-Type: text/html; charset=UTF-8
  Size: 203 kB
<!doctype html>
<html>
<head>
  <title></title>
  <style>
    html, body {
      margin: 0;
      padding: 0;
      background-color: white;
    }

Can someone help me figure out how to get the PDF?


Solution

  • If you have DevTools active in your browser session, you'll see in the network tab that the first response includes a weird JavaScript challenge that triggers that same request again, now with additional headers & cookies. PDF content is in that 2nd response.

    There's a good chance that it does trigger something at server side and this is only reproducible in a short time window, but for now it seems that we can completely ignore all that JavaScript, cookies and most extra headers, we only need to make sure istl-infinite-loop is set:

    library(httr2)
    
    url_ <- "https://processo.stj.jus.br/processo/dj/documento/?=&sequencial=300060606&num_registro=202500087810&data=20250313&data_pesquisa=20250313&componente=MON"
    
    resp <- 
      request(url_) |> 
      req_headers(`istl-infinite-loop` = "1") |> 
      req_perform()
    resp
    #> <httr2_response>
    #> GET
    #> https://processo.stj.jus.br/processo/dj/documento/?=&sequencial=300060606&num_registro=202500087810&data=20250313&data_pesquisa=20250313&componente=MON
    #> Status: 200 OK
    #> Content-Type: application/pdf
    #> Body: In memory (208932 bytes)
    
    # save
    filename <- 
      resp_header(resp, "content-disposition") |> 
      print() |> 
      strsplit("=") |> 
      _[[1]][2]
    #> [1] "inline; filename=stj_dje_20250313_0_46045183.pdf"
    
    resp_body_raw(resp) |> writeBin(filename)
    
    # check
    pdftools::pdf_info(filename) |> str()
    #> List of 11
    #>  $ version    : chr "1.7"
    #>  $ pages      : int 8
    #>  $ encrypted  : logi FALSE
    #>  $ linearized : logi FALSE
    #>  $ keys       :List of 1
    #>   ..$ Producer: chr "iText® 7.1.2 ©2000-2018 iText Group NV (AGPL-version)"
    #>  $ created    : POSIXct[1:1], format: "2025-03-11 00:39:40"
    #>  $ modified   : POSIXct[1:1], format: "2025-04-01 12:04:04"
    #>  $ metadata   : chr ""
    #>  $ locked     : logi FALSE
    #>  $ attachments: logi FALSE
    #>  $ layout     : chr "no_layout"
    pdftools::pdf_text(filename)[1] |> 
      substr(1,350) |> 
      cat()
    #>                                          HABEAS CORPUS Nº 974679 - SP (2025/0008781-0)
    #> 
    #>                 RELATOR                        : MINISTRO REYNALDO SOARES DA FONSECA
    #>                 IMPETRANTE                     : GLAUCIO DALPONTE MATTIOLI
    #>                 ADVOGADO                       : GLAUCIO DALPONTE MATTIOLI - SP253642
    #> 
    

    Created on 2025-04-01 with reprex v2.1.1