rapiweb-scrapinghttr2

POST request with httr2 package


It's related to another post

Rather than rvest, I am trying to use the httr2 package to request the data from this link: https://gnomad.broadinstitute.org/api/. I am not that used to it unfortunately, can someone help?

This is what I have tried so far without luck.

library(tidyverse)
library(httr2)

request("https://gnomad.broadinstitute.org/api/") %>%
  req_body_form(
    'chrom' = '1',
    'datasetId' = 'gnomad_r3', 
    'referenceGenome' = 'GRCh38',
    'start' = 55516868, 
    'stop' = 55516908
  ) %>%  
  req_perform()

Error in `resp_abort()`:
! HTTP 400 Bad Request.
Run `rlang::last_error()` to see where the error occurred.

Solution

  • The API requires a lot more info than you're providing! It looks like you've extracted the parameters sent to the API from Chrome's inspect tools but have only provided the "variables" portion. When performing this through the Web interface, the JS on the page does a lot of the formatting for you given those variables and what the API sees is actually much more extensive. You can see this by going to the URL here which actually shows the full request sent to the API (credit to GraphiQL for making such a helpful API interface!).

    Basically, we need to create this entire string in R and send the whole thing to the API rather than just the list of variables you're providing. Here's a snippet of code that does exactly that: we define a "querymaker" function that takes in our variables and spits out the full string, then pass that to the httr2 functions.

    querymaker <- function(start, stop, chrom, ref_genome, dataset_id){
      paste0('{\n  region(start: ', start, ', stop: ', stop, ', chrom: "', chrom, '", reference_genome: ', ref_genome, ') {\n    clinvar_variants {\n      clinical_significance\n      clinvar_variation_id\n      gnomad {\n        exome {\n          ac\n          an\n          filters\n        }\n        genome {\n          ac\n          an\n          filters\n        }\n      }\n      gold_stars\n      hgvsc\n      hgvsp\n      in_gnomad\n      major_consequence\n      pos\n      review_status\n      transcript_id\n      variant_id\n    }\n    variants(dataset: ', dataset_id, ') {\n      consequence\n      flags\n      gene_id\n      gene_symbol\n      hgvs\n      hgvsc\n      hgvsp\n      lof\n      lof_filter\n      lof_flags\n      pos\n      rsids\n      transcript_id\n      transcript_version\n      variant_id\n      exome {\n        ac\n        ac_hemi\n        ac_hom\n        an\n        af\n        filters\n        populations {\n          id\n          ac\n          an\n          ac_hemi\n          ac_hom\n        }\n      }\n      genome {\n        ac\n        ac_hemi\n        ac_hom\n        an\n        af\n        filters\n        populations {\n          id\n          ac\n          an\n          ac_hemi\n          ac_hom\n        }\n      }\n      lof_curation {\n        verdict\n        flags\n      }\n    }\n  }\n}')
    }
    given_query <- querymaker(start = "55516868", stop = "55516908", chrom = "1", 
                    ref_genome = "GRCh38", dataset_id = "gnomad_r3")
    

    Note the super long string with lots of newlines - that's basically the request body that I'm pasteing our variables into. Now we can pass that whole thing to the API and get a response in JSON format:

    library(httr2)
    jsondata <- request("https://gnomad.broadinstitute.org/api/?") %>%
      req_body_json(list(query=given_query, variables="null")) %>%
      req_perform() %>%
      resp_body_json()
    

    which we can finally extract just the variant ids from with a quick sapply function:

    sapply(jsondata$data$region$variants, function(x)x$variant_id)
    
    "1-55516880-T-C"  "1-55516902-T-G"  "1-55516903-G-GC" "1-55516905-C-CT"
    

    (assuming that's still what you're hoping to extract).

    EDIT:

    Note that if you're only interested in the variant IDs you can shorten the query significantly (and reduce load on the server!) with the following function instead:

    querymaker <- function(start, stop, chrom, ref_genome, dataset_id){
      paste0('{\n  region(start: ', start, ', stop: ', stop, ', chrom: "', chrom, '", reference_genome: ', ref_genome, '){variants(dataset: ', dataset_id, ') {\n      variant_id\n    }\n  }\n}')
    }
    

    everything else will run as normal.