htmlrextractrvestserpapi

How to extract all "query" value from SerpApi auto-generated html file


How to extract all "query" data (keywords of coffee) from this link: https://serpapi.com/search.html?engine=google_trends&q=coffee&data_type=RELATED_QUERIES&cat=0&date=now+7-d&api_key=317da75462cab4790705a5cf8b6a9c74c9ba9f279150afb87d4b191f95d8d5de

to be one column data frame in R. I didn't get the result with rvest.

Regards

library(rvest)

allcom <- read_html("https://serpapi.com/search.html?engine=google_trends&q=coffee&data_type=RELATED_QUERIES&cat=0&date=now+7-d&api_key=317da75462cab4790705a5cf8b6a9c74c9ba9f279150afb87d4b191f95d8d5de")

allcom %>% html_attr("query")

[1] NA


Solution

  • You should use a json request instead of html and use the httr2 and jsonlite packages to easily convert to a dataframe:

    replace your url by "https://serpapi.com/search.json?engine=google_trends&q=coffee&data_type=RELATED_QUERIES&cat=0&date=now+7-d&api_key=317da75462cab4790705a5cf8b6a9c74c9ba9f279150afb87d4b191f95d8d5de" to get a JSON response.

    library(httr2)
    library(jsonlite)
    
    req = request("https://serpapi.com/search.json?engine=google_trends&q=coffee&data_type=RELATED_QUERIES&cat=0&date=now+7-d&api_key=317da75462cab4790705a5cf8b6a9c74c9ba9f279150afb87d4b191f95d8d5de")
    resp = req_perform(req)
    resp2 = resp |> resp_body_string() |> fromJSON()
    
    # if you need all values, bind "rising" and "top" values
    df = rbind(resp2$related_queries$rising, resp2$related_queries$top)
    
    # show only the "coffee" column
    df[1]
    
    > head(df[1])
    #                           query
    #1     the grounds coffee factory
    #2 why can't mormons drink coffee
    #3           ninja coffee machine
    #4            national coffee day
    #5                     the coffee
    #6                    coffee shop