rrvestrselenium

Web scraping on tipti page that requires login


I'm trying to extract the name and prices of the AKI supermarket in Ecuador. There is a page called tipti that gathers products from several supermarkets. However, it requires login and the page seems to be dynamic. This is the first part where login is necessary enter image description here

Upon entering, the different supermarkets appear, I choose AKI. enter image description here

Finally I try to extract the names and prices of the products. enter image description here

library(rvest)
library(hayalbaz)

url <- "https://app.tipti.market/Gran%20Aki/Productos%20Ak%C3%AD"
webpage <- puppet$new(url = url)
webpage

<puppet>
  Public:
    attach_file: function (selector, file) 
    click: function (selector, set_focus = TRUE, scroll = TRUE, wait_for_selector = TRUE) 
    clone: function (deep = FALSE) 
    close: function () 
    content: function () 
    download_enable: function (path, report = TRUE, progress = report) 
    focus: function (selector) 
    get_cookies: function () 
    get_element: function (selector, as_xml2 = TRUE) 
    get_elements: function (selector, as_xml2 = TRUE) 
    get_js_object: function (name) 
    get_source: function () 
    goto: function (url) 
    initialize: function (url = NULL, cookies = NULL) 
    screenshot: function (filename = "screenshot.png", selector = "html", cliprect = NULL, 
    set_cookies: function (cookies) 
    set_debug_msgs: function (flag) 
    set_user_agent: function (user_agent) 
    set_value: function (selector, value) 
    type: function (selector = NULL, text) 
    view: function () 
    wait_for_selector: function (selector, timeout = 30, polling = 0.1) 
    wait_on_load: function () 
  Private:
    download_path: NULL
    download_pb: NULL
    get_all_nodes: function (selector) 
    get_document: function () 
    get_node: function (selector, all = FALSE) 
    get_node_box: function (node_id) 
    get_node_center: function (node_id) 
    get_node_html: function (node_id) 
    key_down: function (key) 
    key_press: function (key) 
    key_up: function (key) 
    mouse_down: function (x, y, button = "left", click_count = 1) 
    mouse_up: function (x, y, button = "left", click_count = 1) 
    press: function (key) 
    session: ChromoteSession, R6
    watch_download: function (start = TRUE, report = TRUE, progress = report) 

webpage$get_elements(".card-product__name") |> html_text(trim = T)
character(0)

Any idea how to extract the information?


Solution

  • You can look at the fetch requests using webtools (F12). Using this header and the credentials, we can fetch the underlying JSON using GET for different categories. category_id "1655" equals 99 cent products for example. Click on the categories on the webpage (left) and observe the fetch requests made to map out the category_ids to "Wonder Woman", "Abarrotes" etc.

    out

    Code

    library(httr)
    # set the category_id=1655 and limit=250 in the request URL
    url <- "https://api.tipti.market/misuper/v3/product/recommendations/category_v3/?page=1&retailer_id=276&category_id=1655&limit=250&page_size=250"
    
    headers = add_headers("Accept" = "*/*",
                          "Accept-Encoding" = "gzip, deflate, br, zstd",
                          "Accept-Language" = "en-US;q=0.8,en;q=0.7",
                          "Referer" = "https://www.tipti.market/",
                          "Origin" = "https://www.tipti.market",
                          "Authorization" = "JWT eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjo0OCwiZW1haWwiOiJpbnZpdGFkb0B0aXB0aS5tYXJrZXQiLCJ0eXBlIjoxLCJ1c2VybmFtZSI6Imludml0YWRvQHRpcHRpLm1hcmtldCIsImV4cCI6ODgwODkwMDQzMjR9.GZJXL3HTvI6GNDsPTyvpimmAkWn2ZELeSrJGnBKbP-o",
                          "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36")
    
    response = GET(url, headers)
    # check status code (should be 200)
    print(response$status_code)
    res <- jsonlite::fromJSON(rawToChar(response$content), flatten= TRUE)
    ninetyNine_cent_products <- res[["results"]]
    

    giving

    > kableExtra::kable(head(ninetyNine_cent_products[,c("item.price", "item.product.name")])) 
    
    item.price item.product.name
    0.99 Tomate Cherry Funda La original 0,99 Ctvs.
    0.99 Limón Meyer Funda Frutos De Mi Tierra 0,99 Ctvs.
    0.99 Naranja Malla Divino Niño 0,99 Ctvs.
    0.99 Limón Malla La Original 0,99 Ctvs.
    0.99 Ajo Pelado Akí 0,99 Ctvs.
    0.99 Tomate Cherry La original 0,99 Ctvs.