rweb-scrapingxpathdplyrrvest

Using xpath in R to scrape data from website with multiple similar paths


I'm trying to scrape in R a list of apartments for sale and the basic info (address, m2, price, rooms, etc.) of this website: https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000 (see also below a screenshot of the page + inspect)

Using SelectorGadget i haven't been able to create a path that unique extracts the square meters of all 50 apartments on page 1, and another path that unique extracts the numbers of rooms, etc.

I did manage to find a path that unique extracts the addresses (see in code block below). But this is in a separate block/class from the rest of the text.

Here is my current code:

library(rvest)
library(dplyr)

link = "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000&page=1"
page = read_html(link)
address = page %>% html_nodes("div.mr-2") %>% html_text()
price = #MISSING - CAN'T FIGURE OUT
sqm = #MISSING - CAN'T FIGURE OUT
rooms = #MISSING - CAN'T FIGURE OUT
forsale = data.frame(address, price, sqm, rooms, stringsAsFactors = FALSE)

Any ideas on how to approach it? I tried using xpath as well to extract the sqm, but only managed to get one specific text field extracted, not all 50 on the page.

Alternative approaches are welcome too. Thanks in advance!


Solution

  • Using their API (found in the network section), you can call on it and retrieve in the information as such:

    library(tidyverse)
    library(httr2)
    
    "https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
      request() %>%
      req_perform() %>%
      resp_body_json(simplifyVector = TRUE) %>%
      pluck("cases") %>%
      unnest(address, names_sep = "_") %>%
      mutate(
        address = str_c(address_roadName, address_houseNumber, address_zipCode, sep = " "),
        .before = 1
      ) %>%
      select(address,
             price = priceCash,
             sqm = housingArea,
             rooms = numberOfRooms)
    
    # A tibble: 100 × 4
       address                       price   sqm rooms
       <chr>                         <int> <int> <int>
     1 Holsteinsgade 66 2100       3135000    56     2
     2 Tuborgvej 60 2900           4875000   114     4
     3 Poppellunden 8 4000         3350000    92     3
     4 Hyldegårds Tværvej 5 2920   6498000   115     3
     5 Grollowstræde 3 3000        3495000    92     3
     6 Rasmus Rasks Vej 8 2500     3995000    80     3
     7 Ryesgade 7 8000             4598000   110     4
     8 Carl Th. Zahles Gade 8 2300 5795000   113     3
     9 Strandlodsvej 23E 2300      5495000   101     3
    10 Nordre Fasanvej 162 2000    4695000    90     4
    # … with 90 more rows
    # ℹ Use `print(n = ...)` to see more rows
    

    Which variables are available for extraction:

    "https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
      request() %>%
      req_perform() %>%
      resp_body_json(simplifyVector = TRUE) %>%
      pluck("cases") %>% 
      glimpse
    
    Rows: 100
    Columns: 37
    $ `_links`               <df[,1]> <data.frame[30 x 1]>
    $ address                <df[,28]> <data.frame[30 x 28]>
    $ addressType            <chr> "condo", "condo", "condo", "condo", "condo", "condo", "c…
    $ caseID                 <chr> "89194273-5948-4734-8085-fec9d42ac3c2", "ff6a9ff5-eacf-…
    $ caseUrl                <chr> "https://www.lokalbolig.dk/?sag=26-X0001820", "https://www.…
    $ coordinates            <df[,3]> <data.frame[30 x 3]>
    $ daysOnMarket           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
    $ defaultImage           <df[,1]> <data.frame[30 x 1]>
    $ descriptionBody        <chr> "Lys stuelejlighed med to terrasser i HørsholmNær centrum o…
    $ descriptionTitle       <chr> "Lys stuelejlighed med to terrasser i Hørsholm", "Fantas…
    $ distinction            <chr> "real_estate", "real_estate", "real_estate", "real_estate",…
    $ energyLabel            <chr> "c", "c", "d", "c", "d", "c", "c", "c", "c", "c", "c", "…
    $ highlighted            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
    $ housingArea            <int> 98, 82, 64, 91, 81, 97, 78, 113, 81, 91, 133, 69, 80, 64, 1…
    $ images                 <list> [<data.frame[5 x 1]>], [<data.frame[3 x 1]>], [<data.frame[…
    $ monthlyExpense         <int> 4183, 3888, 2798, 3205, 3557, 3405, 3233, 2688, 3921, 3907,…
    $ nextOpenHouse          <df[,4]> <data.frame[30 x 4]>
    $ numberOfFloors         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1,…
    $ numberOfRooms          <int> 3, 3, 2, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 2, 4, 2, 3, 4, 2, 4,…
    $ pageViews              <int> 126, 341, 191, 160, 358, 356, 242, 516, 133, 180, 134, 106…
    $ perAreaPrice           <int> 40765, 54817, 62422, 71374, 43148, 58711, 60897, 41150, 480…
    $ priceCash              <int> 3995000, 4495000, 3995000, 6495000, 3495000, 5695000, 47…
    $ priceChangePercentage  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
    $ providerCaseID         <chr> "26-X000182018025lok", "114-2102", "43000000643cam", "13433…
    $ realEstate             <df[,3]> <data.frame[30 x 3]>
    $ realtor                <df[,21]> <data.frame[30 x 21]>
    $ slug                   <chr> "oerbaekgaards-alle-901-0-tv-2970-hoersholm-02239600_901_st…
    $ status                 <chr> "open", "open", "open", "open", "open", "open", "open", "op…
    $ timeOnMarket           <df[,2]> <data.frame[30 x 2]>
    $ totalClickCount        <int> 103, 274, 109, 121, 227, 273, 205, 415, 82, 128, 122, 92, 1…
    $ totalFavourites        <int> 1, 3, 0, 0, 4, 1, 1, 3, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2,…
    $ utilitiesConnectionFee <df[,1]> <data.frame[30 x 1]>
    $ yearBuilt              <int> 2002, 1886, 1907, 2008, 1932, 1914, 1900, 1926, 1934, 1932,…
    $ basementArea           <int> NA, NA, NA, NA, NA, NA, NA, NA, 88, NA, NA, NA, NA, NA, …
    $ lotArea                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5327, NA, N…
    $ weightedArea           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
    $ secondaryAddressType   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
    

    How you can save the data into your environment

    df <- "https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
      request() %>%
      req_perform() %>%
      resp_body_json(simplifyVector = TRUE) %>%
      pluck("cases")