I'm trying to scrape in R a list of apartments for sale and the basic info (address, m2, price, rooms, etc.) of this website: https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000 (see also below a screenshot of the page + inspect)
Using SelectorGadget i haven't been able to create a path that unique extracts the square meters of all 50 apartments on page 1, and another path that unique extracts the numbers of rooms, etc.
I did manage to find a path that unique extracts the addresses (see in code block below). But this is in a separate block/class from the rest of the text.
Here is my current code:
library(rvest)
library(dplyr)
link = "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000&page=1"
page = read_html(link)
address = page %>% html_nodes("div.mr-2") %>% html_text()
price = #MISSING - CAN'T FIGURE OUT
sqm = #MISSING - CAN'T FIGURE OUT
rooms = #MISSING - CAN'T FIGURE OUT
forsale = data.frame(address, price, sqm, rooms, stringsAsFactors = FALSE)
Any ideas on how to approach it? I tried using xpath as well to extract the sqm, but only managed to get one specific text field extracted, not all 50 on the page.
Alternative approaches are welcome too. Thanks in advance!
Using their API (found in the network section), you can call on it and retrieve in the information as such:
library(tidyverse)
library(httr2)
"https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("cases") %>%
unnest(address, names_sep = "_") %>%
mutate(
address = str_c(address_roadName, address_houseNumber, address_zipCode, sep = " "),
.before = 1
) %>%
select(address,
price = priceCash,
sqm = housingArea,
rooms = numberOfRooms)
# A tibble: 100 × 4
address price sqm rooms
<chr> <int> <int> <int>
1 Holsteinsgade 66 2100 3135000 56 2
2 Tuborgvej 60 2900 4875000 114 4
3 Poppellunden 8 4000 3350000 92 3
4 Hyldegårds Tværvej 5 2920 6498000 115 3
5 Grollowstræde 3 3000 3495000 92 3
6 Rasmus Rasks Vej 8 2500 3995000 80 3
7 Ryesgade 7 8000 4598000 110 4
8 Carl Th. Zahles Gade 8 2300 5795000 113 3
9 Strandlodsvej 23E 2300 5495000 101 3
10 Nordre Fasanvej 162 2000 4695000 90 4
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
Which variables are available for extraction:
"https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("cases") %>%
glimpse
Rows: 100
Columns: 37
$ `_links` <df[,1]> <data.frame[30 x 1]>
$ address <df[,28]> <data.frame[30 x 28]>
$ addressType <chr> "condo", "condo", "condo", "condo", "condo", "condo", "c…
$ caseID <chr> "89194273-5948-4734-8085-fec9d42ac3c2", "ff6a9ff5-eacf-…
$ caseUrl <chr> "https://www.lokalbolig.dk/?sag=26-X0001820", "https://www.…
$ coordinates <df[,3]> <data.frame[30 x 3]>
$ daysOnMarket <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ defaultImage <df[,1]> <data.frame[30 x 1]>
$ descriptionBody <chr> "Lys stuelejlighed med to terrasser i HørsholmNær centrum o…
$ descriptionTitle <chr> "Lys stuelejlighed med to terrasser i Hørsholm", "Fantas…
$ distinction <chr> "real_estate", "real_estate", "real_estate", "real_estate",…
$ energyLabel <chr> "c", "c", "d", "c", "d", "c", "c", "c", "c", "c", "c", "…
$ highlighted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ housingArea <int> 98, 82, 64, 91, 81, 97, 78, 113, 81, 91, 133, 69, 80, 64, 1…
$ images <list> [<data.frame[5 x 1]>], [<data.frame[3 x 1]>], [<data.frame[…
$ monthlyExpense <int> 4183, 3888, 2798, 3205, 3557, 3405, 3233, 2688, 3921, 3907,…
$ nextOpenHouse <df[,4]> <data.frame[30 x 4]>
$ numberOfFloors <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1,…
$ numberOfRooms <int> 3, 3, 2, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 2, 4, 2, 3, 4, 2, 4,…
$ pageViews <int> 126, 341, 191, 160, 358, 356, 242, 516, 133, 180, 134, 106…
$ perAreaPrice <int> 40765, 54817, 62422, 71374, 43148, 58711, 60897, 41150, 480…
$ priceCash <int> 3995000, 4495000, 3995000, 6495000, 3495000, 5695000, 47…
$ priceChangePercentage <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ providerCaseID <chr> "26-X000182018025lok", "114-2102", "43000000643cam", "13433…
$ realEstate <df[,3]> <data.frame[30 x 3]>
$ realtor <df[,21]> <data.frame[30 x 21]>
$ slug <chr> "oerbaekgaards-alle-901-0-tv-2970-hoersholm-02239600_901_st…
$ status <chr> "open", "open", "open", "open", "open", "open", "open", "op…
$ timeOnMarket <df[,2]> <data.frame[30 x 2]>
$ totalClickCount <int> 103, 274, 109, 121, 227, 273, 205, 415, 82, 128, 122, 92, 1…
$ totalFavourites <int> 1, 3, 0, 0, 4, 1, 1, 3, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2,…
$ utilitiesConnectionFee <df[,1]> <data.frame[30 x 1]>
$ yearBuilt <int> 2002, 1886, 1907, 2008, 1932, 1914, 1900, 1926, 1934, 1932,…
$ basementArea <int> NA, NA, NA, NA, NA, NA, NA, NA, 88, NA, NA, NA, NA, NA, …
$ lotArea <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5327, NA, N…
$ weightedArea <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ secondaryAddressType <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
How you can save the data into your environment
df <- "https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("cases")