I want to scrape links to ads on this page: https://reality.idnes.cz/s/?page=1 usin R, rvest and httr packages. It returns results which I do not understand.
The code is:
link <- "https://reality.idnes.cz/s/?page=1"
response <- httr::GET(link)
page <- rvest::read_html(response)
In the code above I get correct return code 200 for the "response" object, but the "page" object after calling read_html() is almost empty, it does not contains the web page content.
When I do:
object.size(response)
the result is something like this:
132464 bytes
So this object contains data, looks correct. But when I do:
object.size(page)
the result is:
784 bytes
The same applies if I call read_html(link) directly, the resulting object size is the same 784 bytes. Why is "page" object almost empty, what happens when calling "page <- rvest::read_html(response) ?"
Many thanks in advance for any help.
That's because page
is a wrapped pointer to memory. That variable in R doesn't contain all the data. It points to memory where the data is stored.
str(page)
# List of 2
# $ node:<externalptr>
# $ doc :<externalptr>
# - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
If you convert to a character you can get all the data object.size(as.character(page))
. It's all there. It's just that object.size
is not a reliable way to know how much data is stored with a particular variable when pointers are involved.
You should be able to extract all the data there without an issue. Like you can find all the <div>
tags with page |> rvest::html_nodes("div")