rweb-scrapingrvesthttr

R rvest read_html() returns almost empty page


I want to scrape links to ads on this page: https://reality.idnes.cz/s/?page=1 usin R, rvest and httr packages. It returns results which I do not understand.

The code is:

link <- "https://reality.idnes.cz/s/?page=1"
response <- httr::GET(link)
page <- rvest::read_html(response)

In the code above I get correct return code 200 for the "response" object, but the "page" object after calling read_html() is almost empty, it does not contains the web page content.

When I do:

object.size(response)

the result is something like this:

132464 bytes

So this object contains data, looks correct. But when I do:

object.size(page)

the result is:

784 bytes

The same applies if I call read_html(link) directly, the resulting object size is the same 784 bytes. Why is "page" object almost empty, what happens when calling "page <- rvest::read_html(response) ?"

Many thanks in advance for any help.


Solution

  • That's because page is a wrapped pointer to memory. That variable in R doesn't contain all the data. It points to memory where the data is stored.

    str(page)
    # List of 2
    #  $ node:<externalptr> 
    #  $ doc :<externalptr> 
    #  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
    

    If you convert to a character you can get all the data object.size(as.character(page)). It's all there. It's just that object.size is not a reliable way to know how much data is stored with a particular variable when pointers are involved.

    You should be able to extract all the data there without an issue. Like you can find all the <div> tags with page |> rvest::html_nodes("div")