rwebweb-scrapingrvestweb-scripting

rvest web content scraping issue / car trading website


Question

I wanted to rvest specific parts of the websites (car sales platform).

The CSS is frankly too confusing for me to figure out what's wrong on my own.

#### scraping the website www.otomoto.pl with used cars #####

baseURL_otomoto = "https://www.otomoto.pl/osobowe/?page="

i <- 1

for ( i in 1:7000 )
{
  link = paste0(baseURL_otomoto,i)
  out = read_html(link)
  print(i)
  print(link)

  ### building year 
  build_year  = html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
    html_text() %>%
    str_replace_all("\n","") %>%
    str_replace_all("\r","") %>%
    str_trim()

  mileage  = html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[2]') %>%
    html_text() %>%
    str_replace_all("\n","") %>%
    str_replace_all("\r","") %>%
    str_trim()

  volume  = html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[3]') %>%
    html_text() %>%
    str_replace_all("\n","") %>%
    str_replace_all("\r","") %>%
    str_trim()

  fuel_type  = html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[4]') %>%
    html_text() %>%
    str_replace_all("\n","") %>%
    str_replace_all("\r","") %>%
    str_trim()


  price = html_nodes(out, xpath = '//div[@class="offer-item__price"]') %>%
    html_text() %>%
    str_replace_all("\n","") %>%
    str_replace_all("\r","") %>%
    str_trim()

  link = html_nodes(out, xpath = '//div[@class="offer-item__title"]') %>%
    html_text() %>%
    str_replace_all("\n","") %>%
    str_replace_all("\r","") %>%
    str_trim()

  offer_details = html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul') %>%
    html_text() %>%
    str_replace_all("\n","") %>%
    str_replace_all("\r","") %>%
    str_trim()

Any guesses what might be the reason for this behaviour?

PS#1.

How to rvest all build_type, mileage and fuel_type data from offers available on the analysed website at once as a data.frame? using classes (xpath = '//div[@class=...) didn't work in my case

PS#2.

I wanted to rvest details of the actual offers using f.i.

gear_type = html_nodes(out, xpath = '//*[@id="parameters"]/ul[1]/li[10]/div') %>%
    html_text() %>%
    str_replace_all("\n","") %>%
    str_replace_all("\r","") %>%
    str_trim()

the arguments

Unfortunately though this concept fails as the resulting data frame is empty. Any guesses why?


Solution

  • First and foremost, learn about CSS selectors and XPath. Your selectors are very long and extremely fragile (some of them did not work for me at all, mere two weeks later). For example, instead of:

    html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
        html_text()
    

    you can write:

    html_nodes(out, css="[data-code=year]") %>% html_text()
    

    Second, read documentation of libraries that you use. str_replace_all pattern may be regular expression, which saves you one call (use str_replace_all("[\n\r]", "") instead of str_replace_all("\n","") %>% str_replace_all("\r","")). html_text can do text trimming for you, which means that str_trim() is not needed at all.

    Third, if you copy-paste some code, step back and think if function wouldn't be better solution; usually it would. In your case, personally, I would probably skip str_replace_all calls until data cleaning step, when I would call them on data.frame holding entire scrapped data.


    To create data.frame from your data, call data.frame() function with column names and content, like that:

    data.frame(build_year = build_year,
        mileage = mileage,
        volume = volume,
        fuel_type = fuel_type,
        price = price,
        link = link,
        offer_details = offer_details)
    

    Or you could initialize data.frame with one column only and then add further vectors as columns:

    output_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
    output_df$volume <- html_nodes(out, css="[data-code=engine_capacity]") %>%
      html_text(TRUE)
    

    Finally, you should note that data.frame columns must all be the same length, while some of data that you scrap is optional. At the moment of writing this answer I had few offers without engine capacity and without offer description. You have to use two html_nodes calls in succession (as single CSS selector will not match what doesn't exist). But even then, html_nodes will silently drop missing data. This can be worked around by piping html_nodes output to html_node call:

    current_df$volume  = out %>% html_nodes("ul.offer-item__params") %>% 
        html_node("[data-code=engine_capacity]") %>% 
        html_text(TRUE)
    

    The final version of my approach to loop internals is below. Just make sure that you initialize empty data.frame before calling it and that you merge output of current iteration with final data frame (using for example rbind), or each iteration will overwrite results of previous one. Or you could use do.call(rbind, lapply()), which is idiomatic R for such task.

    As a side note, when scraping large amount of quickly changing data, consider decoupling data downloading and data processing steps. Imagine that there is some corner case that you haven't accounted for which will cause R to terminate. How will you proceed if such condition appear in the middle of your iterations? The longer you stay on one page, the more duplicates you introduce (as more offers appear and existing ones are pushed down on further pages), and more offers you miss (as sale is concluded and offers disappear forever).

    current_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
    
    current_df$mileage  = html_nodes(out, css="[data-code=mileage]") %>%
      html_text(TRUE)
    
    current_df$volume  = out %>% html_nodes("ul.offer-item__params") %>% 
        html_node("[data-code=engine_capacity]") %>% 
        html_text(TRUE)
    
    current_df$fuel_type  = html_nodes(out, css="[data-code=fuel_type]") %>%
      html_text(TRUE)
    
    current_df$price = out %>% html_nodes(xpath="//div[@class='offer-price']//span[contains(@class, 'number')]") %>% 
      html_text(TRUE)
    
    current_df$link = out %>% html_nodes(css = "div.offer-item__title h2 > a") %>% 
      html_text(TRUE) %>% 
      str_replace_all("[\n\r]", "")
    
    current_df$offer_details = out %>% html_nodes("div.offer-item__title") %>% 
        html_node("h3") %>% 
        html_text(TRUE)