rubyweb-scrapingnokogiri

Trying to scrape an image using Nokogiri but it returns a link that I was not expecting


I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.

This is the link that I want to get: https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f

But instead I got this: https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png

Why?

This is what I tried:

url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read

html = Nokogiri::HTML.parse(serialized_html)

title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value

{
  title: title,
  overview: overview,
  poster_url: poster,
}

Solution

  • It has nothing to do with your ruby code.

    If you run in your terminal something like

    curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/ 
    

    You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.

    The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c

    Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.