javascriptrubyweb-scrapingnokogiri

How to scrape data using Ruby which is generated by a Javascript function?


I am trying to scrape the data URL link from the latest date, which is the first row of the table, from this page. It seems like the content of the table is generated by a JavaScript function.

I tried using Nokogiri to get it but Nokogiri can not scrape JavaScript. Then, I tried to get the script part only using Nokogiri using:

url = "http://www.sgx.com/wps/portal/sgxweb/home/marketinfo/historical_data/derivatives/daily_data"
doc = Nokogiri::HTML(open(url))
js = doc.css("script").text
puts js

In the output I found the table that I wanted with class name sgxTableGrid. But, the problem is there is no clue about the data URL link here in the JavaScript function and everything is being generated dynamically.

Does someone know a better way of approaching this problem?


Solution

  • Looking through the HTML for that page, the table is generated by JSON received as the result of a JavaScript request.

    You can figure out what's going on by tracing backwards through the source code of the page. Here's some of what you'll need if you want to retrieve the JSON outside of their JavaScript, however there'll still be work needed to actually do something with it:

    1. Starting with this code:

      require 'open-uri'
      require 'nokogiri'
      
      doc = Nokogiri::HTML(open('http://www.sgx.com/wps/portal/sgxweb/home/marketinfo/historical_data/derivatives/daily_data'))
      scripts = doc.css('script').map(&:text)
      
      puts scripts.select{ |s| s['sgxTableGrid'] }
      

      Look at the text output in an editor. Search for sgxTableGrid. You'll see a line like:

      var tableHeader =  "<table width='100%' class='sgxTableGrid'>"
      

      Look down a little farther and you'll see:

      var totalRows = data.items.length - 1;
      

      data comes from the parameter to the function being called, so that's where we start.

    2. Get a unique part of the containing function's name loadGridns_ and search for it. Each time you find it, look for the parameter data, then look to see where data is defined. If it's passed into that method, then search to see what calls it. Repeat that process until you find that the variable isn't passed into the function, and at that point you'll know you're at the method that creates it.

    3. I found myself in a function that starts with loadGridDatans, where it's part of a block that does a xhrPost call to retrieve a URL. That URL is the target you're after, so grab the name of the containing function, and loop through the calls where the URL is passed in, like you did in the above step.

    4. That search ended up on a line that looks like:

      var url = viewByDailyns_7_2AA4H0C090FIE0I1OH2JFH20K1_...
      
    5. At that point you can start reconstructing the URL you need. Open a JavaScript debugger, like Firebug, and put a break point on that line. Reload the page and JavaScript should stop executing at that line. Single-step, or set breakpoints, and watch the url variable be created until it's in its final form. At that point you have something you can use in OpenURI, which should retrieve the JSON you want.

    Notice, their function names might be generated dynamically; I didn't check to see, so trying to use the full name of the function might fail.

    They might also be serializing the datetime stamp or using a session-key that's serialized to make the function names unique/more opaque, doing it for a number of reasons.

    Even though it's a pain to take this stuff apart, it's also a good lesson in how dynamic pages work.