javascriptweb-scrapingwatir

Javascript inside of iframe. Scraping with watir


I'm trying to figure out WATIR. Here is a situation. I want to monitor ads in few websites, but scraping them is not easy task because they are in iframe, then there is another iframe links which is generated with javascript. Only then comes the page which I would like to get.

Here is the code in main page:

<iframe width="300" height="250" scrolling="no" frameborder="0"
id="adbottomleft" src="/ad/left1" name="adbottomleft"></iframe>

Here is what the iframe says:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title></title>
<style type="text/css">

 body {
     background-color: black;
     margin:0;
     padding:0;
 }</style>
</head>
<body>
<!--  Rubicon Project Tag -->
<!--  Site: MangaReader   Zone: ROS_BTF_LEFT   Size: Medium Rectangle  -->
<div id="adfooter" style="width:300px;height:250px;"></div>
<script language="JavaScript" type="text/javascript">
function tl(){
    var loaded = 0;
    try {
        loaded = parent.document['adver'];
    } catch(e) { loaded = 0; }
    if (loaded != 1) {
        setTimeout(tl, 25);
    } else {
            var dest = document.getElementById('adfooter');
            var lframe = document.createElement('iframe');
            lframe.setAttribute('id','adbleft');
            lframe.setAttribute('width','300');
            lframe.setAttribute('height','250');
            lframe.setAttribute('scrolling','no');
            lframe.setAttribute('frameborder', '0');
            lframe.setAttribute('src', 'http://ad.mangareader.net/btleft1');
            dest.appendChild(lframe);
    }
}
(function (){
tl();
}());
</script>
</body>
</html>

It does generate another iframe which looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title></title>
<style type="text/css">
                * {
            margin:0;
            padding:0;
        }
        body {
            margin-left: 0px;
            margin-top: 0px;
        }
        </style>
</head>
<body>
<!--  Rubicon Project Tag -->
<!--  Site: MangaReader   Zone: ROS_BTF_LEFT   Size: Medium Rectangle  -->
<script language="JavaScript" type="text/javascript">
var cb = Math.random();
var d = document;
var iframe = "&fr=" + (window != top);
var ref = "";
try {
    if (window != top) {
      ref = "&rf="+escape(d.referrer);
   }
} catch (ignore) { }
d.write("<iframe id='25504.15' name='25504.15' src='' framespacing='0' frameborder='no' scrolling='no' align='middle' width='300' height='250' marginheight='0' marginwidth='0'></iframe>");
d.getElementById('25504.15').src='http://optimized-by.rubiconproject.com/a/8240/13310/25504-15.html?cb='+cb+ref;
</script>
</body>
</html>

Only then comes the final page which I'm interested to scrape.

<html>
  <head>
    <meta http-equiv="Pragma" content="no-cache">
    <meta http-equiv="expires" content="0">
    <style type="text/css"> body {margin:0px; padding:0px;} </style>
    <script type="text/javascript">
      rubicon_cb = Math.random(); rubicon_rurl = document.referrer; if(top.location==document.location){rubicon_rurl = document.location;} rubicon_rurl = escape(rubicon_rurl);
      window.rubicon_ad = "3260765" + "." + "js";
      window.rubicon_creative = "3299047" + "." + "js";
    </script>
  </head>
  <body>

<a href="http://optimized-by.rubiconproject.com/t/8240/13310/25504-15.3260765.3299047?url=http%3A%2F%2Fwww.animepremium.net" target="_blank"><img src="http://assets.rubiconproject.com/campaigns/100/91/16/5/1325630095ap_300.jpg" border="0" alt="AnimePremium.net" /></a><script defer="defer" type="text/javascript">
{
    if (Math.floor(Math.random()*100) < 1)
    {
        var url;
        var iframe = (window != top);
        url = "http://tap.rubiconproject.com/stats/iframes?pc=8240/13310&ptc=25504&upn="+iframe;
        setTimeout(function(){ new Image().src = url }, 1000);
    }
}
</script>
<script>var _comscore = _comscore || []; _comscore.push({ c1: "8", c2: "6135404", c3: "28", c4: "13310", c10: "3299047" }); (function() { var s = document.createElement("script"), el = document.getElementsByTagName("script")[0]; s.async = true; s.src = (document.location.protocol == "https:" ? "https://sb" : "http://b") + ".scorecardresearch.com/beacon.js"; el.parentNode.insertBefore(s, el); })();</script><DIV STYLE="height:0px; width:0px; overflow:hidden"><IFRAME SRC="http://tap2-cdn.rubiconproject.com/partner/scripts/rubicon/emily.html?rtb_ext=1&pc=8240/13310&geo=eu" FRAMEBORDER="0" MARGINWIDTH="0" MARGINHEIGHT="0" SCROLLING="NO" WIDTH="0" HEIGHT="0" style="height:0px; width:0px"></IFRAME></DIV>
  </body>
</html>

Impossible task?

here is what I'm doing.

irb
require "watir-webdriver"
browser = Watir::Browser.new :ff
browser.goto "mangareader.net"
browser.frame(:id, "adbottomleft").html - Works!

If I want to get one more layer down I get error

irb
require "watir-webdriver"
browser = Watir::Browser.new :ff
browser.goto "mangareader.net"
browser.frame(:id, "adbottomleft").frame(:id, "adleft").html -> Don't work.

Element belongs to a different frame than the current one - switch to it's containing frame to use it. What should I change in the 2nd code to make it read the next iframe?

I have been searching for days. Started with selenium then htmunit with c# then tried mechanize with python, but couldn't achieve wanted results.

I keep jumping. I finally thought that I will be able to achieve what I wanted with WATIR. I need some help to get this done. Any tips?


Solution

  • The ID of the frame created by the script is "adbleft" not "adleft" that might be your problem

    browser.frame(:id => "adbottomleft").frame(:id => "adbleft").html 
    

    If the id of the final frame is not static, you might have to select it by index

    browser.frame(:id => "adbottomleft").frame(:id => "adbleft").frame(:index => 0)