phpweb-scrapingsimple-html-dom

Extract/scrape a javascript window.open from a static html file using php


I'm trying to scrape a bunch of local html files. Each one has a piece of javascript embeded inside the file, with a different window.open path, like so:

<script>

function goTo() {

if (document.getElementById('somedomain').checked) {
window.open("http://www.somedomain.com");
}

if (document.getElementById('visit').checked) {
window.open("http://extract-this-url.com/?somevar=12345&anothervar=59305&etc=etc");
}

}
</script>

I'm trying extract that second URL - it'll be a different URL for each file (As will the first 'somedomain' url).

I've been looking at SimpleHTMLDOM but it doesnt look like it can do javascript thats embedded within a HTML file.

Is there any decent way of doing this?


Solution

  • Just use a regexp:

    preg_match('#visit.*?window\.open\("(.*?)"#is',$text,$matches);
    print_r($matches);