javascriptnode.jsweb-scrapinghttprequestjsdom

Node Jsdom Scrape Google's Reverse Image Search


I want to programatically find a list of URLs for similar images given an image URL. I can't find any free image search APIs so I'm trying to do this by scraping Google's Search by Image.

If I have an image URL, say http://i.imgur.com/oLmwq.png, then navigating to https://www.google.com/searchbyimage?&image_url=http://i.imgur.com/oLmwq.png gives related images and info.

How do I get jsdom.env to produce the HTML your browser gets from the above URL?

Here's what I've tried (CoffeeScript):

jsdom = require 'jsdom'
url = 'https://www.google.com/searchbyimage?&image_url=http://i.imgur.com/oLmwq.png'
jsdom.env
    html: url
    scripts: [ "http://code.jquery.com/jquery.js" ]
    features:
        FetchExternalResources: ['script']
        ProcessExternalResources: ['script']
    done: (errors, window) ->
        console.log window.$('body').html()

You can see the HTML doesn't match what we want. Is this an issue with Jsdom's HTTP headers?


Solution

  • The issue is Jsdom's User-Agent HTTP header. Once that is set everything (almost) works:

    jsdom = require 'jsdom'
    url = 'https://www.google.com/searchbyimage?&image_url=http://i.imgur.com/oLmwq.png'
    jsdom.env
        html: url
        headers:
            'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
        scripts: [ "http://code.jquery.com/jquery.js" ]
        features:
            FetchExternalResources: ['script']
            ProcessExternalResources: ['script']
    
        done: (errors, window) ->
            $ = window.$
            $('#iur img').parent().each (index, elem) ->
                href = $(elem).attr 'href'
                url = href.split('?')[1].split('&')[0].split('=')[1]
                console.log url
    

    Which gives us a nice list of visually similar images. The only problem now is Jsdom throws an error after returning the result:

    timers.js:103
                if (!process.listeners('uncaughtException').length) throw e;
                                                                          ^
    TypeError: Cannot call method 'call' of undefined
        at new <anonymous> (/project-root/node_modules/jsdom/lib/jsdom/browser/index.js:54:13)
        at _.Zl (https://www.google.com/xjs/_/js/s/c,sb,cr,cdos,jsa,ssb,sf,tbpr,tbui,rsn,qi,ob,mb,lc,hv,cfm,klc,kat,aut,esp,bihu,amcl,kp,lu,m,rtis,shb,sfa,hsm,pcc,csi/rt=j/ver=3w99aWPP0po.en_US./d=1/sv=1/rs=AItRSTPrAylXrfkOPyRRY-YioThBMqxW2A:1238:93)
        at _.jm (https://www.google.com/xjs/_/js/s/c,sb,cr,cdos,jsa,ssb,sf,tbpr,tbui,rsn,qi,ob,mb,lc,hv,cfm,klc,kat,aut,esp,bihu,amcl,kp,lu,m,rtis,shb,sfa,hsm,pcc,csi/rt=j/ver=3w99aWPP0po.en_US./d=1/sv=1/rs=AItRSTPrAylXrfkOPyRRY-YioThBMqxW2A:1239:399)
        at _.km (https://www.google.com/xjs/_/js/s/c,sb,cr,cdos,jsa,ssb,sf,tbpr,tbui,rsn,qi,ob,mb,lc,hv,cfm,klc,kat,aut,esp,bihu,amcl,kp,lu,m,rtis,shb,sfa,hsm,pcc,csi/rt=j/ver=3w99aWPP0po.en_US./d=1/sv=1/rs=AItRSTPrAylXrfkOPyRRY-YioThBMqxW2A:1241:146)
        at Object._onTimeout (https://www.google.com/xjs/_/js/s/c,sb,cr,cdos,jsa,ssb,sf,tbpr,tbui,rsn,qi,ob,mb,lc,hv,cfm,klc,kat,aut,esp,bihu,amcl,kp,lu,m,rtis,shb,sfa,hsm,pcc,csi/rt=j/ver=3w99aWPP0po.en_US./d=1/sv=1/rs=AItRSTPrAylXrfkOPyRRY-YioThBMqxW2A:1248:727)
        at Timer.list.ontimeout (timers.js:101:19)