javascriptpythonhtmlcode-search-engine

how to get content of search page of Krugle and open hub


I want to make a tool to analyse the result of code search engine like Krugle or OpenHub. I've tried java and python to get html page of search result:

import urllib2
def write_url(url, file_name, if_show):    
    if (url is None) or (file_name is None):
        return

    req = urllib2.Request(url)
    resp = urllib2.urlopen(req)

    ret = resp.read()

    fp = open(file_name, "w")
    fp.write(ret)
    fp.close()
    if if_show:
        print ret


if __name__ == "__main__":
    url_ = "http://www.krugle.org/document/search/#query=socket"
    file_n = "D:/tmp/test.txt"
    write_url(url_, file_n, True)
    print "Done"

but I didn't get the content of the result. Part of the page I got is like this:

            <div class="content_result_body">
                <div id="hit_list"></div>
                <div class="paging" style="display: none;"></div>
            </div>

I used chrome to check the result page of a search. It's something like this:

            <div class="content_result_body">
                <div id="hit_list">
                    <div class="hit">...</div>
                    <div class="hit">...</div>
                    <div class="hit">...</div>
                </div>
                <div class="paging">...</div>
            </div>

And in the div.hit, the "..." stands for the content of the result Krugle searched. I'm not sure why there were nothing in the div.hit_list of the returned page I got by my python code. Maybe the content of the result was generated by js. But I don't know how to get it by codes.


Solution

  • To deal with pages that dynamically load content, you can try it with Selenium,

    from selenium import webdriver
    
    url = "your-url.com"
    br = webdriver.Firefox()
    br.get(url)
    
    html = br.page_source
    

    Ofcourse, this would open a web browser as well. If that is inconvenient, I can tell you how to do with xvfb or phantomjs