javascriptpythonweb-scrapingdisquspyv8

Retrieving comments (disqus) embedded in another web page with python


I'm scrapping a web site using python 3.5 (Beautifulsoup). I can read everything in the source code but I've been trying to retrieve the embedded comments from disqus with no success (which is a reference to a script).

The piece of the html code source looks like this:

var disqus_identifier = "node/XXXXX";
script type='text/javascript' src='https://disqus.com/forums/siteweb/embed.js';

the src sends to a script function.

I've read the suggestions in stackoverflow, using selenium but I had a really hard time to make it work with no success. I understand that selenium emulates a browser (which I believe is too heavy for what I want). However, I have a problem with the webdrivers, it is not working correctly. So, I dropped this option.

I would like to be able to execute the script and retrieve the .js with the comments. I found that a possible solution is PyV8. But I can't import in python. I read the posts in internet, I googled it, but it's not working.

I installed Sublime Text 3 and I downloaded pyv8-win64-p3 manually in:

C:\Users\myusername\AppData\Roaming\Sublime Text 3\Installed Packages\PyV8\pyv8-win64-p3

But I keep getting:

ImportError: No module named 'PyV8'.

If somebody can help me, I'll be very very thankful.


Solution

  • So, you can construct the Disqus API by studying its network traffic; in the page source all required data are present. Like Disqus API send some query string. Recently I have extracted comments from Disqus API, here is the sample code.

    Example: Here soup - page source and params_dict = json.loads(str(soup).split("embedVars = ")[1].split(";")[0])

    def disqus(params_dict,soup):
        headers = {
        'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0'
        }
        comments_list = []
        base = 'default'
        s_o = 'default'
        version = '25916d2dd3d996eaedf6cdb92f03e7dd'
        f = params_dict['disqusShortname']
        t_i = params_dict['disqusIdentifier']
        t_u = params_dict['disqusUrl']
        t_e = params_dict['disqusTitle']
        t_d = soup.head.title.text
        t_t = params_dict['disqusTitle']
        url = 'http://disqus.com/embed/comments/?base=%s&version=%s&f=%s&t_i=%s&t_u=%s&t_e=%s&t_d=%s&t_t=%s&s_o=%s&l='%(base,version,f,t_i,t_u,t_e,t_d,t_t,s_o)
        comment_soup = getLink(url)
        temp_dict = json.loads(str(comment_soup).split("threadData\" type=\"text/json\">")[1].split("</script")[0])
        thread_id = temp_dict['response']['thread']['id']
        forumname = temp_dict['response']['thread']['forum']
        i = 1
        count = 0
        flag = True
        while flag is True:
            disqus_url = 'http://disqus.com/api/3.0/threads/listPostsThreaded?limit=100&thread='+thread_id+'&forum='+forumname+'&order=popular&cursor='+str(i)+':0:0'+'&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F'
            comment_soup = getJson(disqus_url)
    

    It,will return json and you can find comments where you can extract comments. Hope this will help for you.