pythonweb-scrapingquora

Automatically scrape multiple questions from quora with a specific tag?


I want to scrape questions from the Quora related to some specific topic which has more than 4 answers or so.

I want to find the

a) Number of answers

b) tags associated with each question

This is my program :

res=requests.get("https://www.quora.com/How-does-Quora-automatically-know-what-tags-to-put-for-a-question")

soup=BeautifulSoup(res.text, 'lxml')
# All the ans inside pagedlist_item
ans=soup.find_all('div', {'class' : 'pagedlist_item'})


#Question Name inside question_text_edit
qname=soup.find('div', {'class' : 'question_text_edit'})
#qnam=soup.find('div', {'class' : 'question_text_edit'})


#Tag of Question
tags=soup.find('div', {'class' : 'QuestionTopicHorizontalList TopicList'})



#checking to see if "TV" is the tag of the question in the current webpage 
#Also, checking if no. of answers of the given question >=4, if yes then print the question
#logic for checking the conditions
no_ans=0;
if "TV" in tags.text:
    print(i.text)
    for a in ans:
        no_ans=no_ans+1
    if no_ans>=4:
        print(qname.text)

I want to search over many such pages which have the tag TV and then later perform the check over those pages to satisfy the above condition.

The logic for checking the conditions is present at the end of the code. But, this will work only for one question in the webpage whose address is inside the requests.get("") function.

How can I let the code automatically iterate over many web pages(multiple questions) with the tag as 'TV' rather than passing a single webpage address into requests.get("") function ?

Also, I want to scrape multiple questions(as many as 40 or so).


Solution

  • I will answer these step by step :

    I want to search over many such pages which have the tag TV and then later perform the check over those pages to satisfy the above condition.

    Well, If you want to scrape multiple pages like these, you have to start from the root page of the topic which has many questions related to that specific topic and begin scraping the links of these questions enlisted in that root page.

    Also, I want to scrape multiple questions(as many as 40 or so)

    For this, you need to mimic scrolling so that you can find more and more questions as you go down.

    You can't directly use Requests, BeautifulSoup to execute events like mimicking the scrolling operation. Here is a piece of code that I have in Python using the Selenium library to fulfill your requirements.

    Note:

    1. Install Chrome driver for your chrome version.

    2. install selenium using pip install -U selenium.

    3. If you are using windows - executable_path='/path/to/chromedriver.exe'

    This code asks for 2 links and then starts scraping the "Question,No. of answers,Tags,4 answers" and saves them in the csv format.

    Keys.PAGE_DOWN is used for mimicking the scroll button. Different details have been appended to the row list and at the end it is saved into the csv file.

    Also, you can change the value of no_of_pagedowns variable to increase no. of scrolls you want.

    import time
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import csv
    
    
    with open('submission.csv','w') as file:
        file.write("Question,No. of answers,Tags,4 answers")
    
    link1 = input("Enter first link")
    #link2 = input("Enter second link")
    manylinks = list()
    manylinks.append(link1)
    #manylinks.append(link2)
    for olink in manylinks:
        qlinks = list()    
        browser = webdriver.Chrome(executable_path='/Users/ajay/Downloads/chromedriver')
        browser.get(olink)
        time.sleep(1)
        elem = browser.find_element_by_tag_name("body")
    
    
        no_of_pagedowns = 50
        while no_of_pagedowns:
            elem.send_keys(Keys.PAGE_DOWN)
            time.sleep(0.2)
            no_of_pagedowns-=1
        post_elems =browser.find_elements_by_xpath("//a[@class='question_link']")
        for post in post_elems:
            qlink = post.get_attribute("href")
            print(qlink)
            qlinks.append(qlink)
    
        for qlink in qlinks:
    
            append_status=0
    
            row = list()
    
            browser.get(qlink)
            time.sleep(1)
    
    
            elem = browser.find_element_by_tag_name("body")
    
    
            no_of_pagedowns = 1
            while no_of_pagedowns:
                elem.send_keys(Keys.PAGE_DOWN)
                time.sleep(0.2)
                no_of_pagedowns-=1
    
    
            #Question Names
            qname =browser.find_elements_by_xpath("//div[@class='question_text_edit']")
            for q in qname:
                print(q.text)
                row.append(q.text)
    
    
            #Answer Count    
            no_ans = browser.find_elements_by_xpath("//div[@class='answer_count']")
        #    print("No. of ans :")
            for count in no_ans:
        #        print(count.text)
                append_status = int(count.text[:2])
    
                row.append(count.text)
    
            #Tags
            tags = browser.find_elements_by_xpath("//div[@class='header']")
        #    print("\nTag :")
            tag_field = list()
            for t in tags:
                tag_field.append(t.text)
        #        print(t.text,'\n')
            row.append(tag_field)
    
    
            #All answers
            all_ans=browser.find_elements_by_xpath("//div[@class='ui_qtext_expanded']")
            i=1
            answer_field = list()
            for post in all_ans:
                if i<=4:
                    i=i+1
        #            print("Answer : ")
        #            print(post.text)
                    answer_field.append(post.text)
                else:
                    break   
            row.append(answer_field)
    
    
            print('append_status',append_status)
    
            if append_status >= 4:
                with open('submission.csv','a') as file:
                    writer = csv.writer(file)
                    writer.writerow(row)