I want to scrape questions from the Quora related to some specific topic which has more than 4 answers or so.
I want to find the
a) Number of answers
b) tags associated with each question
This is my program :
res=requests.get("https://www.quora.com/How-does-Quora-automatically-know-what-tags-to-put-for-a-question")
soup=BeautifulSoup(res.text, 'lxml')
# All the ans inside pagedlist_item
ans=soup.find_all('div', {'class' : 'pagedlist_item'})
#Question Name inside question_text_edit
qname=soup.find('div', {'class' : 'question_text_edit'})
#qnam=soup.find('div', {'class' : 'question_text_edit'})
#Tag of Question
tags=soup.find('div', {'class' : 'QuestionTopicHorizontalList TopicList'})
#checking to see if "TV" is the tag of the question in the current webpage
#Also, checking if no. of answers of the given question >=4, if yes then print the question
#logic for checking the conditions
no_ans=0;
if "TV" in tags.text:
print(i.text)
for a in ans:
no_ans=no_ans+1
if no_ans>=4:
print(qname.text)
I want to search over many such pages which have the tag TV
and then later perform the check over those pages to satisfy the above condition.
The logic for checking the conditions is present at the end of the code. But, this will work only for one question in the webpage whose address is inside the requests.get("")
function.
How can I let the code automatically iterate over many web pages(multiple questions) with the tag as 'TV' rather than passing a single webpage address into requests.get("")
function ?
Also, I want to scrape multiple questions(as many as 40 or so).
I will answer these step by step :
I want to search over many such pages which have the tag TV and then later perform the check over those pages to satisfy the above condition.
Well, If you want to scrape multiple pages like these, you have to start from the root page of the topic which has many questions related to that specific topic and begin scraping the links of these questions enlisted in that root page.
Also, I want to scrape multiple questions(as many as 40 or so)
For this, you need to mimic scrolling so that you can find more and more questions as you go down.
You can't directly use Requests
, BeautifulSoup
to execute events like mimicking the scrolling operation. Here is a piece of code that I have in Python using the Selenium
library to fulfill your requirements.
Note:
install selenium using pip install -U selenium
.
If you are using windows - executable_path='/path/to/chromedriver.exe'
This code asks for 2 links and then starts scraping the "Question,No. of answers,Tags,4 answers" and saves them in the csv format.
Keys.PAGE_DOWN
is used for mimicking the scroll button.
Different details have been appended to the row
list and at the end it is saved
into the csv
file.
Also, you can change the value of no_of_pagedowns
variable to increase no. of
scrolls you want.
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import csv
with open('submission.csv','w') as file:
file.write("Question,No. of answers,Tags,4 answers")
link1 = input("Enter first link")
#link2 = input("Enter second link")
manylinks = list()
manylinks.append(link1)
#manylinks.append(link2)
for olink in manylinks:
qlinks = list()
browser = webdriver.Chrome(executable_path='/Users/ajay/Downloads/chromedriver')
browser.get(olink)
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 50
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
no_of_pagedowns-=1
post_elems =browser.find_elements_by_xpath("//a[@class='question_link']")
for post in post_elems:
qlink = post.get_attribute("href")
print(qlink)
qlinks.append(qlink)
for qlink in qlinks:
append_status=0
row = list()
browser.get(qlink)
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 1
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
no_of_pagedowns-=1
#Question Names
qname =browser.find_elements_by_xpath("//div[@class='question_text_edit']")
for q in qname:
print(q.text)
row.append(q.text)
#Answer Count
no_ans = browser.find_elements_by_xpath("//div[@class='answer_count']")
# print("No. of ans :")
for count in no_ans:
# print(count.text)
append_status = int(count.text[:2])
row.append(count.text)
#Tags
tags = browser.find_elements_by_xpath("//div[@class='header']")
# print("\nTag :")
tag_field = list()
for t in tags:
tag_field.append(t.text)
# print(t.text,'\n')
row.append(tag_field)
#All answers
all_ans=browser.find_elements_by_xpath("//div[@class='ui_qtext_expanded']")
i=1
answer_field = list()
for post in all_ans:
if i<=4:
i=i+1
# print("Answer : ")
# print(post.text)
answer_field.append(post.text)
else:
break
row.append(answer_field)
print('append_status',append_status)
if append_status >= 4:
with open('submission.csv','a') as file:
writer = csv.writer(file)
writer.writerow(row)