pythonseleniumcsvwebdrivergoogle-scholar

how to save data from multiple pages using webdriver into a single csv


so i'm trying to save data from googlescholar using selenium (webdriver) and so far i can print the data that i want, but i when i saved it into a csv it only saves the first page

from selenium import webdriver
from selenium.webdriver.common.by import By
# Import statements for explicit wait
from selenium.webdriver.support.ui import WebDriverWait as W
from selenium.webdriver.support import expected_conditions as EC
import time
import csv
from csv import writer

exec_path = r"C:\Users\gvste\Desktop\proyecto\chromedriver.exe"
URL = r"https://scholar.google.com/citations?view_op=view_org&hl=en&authuser=2&org=8337597745079551909"

button_locators = ['//*[@id="gsc_authors_bottom_pag"]/div/button[2]', '//*[@id="gsc_authors_bottom_pag"]/div/button[2]','//*[@id="gsc_authors_bottom_pag"]/div/button[2]']
wait_time = 3
driver = webdriver.Chrome(executable_path=exec_path)
driver.get(URL)
wait = W(driver, wait_time)
#driver.maximize_window()
for j in range(len(button_locators)):
    button_link = wait.until(EC.element_to_be_clickable((By.XPATH, button_locators[j])))

address = driver.find_elements_by_class_name("gsc_1usr")

    #for post in address:
        #print(post.text)
time.sleep(4)

with open('post.csv','a') as s:
    for i in range(len(address)):

        addresst = address
            #if addresst == 'NONE':
            #   addresst = str(address)
            #else:
        addresst = address[i].text.replace('\n',',')
        s.write(addresst+ '\n')

button_link.click()
time.sleep(4)

    #driver.quit()

Solution

  • You only get one first page data because your program stops after it clicks next page button. You have to put all that in a for loop.

    Notice i wrote in range(7), because I know there are 7 pages to open, in reality we should never do that. Imagine if we have thousands of pages. We should add some logic to check if the "next page button" exists or something and loop until it doesn't

    exec_path = r"C:\Users\gvste\Desktop\proyecto\chromedriver.exe"
    URL = r"https://scholar.google.com/citations?view_op=view_org&hl=en&authuser=2&org=8337597745079551909"
    
    button_locators = "/html/body/div/div[8]/div[2]/div/div[12]/div/button[2]"
    wait_time = 3
    driver = webdriver.Chrome(executable_path=exec_path)
    driver.get(URL)
    wait = W(driver, wait_time)
    
    time.sleep(4)
    
    # 7 pages. In reality, we should get this number programmatically 
    for page in range(7):
    
        # read data from new page
        address = driver.find_elements_by_class_name("gsc_1usr")
    
        # write to file
        with open('post.csv','a') as s:
            for i in range(len(address)):
                addresst = address[i].text.replace('\n',',')
                s.write(addresst+ '\n')
    
        # find and click next page button
        button_link = wait.until(EC.element_to_be_clickable((By.XPATH, button_locators)))
        button_link.click()
        time.sleep(4)
    

    also in the future you should look to change all these time.sleeps to wait.until. Because sometimes your page loads quicker, and the program could do it's job faster. Or even worse, your network might get a lag and that would screw up your script.