pythonweb-scrapingurllib2google-scholar

Why am I getting repetitive output while trying to scrape data from Google Scholar?


I am trying to scrape the PDF links from the search results from Google Scholar. I have tried to set a page counter based on the change in URL, but after the first eight output links, I am getting repetitive links as output.

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests


#modifying the url as per page
urlCounter = 0
while urlCounter <=30:
    urlPart1 = "http://scholar.google.com/scholar?start="
    urlPart2 = "&q=%22entity+resolution%22&hl=en&as_sdt=0,4"
    url = urlPart1 + str(urlCounter) + urlPart2
    page = urllib2.Request(url,None,{"User-Agent":"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"})
    resp = urllib2.urlopen(page)
    html = resp.read()
    soup = BeautifulSoup(html)
    urlCounter = urlCounter + 10

    recordCount = 0
    while recordCount <=9:
        recordPart1 = "gs_ggsW"
        finRecord = recordPart1 + str(recordCount)
        recordCount = recordCount+1

    #printing the links
        for link in soup.find_all('div', id = finRecord):
            linkstring = str(link)
            soup1 = BeautifulSoup(linkstring)
        for link in soup1.find_all('a'):
            print(link.get('href'))

Solution

  • Change the following line in your code:

    finRecord = recordPart1 + str(recordCount)
    

    To

    finRecord = recordPart1 + str(recordCount+urlCounter-10)
    

    The real problem: div ids in the first page are gs_ggsW[0-9], but ids on the second page are gs_ggsW[10-19]. So beautiful soup will find no links on the 2nd page.

    Python's variable scope may confuse people from other languages, like Java. After the for loop below being executed, the variable link still exists. So the link is referenced to the last link on the 1st page.

    for link in soup1.find_all('a'):
        print(link.get('href'))
    

    Updates:

    Google may not provide pdf download links for some papers, so you can't use id to match the link of each paper. You can use css selecters to match all the links together.

    soup = BeautifulSoup(html)
    urlCounter = urlCounter + 10
    for link in soup.select('div.gs_ttss a'):
        print(link.get('href'))