web-scrapingbeautifulsoupgoogle-scholar

Problems in retrieving Google Scholar results with BeautifulSoup


I am continuing the analysis started in this previous question of mine. I have obtained information about specific working paper publications in a data frame composed by four columns: year of publication, order of publication (the publication's order for each year, quite useless in this situation), title, and author. Therefore, I want to use this data frame to scrape Google Scholar and retrieve information about the number of citation. Because some paper's titles are a bit of generic, in some cases the first outcome of google scholar is not actually the one I am interested in. Therefore, to perform a more tailored research, in the creation of the link to perform the research I have included both the title and the author(s) of each paper. I have followed this thread in writing the code.

Note: because true names are needed to perform this scraping, I preferred to not create an example data frame. I have uploaded the .csv file on my GitHub instead.

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
from random import randint
from time import sleep

url = 'https://raw.githubusercontent.com/nicolacaravaggio/working_paper_roma3/master/rm3_working_paper_list.csv'
df = pd.read_csv(url, error_bad_lines = False)

papers = [] 

for index, rows in df.iterrows():  
    list_paper = rows.title + ' ' + rows.author

    papers.append(list_paper)

title_list_gs = []
citations_list_gs = []

with requests.Session() as s:

    for paper in papers: 

            sleep(randint(1,3))

            url = 'https://scholar.google.com/scholar?q=' + paper + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
            r = s.get(url)
            soup = bs(r.content, 'html.parser')

            title_gs = soup.select_one('h3.gs_rt a').text if soup.select_one('h3.gs_rt a') is not None else 'No title'
            title_list_gs.append(title_gs)

            citations_gs = soup.select_one('a:contains("Cited by")').text if soup.select_one('a:contains("Cited by")') is not None else 'No citation count'
            citations_list_gs.append(citations_gs)

            print('Title:', title_gs, '; Citations:', citations_gs)  

However, the result I get from this script is just a list of:

Title: No title ; Citations: No Citation count

I am not sure if the problem is in my clumsy script (probably) or in the fact that Google prevent me to scrape too much from Scholar. In fact, even the script that I used as starting point in this thread, it not always return with the expected outcome. I hope someone could give me some suggestion. Thank you in advance.


Solution

  • It sounds like you are triggering Scholars bot detection. From personal experience scraping Google Scholar, 45 seconds is enough to avoid CAPTCHA and bot detection. I have had a scraper running for >3 days without detection. If you do get flagged, waiting about 2 hours is enough to start again. Here is an extract from my code..

    class ScholarScrape():
        def __init__(self):
            self.page = None
            self.last_url = None
            self.last_time = time.time()
            self.min_time_between_scrape = int(ConfigFile.instance().config.get('scholar','bot_avoidance_time'))
            self.header = {'User-Agent':ConfigFile.instance().config.get('scholar','user_agent')}
            self.session = requests.Session()
            pass
    
        def search(self, query=None, year_lo=None, year_hi=None, title_only=False, publication_string=None, author_string=None, include_citations=True, include_patents=True):
            url = self.get_url(query, year_lo, year_hi, title_only, publication_string, author_string, include_citations, include_patents)
            while True:
                wait_time = self.min_time_between_scrape - (time.time() - self.last_time)
                if wait_time > 0:
                    logger.info("Delaying search by {} seconds to avoid bot detection.".format(wait_time))
                    time.sleep(wait_time)
                self.last_time = time.time()
                logger.info("SCHOLARSCRAPE: " + url)
                self.page = BeautifulSoup(self.session.get(url, headers=self.header).text, 'html.parser')
                self.last_url = url
    
                if "Our systems have detected unusual traffic from your computer network" in str(self.page):
                    raise BotDetectionException("Google has blocked this computer for a short time because it has detected this scraping script.")
    
                return
    
        def get_url(self, query=None, year_lo=None, year_hi=None, title_only=False, publication_string=None, author_string=None, include_citations=True, include_patents=True):
            base_url = "https://scholar.google.com.au/scholar?"
            url = base_url + "as_q=" + urllib.parse.quote(query)
    
            if year_lo is not None and bool(re.match(r'.*([1-3][0-9]{3})', str(year_lo))):
                url += "&as_ylo=" + str(year_lo)
    
            if year_hi is not None and bool(re.match(r'.*([1-3][0-9]{3})', str(year_hi))):
                url += "&as_yhi=" + str(year_hi)
    
            if title_only:
                url += "&as_yhi=title"
            else:
                url += "&as_yhi=any"
    
            if publication_string is not None:
                url += "&as_publication=" + urllib.parse.quote('"' + str(publication_string) + '"')
    
            if author_string is not None:
                url += "&as_sauthors=" + urllib.parse.quote('"' + str(author_string) + '"')
    
            if include_citations:
                url += "&as_vis=0"
            else:
                url += "&as_vis=1"
    
            if include_patents:
                url += "&as_sdt=0"
            else:
                url += "&as_sdt=1"
    
            return url
    
        def get_results_count(self):
            e = self.page.findAll("div", {"class": "gs_ab_mdw"})
            try:
                item = e[1].text.strip()
            except IndexError as ex:
                if "Our systems have detected unusual traffic from your computer network" in str(self.page):
                    raise BotDetectionException("Google has blocked this computer for a short time because it has detected this scraping script.")
                else:
                    raise ex
    
            if self.has_numbers(item):
                return self.get_results_count_from_soup_string(item)
            for item in e:
                item = item.text.strip()
                if self.has_numbers(item):
                    return self.get_results_count_from_soup_string(item)
            return 0
    
        @staticmethod
        def get_results_count_from_soup_string(element):
            if "About" in element:
                num = element.split(" ")[1].strip().replace(",","")
            else:
                num = element.split(" ")[0].strip().replace(",","")
            return num
    
        @staticmethod
        def has_numbers(input_string):
            return any(char.isdigit() for char in input_string)
    
    
    class BotDetectionException(Exception):
        pass
    
    if __name__ == "__main__":
        s = ScholarScrape()
        s.search(**{
            "query":"\"policy shaping\"",
            # "publication_string":"JMLR",
            "author_string": "gilboa",
            "year_lo": "1995",
            "year_hi": "2005",
    
        })
        x = s.get_results_count()
        print(x)
    

    This question might have more information to help you