web-scrapingbeautifulsoupcaptchagoogle-scholar

Scraping large amount of Google Scholar pages with url


I'm trying to get full author list of all publications from an author on Google scholar using BeautifulSoup. Since the home page for the author only has a truncated list of authors for each paper, I have to open the link of the paper to get full list. As a result, I ran into CAPTCHA every few attempts.

Is there a way to avoid CAPTCHA (e.g. pause for 3 secs after every request)? Or make the original Google Scholar profile page to show full author list?


Solution

  • Recently I faced similar issue. I at least eased my collection process with an easy workaround by implementing a random and rather longlasting sleep like this:

    import time
    import numpy as np
    
    time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds
    

    If you have enough time (let's say launch your parser at night), you can make even bigger pause (3+ times bigger) to assure you won't get captcha.

    Furthermore, you can randomly change user-agents in your requests to site, that will mask you even more.