pythonpython-3.xweb-scrapingbeautifulsoupgoogle-scholar

Get year of first publication Google Scholar


I am working on scraping data from Google Scholar using bs4 and urllib. I am trying to get the first year an article is publsihed. For example, from this page I am trying to get the year 1996. This can be read from the bar chart, but only after the bar chart is clicked. I have written the following code, but it prints out the year visible before the bar chart is clicked.

from bs4 import BeautifulSoup
import urllib.request

url = 'https://scholar.google.com/citations?user=VGoSakQAAAAJ'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
year = soup.find('span', {"class": "gsc_g_t"})
print (year)

Solution

  • the chart information is on a different request, this one. There you can get the information you want with the following xpath:

    '//span[@class="gsc_g_t"][1]/text()'
    

    or in soup:

    soup.find('span', {"class": "gsc_g_t"}).text