python beautifulsoup python-requests google-scholar

Extract the original paper links from google scholar author profiles

I put together the following python code to get the links of the papers published by a random author (from google scholar):

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

def fetch_scholar_links_from_url(url: str) -> pd.DataFrame:
 
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
    }
    s = requests.Session()
    s.headers.update(headers)
    r = s.post(url, data={'json': '1'})
    soup = bs(r.json()['B'], 'html.parser')
    works = [
        ('https://scholar.google.com' + x.get('href'))
        for x in soup.select('a')
        if 'javascript:void(0)' not in x.get('href') and len(x.get_text()) > 7
    ]
    df = pd.DataFrame(works, columns=['Link'])
    return df

url = 'https://scholar.google.ca/citations?user=iYN86KEAAAAJ&hl=en'
df_links = fetch_scholar_links_from_url(url)
print(df_links)

So far so good (even though this code returns only the first 20 papers..). However, since the links extracted from the google scholar's author page are not the original links of the journals (but still google scholar links, i.e. "nested links"), I would like to enter each of the extracted google scholar's links, and get the original links of the journals, by re-applying the same function for links extraction.

However, if I perform the same function for extracting the links of the papers (this is an example with the first link in the extracted list of links),

url2 = df_links.iloc[0]['Link']
df_links_2 = fetch_scholar_links_from_url(url2)
print(df_links_2)

I get the following error:

traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/requests/models.py", line 974, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/my_username/my_folder/collect_urls_papers_per_author.py", line 73, in <module>
    df_links_2 = fetch_scholar_links_from_url(url2)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/my_username/my_folder/collect_urls_papers_per_author.py", line 55, in fetch_scholar_links_from_url
    soup = bs(r.json()['B'], 'html.parser')
              ^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/requests/models.py", line 978, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I tried several times to fix this error, in order to get the original links of the papers, but I am not able to do so. Do you have suggestions on how to fix these errors?

Just for clarity, in my example,

url2 = df_links.iloc[0]['Link']

corresponds to

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC

And, with the commands

df_links_2 = fetch_scholar_links_from_url(url2)
print(df_links_2)

I would expect to get the following link...

https://proceedings.neurips.cc/paper_files/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html

...which you would get, manually, simply by clicking on the paper's title in

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC

Solution

TL;DR: I know you expect Python code but Google blocked most of my attempts so to get the job done I used a combination of other tools and python. Scroll down to the bottom to view the results.

Also, I'm on macOS and you might not have the following tools:

- curl
- jq
- ripgrep
- awk
- sed

The final python script uses the output of the first other-tools-only (later reffered as the #1 script) script to process the citation links and get the paper links.

I'm pretty sure your code is getting flagged by Google's anti-bot systems.

My guess is that Google uses some sophisticated techniques to detect bots, such as the order of headers, the presence of certain headers, the behavior of the HTTP client or even SSL behavior.

I've tried your code and a lot of modifications of it. I tired with or without headers. I even used VPN. None of that worked.

I always got picked up by Google's systems and flagged.

But...

When I used a barebones curl request (no custom headers) I got a valid response every single time. Even when making it right after running the Python script, which got blocked. Same IP, no VPN. curl worked, Python got flagged.

So, I thought I could mimic curl.

import http.client
import ssl
from urllib.parse import urlencode

from bs4 import BeautifulSoup


def parse_page(content):
    soup = BeautifulSoup(content, "lxml")
    return [a['href'] for a in soup.select('.gsc_a_at')]

payload = {
    "user": "iYN86KEAAAAJ",
    "hl": "en",
}

params = urlencode(payload)

# Create a connection with SSL verification disabled (curl does this by default)
context = ssl.create_default_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
conn = http.client.HTTPSConnection("scholar.google.ca", context=context)

# Define the headers
headers = {
    'User-Agent': 'curl/8.7.1',
    'Accept': '*/*',
}

conn.request("GET", f"/citations?{params}", headers=headers)
response = conn.getresponse()
data = response.read()

try:
    data = data.decode("utf-8", errors="ignore")
except UnicodeDecodeError:
    data = data.decode("ISO-8859-1")

print("\n".join(parse_page(data)))

conn.close()

And it worked:

[/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ZeXyd9-uunAC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:AXPGKjj_ei8C
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:KxtntwgDAa4C
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:MXK_kJrjxJIC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:_Qo2XoVZTnwC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:RHpTSmoSYBkC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:4JMBOYKVnBMC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:NhqRSupF_l8C
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:IWHjjKOFINEC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:bEWYMUwI8FkC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Mojj43d5GZwC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:7PzlFSSx8tAC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:fPk4N6BV_jEC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ufrVoPGSRksC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:maZDTaKrznsC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:vRqMK49ujn8C
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:LkGwnXOMwfcC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:R3hNpaxXUhUC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Y0pCki6q_DkC

So far so good but I still needed to get all the other citation links, right?

And that's where all the problems really began.

It turns out that the first request is a GET (first 20 links) but then it becomes a POST, as you have it in your cdoe.

But to get it working you need to figure out the pagination.

There's some funky JS code in the HTML source that does just that.

function Re() {
                    var a = window.location.href.split("#")[0];
                    Pe = a.replace(/([?&])(cstart|pagesize)=[^&]*/g, "$1");
                    Qe = Math.max(+a.replace(/.*[?&]cstart=([^&]*).*/, "$1") || 0, 0);
                    Oe = +a.replace(/.*[?&]pagesize=([^&]*).*/, "$1") || 0;
                    Oe = Math.max(Math.min(Oe, 100), 20);
                    Q("#gsc_bpf_more", "click", Ne)
                }

And there's a snippet that handles the POST request.

function cc(a, b, c) {
                    var d = new XMLHttpRequest;
                    d.onreadystatechange = function() {
                        if (d.readyState == 4) {
                            var e = d.status
                              , f = d.responseText
                              , h = d.getResponseHeader("Content-Type")
                              , k = d.responseURL
                              , m = window.location
                              , l = m.protocol;
                            m = "//" + m.host + "/";
                            k && k.indexOf(l + m) && k.indexOf("https:" + m) && (e = 0,
                            h = f = "");
                            c(e, f, h || "")
                        }
                    }
                    ;
                    d.open(b ? "POST" : "GET", a, !0);
                    d.setRequestHeader("X-Requested-With", "XHR");
                    b && d.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
                    b ? d.send(b) : d.send();

But (again!)...

Getting this working in Python turned out to be pretty cumbersome and error-prone. I managed to get the JSON response but failed at decoding issues.

So... I thought I'd give curl another go.

I cooked up a simple bash script.

Here's the breakdown:

I use pairs of numbers to mimic the pagination values and start the while loop
I use curl with default headers to send POST requests only
I add the {"json": 1} payload, although I'm not sure if it's really needed
I pipe the response to jq and grab the value of B, which is the HTML that contains the links
I use rg or ripgrep to grab the citation links
I use sed to replace & with & to get working URL paths
Finally, I use awk to add the base_url to the extracted paths

#!/bin/bash

base_url="https://scholar.google.ca/"
user="iYN86KEAAAAJ"

echo -e "0 20\n20 80\n100 100\n200 100" | while read cstart pagesize; do
    curl -s -X POST "${base_url}citations?user=${user}&hl=en&cstart=${cstart}&pagesize=${pagesize}" -d "json=1" \
    | jq -r ".B" \
    | rg -o -r '$1' 'href="\/(.+?)"' \
    | sed 's/&amp;/\&/g' \
    | awk '{print "https://scholar.google.ca/" $0}'
done

Running this should give you all the citation links for that particular scolar.

Output (first 20 links plus 1 from the next page; I redacted it on purpose):

https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ZeXyd9-uunAC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:AXPGKjj_ei8C
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:KxtntwgDAa4C
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:MXK_kJrjxJIC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:_Qo2XoVZTnwC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:RHpTSmoSYBkC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:4JMBOYKVnBMC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:NhqRSupF_l8C
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:IWHjjKOFINEC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:bEWYMUwI8FkC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Mojj43d5GZwC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:7PzlFSSx8tAC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:fPk4N6BV_jEC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ufrVoPGSRksC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:maZDTaKrznsC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:vRqMK49ujn8C
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:LkGwnXOMwfcC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:R3hNpaxXUhUC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Y0pCki6q_DkC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&cstart=20&pagesize=80&citation_for_view=iYN86KEAAAAJ:5nxA0vEk-isC

and much more...

Here's a slightly modified script where you can save the results to a file and keep looping for even more results.

Let's get all citation links for the Godfather of AI and the Noble prize winner Geoffrey Hinton. :)

Working script #1

Note: make sure you chmod this bash file. For example:

chmod +x no1_script.sh

And use it by running:

$ ./no1_script.sh

#!/bin/bash

max_results=1000 # Modify this to get more results for different users (scholars)
base_url="https://scholar.google.ca/"
user="JicYPdAAAAAJ" # Geoffrey Hinton


cstart=0
while [ $cstart -lt $max_results ]; do
    # Determine pagesize based on cstart
    if [ $cstart -eq 0 ]; then
        pagesize=20
    elif [ $cstart -eq 20 ]; then
        pagesize=80
    else
        pagesize=100
    fi

    curl -s -X POST "${base_url}citations?user=${user}&hl=en&cstart=${cstart}&pagesize=${pagesize}" -d "json=1" \
    | jq -r ".B" \
    | rg -o -r '$1' 'href="\/(.+?)"' \
    | sed 's/&amp;/\&/g' \
    | awk '{print "https://scholar.google.ca/" $0}' >> "${user}.txt"

    # Update cstart for the next iteration
    if [ $cstart -eq 0 ]; then
        cstart=20
    elif [ $cstart -eq 20 ]; then
        cstart=100
    else
        cstart=$((cstart + 100))
    fi
done

If you cat the output file you'll get 709 links.

cat JicYPdAAAAAJ.txt | wc -l
     709

Putting it all together:

Feed the output of the first bash script that collects citation links to this script:

Here I'm using the Geoffrey Hinton's citation links that I stored in the JicYPdAAAAAJ.txt file from the #1 script.

import random
import subprocess
import time

from bs4 import BeautifulSoup


def wait_for(max_wait: int = 10) -> None:
    wait = random.randint(1, max_wait + 1)
    print(f"Waiting for {wait} seconds...")
    time.sleep(wait)

# Use the output of the first script to get the citation links
citation_links_file = [
    line for line in open("JicYPdAAAAAJ.txt").read().split("\n") if line
]

# Remove [:3] to visit all citation links
for link in citation_links_file[:3]:
    print(f"Visiting {link}...")
    curl = subprocess.run(
        ["curl", link],
        capture_output=True,
    )
    try:
        soup = (
            BeautifulSoup(curl.stdout.decode("ISO-8859-1"), "html.parser")
            .select_one("#gsc_oci_title_gg > div > a")

        )
        print(f"Paper link: {soup.get('href')}")
        wait_for()
    except AttributeError:
        print("No paper found.")
        continue

You should get the paper links (no citations have papers links though).

Sample output:

Visiting https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:VN7nJs4JPk0C...
Paper link: https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
Waiting for 7 seconds...
Visiting https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:C-SEpTPhZ1wC...
Paper link: https://hal.science/hal-04206682/document
Waiting for 5 seconds...
Visiting https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:GFxP56DSvIMC...
No paper found.