I put together the following python code to get the links of the papers published by a random author (from google scholar):
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
def fetch_scholar_links_from_url(url: str) -> pd.DataFrame:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
r = s.post(url, data={'json': '1'})
soup = bs(r.json()['B'], 'html.parser')
works = [
('https://scholar.google.com' + x.get('href'))
for x in soup.select('a')
if 'javascript:void(0)' not in x.get('href') and len(x.get_text()) > 7
]
df = pd.DataFrame(works, columns=['Link'])
return df
url = 'https://scholar.google.ca/citations?user=iYN86KEAAAAJ&hl=en'
df_links = fetch_scholar_links_from_url(url)
print(df_links)
So far so good (even though this code returns only the first 20 papers..). However, since the links extracted from the google scholar's author page are not the original links of the journals (but still google scholar links, i.e. "nested links"), I would like to enter each of the extracted google scholar's links, and get the original links of the journals, by re-applying the same function for links extraction.
However, if I perform the same function for extracting the links of the papers (this is an example with the first link in the extracted list of links),
url2 = df_links.iloc[0]['Link']
df_links_2 = fetch_scholar_links_from_url(url2)
print(df_links_2)
I get the following error:
traceback (most recent call last):
File "/opt/anaconda3/lib/python3.12/site-packages/requests/models.py", line 974, in json
return complexjson.loads(self.text, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.12/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.12/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.12/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/my_username/my_folder/collect_urls_papers_per_author.py", line 73, in <module>
df_links_2 = fetch_scholar_links_from_url(url2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/my_username/my_folder/collect_urls_papers_per_author.py", line 55, in fetch_scholar_links_from_url
soup = bs(r.json()['B'], 'html.parser')
^^^^^^^^
File "/opt/anaconda3/lib/python3.12/site-packages/requests/models.py", line 978, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I tried several times to fix this error, in order to get the original links of the papers, but I am not able to do so. Do you have suggestions on how to fix these errors?
Just for clarity, in my example,
url2 = df_links.iloc[0]['Link']
corresponds to
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC
And, with the commands
df_links_2 = fetch_scholar_links_from_url(url2)
print(df_links_2)
I would expect to get the following link...
https://proceedings.neurips.cc/paper_files/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html
...which you would get, manually, simply by clicking on the paper's title in
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC
.
TL;DR: I know you expect Python code but Google blocked most of my attempts so to get the job done I used a combination of other tools and python. Scroll down to the bottom to view the results.
Also, I'm on macOS and you might not have the following tools:
- curl
- jq
- ripgrep
- awk
- sed
The final python script uses the output of the first other-tools-only
(later reffered as the #1 script
) script to process the citation links and get the paper links.
I'm pretty sure your code is getting flagged by Google's anti-bot systems.
My guess is that Google uses some sophisticated techniques to detect bots, such as the order of headers, the presence of certain headers, the behavior of the HTTP client or even SSL behavior.
I've tried your code and a lot of modifications of it. I tired with or without headers. I even used VPN. None of that worked.
I always got picked up by Google's systems and flagged.
But...
When I used a barebones curl
request (no custom headers) I got a valid response every single time. Even when making it right after running the Python script, which got blocked. Same IP, no VPN. curl
worked, Python
got flagged.
So, I thought I could mimic curl
.
import http.client
import ssl
from urllib.parse import urlencode
from bs4 import BeautifulSoup
def parse_page(content):
soup = BeautifulSoup(content, "lxml")
return [a['href'] for a in soup.select('.gsc_a_at')]
payload = {
"user": "iYN86KEAAAAJ",
"hl": "en",
}
params = urlencode(payload)
# Create a connection with SSL verification disabled (curl does this by default)
context = ssl.create_default_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
conn = http.client.HTTPSConnection("scholar.google.ca", context=context)
# Define the headers
headers = {
'User-Agent': 'curl/8.7.1',
'Accept': '*/*',
}
conn.request("GET", f"/citations?{params}", headers=headers)
response = conn.getresponse()
data = response.read()
try:
data = data.decode("utf-8", errors="ignore")
except UnicodeDecodeError:
data = data.decode("ISO-8859-1")
print("\n".join(parse_page(data)))
conn.close()
And it worked:
[/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ZeXyd9-uunAC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:AXPGKjj_ei8C
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:KxtntwgDAa4C
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:MXK_kJrjxJIC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:_Qo2XoVZTnwC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:RHpTSmoSYBkC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:4JMBOYKVnBMC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:NhqRSupF_l8C
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:IWHjjKOFINEC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:bEWYMUwI8FkC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Mojj43d5GZwC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:7PzlFSSx8tAC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:fPk4N6BV_jEC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ufrVoPGSRksC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:maZDTaKrznsC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:vRqMK49ujn8C
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:LkGwnXOMwfcC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:R3hNpaxXUhUC
/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Y0pCki6q_DkC
So far so good but I still needed to get all the other citation links, right?
And that's where all the problems really began.
It turns out that the first request is a GET
(first 20 links) but then it becomes a POST
, as you have it in your cdoe.
But to get it working you need to figure out the pagination.
There's some funky JS code in the HTML
source that does just that.
function Re() {
var a = window.location.href.split("#")[0];
Pe = a.replace(/([?&])(cstart|pagesize)=[^&]*/g, "$1");
Qe = Math.max(+a.replace(/.*[?&]cstart=([^&]*).*/, "$1") || 0, 0);
Oe = +a.replace(/.*[?&]pagesize=([^&]*).*/, "$1") || 0;
Oe = Math.max(Math.min(Oe, 100), 20);
Q("#gsc_bpf_more", "click", Ne)
}
And there's a snippet that handles the POST
request.
function cc(a, b, c) {
var d = new XMLHttpRequest;
d.onreadystatechange = function() {
if (d.readyState == 4) {
var e = d.status
, f = d.responseText
, h = d.getResponseHeader("Content-Type")
, k = d.responseURL
, m = window.location
, l = m.protocol;
m = "//" + m.host + "/";
k && k.indexOf(l + m) && k.indexOf("https:" + m) && (e = 0,
h = f = "");
c(e, f, h || "")
}
}
;
d.open(b ? "POST" : "GET", a, !0);
d.setRequestHeader("X-Requested-With", "XHR");
b && d.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
b ? d.send(b) : d.send();
But (again!)...
Getting this working in Python turned out to be pretty cumbersome and error-prone. I managed to get the JSON response but failed at decoding issues.
So... I thought I'd give curl
another go.
I cooked up a simple bash script.
Here's the breakdown:
while
loopcurl
with default headers to send POST
requests only{"json": 1}
payload, although I'm not sure if it's really neededjq
and grab the value of B
, which is the HTML that contains the linksrg
or ripgrep
to grab the citation linkssed
to replace &
with &
to get working URL pathsawk
to add the base_url
to the extracted paths#!/bin/bash
base_url="https://scholar.google.ca/"
user="iYN86KEAAAAJ"
echo -e "0 20\n20 80\n100 100\n200 100" | while read cstart pagesize; do
curl -s -X POST "${base_url}citations?user=${user}&hl=en&cstart=${cstart}&pagesize=${pagesize}" -d "json=1" \
| jq -r ".B" \
| rg -o -r '$1' 'href="\/(.+?)"' \
| sed 's/&/\&/g' \
| awk '{print "https://scholar.google.ca/" $0}'
done
Running this should give you all the citation links for that particular scolar.
Output (first 20 links plus 1 from the next page; I redacted it on purpose):
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ZeXyd9-uunAC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:AXPGKjj_ei8C
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:KxtntwgDAa4C
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:MXK_kJrjxJIC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:_Qo2XoVZTnwC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:RHpTSmoSYBkC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:4JMBOYKVnBMC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:NhqRSupF_l8C
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:IWHjjKOFINEC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:bEWYMUwI8FkC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Mojj43d5GZwC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:7PzlFSSx8tAC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:fPk4N6BV_jEC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ufrVoPGSRksC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:maZDTaKrznsC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:vRqMK49ujn8C
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:LkGwnXOMwfcC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:R3hNpaxXUhUC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Y0pCki6q_DkC
https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&cstart=20&pagesize=80&citation_for_view=iYN86KEAAAAJ:5nxA0vEk-isC
and much more...
Here's a slightly modified script where you can save the results to a file and keep looping for even more results.
Let's get all citation links for the Godfather of AI and the Noble prize winner Geoffrey Hinton. :)
Note: make sure you chmod
this bash
file. For example:
chmod +x no1_script.sh
And use it by running:
$ ./no1_script.sh
#!/bin/bash
max_results=1000 # Modify this to get more results for different users (scholars)
base_url="https://scholar.google.ca/"
user="JicYPdAAAAAJ" # Geoffrey Hinton
cstart=0
while [ $cstart -lt $max_results ]; do
# Determine pagesize based on cstart
if [ $cstart -eq 0 ]; then
pagesize=20
elif [ $cstart -eq 20 ]; then
pagesize=80
else
pagesize=100
fi
curl -s -X POST "${base_url}citations?user=${user}&hl=en&cstart=${cstart}&pagesize=${pagesize}" -d "json=1" \
| jq -r ".B" \
| rg -o -r '$1' 'href="\/(.+?)"' \
| sed 's/&/\&/g' \
| awk '{print "https://scholar.google.ca/" $0}' >> "${user}.txt"
# Update cstart for the next iteration
if [ $cstart -eq 0 ]; then
cstart=20
elif [ $cstart -eq 20 ]; then
cstart=100
else
cstart=$((cstart + 100))
fi
done
If you cat
the output file you'll get 709
links.
cat JicYPdAAAAAJ.txt | wc -l
709
Feed the output of the first bash
script that collects citation links to this script:
Here I'm using the Geoffrey Hinton's citation links that I stored in the JicYPdAAAAAJ.txt
file from the #1 script
.
import random
import subprocess
import time
from bs4 import BeautifulSoup
def wait_for(max_wait: int = 10) -> None:
wait = random.randint(1, max_wait + 1)
print(f"Waiting for {wait} seconds...")
time.sleep(wait)
# Use the output of the first script to get the citation links
citation_links_file = [
line for line in open("JicYPdAAAAAJ.txt").read().split("\n") if line
]
# Remove [:3] to visit all citation links
for link in citation_links_file[:3]:
print(f"Visiting {link}...")
curl = subprocess.run(
["curl", link],
capture_output=True,
)
try:
soup = (
BeautifulSoup(curl.stdout.decode("ISO-8859-1"), "html.parser")
.select_one("#gsc_oci_title_gg > div > a")
)
print(f"Paper link: {soup.get('href')}")
wait_for()
except AttributeError:
print("No paper found.")
continue
You should get the paper links (no citations have papers links though).
Sample output:
Visiting https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:VN7nJs4JPk0C...
Paper link: https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
Waiting for 7 seconds...
Visiting https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:C-SEpTPhZ1wC...
Paper link: https://hal.science/hal-04206682/document
Waiting for 5 seconds...
Visiting https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:GFxP56DSvIMC...
No paper found.