pythonbeautifulsouppython-requestsgoogle-scholar

Extract the original paper links from google scholar author profiles


I put together the following python code to get the links of the papers published by a random author (from google scholar):

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

def fetch_scholar_links_from_url(url: str) -> pd.DataFrame:
 
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
    }
    s = requests.Session()
    s.headers.update(headers)
    r = s.post(url, data={'json': '1'})
    soup = bs(r.json()['B'], 'html.parser')
    works = [
        ('https://scholar.google.com' + x.get('href'))
        for x in soup.select('a')
        if 'javascript:void(0)' not in x.get('href') and len(x.get_text()) > 7
    ]
    df = pd.DataFrame(works, columns=['Link'])
    return df

url = 'https://scholar.google.ca/citations?user=iYN86KEAAAAJ&hl=en'
df_links = fetch_scholar_links_from_url(url)
print(df_links)

So far so good (even though this code returns only the first 20 papers..). However, since the links extracted from the google scholar's author page are not the original links of the journals (but still google scholar links, i.e. "nested links"), I would like to enter each of the extracted google scholar's links, and get the original links of the journals, by re-applying the same function for links extraction.

However, if I perform the same function for extracting the links of the papers (this is an example with the first link in the extracted list of links),

url2 = df_links.iloc[0]['Link']
df_links_2 = fetch_scholar_links_from_url(url2)
print(df_links_2)

I get the following error:

traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.12/site-packages/requests/models.py", line 974, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/my_username/my_folder/collect_urls_papers_per_author.py", line 73, in <module>
    df_links_2 = fetch_scholar_links_from_url(url2)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/my_username/my_folder/collect_urls_papers_per_author.py", line 55, in fetch_scholar_links_from_url
    soup = bs(r.json()['B'], 'html.parser')
              ^^^^^^^^
  File "/opt/anaconda3/lib/python3.12/site-packages/requests/models.py", line 978, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I tried several times to fix this error, in order to get the original links of the papers, but I am not able to do so. Do you have suggestions on how to fix these errors?


Just for clarity, in my example,

url2 = df_links.iloc[0]['Link']

corresponds to

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC

And, with the commands

df_links_2 = fetch_scholar_links_from_url(url2)
print(df_links_2)

I would expect to get the following link...

https://proceedings.neurips.cc/paper_files/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html

...which you would get, manually, simply by clicking on the paper's title in

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC

.


Solution

  • TL;DR: I know you expect Python code but Google blocked most of my attempts so to get the job done I used a combination of other tools and python. Scroll down to the bottom to view the results.

    Also, I'm on macOS and you might not have the following tools:

    - curl
    - jq
    - ripgrep
    - awk
    - sed
    

    The final python script uses the output of the first other-tools-only (later reffered as the #1 script) script to process the citation links and get the paper links.


    I'm pretty sure your code is getting flagged by Google's anti-bot systems.

    My guess is that Google uses some sophisticated techniques to detect bots, such as the order of headers, the presence of certain headers, the behavior of the HTTP client or even SSL behavior.

    I've tried your code and a lot of modifications of it. I tired with or without headers. I even used VPN. None of that worked.

    I always got picked up by Google's systems and flagged.

    But...

    When I used a barebones curl request (no custom headers) I got a valid response every single time. Even when making it right after running the Python script, which got blocked. Same IP, no VPN. curl worked, Python got flagged.

    So, I thought I could mimic curl.

    import http.client
    import ssl
    from urllib.parse import urlencode
    
    from bs4 import BeautifulSoup
    
    
    def parse_page(content):
        soup = BeautifulSoup(content, "lxml")
        return [a['href'] for a in soup.select('.gsc_a_at')]
    
    payload = {
        "user": "iYN86KEAAAAJ",
        "hl": "en",
    }
    
    params = urlencode(payload)
    
    # Create a connection with SSL verification disabled (curl does this by default)
    context = ssl.create_default_context()
    context.check_hostname = False
    context.verify_mode = ssl.CERT_NONE
    conn = http.client.HTTPSConnection("scholar.google.ca", context=context)
    
    # Define the headers
    headers = {
        'User-Agent': 'curl/8.7.1',
        'Accept': '*/*',
    }
    
    conn.request("GET", f"/citations?{params}", headers=headers)
    response = conn.getresponse()
    data = response.read()
    
    try:
        data = data.decode("utf-8", errors="ignore")
    except UnicodeDecodeError:
        data = data.decode("ISO-8859-1")
    
    print("\n".join(parse_page(data)))
    
    conn.close()
    

    And it worked:

    [/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ZeXyd9-uunAC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:AXPGKjj_ei8C
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:KxtntwgDAa4C
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:MXK_kJrjxJIC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:_Qo2XoVZTnwC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:RHpTSmoSYBkC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:4JMBOYKVnBMC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:NhqRSupF_l8C
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:IWHjjKOFINEC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:bEWYMUwI8FkC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Mojj43d5GZwC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:7PzlFSSx8tAC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:fPk4N6BV_jEC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ufrVoPGSRksC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:maZDTaKrznsC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:vRqMK49ujn8C
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:LkGwnXOMwfcC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:R3hNpaxXUhUC
    /citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Y0pCki6q_DkC
    

    So far so good but I still needed to get all the other citation links, right?

    And that's where all the problems really began.

    It turns out that the first request is a GET (first 20 links) but then it becomes a POST, as you have it in your cdoe.

    But to get it working you need to figure out the pagination.

    There's some funky JS code in the HTML source that does just that.

    function Re() {
                        var a = window.location.href.split("#")[0];
                        Pe = a.replace(/([?&])(cstart|pagesize)=[^&]*/g, "$1");
                        Qe = Math.max(+a.replace(/.*[?&]cstart=([^&]*).*/, "$1") || 0, 0);
                        Oe = +a.replace(/.*[?&]pagesize=([^&]*).*/, "$1") || 0;
                        Oe = Math.max(Math.min(Oe, 100), 20);
                        Q("#gsc_bpf_more", "click", Ne)
                    }
    

    And there's a snippet that handles the POST request.

    function cc(a, b, c) {
                        var d = new XMLHttpRequest;
                        d.onreadystatechange = function() {
                            if (d.readyState == 4) {
                                var e = d.status
                                  , f = d.responseText
                                  , h = d.getResponseHeader("Content-Type")
                                  , k = d.responseURL
                                  , m = window.location
                                  , l = m.protocol;
                                m = "//" + m.host + "/";
                                k && k.indexOf(l + m) && k.indexOf("https:" + m) && (e = 0,
                                h = f = "");
                                c(e, f, h || "")
                            }
                        }
                        ;
                        d.open(b ? "POST" : "GET", a, !0);
                        d.setRequestHeader("X-Requested-With", "XHR");
                        b && d.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
                        b ? d.send(b) : d.send();
    

    But (again!)...

    Getting this working in Python turned out to be pretty cumbersome and error-prone. I managed to get the JSON response but failed at decoding issues.

    So... I thought I'd give curl another go.

    I cooked up a simple bash script.

    Here's the breakdown:

    #!/bin/bash
    
    base_url="https://scholar.google.ca/"
    user="iYN86KEAAAAJ"
    
    echo -e "0 20\n20 80\n100 100\n200 100" | while read cstart pagesize; do
        curl -s -X POST "${base_url}citations?user=${user}&hl=en&cstart=${cstart}&pagesize=${pagesize}" -d "json=1" \
        | jq -r ".B" \
        | rg -o -r '$1' 'href="\/(.+?)"' \
        | sed 's/&amp;/\&/g' \
        | awk '{print "https://scholar.google.ca/" $0}'
    done
    

    Running this should give you all the citation links for that particular scolar.

    Output (first 20 links plus 1 from the next page; I redacted it on purpose):

    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:kNdYIx-mwKoC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ZeXyd9-uunAC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:AXPGKjj_ei8C
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:KxtntwgDAa4C
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:MXK_kJrjxJIC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:_Qo2XoVZTnwC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:RHpTSmoSYBkC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:4JMBOYKVnBMC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:NhqRSupF_l8C
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:IWHjjKOFINEC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:bEWYMUwI8FkC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Mojj43d5GZwC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:7PzlFSSx8tAC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:fPk4N6BV_jEC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:ufrVoPGSRksC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:maZDTaKrznsC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:vRqMK49ujn8C
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:LkGwnXOMwfcC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:R3hNpaxXUhUC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&citation_for_view=iYN86KEAAAAJ:Y0pCki6q_DkC
    https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=iYN86KEAAAAJ&cstart=20&pagesize=80&citation_for_view=iYN86KEAAAAJ:5nxA0vEk-isC
    
    and much more...
    

    Here's a slightly modified script where you can save the results to a file and keep looping for even more results.

    Let's get all citation links for the Godfather of AI and the Noble prize winner Geoffrey Hinton. :)

    Working script #1

    Note: make sure you chmod this bash file. For example:

    chmod +x no1_script.sh
    

    And use it by running:

    $ ./no1_script.sh

    #!/bin/bash
    
    max_results=1000 # Modify this to get more results for different users (scholars)
    base_url="https://scholar.google.ca/"
    user="JicYPdAAAAAJ" # Geoffrey Hinton
    
    
    cstart=0
    while [ $cstart -lt $max_results ]; do
        # Determine pagesize based on cstart
        if [ $cstart -eq 0 ]; then
            pagesize=20
        elif [ $cstart -eq 20 ]; then
            pagesize=80
        else
            pagesize=100
        fi
    
        curl -s -X POST "${base_url}citations?user=${user}&hl=en&cstart=${cstart}&pagesize=${pagesize}" -d "json=1" \
        | jq -r ".B" \
        | rg -o -r '$1' 'href="\/(.+?)"' \
        | sed 's/&amp;/\&/g' \
        | awk '{print "https://scholar.google.ca/" $0}' >> "${user}.txt"
    
        # Update cstart for the next iteration
        if [ $cstart -eq 0 ]; then
            cstart=20
        elif [ $cstart -eq 20 ]; then
            cstart=100
        else
            cstart=$((cstart + 100))
        fi
    done
    

    If you cat the output file you'll get 709 links.

    cat JicYPdAAAAAJ.txt | wc -l
         709
    

    Putting it all together:

    Feed the output of the first bash script that collects citation links to this script:

    Here I'm using Geoffrey Hinton's citation links that I stored in the JicYPdAAAAAJ.txt file from the #1 script.

    import random
    import subprocess
    import time
    
    from bs4 import BeautifulSoup
    
    
    def wait_for(max_wait: int = 10) -> None:
        wait = random.randint(1, max_wait + 1)
        print(f"Waiting for {wait} seconds...")
        time.sleep(wait)
    
    # Use the output of the first script to get the citation links
    citation_links_file = [
        line for line in open("JicYPdAAAAAJ.txt").read().split("\n") if line
    ]
    
    # Remove [:3] to visit all citation links
    for link in citation_links_file[:3]:
        print(f"Visiting {link}...")
        curl = subprocess.run(
            ["curl", link],
            capture_output=True,
        )
        try:
            soup = (
                BeautifulSoup(curl.stdout.decode("ISO-8859-1"), "html.parser")
                .select_one("#gsc_oci_title_gg > div > a")
    
            )
            print(f"Paper link: {soup.get('href')}")
            wait_for()
        except AttributeError:
            print("No paper found.")
            continue
    

    You should get the paper links (not all citations have papers links though).

    Sample output:

    Visiting https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:VN7nJs4JPk0C...
    Paper link: https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
    Waiting for 7 seconds...
    Visiting https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:C-SEpTPhZ1wC...
    Paper link: https://hal.science/hal-04206682/document
    Waiting for 5 seconds...
    Visiting https://scholar.google.ca/citations?view_op=view_citation&hl=en&oe=ASCII&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:GFxP56DSvIMC...
    No paper found.