pythonscopuspybliometrics

How do I skip titles that contains too many search results (or take too long to retrieve the info from Scopus)?


I would like to access the ScopusSearch API and obtain the EIDs of a list of 1400 article titles that are saved in an excel spreadsheet. I tried to retrieve the EIDs via the following code:

import numpy as np
import pandas as pd
from pybliometrics.scopus import ScopusSearch
nan = pd.read_excel(r'C:\Users\Apples\Desktop\test\titles_nan.xlsx', sheet_name='nan')
error_index = {}

for i in range(0,len(nan)):
   scopus_title = nan.loc[i ,'Title']
   s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
   print('TITLE("{0}")'.format(scopus_title))
   try:
      s = ScopusSearch(scopus_title)
      nan.at[i,'EID'] = s.results[0].eid
      print(str(i) + ' ' + s.results[0].eid)
   except:
      nan.loc[i,'EID'] = np.nan
      error_index[i] = scopus_title
      print(str(i) + 'error' )

However, I was never able to retrieve the EIDs beyond 100 titles (approximately) because certain titles yield far too many searches and that stalls the entire process.

As such, I wanted to skip titles that contain too many searches and move on to the next title, all while keeping a record of the titles that were skipped.

I am just starting out with Python so I am not sure how to go about doing this. I have the following sequence in mind:

• If the title yields 1 search, retrieve the EID and record it under the ‘EID’ column of file ‘nan’.

• If the title yields more than 1 search, record the title in the error index, print ‘Too many searches’ and move on to the next search.

• If the title does not yield any searches, record the title in the error index, print ‘Error’ and move on to the next search.

Attempt 1
for i in range(0,len(nan)):
   scopus_title = nan.at[i ,'Title']
   print('TITLE("{0}")'.format(scopus_title))
s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
print(type(s))

if(s.count()== 1):
    nan.at[i,"EID"] = s.results[0].eid
    print(str(i) + "   " + s.results[0].eid)
elif(s.count()>1):
    continue
    print(str(i) + "  " + "Too many searches")
else:
    error_index[i] = scopus_title
    print(str(i) + "error")

Attempt 2
for i in range(0,len(nan)):
    scopus_title = nan.at[i ,'Title']<br/>
    print('TITLE("{0}")'.format(scopus_title))<br/>
    s = ScopusSearch('TITLE("{0}")'.format(scopus_title))
    if len(s.results)== 1:
        nan.at[i,"EID"] = s.results[0].eid
        print(str(i) + "   " + s.results[0].eid)
    elif len(s.results)>1:  
        continue
        print(str(i) + "  " + "Too many searches")
    else:
        continue
        print(str(i) + "  " + "Error")

I got errors stating that object of type 'ScopusSearch' has no len() /count() or the searches or not a list themselves. I am unable to proceed from here. In addition, I am not sure if this is the right way to go about it – skipping titles based on too many searches. Are there more effective methods (e.g. timeouts – skip the title after a certain amount of time is spent on the search).

Any help on this matter would be very much appreciated. Thank you!


Solution

  • Combine .get_results_size() with download=False:

    from pybliometrics.scopus import ScopusSearch
    
    scopus_title = "Editorial"
    q = f'TITLE("{scopus_title}")'  # this is f-string notation, btw
    s = ScopusSearch(q, download=False)
    s.get_results_size()
    # 243142
    

    if this number is below a certain threshold, simply do s = ScopusSearch(q) and proceed as in "Attempt 2":

    for i, row in nan.iterrows():
        q = f'TITLE("{row['Title']}")'
        print(q)
        s = ScopusSearch(q, download=False)
        n = s.get_results_size()
        if n == 1:
            s = ScopusSearch(q)
            nan.at[i,"EID"] = s.results[0].eid
            print(f"{i} s.results[0].eid")
        elif n > 1:
            print(f"{i} Too many results")
            continue  # must come last
        else:
            print(f"{i} Error")
            continue  # must come last
    

    (I used the .iterrows() here to get rid of the indexation. But the i will be incorrect if the index is not a range sequence - in this case enclose all in enumerate().)