pythondataframeweb-crawlerarticlegoogle-scholar

Python: How do I save scholarly.search_pubs() result as a dataframe?


I used the following code to find an article using the scholarly.search_pubs() function:

search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
scholarly.pprint(next(search_query))

Output:

{'author_id': ['', ''],
 'bib': {'abstract': 'A style goods item has a finite selling period during '
                     'which the sales rate varies in a seasonal and, to some '
                     'extent, predictable fashion. There are only a limited '
                     'number of opportunities to purchase or manufacture the '
                     'style goods item, and the cost, in general, will depend '
                     'on the time at which the item is obtained. The unit '
                     'revenue achieved from sales of the item also varies '
                     'during the selling season, and, in particular, reaches '
                     'an appreciably lower terminal salvage value. Previous '
                     'work on this class of problem has assumed one of the '
                     'following:(a)',
         'author': ['GR Murray Jr', 'EA Silver'],
         'pub_year': '1966',
         'title': 'A Bayesian analysis of the style goods inventory problem',
         'venue': 'Management Science'},
 'citedby_url': '/scholar?cites=9014559854426428787&as_sdt=5,33&sciodt=0,33&hl=en',
 'filled': False,
 'gsrank': 1,
 'num_citations': 208,
 'pub_url': 'https://pubsonline.informs.org/doi/abs/10.1287/mnsc.12.11.785',
 'source': 'PUBLICATION_SEARCH_SNIPPET',
 'url_add_sclib': '/citations?hl=en&xsrf=&continue=/scholar%3Fq%3DA%2BBayesian%2BAnalysis%2Bof%2Bthe%2BStyle%2BGoods%2BInventory%2BProblem%26hl%3Den%26as_sdt%3D0,33&citilm=1&update_op=library_add&info=c5WVKW0mGn0J&ei=4DdoYri8IoySyASZk6HgCA&json=',
 'url_related_articles': '/scholar?q=related:c5WVKW0mGn0J:scholar.google.com/&scioq=A+Bayesian+Analysis+of+the+Style+Goods+Inventory+Problem&hl=en&as_sdt=0,33',
 'url_scholarbib': '/scholar?q=info:c5WVKW0mGn0J:scholar.google.com/&output=cite&scirp=0&hl=en'}

I want to save this output as a pandas dataframe. Can someone please help me with it?

Edit(1): Thank you for answering my question.

When I run this code:

data = next(search_query)
df = pd.json_normalize(data)

... it gives the following error message:

StopIteration                             Traceback (most recent call last)
<ipython-input-78-ef73437b55a5> in <module>
----> 1 data = next(search_query)
      2 df = pd.json_normalize(data)

~\Anaconda3\lib\site-packages\scholarly\publication_parser.py in __next__(self)
     91             return self.__next__()
     92         else:
---> 93             raise StopIteration
     94 
     95     # Pickle protocol
StopIteration:


FOLLOW UP QUESTION



I have an excel file that contains the titles of multiple articles. Instead of separately searching for each article, I imported my excel file as a dataframe, and used the following code to find the info about the articles:

for i in df['Title']:
    search_query_1 = scholarly.search_pubs(i)

Now, search_query_1 iterator contains multiple articles. How can I save them as a dataframe?


Solution

  • Try using pd.json_normalize

    # python 3.8.9
    # scholarly==1.6.0
    
    search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
    data = next(search_query)
    # you can use data = list(search_query) to get the entire search back
    df = pd.json_normalize(data)
    
    #output
    >>> df.T                                                                      
                                                                         0
    container_type                                              Publication
    source                     PublicationSource.PUBLICATION_SEARCH_SNIPPET
    filled                                                            False
    gsrank                                                                1
    pub_url               https://pubsonline.informs.org/doi/abs/10.1287...
    author_id                                                          [, ]
    url_scholarbib        /scholar?q=info:c5WVKW0mGn0J:scholar.google.co...
    url_add_sclib         /citations?hl=en&xsrf=&continue=/scholar%3Fq%3...
    num_citations                                                       209
    citedby_url           /scholar?cites=9014559854426428787&as_sdt=5,33...
    url_related_articles  /scholar?q=related:c5WVKW0mGn0J:scholar.google...
    bib.title             A Bayesian analysis of the style goods invento...
    bib.author                                    [GR Murray Jr, EA Silver]
    bib.pub_year                                                       1966
    bib.venue                                            Management Science
    bib.abstract          A style goods item has a finite selling period...
    >>> df.columns
    Index(['container_type', 'source', 'filled', 'gsrank', 'pub_url', 'author_id',
           'url_scholarbib', 'url_add_sclib', 'num_citations', 'citedby_url',
           'url_related_articles', 'bib.title', 'bib.author', 'bib.pub_year',
           'bib.venue', 'bib.abstract'],
          dtype='object')
    

    collect the iterated search and do the json normalize

    To handle iterating over multiple titles

    titles_to_search = list(df['Title'].unique())
    
    dfs = []
    for title_to_search in titles_to_search:
        search_query = scholarly.search_pubs(title_to_search)
        search_results = list(search_query)
        
        temp_df = pd.json_normalize(data=search_results)
        if not temp_df.empty:
            dfs += [temp_df]
    
    total_search_df = pd.concat(dfs)