I used the following code to find an article using the scholarly.search_pubs() function:
search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
scholarly.pprint(next(search_query))
Output:
{'author_id': ['', ''],
'bib': {'abstract': 'A style goods item has a finite selling period during '
'which the sales rate varies in a seasonal and, to some '
'extent, predictable fashion. There are only a limited '
'number of opportunities to purchase or manufacture the '
'style goods item, and the cost, in general, will depend '
'on the time at which the item is obtained. The unit '
'revenue achieved from sales of the item also varies '
'during the selling season, and, in particular, reaches '
'an appreciably lower terminal salvage value. Previous '
'work on this class of problem has assumed one of the '
'following:(a)',
'author': ['GR Murray Jr', 'EA Silver'],
'pub_year': '1966',
'title': 'A Bayesian analysis of the style goods inventory problem',
'venue': 'Management Science'},
'citedby_url': '/scholar?cites=9014559854426428787&as_sdt=5,33&sciodt=0,33&hl=en',
'filled': False,
'gsrank': 1,
'num_citations': 208,
'pub_url': 'https://pubsonline.informs.org/doi/abs/10.1287/mnsc.12.11.785',
'source': 'PUBLICATION_SEARCH_SNIPPET',
'url_add_sclib': '/citations?hl=en&xsrf=&continue=/scholar%3Fq%3DA%2BBayesian%2BAnalysis%2Bof%2Bthe%2BStyle%2BGoods%2BInventory%2BProblem%26hl%3Den%26as_sdt%3D0,33&citilm=1&update_op=library_add&info=c5WVKW0mGn0J&ei=4DdoYri8IoySyASZk6HgCA&json=',
'url_related_articles': '/scholar?q=related:c5WVKW0mGn0J:scholar.google.com/&scioq=A+Bayesian+Analysis+of+the+Style+Goods+Inventory+Problem&hl=en&as_sdt=0,33',
'url_scholarbib': '/scholar?q=info:c5WVKW0mGn0J:scholar.google.com/&output=cite&scirp=0&hl=en'}
I want to save this output as a pandas dataframe. Can someone please help me with it?
Edit(1): Thank you for answering my question.
When I run this code:
data = next(search_query)
df = pd.json_normalize(data)
... it gives the following error message:
StopIteration Traceback (most recent call last)
<ipython-input-78-ef73437b55a5> in <module>
----> 1 data = next(search_query)
2 df = pd.json_normalize(data)
~\Anaconda3\lib\site-packages\scholarly\publication_parser.py in __next__(self)
91 return self.__next__()
92 else:
---> 93 raise StopIteration
94
95 # Pickle protocol
StopIteration:
FOLLOW UP QUESTION
I have an excel file that contains the titles of multiple articles. Instead of separately searching for each article, I imported my excel file as a dataframe, and used the following code to find the info about the articles:
for i in df['Title']:
search_query_1 = scholarly.search_pubs(i)
Now, search_query_1 iterator contains multiple articles. How can I save them as a dataframe?
Try using pd.json_normalize
# python 3.8.9
# scholarly==1.6.0
search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
data = next(search_query)
# you can use data = list(search_query) to get the entire search back
df = pd.json_normalize(data)
#output
>>> df.T
0
container_type Publication
source PublicationSource.PUBLICATION_SEARCH_SNIPPET
filled False
gsrank 1
pub_url https://pubsonline.informs.org/doi/abs/10.1287...
author_id [, ]
url_scholarbib /scholar?q=info:c5WVKW0mGn0J:scholar.google.co...
url_add_sclib /citations?hl=en&xsrf=&continue=/scholar%3Fq%3...
num_citations 209
citedby_url /scholar?cites=9014559854426428787&as_sdt=5,33...
url_related_articles /scholar?q=related:c5WVKW0mGn0J:scholar.google...
bib.title A Bayesian analysis of the style goods invento...
bib.author [GR Murray Jr, EA Silver]
bib.pub_year 1966
bib.venue Management Science
bib.abstract A style goods item has a finite selling period...
>>> df.columns
Index(['container_type', 'source', 'filled', 'gsrank', 'pub_url', 'author_id',
'url_scholarbib', 'url_add_sclib', 'num_citations', 'citedby_url',
'url_related_articles', 'bib.title', 'bib.author', 'bib.pub_year',
'bib.venue', 'bib.abstract'],
dtype='object')
collect the iterated search and do the json normalize
To handle iterating over multiple titles
titles_to_search = list(df['Title'].unique())
dfs = []
for title_to_search in titles_to_search:
search_query = scholarly.search_pubs(title_to_search)
search_results = list(search_query)
temp_df = pd.json_normalize(data=search_results)
if not temp_df.empty:
dfs += [temp_df]
total_search_df = pd.concat(dfs)