pythonpandasdataframepubmed

Pubmed fetch article details to a daframe


Here is the code.

import pandas as pd
from pymed import PubMed
import numpy as np
pubmed = PubMed(tool="PubMedSearcher", email="myemail@ccc.com")


## PUT YOUR SEARCH TERM HERE ##
search_term = 'Charlie Brown'
results = pubmed.query(search_term, max_results=100000)
articleList = []
articleInfo = []

for article in results:
# Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
# We need to convert it to dictionary with available function
    articleDict = article.toDict()
    articleList.append(articleDict)

# Generate list of dict records which will hold all article details that could be fetch from PUBMED API
for article in articleList:
#Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
    pubmedId = article['pubmed_id'].partition('\n')[0]
    # Append article info to dictionary 
    articleInfo.append({u'pubmed_id':pubmedId,
                       u'publication_date':article['publication_date'], 
                       u'authors':article['authors']})

df=pd.json_normalize(articleInfo)

Running this code would fetch three columns, pubmed_id, publication_date and authorsenter image description here.

Is there a way to unnest the authors column and keep the other two columns? Thanks so much in advance.


Solution

  • If you want to unnest then, you have to define some strategy. For example, you can join the authors with lastname, firstname splitting each author with ;:

    # New column to easily identify how many authors there are in the paper
    df['n_authors'] = df['authors'].map(len)
    
    # Unnest authors into a single string using the above-mentioned strategy
    df['authors'] = df['authors'].map(lambda authors: ';'.join([f"{author['lastname']}, {author['firstname']}" for author in authors]))
    

    Output:

       pubmed_id publication_date                                            authors  n_authors  
    0   35435469       2022-04-19  Easwaran, Raju;Khan, Moin;Sancheti, Parag;Shya...         41  
    1   34480858       2021-09-05  Flaxman, Amy;Marchevsky, Natalie G;Jenkin, Dan...         38  
    2   30857579       2019-03-13                                     Brown, Charlie          1  
    3   28640023       2017-06-24  Thornton, Kevin C;Schwarz, Jennifer J;Gross, A...         12  
    4   24195874       2013-11-08  Bicket, Mark C;Gupta, Anita;Brown, Charlie H;C...          4  
    5   21741796       2011-07-12  Bird, Jonathan H;Carmont, Michael R;Dhillon, M...          7  
    6   21324873       2011-02-18  Cohen, Steven P;Brown, Charlie;Kurihara, Conni...          6  
    7   20228712       2010-03-17  Cohen, Steven P;Kapoor, Shruti G;Nguyen, Cuong...          8  
    8   20109957       2010-01-30  Cohen, Steven P;Brown, Charlie;Kurihara, Conni...          6  
    9   18248779       2008-02-06  Whitaker, Iain S;Duggan, Eileen M;Alloway, Rit...         10  
    10  16917639       2006-08-19  Drayton, William;Brown, Charlie;Hillhouse, Karin          3  
    11  16282488       2005-11-12  Mao, Hanwen;Lafont, Bernard A P;Igarashi, Tats...          9  
    12  14581571       2003-10-29  Moniuszko, Marcin;Brown, Charlie;Pal, Ranajit;...          7  
    13  12163382       2002-08-07  Williams, Kenneth;Schwartz, Annette;Corey, Sar...         10