pythonpandascsvjson-normalize

pandas retaining column after flattening json


I have a pandas dataframe where each row corresponds to a news article, and each cell in the row contains one dictionary that contains stock ticker information as well as some metrics. The dataframe also contains a time_published column, which contains the publishing date and time corresponding to the row for the article.

I flattened the table into a more workable format, but due to the way I did it, I'm now missing the time_published data in my final table. I'm not quite sure how to make it so that I can keep it.

This is the original table:

index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 time_published
0 {'ticker': 'RYCEF', 'relevance_score': '0.079921', 'ticker_sentiment_score': '0.091132', 'ticker_sentiment_label': 'Neutral'} null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null 20240211T042031
1 null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null 20240211T034349
2 {'ticker': 'FOREX:IDR', 'relevance_score': '0.047371', 'ticker_sentiment_score': '0.0', 'ticker_sentiment_label': 'Neutral'} {'ticker': 'FOREX:JPY', 'relevance_score': '0.094576', 'ticker_sentiment_score': '0.079538', 'ticker_sentiment_label': 'Neutral'} null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null 20240211T033015
3 null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null 20240211T031049
4 {'ticker': 'BLK', 'relevance_score': '0.080188', 'ticker_sentiment_score': '-0.036152', 'ticker_sentiment_label': 'Neutral'} {'ticker': 'GOOG', 'relevance_score': '0.080188', 'ticker_sentiment_score': '0.0', 'ticker_sentiment_label': 'Neutral'} {'ticker': 'CRYPTO:BTC', 'relevance_score': '0.237358', 'ticker_sentiment_score': '0.229927', 'ticker_sentiment_label': 'Somewhat-Bullish'} null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null null 20240211T023653

I used the following code chunk to flatten the table into a workable format:

df_final = pd.DataFrame() 
for i in range(0, 41):
    df_flat_ticker_temp = pd.json_normalize(df_flat_ticker[i])
    df_final = pd.concat([df_final, df_flat_ticker_temp])
    df_final = df_final.dropna() 

Resulting in this table:

index ticker relevance_score ticker_sentiment_score ticker_sentiment_label
0 RYCEF 0.079921 0.091132 Neutral
2 FOREX:IDR 0.047371 0.0 Neutral
4 BLK 0.080188 -0.036152 Neutral
5 NSC 0.329551 -0.319278 Somewhat-Bearish
6 FOREX:EUR 0.043499 -0.222413 Somewhat-Bearish
7 FOREX:EUR 0.259727 0.11301 Neutral
8 VZ 0.390613 0.331239 Somewhat-Bullish
9 BABA 0.064449 0.0 Neutral
10 ADM 0.376704 0.166407 Somewhat-Bullish
11 EVGGF 0.098255 -0.010632 Neutral

I am happy with this table, but the only problem is that I cannot find a way to preserve the time_published data due to the way the table was originally formatted. My difficulty with this task arises from the fact that in the original table, an entry in the time_published column corresponds with all of the dictionaries in that row, but in my final table each of those dictionaries is now a separate row. I'd appreciate any help or advice from someone more skilled at pandas than I am on how to do this.


Solution

  • You could add the time_published column back in to each set of columns you append to df_final:

    for i in range(42):
        df_flat_ticker_temp = pd.json_normalize(df_flat_ticker[i])
        df_final = pd.concat([df_final, pd.concat([df_flat_ticker_temp, df['time_published']], axis=1)])
    
    df_final = df_final.dropna()
    

    Output for your sample data:

           ticker relevance_score ticker_sentiment_score ticker_sentiment_label   time_published
    0       RYCEF        0.079921               0.091132                Neutral  20240211T042031
    2   FOREX:IDR        0.047371                    0.0                Neutral  20240211T033015
    4         BLK        0.080188              -0.036152                Neutral  20240211T023653
    2   FOREX:JPY        0.094576               0.079538                Neutral  20240211T033015
    4        GOOG        0.080188                    0.0                Neutral  20240211T023653
    4  CRYPTO:BTC        0.237358               0.229927       Somewhat-Bullish  20240211T023653
    

    Note that you need range(42) to cover all your columns (up to 41).

    Also you could use assign instead of a nested concat:

    df_final = pd.concat([df_final, df_flat_ticker_temp.assign(time_published=df['time_published'])])