pythonpandaspostgresqlhstore

Access PostgreSQL hstore keys and values in Python and create new dataframe column for each key


I manage a PostgreSQL database and am working on a tool for users to access a subset of the database. The database has a number of columns, and in addition we use a huge number of hstore keys to store additional information specific to certain rows in the database. Basic example below

A          B        C        hstore   
"foo"      1        4        "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway" 
"bar"      4        6        "Pet"=>"cat", "Country"=>"Suriname", "Number"=>"5"
"foobar"   2        8
"baz"      3        1        "Fruit"=>"apple", "Name"=>"David"

The data is routinely exported to a CSV file like this:

COPY tableName TO '/filepath/file.csv' DELIMITER ',' CSV HEADER;

I read this into a Pandas dataframe in Python like this:

import pandas as pd
df = pd.read_csv('/filepath/file.csv')

I then access a subset of the data. This subset should have a common set of hstore keys in most, but not necessarily all rows.

I would like to create a separate column for each of the hstore keys. Where a key does not exist for a row, the cell should be left empty, or filled with a NULL or NAN value, whatever is easiest. What is the most effective way to do this?


Solution

  • You can use .str.extractall() to extract the keys and values from column hstore, then use .pivot() to transform the keys to column labels. Aggregate the entries for each row in original dataframe by .groupby() and .agg(). Set NaN for empty entries with .replace(). Finally, join back the result dataframe to original dataframe with .join():

    df.join(df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
                 .reset_index()
                 .pivot(index=['level_0', 'match'], columns=0, values=1)
                 .groupby(level=0)
                 .agg(lambda x: ''.join(x.dropna()))
                 .replace('', np.nan)
           )
    

    Result:

              A  B  C                                               hstore   Country  Fruit   Name  Pet
    0     "foo"  1  4  "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway"    Norway  apple    NaN  dog
    1     "bar"  4  6                  "Pet"=>"cat", "Country"=>"Suriname"  Suriname    NaN    NaN  cat
    2  "foobar"  2  8                                                 None       NaN    NaN    NaN  NaN
    3     "baz"  3  1                    "Fruit"=>"apple", "Name"=>"David"       NaN  apple  David  NaN
    
    

    If you want to get a new dataframe for the extraction instead of joining back to the original dataframe, you can remove the .join() step and do a .reindex(), as follows:

    df_out = (df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
                 .reset_index()
                 .pivot(index=['level_0', 'match'], columns=0, values=1)
                 .groupby(level=0)
                 .agg(lambda x: ''.join(x.dropna()))
                 .replace('', np.nan)
             )
    df_out = df_out.reindex(df.index)
    

    Result:

    print(df_out)
    
    
        Country  Fruit   Name  Pet
    0    Norway  apple    NaN  dog
    1  Suriname    NaN    NaN  cat
    2       NaN    NaN    NaN  NaN
    3       NaN  apple  David  NaN