I manage a PostgreSQL database and am working on a tool for users to access a subset of the database. The database has a number of columns, and in addition we use a huge number of hstore keys to store additional information specific to certain rows in the database. Basic example below
A B C hstore
"foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway"
"bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname", "Number"=>"5"
"foobar" 2 8
"baz" 3 1 "Fruit"=>"apple", "Name"=>"David"
The data is routinely exported to a CSV file like this:
COPY tableName TO '/filepath/file.csv' DELIMITER ',' CSV HEADER;
I read this into a Pandas dataframe in Python like this:
import pandas as pd
df = pd.read_csv('/filepath/file.csv')
I then access a subset of the data. This subset should have a common set of hstore keys in most, but not necessarily all rows.
I would like to create a separate column for each of the hstore keys. Where a key does not exist for a row, the cell should be left empty, or filled with a NULL or NAN value, whatever is easiest. What is the most effective way to do this?
You can use .str.extractall()
to extract the keys and values from column hstore
, then use .pivot()
to transform the keys to column labels. Aggregate the entries for each row in original dataframe by .groupby()
and .agg()
. Set NaN
for empty entries with .replace()
. Finally, join back the result dataframe to original dataframe with .join()
:
df.join(df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
Result:
A B C hstore Country Fruit Name Pet
0 "foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway" Norway apple NaN dog
1 "bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname" Suriname NaN NaN cat
2 "foobar" 2 8 None NaN NaN NaN NaN
3 "baz" 3 1 "Fruit"=>"apple", "Name"=>"David" NaN apple David NaN
If you want to get a new dataframe for the extraction instead of joining back to the original dataframe, you can remove the .join()
step and do a .reindex()
, as follows:
df_out = (df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
df_out = df_out.reindex(df.index)
Result:
print(df_out)
Country Fruit Name Pet
0 Norway apple NaN dog
1 Suriname NaN NaN cat
2 NaN NaN NaN NaN
3 NaN apple David NaN