pythonsqlpandasassociative

Python sql query from associative entity result to 2d pandas dataframe


I would like to populate a pandas dataframe from an sql table that is an associative entity so that the data frame has an index of one of the entities and column headers of the second entity.

So as an example, I have the following SQL tables:

Entity 1

code name
A Type A
B Type B

Entity 2

code name
W Type W
X Type X
Y Type Y
Z Type Z

Associative entity

Entity 1_code Entity 2_code value
A W 1
A Y 7
A Z 3
B X 88
B Y 5

And I would like my dataframe to have the following structure

W X Y Z
A 1 NaN 7 3
B NaN 88 5 NaN

Semantically I can do this by loading an empty frame using following pseudo code:

connection = psycopg2.connect( ... )

# create empty df with index set to Entity 1 codes
df = psql.read_sql('SELECT code FROM entity_1', connection, index_col='code')

cur = connection.cursor()
cur.execute('SELECT code FROM entity_2')

# create list of column names
entity_2_codes = [r[0] for r in cur.fetchall()]
# add columns from entity 2 codes
df=df.reindex(columns=entity_2_codes) 

# now loop through each associative entity entry and insert value into dataframe

Is there a smart way that I can populate the table more efficiently? Add a column or a row at a time maybe? Note the data is sparse and so not every Entity 1 x Entity 2 combination will have a value.


Solution

  • You could use pandas pivot() or pivot_table() methods. pivot is used when you don't need aggregation (there'll be only one value per Entity 1_code and Entity 2_code combination). pivot_table can be used to do aggregations (sum, count, max) if you have more than one value, you can specify how to fill NA values etc.

    If you can load the Associative entity table into a DataFrame df, this would simply be:

    df.pivot(index='Entity 1_code', columns='Entity 2_code', values='value')
    

    or using pivot_table:

    df.pivot_table(index='Entity 1_code', columns='Entity 2_code', values='value', aggfunc='mean')
    

    If there is only one value per combination, pivot_table can mimic pivot by setting the aggfunc to 'mean' as the average will just be that one value.