I would like to populate a pandas dataframe from an sql table that is an associative entity so that the data frame has an index of one of the entities and column headers of the second entity.
So as an example, I have the following SQL tables:
Entity 1
code | name |
---|---|
A | Type A |
B | Type B |
Entity 2
code | name |
---|---|
W | Type W |
X | Type X |
Y | Type Y |
Z | Type Z |
Associative entity
Entity 1_code | Entity 2_code | value |
---|---|---|
A | W | 1 |
A | Y | 7 |
A | Z | 3 |
B | X | 88 |
B | Y | 5 |
And I would like my dataframe to have the following structure
W | X | Y | Z | |
---|---|---|---|---|
A | 1 | NaN | 7 | 3 |
B | NaN | 88 | 5 | NaN |
Semantically I can do this by loading an empty frame using following pseudo code:
connection = psycopg2.connect( ... )
# create empty df with index set to Entity 1 codes
df = psql.read_sql('SELECT code FROM entity_1', connection, index_col='code')
cur = connection.cursor()
cur.execute('SELECT code FROM entity_2')
# create list of column names
entity_2_codes = [r[0] for r in cur.fetchall()]
# add columns from entity 2 codes
df=df.reindex(columns=entity_2_codes)
# now loop through each associative entity entry and insert value into dataframe
Is there a smart way that I can populate the table more efficiently? Add a column or a row at a time maybe? Note the data is sparse and so not every Entity 1
x Entity 2
combination will have a value.
You could use pandas pivot() or pivot_table() methods. pivot
is used when you don't need aggregation (there'll be only one value per Entity 1_code
and Entity 2_code
combination). pivot_table
can be used to do aggregations (sum, count, max) if you have more than one value, you can specify how to fill NA values etc.
If you can load the Associative entity
table into a DataFrame df
, this would simply be:
df.pivot(index='Entity 1_code', columns='Entity 2_code', values='value')
or using pivot_table
:
df.pivot_table(index='Entity 1_code', columns='Entity 2_code', values='value', aggfunc='mean')
If there is only one value per combination, pivot_table
can mimic pivot
by setting the aggfunc to 'mean' as the average will just be that one value.