I am developing a PostgreSQL Procedure using the plpython3u extension which let you use python procedural language into PostgreSQL.
With the following code using plpy, I am able to retrive data form table and put it into pandas dataframe.
CREATE OR REPLACE PROCEDURE public.plpy_proc_clas_full(
)
LANGUAGE 'plpython3u'
AS $BODY$
import pandas as pd
data_lt = plpy.execute('SELECT "key", "value" FROM public."<your-table>" ORDER BY "key"'); #PLyResult --> List or Dictionary
data_df_x = pd.DataFrame.from_records(data_lt)['key'];
data_df_y = pd.DataFrame.from_records(data_lt)['value'];
df = pd.concat([data_df_x, data_df_y], axis=1).values
return df;
$BODY$;
But how can I write back the pandas dataframe to a table (for example after a few data manipulations in python)?
After some time, I read a few articles which were using libraries like psycopg2 or plpy.connect to open a connection to destination database.
Personally, I do not find any reason to re-open a connection while manipulating data (in and out) within the same runtime/database.
I thought of a solution which makes the job, although not sure about performance while handling huge tables because of using an iteration into the dataframe which case multiple insert statements, more specific one insert per line into dataframe, which is not bulk.
CREATE OR REPLACE PROCEDURE public.plpy_proc_test(
)
LANGUAGE 'plpython3u'
AS $BODY$
import pandas as pd
data_lt = plpy.execute('SELECT "key", "value" FROM public."ml_train" ORDER BY "key"'); #PLyResult --> List or Dictionary
data_df_x = pd.DataFrame.from_records(data_lt)['key'];
data_df_y = pd.DataFrame.from_records(data_lt)['value'];
df = pd.concat([data_df_x, data_df_y], axis=1)
for index, row in df.iterrows():
plan = plpy.prepare('INSERT INTO test (y_hat, y_act) VALUES ($1, $2)', ['numeric', 'numeric'])
plpy.execute(plan, [row[0], row[1]])
$BODY$;