I have a few different functions that scrape different tables with pandas, saves each to a dataframe, and saves them to a PostgreSQL database. I am able to successfully scrape and save each table as a dataframe, but I am having a bit of an issue when saving it to SQL. I am trying to do this with something like the below:
from sqlalchemy import create_engine
# Opening sql connection
engine = create_engine('postgresql://postgres:pw@localhost/name')
con = engine.connect()
def df1():
df = scraped_data
df.to_sql(table_name, con, if_exists='replace')
df1()
def df2():
df = scraped_data
df.to_sql(table_name, con, if_exists='replace')
df2()
# Closing connection
con.close()
I am able to successfully save df1
to SQL, but I get an error when running df2
.
The only real difference between the two functions is that they are scraping data from different sources. Everything else is essentially the same.
I have a couple other functions for other dataframes, but only the first one ever works no matter which order I call the functions.
I keep getting the same error for all of the other functions I call:
psycopg2.ProgrammingError: incomplete placeholder: '%(' without ')'
They also linked a page for background on the error: http://sqlalche.me/e/f405), although I still don't quite know what to make of it.
I just find it strange how it works for one function but not the others when the only thing that changes is the url that I am scraping from.
EDIT
I am scraping data from the NFL's website.
df1
iterates through years in a table from http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=TM&offensiveStatisticCategory=GAME_STATS&defensiveStatisticCategory=null&season=2019&seasonType=REG&tabSeq=2&qualified=false&Submit=Go.
df2
does a very similar thing but pulls data from http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=TM&offensiveStatisticCategory=TEAM_PASSING&defensiveStatisticCategory=null&season=2019&seasonType=REG&tabSeq=2&qualified=false&Submit=Go.
It looks like that the main difference is that df1
uses Pct
to represent percentage in a column header whereas df2
uses %
TL;DR: You have a potential SQL injection hole.
The problem is that one of your column names contains %
. Here is a minimal reproducible example:
In [8]: df = pd.DataFrame({"%A": ['x', 'y', 'z']})
In [9]: df.to_sql('foo', engine, if_exists='replace')
which produces the following log and traceback:
...
INFO:sqlalchemy.engine.base.Engine:
DROP TABLE foo
INFO:sqlalchemy.engine.base.Engine:{}
INFO:sqlalchemy.engine.base.Engine:COMMIT
INFO:sqlalchemy.engine.base.Engine:
CREATE TABLE foo (
index BIGINT,
"%%A" TEXT
)
INFO:sqlalchemy.engine.base.Engine:{}
INFO:sqlalchemy.engine.base.Engine:COMMIT
INFO:sqlalchemy.engine.base.Engine:BEGIN (implicit)
INFO:sqlalchemy.engine.base.Engine:INSERT INTO foo (index, "%%A") VALUES (%(index)s, %(%A)s)
INFO:sqlalchemy.engine.base.Engine:({'index': 0, '%A': 'x'}, {'index': 1, '%A': 'y'}, {'index': 2, '%A': 'z'})
INFO:sqlalchemy.engine.base.Engine:ROLLBACK
---------------------------------------------------------------------------
ProgrammingError Traceback (most recent call last)
~/Work/sqlalchemy/lib/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
1239 self.dialect.do_executemany(
-> 1240 cursor, statement, parameters, context
1241 )
~/Work/sqlalchemy/lib/sqlalchemy/dialects/postgresql/psycopg2.py in do_executemany(self, cursor, statement, parameters, context)
854 if self.executemany_mode is EXECUTEMANY_DEFAULT:
--> 855 cursor.executemany(statement, parameters)
856 return
ProgrammingError: incomplete placeholder: '%(' without ')'
The above exception was the direct cause of the following exception:
ProgrammingError Traceback (most recent call last)
<ipython-input-9-88cf8a93ad8c> in <module>()
----> 1 df.to_sql('foo', engine, if_exists='replace')
...
ProgrammingError: (psycopg2.ProgrammingError) incomplete placeholder: '%(' without ')'
[SQL: INSERT INTO foo (index, "%%A") VALUES (%(index)s, %(%A)s)]
[parameters: ({'index': 0, '%A': 'x'}, {'index': 1, '%A': 'y'}, {'index': 2, '%A': 'z'})]
(Background on this error at: http://sqlalche.me/e/f405)
As can be seen SQLAlchemy/Pandas uses the column name as the placeholder key: %(%A)s
. This means that you may be open to SQL injection, especially since you are handling scraped data:
In [3]: df = pd.DataFrame({"A": [1, 2, 3], """A)s);
...: DO $$
...: BEGIN
...: RAISE 'HELLO, BOBBY!';
...: END;$$ --""": ['x', 'y', 'z']})
In [4]: df.to_sql('foo', engine, if_exists='replace')
The result:
...
INFO sqlalchemy.engine.base.Engine INSERT INTO foo (index, "A", "A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --") VALUES (%(index)s, %(A)s, %(A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --)s)
INFO sqlalchemy.engine.base.Engine ({'index': 0, 'A': 1, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'x'}, {'index': 1, 'A': 2, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'y'}, {'index': 2, 'A': 3, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'z'})
INFO sqlalchemy.engine.base.Engine ROLLBACK
---------------------------------------------------------------------------
RaiseException Traceback (most recent call last)
...
InternalError: (psycopg2.errors.RaiseException) HELLO, BOBBY!
CONTEXT: PL/pgSQL function inline_code_block line 3 at RAISE
[SQL: INSERT INTO foo (index, "A", "A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --") VALUES (%(index)s, %(A)s, %(A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --)s)]
[parameters: ({'index': 0, 'A': 1, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'x'}, {'index': 1, 'A': 2, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'y'}, {'index': 2, 'A': 3, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'z'})]
(Background on this error at: http://sqlalche.me/e/2j85)
If you are using a database user with sufficient privileges, this allows for example executing arbitrary commands on your machine:
In [11]: df = pd.DataFrame({"A": [1, 2, 3], """A)s);
...: CREATE TEMPORARY TABLE IF NOT EXISTS evil (state text);
...: DO $$
...: BEGIN
...: IF NOT EXISTS (SELECT * FROM evil) THEN
...: COPY evil (state) FROM PROGRAM 'send_ssh_keys | echo done';
...: END IF;
...: END;$$ --""": ['x', 'y', 'z']})
This might seem like an oversight on SQLAlchemy's (and/or Pandas') part, but usually you are not meant to allow users, or outside data, to define your schema, and so table and column names are "trusted". In this light the only proper solution is to whitelist columns, i.e. check against a known set that your dataframe has only allowable columns.