pythonpandaspostgresqlsqlalchemypandas-to-sql

psycopg2.ProgrammingError: incomplete placeholder: '%(' without ')'


I have a few different functions that scrape different tables with pandas, saves each to a dataframe, and saves them to a PostgreSQL database. I am able to successfully scrape and save each table as a dataframe, but I am having a bit of an issue when saving it to SQL. I am trying to do this with something like the below:

from sqlalchemy import create_engine

# Opening sql connection
engine = create_engine('postgresql://postgres:pw@localhost/name')
con = engine.connect()

def df1():
    df = scraped_data
    df.to_sql(table_name, con, if_exists='replace')
df1()

def df2():
    df = scraped_data
    df.to_sql(table_name, con, if_exists='replace')
df2()

# Closing connection
con.close()

I am able to successfully save df1 to SQL, but I get an error when running df2. The only real difference between the two functions is that they are scraping data from different sources. Everything else is essentially the same.

I have a couple other functions for other dataframes, but only the first one ever works no matter which order I call the functions.

I keep getting the same error for all of the other functions I call:

psycopg2.ProgrammingError: incomplete placeholder: '%(' without ')'

They also linked a page for background on the error: http://sqlalche.me/e/f405), although I still don't quite know what to make of it.

I just find it strange how it works for one function but not the others when the only thing that changes is the url that I am scraping from.

EDIT

I am scraping data from the NFL's website.

df1 iterates through years in a table from http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=TM&offensiveStatisticCategory=GAME_STATS&defensiveStatisticCategory=null&season=2019&seasonType=REG&tabSeq=2&qualified=false&Submit=Go.

df2 does a very similar thing but pulls data from http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=TM&offensiveStatisticCategory=TEAM_PASSING&defensiveStatisticCategory=null&season=2019&seasonType=REG&tabSeq=2&qualified=false&Submit=Go.

It looks like that the main difference is that df1 uses Pct to represent percentage in a column header whereas df2 uses %


Solution

  • TL;DR: You have a potential SQL injection hole.

    The problem is that one of your column names contains %. Here is a minimal reproducible example:

    In [8]: df = pd.DataFrame({"%A": ['x', 'y', 'z']})
    
    In [9]: df.to_sql('foo', engine, if_exists='replace')
    

    which produces the following log and traceback:

    ...
    INFO:sqlalchemy.engine.base.Engine:
    DROP TABLE foo
    INFO:sqlalchemy.engine.base.Engine:{}
    INFO:sqlalchemy.engine.base.Engine:COMMIT
    INFO:sqlalchemy.engine.base.Engine:
    CREATE TABLE foo (
            index BIGINT, 
            "%%A" TEXT
    )
    
    
    INFO:sqlalchemy.engine.base.Engine:{}
    INFO:sqlalchemy.engine.base.Engine:COMMIT
    INFO:sqlalchemy.engine.base.Engine:BEGIN (implicit)
    INFO:sqlalchemy.engine.base.Engine:INSERT INTO foo (index, "%%A") VALUES (%(index)s, %(%A)s)
    INFO:sqlalchemy.engine.base.Engine:({'index': 0, '%A': 'x'}, {'index': 1, '%A': 'y'}, {'index': 2, '%A': 'z'})
    INFO:sqlalchemy.engine.base.Engine:ROLLBACK
    ---------------------------------------------------------------------------
    ProgrammingError                          Traceback (most recent call last)
    ~/Work/sqlalchemy/lib/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
       1239                     self.dialect.do_executemany(
    -> 1240                         cursor, statement, parameters, context
       1241                     )
    
    ~/Work/sqlalchemy/lib/sqlalchemy/dialects/postgresql/psycopg2.py in do_executemany(self, cursor, statement, parameters, context)
        854         if self.executemany_mode is EXECUTEMANY_DEFAULT:
    --> 855             cursor.executemany(statement, parameters)
        856             return
    
    ProgrammingError: incomplete placeholder: '%(' without ')'
    
    The above exception was the direct cause of the following exception:
    
    ProgrammingError                          Traceback (most recent call last)
    <ipython-input-9-88cf8a93ad8c> in <module>()
    ----> 1 df.to_sql('foo', engine, if_exists='replace')
    
    ...
    
    ProgrammingError: (psycopg2.ProgrammingError) incomplete placeholder: '%(' without ')'
    [SQL: INSERT INTO foo (index, "%%A") VALUES (%(index)s, %(%A)s)]
    [parameters: ({'index': 0, '%A': 'x'}, {'index': 1, '%A': 'y'}, {'index': 2, '%A': 'z'})]
    (Background on this error at: http://sqlalche.me/e/f405)
    

    As can be seen SQLAlchemy/Pandas uses the column name as the placeholder key: %(%A)s. This means that you may be open to SQL injection, especially since you are handling scraped data:

    In [3]: df = pd.DataFrame({"A": [1, 2, 3], """A)s);
       ...: DO $$
       ...: BEGIN
       ...: RAISE 'HELLO, BOBBY!';
       ...: END;$$ --""": ['x', 'y', 'z']})
    
    In [4]: df.to_sql('foo', engine, if_exists='replace')
    

    The result:

    ...
    INFO sqlalchemy.engine.base.Engine INSERT INTO foo (index, "A", "A)s);
    DO $$
    BEGIN
    RAISE 'HELLO, BOBBY!';
    END;$$ --") VALUES (%(index)s, %(A)s, %(A)s);
    DO $$
    BEGIN
    RAISE 'HELLO, BOBBY!';
    END;$$ --)s)
    INFO sqlalchemy.engine.base.Engine ({'index': 0, 'A': 1, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'x'}, {'index': 1, 'A': 2, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'y'}, {'index': 2, 'A': 3, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'z'})
    INFO sqlalchemy.engine.base.Engine ROLLBACK
    ---------------------------------------------------------------------------
    RaiseException                            Traceback (most recent call last)
    ...
    
    InternalError: (psycopg2.errors.RaiseException) HELLO, BOBBY!
    CONTEXT:  PL/pgSQL function inline_code_block line 3 at RAISE
    
    [SQL: INSERT INTO foo (index, "A", "A)s);
    DO $$
    BEGIN
    RAISE 'HELLO, BOBBY!';
    END;$$ --") VALUES (%(index)s, %(A)s, %(A)s);
    DO $$
    BEGIN
    RAISE 'HELLO, BOBBY!';
    END;$$ --)s)]
    [parameters: ({'index': 0, 'A': 1, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'x'}, {'index': 1, 'A': 2, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'y'}, {'index': 2, 'A': 3, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'z'})]
    (Background on this error at: http://sqlalche.me/e/2j85)
    

    If you are using a database user with sufficient privileges, this allows for example executing arbitrary commands on your machine:

    In [11]: df = pd.DataFrame({"A": [1, 2, 3], """A)s);
        ...: CREATE TEMPORARY TABLE IF NOT EXISTS evil (state text);
        ...: DO $$
        ...: BEGIN
        ...: IF NOT EXISTS (SELECT * FROM evil) THEN
        ...: COPY evil (state) FROM PROGRAM 'send_ssh_keys | echo done';
        ...: END IF;
        ...: END;$$ --""": ['x', 'y', 'z']})
    

    This might seem like an oversight on SQLAlchemy's (and/or Pandas') part, but usually you are not meant to allow users, or outside data, to define your schema, and so table and column names are "trusted". In this light the only proper solution is to whitelist columns, i.e. check against a known set that your dataframe has only allowable columns.