I have a webcrawler that should save/insert the html page content to a PostgreSQL Database with some other meta data fields.
When inserting the html content field using mogrify
I'll get the error message List index out of range
. If I use a static dummy text for the html content e.g. "Hello World ö ü ä ß" (I am dealing with a german character set) the insert works fine.
This is my function:
def batch_insert(df, table, no_cols, conn=None, upsert_qry=""):
"""
Using cursor.mogrify() to build the bulk insert query
then cursor.execute() to execute the query
"""
if conn is None:
conn = get_connection()
col_str = "(" + (no_cols-1)*"%s," + "%s)"
# Create a list of tuples from the dataframe values
tuples = [tuple(x) for x in df.to_numpy()]
# Comma-separated dataframe columns
cols = ','.join(list(df.columns))
# SQL quert to execute
cursor = conn.cursor()
values = [cursor.mogrify(col_str, tup).decode('utf8') for tup in tuples]
query = "INSERT INTO %s(%s) VALUES " % (table, cols) + ",".join(values) + upsert_qry
#" ON CONFLICT (hk_portal_pid) DO UPDATE SET crawled_last=" + str(datetime.now())
try:
cursor.execute(query, tuples)
conn.commit()
except (Exception, psycopg2.DatabaseError) as error:
print("Error: %s" % error)
conn.rollback()
cursor.close()
return 1
print("execute_mogrify() done")
cursor.close()
conn.close()
The postgres column field is declared as text
as suggested in this post. The html content in my case has an average length of 40.000 characters, so there shouldn't be any storage issues.
I also checked if the the generated tuple in the function gets splitted for the html content field because of escape characters etc.. But this doesn't seem to be the case.
>>>>[len(a) for a in tuples]
[9, 9, 9, 9] # 4 rows / 9 columns
Data type for the 8th column (page_content) is string:
>>>>type(tuples[0][7])
<class 'str'>
Target table for batch insert:
CREATE TABLE public.as_portal_parent_content (
hk_page varchar(100) NULL,
hk_offer varchar(100) NULL,
portal_id varchar(3) NULL,
pid varchar(10) null,
page_category varchar(30) null,
page_link bpchar(300) null,
status varchar(3) null,
page_content text null,
last_crawled timestamp null
);
Any help much appreciated!
Added/Edit:
Traceback:
Traceback (most recent call last):
File "C:/Users/PycharmProjects/pythonProject/project/playground.py", line 5, in <module>
test.save_to_db(table='as_portal_parent_content')
File "C:/Users/PycharmProjects/pythonProject/project/util/crawl.py", line 163, in save_to_db
batch_insert(self.df, table)
File "C:/Users/PycharmProjects/pythonProject/project/util/db.py", line 60, in batch_insert
cursor.execute(query, tuples)
IndexError: list index out of range
Process finished with exit code 1
You shouldn't use tuples
in cursor.execute(query, tuples)
.
When you use mogrify
, you are basically generating the after-VALUES part of the sql query. So, there's no need to pass query parameters(tuples
in your case) to cur.execute
again.