I am a newbie to web development. My task is to download a large amount of data from a web api, and to upload them into a postgres database on google cloud.
Because the amount of data is very large, I have a for-loop that scrapes the data part-by-part, insert each part into the cloud table, and finally commits everything. I use sqlalchemy and pg8000 to do the job.
Here is the basic structure of my code:
engine = create_engine("postgresql+pg8000://connection/info")
session = scoped_session(sessionmaker(autocommit=False,autoflush=False,bind=engine))
Base = declarative_base()
class MyTableClass(Base):
some columns
Base.metadata.create_all(engine)
for part in scraping_data():
engine.execute(MyTableClass.__table__.insert(), part)
session.commit()
session.close()
engine.dispose()
The for loop ran successfully for 12 hours. Then I received a network error. Here is the first part. The remainder of the error message is pretty long and is just a string of errors caused by the first part.
ERROR:sqlalchemy.pool.impl.QueuePool:Exception closing connection <pg8000.legacy.Connection object at 0x7fa2ba514160>
Traceback (most recent call last):
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/pg8000/core.py", line 760, in handle_messages
code, data_len = ci_unpack(self._read(5))
struct.error: unpack_from requires a buffer of at least 5 bytes for unpacking 5 bytes at offset 0 (actual buffer size is 0)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1782, in _execute_context
self.dialect.do_executemany(
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 716, in do_executemany
cursor.executemany(statement, parameters)
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/pg8000/legacy.py", line 304, in executemany
self.execute(operation, parameters)
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/pg8000/legacy.py", line 252, in execute
self._context = self._c.execute_unnamed(
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/pg8000/core.py", line 649, in execute_unnamed
self.handle_messages(context)
File "/home/user/anaconda3/envs/myenv/lib/python3.8/site-packages/pg8000/core.py", line 762, in handle_messages
raise InterfaceError("network error on read") from e
pg8000.exceptions.InterfaceError: network error on read
Could anyone help me shed light on the error?
It seems possible that my connection has just timed out. Could anyone help me better structure the code in order to avoid losing a large of downloaded data when time-out happens?
Thanks!
As confirmed by @Eddie his issue was resolved by breaking up his export jobs into smaller ones. I will be posting the comments (troubleshooting steps/potential problems) as an answer for anyone who goes through the same issue.
As per this link, exception pg8000.errors.InterfaceError(Error) is a Generic exception raised for errors that are related to the database interface rather than the database itself. For example, if the interface attempts to use an SSL connection but the server refuses, an InterfaceError will be raised.
Check if you have enabled the Service Networking API in the project. If you're trying to assign a private IP address to a Cloud SQL instance, and you're using a Shared VPC, you also need to enable the Service Networking API for the host project.
CSV and SQL formats do export differently. The SQL format includes the entire database and is likely to take longer to complete. Use the CSV format and run multiple, smaller export jobs to reduce the size and length of each operation to avoid timeout during export. Maybe the temp_file_limit flag is set too low for your database usage. Increase the temp_file_limit size. See Configuring database flags.
Check if you are properly scoping the sqlalchemy session as described here: https://flask.palletsprojects.com/en/1.1.x/patterns/sqlalchemy/
Specifically, when using scoped_session, you are not removing the session with the @app.teardown_context decorator.
Follow this documentation to see if all the checklists for connectivity issues have been met in your case.