I have gunicorn running on AWS ECS container. There is Flask app behind gunicorn serving requests. The flask app uses pymongo to interface with AWS dcoumentDB. This works most of the time but I am getting random failures from pymongo. Here is trackback I get when I run find_one() query. Please note the same query works almost 9 times out of 10. I am the only user of API right now. So not as issue of heavy traffic.
File "/opt/lib/python3.8/site-packages/pymongo/collection.py", line 1452, in find_one
for result in cursor.limit(-1):
File "/opt/lib/python3.8/site-packages/pymongo/cursor.py", line 1248, in next
if len(self.__data) or self._refresh():
File "/opt//lib/python3.8/site-packages/pymongo/cursor.py", line 1139, in _refresh
self.__session = self.__collection.database.client._ensure_session()
File "/opt/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1712, in _ensure_session
return self.__start_session(True, causal_consistency=False)
File "/opt/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1657, in __start_session
self._topology._check_implicit_session_support()
File "/opt/lib/python3.8/site-packages/pymongo/topology.py", line 538, in _check_implicit_session_support
self._check_session_support()
File "/opt/lib/python3.8/site-packages/pymongo/topology.py", line 554, in _check_session_support
self._select_servers_loop(
File "/opt/lib/python3.8/site-packages/pymongo/topology.py", line 238, in _select_servers_loop
raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: <document_db_cluster_host>:27017: timed out, Timeout: 30s, Topology Description: <TopologyDescription id: 66340edd96390975a24866e3, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('<document_db_cluster_host>', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('<document_db_cluster_host>:27017: timed out')>]>',)"
Anyone faced similar issue before?
As suggested I am using connect=False while creating MongoClient. I am also making sure that MongoClient is instantiated after Flask app is completely initialized and first request is received. I tried various combinations while creating MongoClient like readPreference="primaryPreferred", directConnection=True with host as primary instance and maxPoolSize=1. I have also simplified gunicorn configuration so it uses only 1 worker and 1 thread. But nothing works.
This turned out to be a bug in my configuration. I have one more container apart from my service container. This additional container is exact replica of service container but serves only 5% of total traffic. Lets call it test container and its mainly for testing. Because of a bug in my configuration this test container wasn't allowed to connect with DB cluster. So all the traffic going to this container was failing. This was the reason it looked like a random failure.