[SOLVED] gunicorn, Flask and PyMongo intermittent timeout failure

gunicorn, Flask and PyMongo intermittent timeout failure

I have gunicorn running on AWS ECS container. There is Flask app behind gunicorn serving requests. The flask app uses pymongo to interface with AWS dcoumentDB. This works most of the time but I am getting random failures from pymongo. Here is trackback I get when I run find_one() query. Please note the same query works almost 9 times out of 10. I am the only user of API right now. So not as issue of heavy traffic.

  File "/opt/lib/python3.8/site-packages/pymongo/collection.py", line 1452, in find_one    
    for result in cursor.limit(-1):  
  File "/opt/lib/python3.8/site-packages/pymongo/cursor.py", line 1248, in next    
    if len(self.__data) or self._refresh():  
  File "/opt//lib/python3.8/site-packages/pymongo/cursor.py", line 1139, in _refresh    
    self.__session = self.__collection.database.client._ensure_session()  
  File "/opt/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1712, in _ensure_session    
    return self.__start_session(True, causal_consistency=False)  
  File "/opt/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1657, in __start_session    
    self._topology._check_implicit_session_support()  
  File "/opt/lib/python3.8/site-packages/pymongo/topology.py", line 538, in _check_implicit_session_support    
    self._check_session_support()  
  File "/opt/lib/python3.8/site-packages/pymongo/topology.py", line 554, in _check_session_support    
    self._select_servers_loop(  
  File "/opt/lib/python3.8/site-packages/pymongo/topology.py", line 238, in _select_servers_loop    
    raise ServerSelectionTimeoutError(
      pymongo.errors.ServerSelectionTimeoutError: <document_db_cluster_host>:27017: timed out, Timeout: 30s, Topology Description: <TopologyDescription id: 66340edd96390975a24866e3, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('<document_db_cluster_host>', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('<document_db_cluster_host>:27017: timed out')>]>',)"

Anyone faced similar issue before?

As suggested I am using connect=False while creating MongoClient. I am also making sure that MongoClient is instantiated after Flask app is completely initialized and first request is received. I tried various combinations while creating MongoClient like readPreference="primaryPreferred", directConnection=True with host as primary instance and maxPoolSize=1. I have also simplified gunicorn configuration so it uses only 1 worker and 1 thread. But nothing works.

Solution

This turned out to be a bug in my configuration. I have one more container apart from my service container. This additional container is exact replica of service container but serves only 5% of total traffic. Lets call it test container and its mainly for testing. Because of a bug in my configuration this test container wasn't allowed to connect with DB cluster. So all the traffic going to this container was failing. This was the reason it looked like a random failure.