I'm running a Windows server on AWS that is serving some data to IOT devices, but after a while the server stops responding to requests because it hangs on the s.accept() call, I've managed to determine that this happens because the server has too many TCP connections open so the OS wont allocate any more which makes sense, but what doesn't make sense to me is why the connections are open still open because they should all have been closed. Here is an example from my code with parts omitted for safety:
def connection(conn, addr):
conn.settimeout(10)
data = None
connection_time = datetime.now()
n_items = 0
try:
print(connection_time.strftime("[%d/%m/%Y, %H:%M:%S] "), "new connection started:", addr)
data = get_info(conn)
print(addr, data)
# serve client here, protocol omitted
except Exception as e:
print(f"{addr} connection error:" + str(e))
if data is not None:
add_connection_info(addr, data, connection_time)
try:
conn.close()
print(connection_time.strftime("[%d/%m/%Y, %H:%M:%S] "), "connection ended:", addr)
except Exception as e:
print(f"close failed: {addr} ; {e}")
if __name__ == '__main__':
ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ssl_context.load_cert_chain(cert, key, password=*omitted*)
s = socket.socket()
s = ssl_context.wrap_socket(s, server_side=True)
host = "0.0.0.0"
port = 12345 # not the actual port
print('Server started:', host, port)
s.bind((host, port)) # Bind to the port
s.listen() # Now wait for client connection.
s.setblocking(False)
# Join completed threads and check connection status
threads = []
while True:
for thread in threads:
thread.join(0)
threads = [t for t in threads if t.is_alive()]
print(f"{len(threads)} active connections")
try:
# Use select to wait for a connection or timeout
rlist, _, _ = select.select([s], [], [], 100) # 100 seconds timeout
if s in rlist:
s.settimeout(10)
# TODO timout here
c, addr = s.accept()
print(f"Accepted connection from {addr}")
thread = Thread(target=connection, args=(c, addr))
#thread.daemon = True
thread.start()
threads.append(thread)
print("thread started")
else:
print("No connection within 100 second period")
except BlockingIOError:
print("No connection ready")
except Exception as e:
print("error", str(e))
try:
c.close()
print(f"Connection from {addr} closed due to error.")
except Exception as e_close:
print(f"Failed to close connection after error: {str(e_close)}")
I'm logging the output of the server and when I checked last after seeing the server freeze for every print(connection_time.strftime("[%d/%m/%Y, %H:%M:%S] "), "new connection started:", addr)
there is a matching print(connection_time.strftime("[%d/%m/%Y, %H:%M:%S] "), "connection ended:", addr)
so from what I can tell there should be no open connections because print(f"{len(threads)} active connections")
prints that there are 0 active threads. But when I open windows resource monitor there are 50+ ish open TCP by python even for (ip, port) that should have been closed hours ago and were logged as "ended" by the server so I don't understand why they still are. Update: I was a little mistaken, after adding some logging to my code it appears that the open connections in the resource monitor have never been accepted by my server, they are definitely coming from my devices though but I'm unsure of how to close them/free them if I never know they are there in the first place. if I use ´netstat -an´ I can see they are all stuck in a CLOSE_WAIT state, is there a way for me to force windows to just cleanup connections that have been stuck in this state for more than 5 minutes?
I've managed to fix the issue, the fix seems to have been to use socket.setdefaulttimeout(10)
, I'm not sure why this works but not s.settimeout(10)
, but now the server has been running for 6 days without issues (it used to run for about 8-12 ish hours before halting), there are now 0 connections stuck in the closed wait state.