node.js reactjs amazon-ec2 amazon-rds http-status-code-503

The entire Node.js service crashes with 503 errors when the smallest error occurs

This is my first Node.js application I designed and developed. When I run the app in staging or production it crashes when fatal errors occur. For instance, trying to access error[0] when error is empty. This takes the entire service down and the client receives 503 errors. I am used to PHP & C# where this doesn't happen. I mean, the end-user will get an error, but the server for PHP or C# doesn't go down. With Node.js the entire Web service is no longer available. I am working through the code to catch everything that could be a fatal error, but still, I don't have confidence in this app knowing one mistake and my clients are not able to work. To restart the services, I created a health check system that expects a 200 code.

Here is my environment:

React.js with Ant Design & Node.js (14.21.1)
This is what I run to start the services on production and staging for the web service and client respectively:
- nohup node app.js > nohup.out &
- nohup node node_modules/@craco/craco/scripts/start.js > nohup.out &
The web service uses Apollo GraphQL & Sequelize with a pool { max: 5, min: 0, acquire: 30000, idle: 10000 }
Running Apache2 on AWS EC2 instances with Aurora/MySQL, but without using an RDS proxy.
PHP 8 to make health checks every 10 seconds and restarting services when anything >= 300 is returned.

Here is what I want to know:

Why does the entire web service go down when an error happens? When PHP or C# have fatal errors, it only affects the one request, not the entire site. What I am experiences is the equivalent of a PHP web service having a fatal error and the entire service crashes and goes offline.
Is this normal for Node.js services?
If this is not normal, what I am doing wrong?
How can I stop error from taking down the entire service?

Solution

When you use Node.js, your application is the server. On the other hand, when you use PHP, the server is Apache and your code is just a script being executed by Apache's mod_php module. (at least, these are the typical configurations for Node.js and PHP, though not the only ones)

So when your Node.js application has an uncaught error, it's equivalent to your HTTP server having an uncaught error. It will crash. While with PHP, Apache mod_php will catch it and handle it in a specific way.

But that doesn't mean it's acceptable for a run-of-the-mill error to cause a Node.js HTTP server to crash. If that happens, it just means your error handling needs improvement. A well-coded Node.js server will catch an error, log it, respond with an error response, and keep chugging. You have to write that code yourself though, so it's not as forgiving as PHP in that regard.

As for what to do about it, it depends on what kind of errors you're facing exactly, but the general idea is that there should be a top-level error handler that catches any error during a request, logs the error, does whatever else should be done, and returns a code 500 response. There are at least a couple of gotchas in addition:

That kind of error handler cannot catch an "uncaught promise rejection". This sort of error happens when you are not awaiting or catch()ing your promises. You should always do so, but if you want to stop your server from crashing while you diagnose where those are, you can subscribe to the unhandledRejection event. This will prevent them from crashing the process. Make sure to log them and fix them the right way.
If you are using something that inherits from EventEmitter, and it fires an error event which you have not subscribed to, this will be re-thrown as an unhandled error and crash the application. If you're using anything that fires events (not super common for an HTTP server), make sure you subscribe to its error event.
If you are using callback APIs (there's rarely a good reason to anymore) you need to be careful about throwing an error from within a callback, doing so can crash the application.