amazon-web-servicesnginxamazon-cloudfronthttp-status-code-502

How best to manage Cloudfront/Nginx 502 Bad Gateway errors in AWS


We have a website which is served over CloudFront. Sometime this week the origin EC2 (ECS) server crashed and for a short time it started returning 502 errors:

502 Bad Gateway | Nginx

This issue was resolved quickly, but we have had a couple of users still seeing the errors in their browsers. They are both using Google Chrome and the problem seems to be constant (like the browser/CloudFront has cached the error). One user fixed the issue by entering Incognito mode, the other sees the issue every time they click on a link from our newsletter. Some other users have fixed the issue only by using a different browser.

I am unsure how to start debugging this. Also, I'd imagine if the receives a 502 error it wouldn't cache the page content. Also, I'm unable to replicate from my end.

To add extra information to the question:

I'm not looking for advice on how to stop or manage 502 bad gateway errors. We know why these happen(ed) this question is purely advice on fixing cached 502 errors after they have been delivered to the user.

From the feedback so far it looks like we have can uncache 502 errors in CloudFront after 10 seconds. This was enabled, but the issue still persists.

My feeling here is that the user's browser has Cached the 503 error page and isn't requesting an update from the server. Without getting them to clear their cache, is there a way to set CloudFront or their browser only to cache a 502 error for a short period before requesting an updated page from the server?

Also, thinking about this again. The error is '502 Bad Gateway | Nginx' is this even coming from CloudFront? could my server be sending long Cache-Control headers with 502 errors?


Solution

  • After going down a lot of dead ends, I finally found a solution to this issue. Apologies the initial question was incorrect in its assumptions. But thanks for everyone's input anyway. My previous experience of 502 errors was limited to instances where the origin server went down. So when a small number of our users started receiving constant 502 errors, when the server was functioning correctly, I immediately thought it was a CloudFront caching issue. The origin server had crashed, and the 502 error was being cached for these unfortunate users.

    After debugging more, the actual issue was due to a large (growing) cookie being set when the user came to the website from our emails. If the user wasn't logged in the cookie would save more data over time and get larger in filesize. This was limited to the max filesize of a cookie. But it didn't count on Nginx's header limits. So this created an 'upstream sent too big header' error. Hence the 502. Removing the cookie and increasing the header limits fixed the issue. We will lower the limits back over time once the cookie has been deleted or expired for our users.

    fastcgi_buffers 8 16k;
    

    updated to:

    fastcgi_buffers 16 16k;
    

    upstream sent too big header while reading response header from upstream