nginx – remove “no live upstreams” for your backup connections

In this article, we discuss the use of Nginx upstream module for HTTP and CGI (FastCGI) requests.

Using Nginx upstream module is essential for scaling application backend, but there might be a few catch-ups. One of them is related to what will happen when a server fails?
There is a proxy_next_upstream directive (or for FastCGI – fastcgi_next_upstream), which instructs the upstream module what is a fail – is it a connection error or timeout or HTTP 500 returned by the upstream server or an ordinary HTTP 404 returned by the upstream server is also a fail. So when a failure is identified by the Nginx upstream module the upstream module will look for the next upstream server to handle the request. These directives instruct Nginx upstream module what is a failure then to handle the next upstream server if available.
The default values are too conservative (and probably it is better to be like that):

proxy_next_upstream error timeout;

And available options are:

proxy_next_upstream error | timeout | invalid_header | http_500 | http_502 | http_503 | http_504 | http_403 | http_404 | http_429 | non_idempotent | off ...;

They are the same for fastcgi_next_upstream, too.

Imagine you what to protect yourself from HTTP 500? Or even HTTP 403 and 404? It’s normal to include them in your configuration. But here is the catch-up:
What if an image is really missing (HTTP 404) on all your upstream backend servers? Or there is a syntax error in an application file (like a simple PHP file)? All your upstreams servers will return 404 or 500 (in case of application error) and all of them will be blacklisted for at least 20 seconds. Remember 404 or 500 is a failure and we need next upstream server and if all return failure for this particular request Nginx will return to the client there is a problem and will mark the server as down (unavailable for a period of time).

SO because of a single file (a problem in a single file), all the following requests will be denied with “502 Bad Gateway” or “500 Internal Server Error”, even your servers are healthy!

Just a tiny miss like a missing image could be misinterpreted as your upstream servers have problems, so they must be blocked! Even if you put a “backup” directive in the upstream server line!

The solution is to include one or more of your upstream servers with disabled failure count (fail_timeout=0s) as a backup server. So this server will be always available when all normal servers got blacklisted! And you are not going to receive any more “no live upstreams” and returning an error to your clients.
Here is a working configuration (it is the same for HTTP and FastCGI setups):

upstream backend {
     server   127.0.0.1:8000;
     server   10.10.10.10:9000 fail_timeout=20s;
     server   127.0.0.1:8000 backup fail_timeout=0s max_fails=1000;
}

And be careful what you add in proxy_next_upstream (or fastcgi_next_upstream). In general, HTTP 403 and 403 are not for this directive!

A real-world example (FastCGI)

One of our project using an Nginx with the upstream module to scale the PHP application backend began to serve only HTTP 502 to all clients! In the PHP logs, there was a rear syntax error on a single file (not in the main part of the site), but Nginx was answering to all requests, no matter of the URI with 502. What had happened? The two backend application servers had returned with 500 (because of this error) and were blacklisted for 20 seconds! And all the following requests were not served by the upstream backends because there were “no live upstreams”:

upstream backend-php {
        server   127.0.0.1:8000;
        server   178.63.22.46:9000 fail_timeout=20s;
        server   127.0.0.1:8000 backup;
}

with

fastcgi_next_upstream error timeout invalid_header http_500 http_503 http_429 non_idempotent;

Even we have a backup, it was also blacklisted. In fact, a scanner was scanning all of our PHP files and one URL returned syntax error, then all of the upstream servers were blacklisted and we experienced an effective DoS because of misconfiguration. After we changed our configuration to:

upstream backend-php {
        server   127.0.0.1:8000;
        server   10.10.10.10:9000 fail_timeout=20s;
        server   127.0.0.1:8000 backup fail_timeout=0s max_fails=1000;
}

Everything returned to normal. There was still this syntax error but did not stop all other valid URLs to be served. Of course, we fixed the broken file and stopped the scammer from scanning our site.

A real-world example 2 (HTTP)

In our proxy static cache servers in remote locations, we experienced periodically “no live upstreams” and our clients received “502 Bad Gateway” on-peak hours! The problem was we have too aggressive proxy connect, read and send timeout, but because we were serving a live TV we needed them. And on-peak if a single connection just huck-up for 5-10 seconds our upstream servers were blacklisted for 20 seconds! Using proxy_cache_lock could worsen the situation! Then we changed our configuration to have a backup upstream server, which effectively would not be blacklisted and lowered the proxy_cache_lock to be sure if a single connection failed for some reason all other might succeed in bringing the file to the cache!

proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504 http_403 http_404;
proxy_connect_timeout 2s;
proxy_read_timeout 5s;
proxy_send_timeout 5s;
proxy_cache_lock on;
proxy_cache_lock_timeout 20s;
proxy_cache_lock_age 10s;

with upstream configuration:

upstream backend_http {
        server 10.10.10.10 fail_timeout=20s;
        server 10.10.10.11 backup fail_timeout=0s max_fails=1000;
        keepalive 16;
}

Leave a Reply

Your email address will not be published.