nginx – remove “no live upstreams” for your backup connections

In this article, we discuss the use of Nginx upstream module for HTTP and CGI (FastCGI) requests.

Using Nginx upstream module is essential for scaling application backend, but there might be a few catch-ups. One of them is related to what will happen when a server fails?
There is a proxy_next_upstream directive (or for FastCGI – fastcgi_next_upstream), which instructs the upstream module what is a fail – is it a connection error or timeout or HTTP 500 returned by the upstream server or an ordinary HTTP 404 returned by the upstream server is also a fail. So when a failure is identified by the Nginx upstream module the upstream module will look for the next upstream server to handle the request. These directives instruct Nginx upstream module what is a failure then to handle the next upstream server if available.
The default values are too conservative (and probably it is better to be like that):

proxy_next_upstream error timeout;

And available options are:

proxy_next_upstream error | timeout | invalid_header | http_500 | http_502 | http_503 | http_504 | http_403 | http_404 | http_429 | non_idempotent | off ...;

They are the same for fastcgi_next_upstream, too.

Imagine you what to protect yourself from HTTP 500? Or even HTTP 403 and 404? It’s normal to include them in your configuration. But here is the catch-up:
What if an image is really missing (HTTP 404) on all your upstream backend servers? Or there is a syntax error in an application file (like a simple PHP file)? All your upstreams servers will return 404 or 500 (in case of application error) and all of them will be blacklisted for at least 20 seconds. Remember 404 or 500 is a failure and we need next upstream server and if all return failure for this particular request Nginx will return to the client there is a problem and will mark the server as down (unavailable for a period of time).

SO because of a single file (a problem in a single file), all the following requests will be denied with “502 Bad Gateway” or “500 Internal Server Error”, even your servers are healthy!

Just a tiny miss like a missing image could be misinterpreted as your upstream servers have problems, so they must be blocked! Even if you put a “backup” directive in the upstream server line!

The solution is to include one or more of your upstream servers with disabled failure count (fail_timeout=0s) as a backup server. So this server will be always available when all normal servers got blacklisted! And you are not going to receive any more “no live upstreams” and returning an error to your clients.
Here is a working configuration (it is the same for HTTP and FastCGI setups):

upstream backend {
     server   127.0.0.1:8000;
     server   10.10.10.10:9000 fail_timeout=20s;
     server   127.0.0.1:8000 backup fail_timeout=0s max_fails=1000;
}

And be careful what you add in proxy_next_upstream (or fastcgi_next_upstream). In general, HTTP 403 and 403 are not for this directive!

A real-world example (FastCGI)

One of our project using an Nginx with the upstream module to scale the PHP application backend began to serve only HTTP 502 to all clients! In the PHP logs, there was a rear syntax error on a single file (not in the main part of the site), but Nginx was answering to all requests, no matter of the URI with 502. What had happened? The two backend application servers had returned with 500 (because of this error) and were blacklisted for 20 seconds! And all the following requests were not served by the upstream backends because there were “no live upstreams”:

upstream backend-php {
        server   127.0.0.1:8000;
        server   178.63.22.46:9000 fail_timeout=20s;
        server   127.0.0.1:8000 backup;
}

with

fastcgi_next_upstream error timeout invalid_header http_500 http_503 http_429 non_idempotent;

Even we have a backup, it was also blacklisted. In fact, a scanner was scanning all of our PHP files and one URL returned syntax error, then all of the upstream servers were blacklisted and we experienced an effective DoS because of misconfiguration. After we changed our configuration to:

upstream backend-php {
        server   127.0.0.1:8000;
        server   10.10.10.10:9000 fail_timeout=20s;
        server   127.0.0.1:8000 backup fail_timeout=0s max_fails=1000;
}

Everything returned to normal. There was still this syntax error but did not stop all other valid URLs to be served. Of course, we fixed the broken file and stopped the scammer from scanning our site.

A real-world example 2 (HTTP)

In our proxy static cache servers in remote locations, we experienced periodically “no live upstreams” and our clients received “502 Bad Gateway” on-peak hours! The problem was we have too aggressive proxy connect, read and send timeout, but because we were serving a live TV we needed them. And on-peak if a single connection just huck-up for 5-10 seconds our upstream servers were blacklisted for 20 seconds! Using proxy_cache_lock could worsen the situation! Then we changed our configuration to have a backup upstream server, which effectively would not be blacklisted and lowered the proxy_cache_lock to be sure if a single connection failed for some reason all other might succeed in bringing the file to the cache!

proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504 http_403 http_404;
proxy_connect_timeout 2s;
proxy_read_timeout 5s;
proxy_send_timeout 5s;
proxy_cache_lock on;
proxy_cache_lock_timeout 20s;
proxy_cache_lock_age 10s;

with upstream configuration:

upstream backend_http {
        server 10.10.10.10 fail_timeout=20s;
        server 10.10.10.11 backup fail_timeout=0s max_fails=1000;
        keepalive 16;
}

ansible – restart a (nginx) service only if it is running and the configuration is ok

Another ansible quick tip showing how to restart a program properly. We want to restart the program or the service only if it is running (because some system on executing restart may start the service even it is in the stopped state).
Here is what the ansible playbook do:

  1. Check if the program is running.
  2. Check the configuration of the program. Do not restart a program or service if it cannot start after a stop command because of bad configuration file(s).
  3. Restart the service (the program) only if the above two are true.

If you are a newbie in ansible you can check this article – First ansible use – install and execute a single command or multiple tasks in a playbook There you can see how to create your inventory file (and configure sudo if you remotely log in with unprivileged user) used herein the example.

Ansible YAML file

For our example we use the nginx webserver in the ansible playbook. Put the following code in a file and then execute ansible-playbook:

---
- hosts: all
  tasks:
            
    - name: Test for running nginx
      shell: ps axuf|grep 'nginx'|grep -v "grep" | tr -d "\n" | cat
      register: test_running_nginx
      changed_when: False
      tags: restart-nginx
      
    - name: First check the configuration
      shell: /usr/sbin/nginx -t
      register: test_nginx_config
      when: test_running_nginx.stdout != ""
      changed_when: False
      ignore_errors: True
      tags: restart-nginx
          
    - name: Restart nginx
      service: name=nginx state=restarted
      when: test_running_nginx.stdout != "" and test_nginx_config.rc == 0
      tags: restart-nginx

Here is how to run the above ansible playbook

myuser@srv ~ $ ansible-playbook -l srv2 -i ./inventory.ini ./playbook-example.yml -b

PLAY [all] *****************************************************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************************************
ok: [srv2]

TASK [Test for running nginx] **********************************************************************************************************************************************
ok: [srv2]

TASK [First check the configuration] ***************************************************************************************************************************************
ok: [srv2]

TASK [Restart nginx] *******************************************************************************************************************************************************
changed: [srv2]

PLAY RECAP *****************************************************************************************************************************************************************
srv2                       : ok=4    changed=1    unreachable=0    failed=0   

Here we add to the command line “-b”, which will escalate to root if it is needed (using sudo) because the remote connection is done with unprivileged user “myuser”. You can skip this option if you described the remote connection with the root user in the inventory file (or a system user, which has permissions to restart services).
Keep on reading!

nginx with php fpm (fastcgi) and the warning – an upstream response is buffered to a temporary file /var/cache/nginx/fastcgi_temp

As the web grows and the technology advances the page size of the web sites also grows or just some times you might want to output a big chunk of data from your application server – PHP-FPM (but it could be any of another ruby, python, C, Django and more), for example.
Here is a fast configuration tip (note this is not the proxy-related warning!):

The default nginx buffers per CGI connection are too small

Here is what to do in your nginx configuration file:
First, look for a line “include /etc/nginx/fastcgi_params;” or similar and add or edit if they exist after this line:

        fastcgi_buffer_size 16k;
        fastcgi_buffers 32 16k;

Check out more for the buffers here http://nginx.org/en/docs/http/ngx_http_fastcgi_module.html#fastcgi_buffers
The warning should stop if it does not stop you can try raising them. It could consume more memory but could lower the IO usage of your disks and improve the performance of your site or whatever backend works!

Here is the warning in our nginx error logs. We got this warning when using php-fpm and the php output size was 325965 bytes (~320K).

2019/04/04 09:56:05 [warn] 24451#24451: *44269838 an upstream response is buffered to a temporary file /var/cache/nginx/fastcgi_temp/0/12/0019966120 while reading upstream, client: 10.10.10.10, server: srv17.srv.en, request: "GET /api/20140102/product HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "srv17.srv.en"
2019/04/04 09:56:07 [warn] 24451#24451: *44269849 an upstream response is buffered to a temporary file /var/cache/nginx/fastcgi_temp/2/12/0019966122 while reading upstream, client: 10.10.10.11, server: srv17.srv.en, request: "GET /api/20140102/product HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "srv17.srv.en"
2019/04/04 09:56:09 [warn] 24450#24450: *44269856 an upstream response is buffered to a temporary file /var/cache/nginx/fastcgi_temp/7/12/0019966127 while reading upstream, client: 10.10.10.12, server: srv17.srv.en, request: "GET /api/20140102/product HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "srv17.srv.en"

Tune nginx proxy cache – control the cache manager how to delete cached files

In most cases you’ll never want to modify the default settings for deleting cache items with proxy_cache_path directives. The problem is in a peak the file deleting could impact your server performance and even it could kill your server leaving it unresponsive for a period of time. You cannot instruct nginx with a schedule job for deletion cached items or ban the deletion when the server is busy or loaded. The manager just traces each zone for used cache capacity versus the maximum allowed size and if the used capacity is near or bigger than the maximum allowed size (max_size) the manager process triggers deletion with the default values – the nginx manager will try to delete at least 100 files (up to 200 milliseconds) and then it will sleep for 50 milliseconds then again it will try deleting 100 files. So your file system could receive at least 1000 files per second to delete!

This could lead your server to almost unresponsive state in the peaks.

And it could be perfectly OK in off-peaks, but there is no way how to tell nginx cache manager there is a plenty free space despite you reach the cache limit so at the moment it is not the best time to delete the cache!

You can tune three parameters per cache directory (manual here: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path):

  • manager_files – not more than this number of files to delete in one iteration. The default value is 100.
  • manager_threshold – limit the delete iteration time. The default value is 200 milliseconds and you must use nginx time syntax concatenated to the number you want, for example if you want 500 milliseconds you must use “500 ms”.
  • manager_sleep – how much time to sleep the manager before executing another delete iteration. The default value is 50 milliseconds and here you must use nginx time syntax concatenated to the number you want, for example if you want 500 milliseconds you must use “500 ms”.
        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=4000g manager_files=2 manager_sleep=200ms manager_threshold=500ms;

The cache manager will delete not more than 2 files for up to 500 milliseconds and it will sleep 200 milliseconds before another delete iteration.

The best option for loaded servers

The best option for loaded servers with full cache is to balance the free space – delete small amount of files at once to be sure your server will not get loaded even the free space decreases at the peaks (so more files are cached than the nginx manager could delete – you are aware of this and the free space should be enough), but during the off peak (which normally is several times longer than the peak) the nginx manager could catch up with the deleting and it should free up some space (cached files are lesser than the deleted ones). Of course, you should tune this according to your situation.
The main idea is to delete in small amounts of files to not saturate your disks it could take longer to recover the free space, but it will not load your server in peaks. You should consider two things:

  1. Free space – enough free space and to be sure the free space is enough for the peaks, when the cache could grow above the threshold.
  2. Number of deletions per iteration – you should experiment with this. Fist you should be away how many files are added for a period of time, which includes one peak and one off-peak and then to balance the number in such a way that after the period the cache is not above the maximum size. Probably the best is to start with a 24 hours period, which includes at least one peak.

As you can see the example above only 2 files are good enough for an iteration for our case. Taking into account the 200ms sleep between the files’ deletions 10 files at most should be deleted per second. In our case it is not enough for the peak, but for the off-peak, which is 20 hours every 24 hours, is good enough to get into the maximum size limit of the cache.

Here you can learn how to verify your nginx is deleting cache files and the impact of the default settings on a busy server in a peak: how to disable effectively the deleting (purging) files from nginx proxy_cache (nginx cache manager process) Our loaded server just stopped serving files and the bandwidth decreased with 99% because nginx cache manager suddenly started deleting cached files.

how to disable effectively the deleting (purging) files from nginx proxy_cache (nginx cache manager process)

In peaks deleting files could kill your server and easily the traffic could degraded multiple times than normal if the nginx cache manager start deleting files!

The server is perfectly normal but suddenly it just get loaded and all nginx processes are in D (“Disk sleep”) state.

What could it be? What is going on with your proxy server?

Probably the cache is full!

Unfortunately there is no way to check how much is filled the cache live – just an upgrade or restart of the nginx process will trigger nginx cache loader to check all the cache files and will write the cache size on exit in the error log – but be careful the cache loading is also IO intensive operation – stats all the cache files and they could be millions images).

If you are sure the cache manager is to blame for the IO of your server (probably using this method – Check whether nginx cache manager is deleting files at the moment), you can stop it almost immediately!

Just increase the nginx cache drastically – add zero to the maximum cache size

Of course, you should have enough free space till you resolve the problem – for example more servers or manual deletion on peak-off or tune your cache deletion or any other solution….
Search for something like

        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=400g

And add zero to the max_size number like:

        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=4000g

The max size will increase from 400G to 4000G (4T)!
This will effectively stop the files deleting and the nginx cache manager will have slept for long time before invoking again to delete files. This could be life saving operation for your server at peak!

Here is a real graph from one of our servers – the cache manager started deleting files from the cache and the traffic dropped 99%!!!

SCREENSHOT 1) The nginx cache manager just started to delete files from the cache and this operation just killed our server completely.

You can see almost zero bandwidth! The problem was resolved when we reloaded nginx with a bigger cache max_size value. The nginx manager immediately went to sleep and no IO for deleting files. The load of the server returned to normal!

main menu
nginx cache manager start deleting files

SCREENSHOT 2) Hard drives were saturated and the disk maxed the IO time to 10 ms.

Despite the bigger READ and WRITE IOPS there was 95-99% less traffic.

main menu
Disk IO Time when cache manager is working

Then you can tune the values for deleting files from the cache – Tune nginx proxy cache – control the cache manager how to delete cached files.

Check whether nginx cache manager is deleting files at the moment

Here is a tip for the webmasters (or system admins) to discover whether the nginx using proxy_cache to cache files is deleting files at the moment! There situation where you may need to know if the loaded of a static media server is caused by the deletion of the cache manager or by the read or seek operations when serving the static files. The deletion is really slow and IO intensive operation, which could greatly impact the performance and traffic of the server.
Find the process nginx’s “cache manager process” and strace it:

[root@srv ~]# ps axuf|grep nginx
root     31582  0.0  0.0 2906768 25108 ?       Ss   Feb15   0:01 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx    16008  1.9  1.3 2941188 440224 ?      S    16:39   1:33  \_ nginx: worker process
nginx    16009  1.5  1.2 2941188 398836 ?      S    16:39   1:12  \_ nginx: worker process
nginx    16010  0.5  0.7 2941984 239064 ?      S    16:39   0:26  \_ nginx: worker process
nginx    16011  0.7  0.9 2941984 299356 ?      D    16:39   0:35  \_ nginx: worker process
nginx    16012  1.2  1.1 2941188 389540 ?      D    16:39   1:01  \_ nginx: worker process
nginx    16013  2.3  1.5 2941188 487324 ?      D    16:39   1:55  \_ nginx: worker process
nginx    16014  0.0  0.6 2906772 224004 ?      S    16:39   0:01  \_ nginx: cache manager process
[root@srv ~]# strace -f -p 16014
strace: Process 16014 attached
gettid()                                = 16014
write(31, "2019/02/25 18:00:31 [info] 16014"..., 89) = 89
epoll_wait(36, [], 512, 5406)           = 0
unlink("/mnt/cache/0/39/c8ccbbc06d16debb1c8d58ceb6f99390") = 0
unlink("/mnt/cache/0/78/118924d7bf70e20fa8f790c6f9e7c780") = 0
unlink("/mnt/cache/3/ce/fab074cc670e6a80114dcbc398a63ce3") = 0
unlink("/mnt/cache/5/48/0b4e162dd7be8244815721fb7d68e485") = 0
unlink("/mnt/cache/5/56/e5eb4b38c7c8d209d0aabaf79ac02565") = 0
unlink("/mnt/cache/e/c6/207b432fa77375e4eefcaf52db250c6e") = 0
unlink("/mnt/cache/4/6d/ac0db27a03dabc79d869068db1b516d4") = 0
unlink("/mnt/cache/9/e8/91625c6e60de8e5425c4135c7dfb2e89") = 0
unlink("/mnt/cache/b/3c/f3c53000cf0cb20d55d8c09df8a733cb") = 0
unlink("/mnt/cache/f/f7/6f06423cd411b45816969fe020903f7f") = 0
unlink("/mnt/cache/f/50/c9b8ab72821a6e9bcb9c8d4b790dc50f") = 0
unlink("/mnt/cache/6/1f/74b0f1fdf1ac30db6af7793dc15671f6") = 0
unlink("/mnt/cache/0/83/caf199c1b99d438f96caec71bf2ea830") = 0
unlink("/mnt/cache/4/3d/c90f8fbbba4aaf407e386641dc2203d4") = 0
unlink("/mnt/cache/4/ad/d23cf8598020141b2bcec46d2b5cbad4") = 0
unlink("/mnt/cache/d/47/05973bc310503f36c67b7c1c24c8247d") = 0
unlink("/mnt/cache/f/11/e4fcbde8533d89105ab41f22c55e211f") = 0
unlink("/mnt/cache/2/06/29066a58e4116d24266026b4ed1e3062") = 0
epoll_wait(32, [], 512, 50)             = 0
unlink("/mnt/cache/4/6b/9a104ebdf70d00137a88d4584b2bb6b4") = 0
unlink("/mnt/cache/e/95/6d176447f57f21769d86a8f0b2a8b95e") = 0
unlink("/mnt/cache/b/b2/2f6f51163c65ae1fc06a913d6de1ab2b") = 0
unlink("/mnt/cache/a/24/2b058045a23b69de7a4442c9e6fce24a") = 0
unlink("/mnt/cache/7/60/00833e0b236ca8472f5be8227d645607") = 0
unlink("/mnt/cache/a/08/bf00eea300eff97dc4fffa61daaca08a") = 0
unlink("/mnt/cache/2/48/a291d8aca2b6f4f9471686eabe9b2482") = 0
unlink("/mnt/cache/0/e3/2d631adbc3bfdf8e44a51fa5453eee30") = 0
unlink("/mnt/cache/1/3b/08eef7c86c5ece9b5279b304dd86e3b1") = 0
unlink("/mnt/cache/b/a4/03213e4a8a1e8fb17ae698e54e70fa4b") = 0
unlink("/mnt/cache/b/a3/77f1b11811a9cda0ae93c498769f7a3b") = 0
unlink("/mnt/cache/4/01/1d50fac60681ae3263c8875775d20014") = 0
unlink("/mnt/cache/c/94/e71b96cbc65b248bd8e4540cbd69294c") = 0
unlink("/mnt/cache/1/59/99ec58e865b97e217835dd84f5f48591") = 0
unlink("/mnt/cache/4/b8/6a64825ce555b8f2440f051a7f7bcb84") = 0
unlink("/mnt/cache/7/51/fe2acbb895427ed8e406ce7e79d61517") = 0
.....
.....

You can tune the file removing from the cache with manager_files, manager_threshold and manager_sleep arguments of the proxy_cache_path.
If you came here searching information on the topic probably you should check out these articles, too: how to disable effectively the deleting (purging) files from nginx proxy_cache (nginx cache manager process) and Tune nginx proxy cache – control the cache manager how to delete cached files