nginx and proxy_cache not growing cache despite max_size is bigger – shared memory zone to blame

One of our big Nginx cache servers has recently been upgraded to have 70T storage, which is pretty good storage for a proxy. And in a hurry to configure the big storage we changed only the “max_size” option of proxy_cache_path directive! After a week in production, the proxy reached 23T and it just stopped growing with no apparent reason! Space and Nginx max_size were OK 75T total space and 70T for the proxy cache, but no Nginx had not added more space after reaching 23T occupied space for two days, which was impossible because all files were kept for 5 years and 200G per day were generated. No errors in the logs and we even use “virtual host traffic status module” – Live status information like used space and more for nginx proxy cache, but still no clue why it did not grow above this threshold of 23T! And it began to remove cached objects!

proxy_cache_path /mnt/cache levels=1:2 keys_zone=STATIC:900m inactive=42600h max_size=70000g;

It appeared we exhausted the shared memory zone limit for the zone! And Nginx cache just stopped growing.

According to the Nginx manual “One megabyte (of shared memory zone), a zone can store about 8 thousand keys“. Apparently, after 23T of files, we have passed 7 200 000 keys and exhausted the limit we configured in the proxy_cache_path line!

The solution is really simple just increase the limit for the shared memory zone for the zone.

proxy_cache_path /mnt/cache levels=1:2 keys_zone=STATIC:4000m inactive=42600h max_size=70000g;

In the past, with small cache (15T) it was enough to have 900Mbytes for the cache’s shared memory. Now we set it to 4000 Mbytes to be able to store approximately 32 000 000 keys. We have 23T and 900M of shared memory for keys (for our setup, your setup may differ a lot!!!) and setting it to 4000M, which is more than 4 times bigger than before it will probably be enough for the rest free storage to be used at full extent.
Be careful this operation will trigger the “Nginx cache loader” to load the cache index and may produce IO during this operation!

Nginx shared memory zone size

Nginx workers use shared mappings – mmap, which is different from the SYSV and POSIX shared memory (so you cannot use ipcs tool to check for shared memory). You should check how many memory currently the process is using. Here is how you can get the size of the shared memory zone occupied by the Nginx processes and as you can see each Nginx worker is around 900M of column “RSS” (Resident Set Size):

[root@srv ~]# ps -o rss,pid,comm,user,cmd -C nginx
  RSS   PID COMMAND         USER     CMD
904888 3979 nginx           nginx    nginx: worker process
905116 3980 nginx           nginx    nginx: worker process
904828 3981 nginx           nginx    nginx: worker process
905176 3982 nginx           nginx    nginx: worker process
905196 3983 nginx           nginx    nginx: worker process
905008 3984 nginx           nginx    nginx: worker process
904908 3985 nginx           nginx    nginx: worker process
905372 3986 nginx           nginx    nginx: worker process
905088 3987 nginx           nginx    nginx: worker process
902688 3988 nginx           nginx    nginx: worker process
904932 3989 nginx           nginx    nginx: worker process
905032 3990 nginx           nginx    nginx: worker process
26452  3991 nginx           nginx    nginx: cache manager process
33928  8148 nginx           root     nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf

For a single Nginx process:

[root@srv ~]# cat /proc/3981/status |grep RssShmem
RssShmem:         894240 kB

You can check the occupied inodes of your file system with df to get approximately how many files you have:

[root@srv ~]# df -i
Filesystem         Inodes   IUsed      IFree IUse% Mounted on
devtmpfs         16452656     715   16451941   1% /dev
tmpfs            16455999       1   16455998   1% /dev/shm
tmpfs            16455999    1153   16454846   1% /run
tmpfs            16455999      17   16455982   1% /sys/fs/cgroup
/dev/md1          2076704   39687    2037017   2% /
/dev/md3       1214685184 6897020 1207788164   1% /mnt/cache
tmpfs            16455999       5   16455994   1% /run/user/0

inodes around 6 897 020 and not growing for days. This number is very close to the maximum keys, which 900M key shared memory zone may store!

Two days after changing the key shared memory zone limit to 4000Mbytes:

proxy_cache_path /mnt/cache levels=1:2 keys_zone=STATIC:4000m inactive=42600h max_size=70000g;

The Nginx workers passed 900Mbytes RSS (Resident Set Size) and it reached 1Gbyte. The occupied cached sized grew with 1T and continued to grow.

[root@srv ~]# ps -o rss,pid,comm,user,cmd -C nginx
  RSS   PID COMMAND         USER     CMD
52256  8148 nginx           root     nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
1005624 16899 nginx         nginx    nginx: worker process
1005948 16900 nginx         nginx    nginx: worker process
1005936 16901 nginx         nginx    nginx: worker process
1005912 16902 nginx         nginx    nginx: worker process
1005832 16903 nginx         nginx    nginx: worker process
1005836 16904 nginx         nginx    nginx: worker process
1005868 16905 nginx         nginx    nginx: worker process
1005932 16906 nginx         nginx    nginx: worker process
1005796 16907 nginx         nginx    nginx: worker process
1005980 16908 nginx         nginx    nginx: worker process
1005848 16909 nginx         nginx    nginx: worker process
1005888 16910 nginx         nginx    nginx: worker process
26328 16911 nginx           nginx    nginx: cache manager process

The occupied inodes also increased to 7 484 291, which means the cache added around 700 000 new files.

[root@srv ~]# df -i
Filesystem         Inodes   IUsed      IFree IUse% Mounted on
devtmpfs         16452656     715   16451941    1% /dev
tmpfs            16455999       1   16455998    1% /dev/shm
tmpfs            16455999    1153   16454846    1% /run
tmpfs            16455999      17   16455982    1% /sys/fs/cgroup
/dev/md1          2076704   39690    2037014    2% /
/dev/md3       1214685184 7484582 1207200602    1% /mnt/cache
tmpfs            16455999       5   16455994    1% /run/user/0

Tune nginx proxy cache – control the cache manager how to delete cached files

In most cases you’ll never want to modify the default settings for deleting cache items with proxy_cache_path directives. The problem is in a peak the file deleting could impact your server performance and even it could kill your server leaving it unresponsive for a period of time. You cannot instruct nginx with a schedule job for deletion cached items or ban the deletion when the server is busy or loaded. The manager just traces each zone for used cache capacity versus the maximum allowed size and if the used capacity is near or bigger than the maximum allowed size (max_size) the manager process triggers deletion with the default values – the nginx manager will try to delete at least 100 files (up to 200 milliseconds) and then it will sleep for 50 milliseconds then again it will try deleting 100 files. So your file system could receive at least 1000 files per second to delete!

This could lead your server to almost unresponsive state in the peaks.

And it could be perfectly OK in off-peaks, but there is no way how to tell nginx cache manager there is a plenty free space despite you reach the cache limit so at the moment it is not the best time to delete the cache!

You can tune three parameters per cache directory (manual here: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path):

  • manager_files – not more than this number of files to delete in one iteration. The default value is 100.
  • manager_threshold – limit the delete iteration time. The default value is 200 milliseconds and you must use nginx time syntax concatenated to the number you want, for example if you want 500 milliseconds you must use “500 ms”.
  • manager_sleep – how much time to sleep the manager before executing another delete iteration. The default value is 50 milliseconds and here you must use nginx time syntax concatenated to the number you want, for example if you want 500 milliseconds you must use “500 ms”.
        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=4000g manager_files=2 manager_sleep=200ms manager_threshold=500ms;

The cache manager will delete not more than 2 files for up to 500 milliseconds and it will sleep 200 milliseconds before another delete iteration.

The best option for loaded servers

The best option for loaded servers with full cache is to balance the free space – delete small amount of files at once to be sure your server will not get loaded even the free space decreases at the peaks (so more files are cached than the nginx manager could delete – you are aware of this and the free space should be enough), but during the off peak (which normally is several times longer than the peak) the nginx manager could catch up with the deleting and it should free up some space (cached files are lesser than the deleted ones). Of course, you should tune this according to your situation.
The main idea is to delete in small amounts of files to not saturate your disks it could take longer to recover the free space, but it will not load your server in peaks. You should consider two things:

  1. Free space – enough free space and to be sure the free space is enough for the peaks, when the cache could grow above the threshold.
  2. Number of deletions per iteration – you should experiment with this. Fist you should be away how many files are added for a period of time, which includes one peak and one off-peak and then to balance the number in such a way that after the period the cache is not above the maximum size. Probably the best is to start with a 24 hours period, which includes at least one peak.

As you can see the example above only 2 files are good enough for an iteration for our case. Taking into account the 200ms sleep between the files’ deletions 10 files at most should be deleted per second. In our case it is not enough for the peak, but for the off-peak, which is 20 hours every 24 hours, is good enough to get into the maximum size limit of the cache.

Here you can learn how to verify your nginx is deleting cache files and the impact of the default settings on a busy server in a peak: how to disable effectively the deleting (purging) files from nginx proxy_cache (nginx cache manager process) Our loaded server just stopped serving files and the bandwidth decreased with 99% because nginx cache manager suddenly started deleting cached files.

how to disable effectively the deleting (purging) files from nginx proxy_cache (nginx cache manager process)

In peaks deleting files could kill your server and easily the traffic could degraded multiple times than normal if the nginx cache manager start deleting files!

The server is perfectly normal but suddenly it just get loaded and all nginx processes are in D (“Disk sleep”) state.

What could it be? What is going on with your proxy server?

Probably the cache is full!

Unfortunately there is no way to check how much is filled the cache live – just an upgrade or restart of the nginx process will trigger nginx cache loader to check all the cache files and will write the cache size on exit in the error log – but be careful the cache loading is also IO intensive operation – stats all the cache files and they could be millions images).

If you are sure the cache manager is to blame for the IO of your server (probably using this method – Check whether nginx cache manager is deleting files at the moment), you can stop it almost immediately!

Just increase the nginx cache drastically – add zero to the maximum cache size

Of course, you should have enough free space till you resolve the problem – for example more servers or manual deletion on peak-off or tune your cache deletion or any other solution….
Search for something like

        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=400g

And add zero to the max_size number like:

        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=4000g

The max size will increase from 400G to 4000G (4T)!
This will effectively stop the files deleting and the nginx cache manager will have slept for long time before invoking again to delete files. This could be life saving operation for your server at peak!

Here is a real graph from one of our servers – the cache manager started deleting files from the cache and the traffic dropped 99%!!!

SCREENSHOT 1) The nginx cache manager just started to delete files from the cache and this operation just killed our server completely.

You can see almost zero bandwidth! The problem was resolved when we reloaded nginx with a bigger cache max_size value. The nginx manager immediately went to sleep and no IO for deleting files. The load of the server returned to normal!

main menu
nginx cache manager start deleting files

SCREENSHOT 2) Hard drives were saturated and the disk maxed the IO time to 10 ms.

Despite the bigger READ and WRITE IOPS there was 95-99% less traffic.

main menu
Disk IO Time when cache manager is working

Then you can tune the values for deleting files from the cache – Tune nginx proxy cache – control the cache manager how to delete cached files.