how to disable effectively the deleting (purging) files from nginx proxy_cache (nginx cache manager process)

In peaks deleting files could kill your server and easily the traffic could degraded multiple times than normal if the nginx cache manager start deleting files!

The server is perfectly normal but suddenly it just get loaded and all nginx processes are in D (“Disk sleep”) state.

What could it be? What is going on with your proxy server?

Probably the cache is full!

Unfortunately there is no way to check how much is filled the cache live – just an upgrade or restart of the nginx process will trigger nginx cache loader to check all the cache files and will write the cache size on exit in the error log – but be careful the cache loading is also IO intensive operation – stats all the cache files and they could be millions images).

If you are sure the cache manager is to blame for the IO of your server (probably using this method – Check whether nginx cache manager is deleting files at the moment), you can stop it almost immediately!

Just increase the nginx cache drastically – add zero to the maximum cache size

Of course, you should have enough free space till you resolve the problem – for example more servers or manual deletion on peak-off or tune your cache deletion or any other solution….
Search for something like

        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=400g

And add zero to the max_size number like:

        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=4000g

The max size will increase from 400G to 4000G (4T)!
This will effectively stop the files deleting and the nginx cache manager will have slept for long time before invoking again to delete files. This could be life saving operation for your server at peak!

Here is a real graph from one of our servers – the cache manager started deleting files from the cache and the traffic dropped 99%!!!

SCREENSHOT 1) The nginx cache manager just started to delete files from the cache and this operation just killed our server completely.

You can see almost zero bandwidth! The problem was resolved when we reloaded nginx with a bigger cache max_size value. The nginx manager immediately went to sleep and no IO for deleting files. The load of the server returned to normal!

main menu
nginx cache manager start deleting files

SCREENSHOT 2) Hard drives were saturated and the disk maxed the IO time to 10 ms.

Despite the bigger READ and WRITE IOPS there was 95-99% less traffic.

main menu
Disk IO Time when cache manager is working

Then you can tune the values for deleting files from the cache – Tune nginx proxy cache – control the cache manager how to delete cached files.

Check whether nginx cache manager is deleting files at the moment

Here is a tip for the webmasters (or system admins) to discover whether the nginx using proxy_cache to cache files is deleting files at the moment! There situation where you may need to know if the loaded of a static media server is caused by the deletion of the cache manager or by the read or seek operations when serving the static files. The deletion is really slow and IO intensive operation, which could greatly impact the performance and traffic of the server.
Find the process nginx’s “cache manager process” and strace it:

[root@srv ~]# ps axuf|grep nginx
root     31582  0.0  0.0 2906768 25108 ?       Ss   Feb15   0:01 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx    16008  1.9  1.3 2941188 440224 ?      S    16:39   1:33  \_ nginx: worker process
nginx    16009  1.5  1.2 2941188 398836 ?      S    16:39   1:12  \_ nginx: worker process
nginx    16010  0.5  0.7 2941984 239064 ?      S    16:39   0:26  \_ nginx: worker process
nginx    16011  0.7  0.9 2941984 299356 ?      D    16:39   0:35  \_ nginx: worker process
nginx    16012  1.2  1.1 2941188 389540 ?      D    16:39   1:01  \_ nginx: worker process
nginx    16013  2.3  1.5 2941188 487324 ?      D    16:39   1:55  \_ nginx: worker process
nginx    16014  0.0  0.6 2906772 224004 ?      S    16:39   0:01  \_ nginx: cache manager process
[root@srv ~]# strace -f -p 16014
strace: Process 16014 attached
gettid()                                = 16014
write(31, "2019/02/25 18:00:31 [info] 16014"..., 89) = 89
epoll_wait(36, [], 512, 5406)           = 0
unlink("/mnt/cache/0/39/c8ccbbc06d16debb1c8d58ceb6f99390") = 0
unlink("/mnt/cache/0/78/118924d7bf70e20fa8f790c6f9e7c780") = 0
unlink("/mnt/cache/3/ce/fab074cc670e6a80114dcbc398a63ce3") = 0
unlink("/mnt/cache/5/48/0b4e162dd7be8244815721fb7d68e485") = 0
unlink("/mnt/cache/5/56/e5eb4b38c7c8d209d0aabaf79ac02565") = 0
unlink("/mnt/cache/e/c6/207b432fa77375e4eefcaf52db250c6e") = 0
unlink("/mnt/cache/4/6d/ac0db27a03dabc79d869068db1b516d4") = 0
unlink("/mnt/cache/9/e8/91625c6e60de8e5425c4135c7dfb2e89") = 0
unlink("/mnt/cache/b/3c/f3c53000cf0cb20d55d8c09df8a733cb") = 0
unlink("/mnt/cache/f/f7/6f06423cd411b45816969fe020903f7f") = 0
unlink("/mnt/cache/f/50/c9b8ab72821a6e9bcb9c8d4b790dc50f") = 0
unlink("/mnt/cache/6/1f/74b0f1fdf1ac30db6af7793dc15671f6") = 0
unlink("/mnt/cache/0/83/caf199c1b99d438f96caec71bf2ea830") = 0
unlink("/mnt/cache/4/3d/c90f8fbbba4aaf407e386641dc2203d4") = 0
unlink("/mnt/cache/4/ad/d23cf8598020141b2bcec46d2b5cbad4") = 0
unlink("/mnt/cache/d/47/05973bc310503f36c67b7c1c24c8247d") = 0
unlink("/mnt/cache/f/11/e4fcbde8533d89105ab41f22c55e211f") = 0
unlink("/mnt/cache/2/06/29066a58e4116d24266026b4ed1e3062") = 0
epoll_wait(32, [], 512, 50)             = 0
unlink("/mnt/cache/4/6b/9a104ebdf70d00137a88d4584b2bb6b4") = 0
unlink("/mnt/cache/e/95/6d176447f57f21769d86a8f0b2a8b95e") = 0
unlink("/mnt/cache/b/b2/2f6f51163c65ae1fc06a913d6de1ab2b") = 0
unlink("/mnt/cache/a/24/2b058045a23b69de7a4442c9e6fce24a") = 0
unlink("/mnt/cache/7/60/00833e0b236ca8472f5be8227d645607") = 0
unlink("/mnt/cache/a/08/bf00eea300eff97dc4fffa61daaca08a") = 0
unlink("/mnt/cache/2/48/a291d8aca2b6f4f9471686eabe9b2482") = 0
unlink("/mnt/cache/0/e3/2d631adbc3bfdf8e44a51fa5453eee30") = 0
unlink("/mnt/cache/1/3b/08eef7c86c5ece9b5279b304dd86e3b1") = 0
unlink("/mnt/cache/b/a4/03213e4a8a1e8fb17ae698e54e70fa4b") = 0
unlink("/mnt/cache/b/a3/77f1b11811a9cda0ae93c498769f7a3b") = 0
unlink("/mnt/cache/4/01/1d50fac60681ae3263c8875775d20014") = 0
unlink("/mnt/cache/c/94/e71b96cbc65b248bd8e4540cbd69294c") = 0
unlink("/mnt/cache/1/59/99ec58e865b97e217835dd84f5f48591") = 0
unlink("/mnt/cache/4/b8/6a64825ce555b8f2440f051a7f7bcb84") = 0
unlink("/mnt/cache/7/51/fe2acbb895427ed8e406ce7e79d61517") = 0
.....
.....

You can tune the file removing from the cache with manager_files, manager_threshold and manager_sleep arguments of the proxy_cache_path.
If you came here searching information on the topic probably you should check out these articles, too: how to disable effectively the deleting (purging) files from nginx proxy_cache (nginx cache manager process) and Tune nginx proxy cache – control the cache manager how to delete cached files

Upgrade self-hosted atlassian bitbucket 4.x to 5.x in CentOS 7

Here we are going to show you a real example of how we upgraded out Atlassian Bitbucket server from 4.14.4 (around April 2017 installation) with the latest version of Atlassian Bitbucket 5.14.0. We use

  1. using self-hosted instance of Bitbucket 4.14.4
  2. Linux distro – CentOS 7.
  3. MySQL server for back-end. So there is a jdbc mysql driver (which should be installed after the upgrade).
  4. NGINX is used as proxy for our main HTTPS url. So we have changed our default configuration in server (in sever.xml).
  5. Bitbucket is loaded from a URL/bitbucket – “https://dev.example.com/bitbucket/”. So we have changed our default configuration (in sever.xml).

and you’ll see there are some pitfalls you can avoid if you follow our article. The latest git program in CentOS 7 is 1.8 and is not compatible with the new Atlassian Bitbucket 5.x, so we need to solve this problem before updating the server. Check out the official upgrade page here
Keep on reading!

Review of netdata graphs – nginx, php-fpm, mysql, memcached, redis, mail (postfix)

Here we show what to expect from the netdata graphics when using it in a web server. So we included here only the specific graphs for a web server:

  1. nginx – the web server
  2. php-fpm – the application, fastcgi php
  3. mysql – the database server
  4. memcached – memory cache
  5. redis – more sophisticated memory/disk cache
  6. mail – postfix mail server to send and receive mails

You can also visit our review of the generic graphs like system overview, cpu, memory and disks here: Review of netdata graphs – system overview, cpu, memory, disks and nfs

So here are the graphs netdata 1.10 offers to us:

CHART 1) Nginx Graphs

1) all active connections; 2) requests per second to nginx

main menu
Nginx Graphs

CHART 2) Nginx Graphs 2

1) nginx active connections by their status – reading (from client), writing (from client), idle (doing nothing, but opened to the client); 2) connections rate – accepted and handled

main menu
Nginx Graphs 2

CHART 3) PHP-FPM – FastCGI PHP performance metrics

1) active connections – active (executing PHP code on the CPU right now – “php running”), max active, idle; 2) requests; 3) performance – max children reached or slow requests (it depends on your version of netdata).

main menu
PHP-FPM – FastCGI PHP performance metrics

CHART 4) PHP-FPM – request information

1) reuqest duration – minimum, maximum, avarage – how much time do a request take time – very useful to see how fast is your backend application. 2) request CPU in procentages; 3) request memory – reuested memory by your php fpm processes.

main menu
PHP-FPM – request information

CHART 5) MySQL – performance metrics

1) bandwidth – The amount of data sent to mysql clients (out) and received from mysql clients (in); 2) queries – The number of statements executed by the server. To see a slow queries the slow query log should be enabled.

main menu
MySQL – performance metrics

CHART 6) MySQL – handlers and locks

1) handlers – netdata Quotation: “Usage of the internal handlers of mysql. This chart provides very good insights of what the mysql server is actually doing. – commit, the number of internal COMMIT statements; delete, the number of times that rows have been deleted from tables; prepare, a counter for the prepare phase of two-phase commit operations; read first, the number of times the first entry in an index was read. A high value suggests that the server is doing a lot of full index scans; e.g. SELECT col1 FROM foo, with col1 indexed; read key, the number of requests to read a row based on a key. If this value is high, it is a good indication that your tables are properly indexed for your queries; read next, the number of requests to read the next row in key order. This value is incremented if you are querying an index column with a range constraint or if you are doing an index scan; read prev, the number of requests to read the previous row in key order. This read method is mainly used to optimize ORDER BY … DESC; read rnd, the number of requests to read a row based on a fixed position. A high value indicates you are doing a lot of queries that require sorting of the result. You probably have a lot of queries that require MySQL to scan entire tables or you have joins that do not use keys properly; read rnd next, the number of requests to read the next row in the data file. This value is high if you are doing a lot of table scans. Generally this suggests that your tables are not properly indexed or that your queries are not written to take advantage of the indexes you have; rollback, the number of requests for a storage engine to perform a rollback operation; savepoint, the number of requests for a storage engine to place a savepoint; savepoint rollback, the number of requests for a storage engine to roll back to a savepoint; update, the number of requests to update a row in a table; write, the number of requests to insert a row in a table.” 2) MySQL table locks counters, netdata Quotation: ” immediate, the number of times that a request for a table lock could be granted immediately – waited, the number of times that a request for a table lock could not be granted immediately and a wait was needed. If this is high and you have performance problems, you should first optimize your queries, and then either split your table or tables or use replication.”

main menu
MySQL – handlers and locks

CHART 7) MySQL – sorts, selects and temporaries

1) mysql SELECT JOIN – full range, range, scan; 2) mysql sorts – range and scan; 3) temporaries – disk tables (writing to the disk is slow and should be avoided!!!) and tables.

main menu
MySQL – sorts, selects and temporaries

CHART 8) MySQL – connections and binlog

1) connections in seconds – all and aborted – if you are using persistent connections to MySQL you can see a busy MySQL server could have 2-3 new connections in a minute, because all the application backend uses the pool of already opened connections to the server. 2) connection errors – accepted, internal, max, peer_addr, select, tcpwrap; 3) binlog transactions per second

main menu
MySQL – connections and binlog

CHART 9) MySQL – binlog and threads

1) binlog statement cache; 2) MySQL threads – connected, cached, running; 3) threads cache misses

main menu
MySQL – binlog and threads

CHART 10) MySQL – Innodb engine infromation

1) Innodb I/O bandwidth – reads and writes; 2) Innodb I/O Operations – reads, writes and fsyncs; 3) Innodb Pending I/O Operations – reads and fsyncs; 4) Innodb Log Operations – write requests and writes.

main menu
MySQL – Innodb engine infromation

CHART 11) MySQL – Innodb engine infromation 2

1) Innodb OS Log Operations – fsyncs; 2) Innodb OS Log bandwidth – write (megabytes/s); 3) Innodb current row locks – current_waits; 4) Innodb row operations – inserted, read, updated and deleted.

main menu
MySQL – Innodb engine infromation 2

CHART 12) MySQL – Innodb engine infromation 3

1) Innodb buffer pool pages – data, dirty, free, flushed, misc, total; 2) Innodb buffer pool bytes – data and dirty; 3) Innodb buffer pool read ahaed – all, evicted, random; 4) Innodb buffer pool requests – reads and writes per second.

main menu
MySQL – Innodb engine infromation 3

CHART 13) MySQL – Innodb engine infromation 4

1) Innodb buffer pool operations – disk reads – operations per second.

main menu
MySQL – Innodb engine infromation 4

CHART 14) MySQL – query cache (qcache)

1) query cache operations – hits, low memory prunes, inserts, not cached; 2) queries in the cache; 3) query cache free memory; 4) query cache memory blocks – free and total.

main menu
MySQL – query cache (qcache)

CHART 15) MySQL – myisam engine information

This server does not uses MyISAM engine, so you can see almost everything is zero – 1) MyISAM key cache blocks – unused and used; 2) MyISAM key cache requests – reads and writes; 3) MyISAM key cache disk operation – reads and writes.

main menu
MySQL – myisam engine information

CHART 16) MySQL – files

1) open files – how many files are opened at the moment; 2) opened file rate – files per second.

main menu
MySQL – files

CHART 17) Memcached – distributed memory caching system. A key-value memory storage.

1) cache size – available and used; 2) network – in and out megabytes per second.

main menu
Memcached – distributed memory caching system. A key-value memory storage.

CHART 18) Memcached – connections and items

1) connections – current and total. Persistent connections are used, so no new connections often; 2) items cached – current and total. 3) items – evicted (forced removed – be careful here, this means your cached items are forcedly removed by the server because of lack of memory?) and reclaims (expired items).

main menu
Memcached – connections and items

CHART 19) Memcached – get and set operations

1) get operation requests – hits and misses; 2) get operations rate – requests per second; 3) set operation requests – requests per second.

main menu
Memcached – get and set operations

CHART 20) Memcached – check and set ops, delete ops, increment ops

1) check and set operation requests – hits, misses, bad value; 2) delete operation requests – hits and misses; increment operation requests – hits and misses

main menu
Memcached – check and set ops, delete ops, increment ops

CHART 21) Memcached – decrement ops, touch ops

1) decrement operation request – hits and misses; 2) touch operation requests – hits and misses; 3) touch operation requests rate – requests per second.

main menu
Memcached – decrement ops, touch ops

CHART 22) Postfix – mail service

1) Postfix Queue Emails – the emails in the queue of the mail transfer agent, these mails are in transfer state; 2) Postfix Queue Emails size – size.

main menu
Postfix – mail service

CHART 23) Redis – performance metrics for in-memory data structure store, used as a database, cache and message broker.

1) operations – commands and operations per second; 2) hit rate – persentage, the effectiveness of the cache.

main menu
Redis – performance metrics for in-memory data structure store, used as a database, cache and message broker.

CHART 24) Redis – memory, keys, network

1) Redis memory utilization – total and lua; 2) keys – how many keys does each database have – keys per database name; 3) network – Redis network bandwidth – in and out in megabytes per second.

main menu
Redis – memory, keys, network

CHART 25) Redis – connections and replication

1) Redis connections – received per second – it’s like new connections and if you use persistent connections no new connections are opened often; 2) Redis clients – connected processes to the redis server; 3) replication – connected slave servers.

main menu
Redis – connections and replication

CHART 26) Redis – persistence (save the databases to the disks)

1) Persistence changes since last save – changes – how many changed items have been there since last save of the databases to the disks. 2) Duration of the RDB Save operation – rdb save in time; 3) Status of the last RDB Save Operation – rdb status.

main menu
Redis – persistence (save the databases to the disks)

CHART 27) Web server access logs information

Live parsing of the access logs – be careful here, because this could take a good deal of CPU and I/O of your busy server. Here we included only the default nginx log, which does not save many records. netdata Quotation: “Information extracted from a server log file. web_log plugin incrementally parses the server log file to provide, in real-time, a break down of key server performance metrics. For web servers, an extended log file format may optionally be used (for nginx and apache) offering timing information and bandwidth for both requests and responses. web_log plugin may also be configured to provide a break down of requests per URL pattern (check /etc/netdata/python.d/web_log.conf).” – 1) responses – success and bad requests per second; 2) Response codes – 1xx and 4xx and more if any in the logs.

main menu
Web server access logs information

CHART 28) Web server access logs information – detailed response code, bandwidth, http methods

1) detailed response code – requests per second; 2) bandwidth of the requests and reponses; 3) Requests per HTTP Method – GET, POST, PUT, DELETE and so on if they present in the logs.

main menu
Web server access logs information – detailed response code, bandwidth, http methods

CHART 29) Web server access logs information – http versions, ip protocols, clients

1) Requests per HTTP Version – 1.0, 1.1 and 2.0 if any in the logs; 2) Requests per IP protocol – IPv4 and IPv6 (if used); 3) clients – unique client IPs per data collection.

main menu
Web server access logs information – http versions, ip protocols, clients

CHART 30) Web server access logs information – unique client IPs

Unique client IPs since last restart of netdata

main menu
Web server access logs information – unique client IPs