As the web grows and the technology advances the page size of the web sites also grows or just some times you might want to output a big chunk of data from your application server – PHP-FPM (but it could be any of another ruby, python, C, Django and more), for example.
Here is a fast configuration tip (note this is not the proxy-related warning!):
The default nginx buffers per CGI connection are too small
Here is what to do in your nginx configuration file: First, look for a line “include /etc/nginx/fastcgi_params;” or similar and add or edit if they exist after this line:
In most cases you’ll never want to modify the default settings for deleting cache items with proxy_cache_path directives. The problem is in a peak the file deleting could impact your server performance and even it could kill your server leaving it unresponsive for a period of time. You cannot instruct nginx with a schedule job for deletion cached items or ban the deletion when the server is busy or loaded. The manager just traces each zone for used cache capacity versus the maximum allowed size and if the used capacity is near or bigger than the maximum allowed size (max_size) the manager process triggers deletion with the default values – the nginx manager will try to delete at least 100 files (up to 200 milliseconds) and then it will sleep for 50 milliseconds then again it will try deleting 100 files. So your file system could receive at least 1000 files per second to delete!
This could lead your server to almost unresponsive state in the peaks.
And it could be perfectly OK in off-peaks, but there is no way how to tell nginx cache manager there is a plenty free space despite you reach the cache limit so at the moment it is not the best time to delete the cache!
manager_files – not more than this number of files to delete in one iteration. The default value is 100.
manager_threshold – limit the delete iteration time. The default value is 200 milliseconds and you must use nginx time syntax concatenated to the number you want, for example if you want 500 milliseconds you must use “500 ms”.
manager_sleep – how much time to sleep the manager before executing another delete iteration. The default value is 50 milliseconds and here you must use nginx time syntax concatenated to the number you want, for example if you want 500 milliseconds you must use “500 ms”.
The cache manager will delete not more than 2 files for up to 500 milliseconds and it will sleep 200 milliseconds before another delete iteration.
The best option for loaded servers
The best option for loaded servers with full cache is to balance the free space – delete small amount of files at once to be sure your server will not get loaded even the free space decreases at the peaks (so more files are cached than the nginx manager could delete – you are aware of this and the free space should be enough), but during the off peak (which normally is several times longer than the peak) the nginx manager could catch up with the deleting and it should free up some space (cached files are lesser than the deleted ones). Of course, you should tune this according to your situation.
The main idea is to delete in small amounts of files to not saturate your disks it could take longer to recover the free space, but it will not load your server in peaks. You should consider two things:
Free space – enough free space and to be sure the free space is enough for the peaks, when the cache could grow above the threshold.
Number of deletions per iteration – you should experiment with this. Fist you should be away how many files are added for a period of time, which includes one peak and one off-peak and then to balance the number in such a way that after the period the cache is not above the maximum size. Probably the best is to start with a 24 hours period, which includes at least one peak.
As you can see the example above only 2 files are good enough for an iteration for our case. Taking into account the 200ms sleep between the files’ deletions 10 files at most should be deleted per second. In our case it is not enough for the peak, but for the off-peak, which is 20 hours every 24 hours, is good enough to get into the maximum size limit of the cache.
In peaks deleting files could kill your server and easily the traffic could degraded multiple times than normal if the nginx cache manager start deleting files!
The server is perfectly normal but suddenly it just get loaded and all nginx processes are in D (“Disk sleep”) state.
What could it be? What is going on with your proxy server?
Probably the cache is full!
Unfortunately there is no way to check how much is filled the cache live – just an upgrade or restart of the nginx process will trigger nginx cache loader to check all the cache files and will write the cache size on exit in the error log – but be careful the cache loading is also IO intensive operation – stats all the cache files and they could be millions images).
Just increase the nginx cache drastically – add zero to the maximum cache size
Of course, you should have enough free space till you resolve the problem – for example more servers or manual deletion on peak-off or tune your cache deletion or any other solution….
Search for something like
The max size will increase from 400G to 4000G (4T)!
This will effectively stop the files deleting and the nginx cache manager will have slept for long time before invoking again to delete files. This could be life saving operation for your server at peak!
Here is a real graph from one of our servers – the cache manager started deleting files from the cache and the traffic dropped 99%!!!
SCREENSHOT 1) The nginx cache manager just started to delete files from the cache and this operation just killed our server completely.
You can see almost zero bandwidth! The problem was resolved when we reloaded nginx with a bigger cache max_size value. The nginx manager immediately went to sleep and no IO for deleting files. The load of the server returned to normal!
SCREENSHOT 2) Hard drives were saturated and the disk maxed the IO time to 10 ms.
Despite the bigger READ and WRITE IOPS there was 95-99% less traffic.
Here is a tip for the webmasters (or system admins) to discover whether the nginx using proxy_cache to cache files is deleting files at the moment! There situation where you may need to know if the loaded of a static media server is caused by the deletion of the cache manager or by the read or seek operations when serving the static files. The deletion is really slow and IO intensive operation, which could greatly impact the performance and traffic of the server.
Find the process nginx’s “cache manager process” and strace it:
Here we are going to show you a real example of how we upgraded out Atlassian Bitbucket server from 4.14.4 (around April 2017 installation) with the latest version of Atlassian Bitbucket 5.14.0. We use
using self-hosted instance of Bitbucket 4.14.4
Linux distro – CentOS 7.
MySQL server for back-end. So there is a jdbc mysql driver (which should be installed after the upgrade).
NGINX is used as proxy for our main HTTPS url. So we have changed our default configuration in server (in sever.xml).
Bitbucket is loaded from a URL/bitbucket – “https://dev.example.com/bitbucket/”. So we have changed our default configuration (in sever.xml).
and you’ll see there are some pitfalls you can avoid if you follow our article. The latest git program in CentOS 7 is 1.8 and is not compatible with the new Atlassian Bitbucket 5.x, so we need to solve this problem before updating the server. Check out the official upgrade page here Keep on reading!
1) active connections – active (executing PHP code on the CPU right now – “php running”), max active, idle; 2) requests; 3) performance – max children reached or slow requests (it depends on your version of netdata).
CHART 4) PHP-FPM – request information
1) reuqest duration – minimum, maximum, avarage – how much time do a request take time – very useful to see how fast is your backend application. 2) request CPU in procentages; 3) request memory – reuested memory by your php fpm processes.
CHART 5) MySQL – performance metrics
1) bandwidth – The amount of data sent to mysql clients (out) and received from mysql clients (in); 2) queries – The number of statements executed by the server. To see a slow queries the slow query log should be enabled.
CHART 6) MySQL – handlers and locks
1) handlers – netdata Quotation: “Usage of the internal handlers of mysql. This chart provides very good insights of what the mysql server is actually doing. – commit, the number of internal COMMIT statements; delete, the number of times that rows have been deleted from tables; prepare, a counter for the prepare phase of two-phase commit operations; read first, the number of times the first entry in an index was read. A high value suggests that the server is doing a lot of full index scans; e.g. SELECT col1 FROM foo, with col1 indexed; read key, the number of requests to read a row based on a key. If this value is high, it is a good indication that your tables are properly indexed for your queries; read next, the number of requests to read the next row in key order. This value is incremented if you are querying an index column with a range constraint or if you are doing an index scan; read prev, the number of requests to read the previous row in key order. This read method is mainly used to optimize ORDER BY … DESC; read rnd, the number of requests to read a row based on a fixed position. A high value indicates you are doing a lot of queries that require sorting of the result. You probably have a lot of queries that require MySQL to scan entire tables or you have joins that do not use keys properly; read rnd next, the number of requests to read the next row in the data file. This value is high if you are doing a lot of table scans. Generally this suggests that your tables are not properly indexed or that your queries are not written to take advantage of the indexes you have; rollback, the number of requests for a storage engine to perform a rollback operation; savepoint, the number of requests for a storage engine to place a savepoint; savepoint rollback, the number of requests for a storage engine to roll back to a savepoint; update, the number of requests to update a row in a table; write, the number of requests to insert a row in a table.” 2) MySQL table locks counters, netdata Quotation: ” immediate, the number of times that a request for a table lock could be granted immediately – waited, the number of times that a request for a table lock could not be granted immediately and a wait was needed. If this is high and you have performance problems, you should first optimize your queries, and then either split your table or tables or use replication.”
CHART 7) MySQL – sorts, selects and temporaries
1) mysql SELECT JOIN – full range, range, scan; 2) mysql sorts – range and scan; 3) temporaries – disk tables (writing to the disk is slow and should be avoided!!!) and tables.
CHART 8) MySQL – connections and binlog
1) connections in seconds – all and aborted – if you are using persistent connections to MySQL you can see a busy MySQL server could have 2-3 new connections in a minute, because all the application backend uses the pool of already opened connections to the server. 2) connection errors – accepted, internal, max, peer_addr, select, tcpwrap; 3) binlog transactions per second
1) Innodb I/O bandwidth – reads and writes; 2) Innodb I/O Operations – reads, writes and fsyncs; 3) Innodb Pending I/O Operations – reads and fsyncs; 4) Innodb Log Operations – write requests and writes.
CHART 11) MySQL – Innodb engine infromation 2
1) Innodb OS Log Operations – fsyncs; 2) Innodb OS Log bandwidth – write (megabytes/s); 3) Innodb current row locks – current_waits; 4) Innodb row operations – inserted, read, updated and deleted.
CHART 12) MySQL – Innodb engine infromation 3
1) Innodb buffer pool pages – data, dirty, free, flushed, misc, total; 2) Innodb buffer pool bytes – data and dirty; 3) Innodb buffer pool read ahaed – all, evicted, random; 4) Innodb buffer pool requests – reads and writes per second.
CHART 13) MySQL – Innodb engine infromation 4
1) Innodb buffer pool operations – disk reads – operations per second.
CHART 14) MySQL – query cache (qcache)
1) query cache operations – hits, low memory prunes, inserts, not cached; 2) queries in the cache; 3) query cache free memory; 4) query cache memory blocks – free and total.
CHART 15) MySQL – myisam engine information
This server does not uses MyISAM engine, so you can see almost everything is zero – 1) MyISAM key cache blocks – unused and used; 2) MyISAM key cache requests – reads and writes; 3) MyISAM key cache disk operation – reads and writes.
CHART 16) MySQL – files
1) open files – how many files are opened at the moment; 2) opened file rate – files per second.
1) cache size – available and used; 2) network – in and out megabytes per second.
CHART 18) Memcached – connections and items
1) connections – current and total. Persistent connections are used, so no new connections often; 2) items cached – current and total. 3) items – evicted (forced removed – be careful here, this means your cached items are forcedly removed by the server because of lack of memory?) and reclaims (expired items).
CHART 19) Memcached – get and set operations
1) get operation requests – hits and misses; 2) get operations rate – requests per second; 3) set operation requests – requests per second.
CHART 20) Memcached – check and set ops, delete ops, increment ops
1) check and set operation requests – hits, misses, bad value; 2) delete operation requests – hits and misses; increment operation requests – hits and misses
CHART 21) Memcached – decrement ops, touch ops
1) decrement operation request – hits and misses; 2) touch operation requests – hits and misses; 3) touch operation requests rate – requests per second.
CHART 22) Postfix – mail service
1) Postfix Queue Emails – the emails in the queue of the mail transfer agent, these mails are in transfer state; 2) Postfix Queue Emails size – size.
CHART 23) Redis – performance metrics for in-memory data structure store, used as a database, cache and message broker.
1) operations – commands and operations per second; 2) hit rate – persentage, the effectiveness of the cache.
CHART 24) Redis – memory, keys, network
1) Redis memory utilization – total and lua; 2) keys – how many keys does each database have – keys per database name; 3) network – Redis network bandwidth – in and out in megabytes per second.
CHART 25) Redis – connections and replication
1) Redis connections – received per second – it’s like new connections and if you use persistent connections no new connections are opened often; 2) Redis clients – connected processes to the redis server; 3) replication – connected slave servers.
CHART 26) Redis – persistence (save the databases to the disks)
1) Persistence changes since last save – changes – how many changed items have been there since last save of the databases to the disks. 2) Duration of the RDB Save operation – rdb save in time; 3) Status of the last RDB Save Operation – rdb status.
CHART 27) Web server access logs information
Live parsing of the access logs – be careful here, because this could take a good deal of CPU and I/O of your busy server. Here we included only the default nginx log, which does not save many records. netdata Quotation: “Information extracted from a server log file. web_log plugin incrementally parses the server log file to provide, in real-time, a break down of key server performance metrics. For web servers, an extended log file format may optionally be used (for nginx and apache) offering timing information and bandwidth for both requests and responses. web_log plugin may also be configured to provide a break down of requests per URL pattern (check /etc/netdata/python.d/web_log.conf).” – 1) responses – success and bad requests per second; 2) Response codes – 1xx and 4xx and more if any in the logs.
CHART 28) Web server access logs information – detailed response code, bandwidth, http methods
1) detailed response code – requests per second; 2) bandwidth of the requests and reponses; 3) Requests per HTTP Method – GET, POST, PUT, DELETE and so on if they present in the logs.
CHART 29) Web server access logs information – http versions, ip protocols, clients
1) Requests per HTTP Version – 1.0, 1.1 and 2.0 if any in the logs; 2) Requests per IP protocol – IPv4 and IPv6 (if used); 3) clients – unique client IPs per data collection.
CHART 30) Web server access logs information – unique client IPs