This article is to show how simple is to use an SSD cache device to a hard disk drive. We also included statistics and graphs for several days of usage in one of our streaming servers.
Our setup:
1 SSD disk Samsung 480G. It will be used for writeback cache device!
1 Hard disk drive 1T
We included several graphs of this setup from one of our static media servers serving HLS video streaming.
The effectiveness of the cache is around 2-4 times at least!
Here is how you can install a BOINC client and attach it to a project (SETI). We use only command line tools, not GUI involved here. You can attach to a project and do some various administrative work with
boinccmd
like to get the progress of the currently running project tasks, project info, manage tasks and more.
Steps to install and run a (SETI) project:
STEP 1) Install BOINC client
The installation of the BOINC client requires using EPEL repository. Become root user and install Epel repository and Boinc client.
Here we use our SETI project to attach the server to it. There are two ways to attach to a project:
Using URL project and account key (strong or weak – it works with both). You can get your account keys from the site project url.
Using URL project and account info – username and password
To attach to a project your boinc client must be up and running and to use boinccmd – The command line interface to the BOINC client:
cd /var/lib/boinc
boinccmd --project_attach "http://setiathome.berkeley.edu/" "111111_22222233333344444444444555555555"
The first argument of “–project_attach” is the URL address of the project site, in this case, SETI with “http://setiathome.berkeley.edu/” and the account key of our account is 111111_22222233333344444444444555555555 (this is not the real one!). After successful attach your client will start to download the project files and begin to work on units:
[myuser@compute1 ~]# boinccmd --get_tasks
======== Tasks ========
1) -----------
name: blc36_2bit_guppi_58406_31023_HIP20352_0115.19579.818.22.45.146.vlar_0
WU name: blc36_2bit_guppi_58406_31023_HIP20352_0115.19579.818.22.45.146.vlar
project URL: http://setiathome.berkeley.edu/
received: Tue Apr 2 13:38:07 2019
report deadline: Sat May 25 18:37:49 2019
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 800
resources: 1 CPU
estimated CPU time remaining: 4843.738587
CPU time at last checkpoint: 1326.038000
current CPU time: 1366.303000
fraction done: 0.170349
swap size: 54 MB
working set size: 52 MB
2) -----------
name: blc34_2bit_guppi_58406_26949_HIP20491_0103.21476.0.21.44.1.vlar_1
WU name: blc34_2bit_guppi_58406_26949_HIP20491_0103.21476.0.21.44.1.vlar
project URL: http://setiathome.berkeley.edu/
received: Tue Apr 2 13:43:18 2019
report deadline: Sat May 25 18:42:59 2019
ready to report: no
state: downloaded
scheduler state: uninitialized
active_task_state: UNINITIALIZED
app version num: 800
resources: 1 CPU
estimated CPU time remaining: 5838.283108
3) -----------
....
You can see all the tasks – the one running at the moment and those on the queue.
When the “fraction done” reaches 1.0 the tasks is ready to report.
To view all running work units and some useful links in the project site like forums, account, preferences, recent tasks, list of the computers on which you are running SETI@Home, team information and more you can use “–get_simple_gui_info” (some of the data here are changed):
[myuser@compute1 ~]# boinccmd --get_simple_gui_info
======== Projects ========
1) -----------
name: SETI@home
master URL: http://setiathome.berkeley.edu/
user_name: neoX
team_name: neoX Group
resource share: 150.000000
user_total_credit: 14376490.970263
user_expavg_credit: 25085.116699
host_total_credit: 0.000000
host_expavg_credit: 0.000000
nrpc_failures: 2
master_fetch_failures: 0
master fetch pending: no
scheduler RPC pending: no
trickle upload pending: no
attached via Account Manager: no
ended: no
suspended via GUI: no
don't request more work: no
disk usage: 0.000000
last RPC: Tue Apr 2 17:19:56 2019
project files downloaded: 1554212310.422403
GUI URL:
name: Message boards
description: Correspond with other users on the SETI@home message boards
URL: http://setiathome.berkeley.edu/forum_index.php
GUI URL:
name: Help
description: Ask questions and report problems
URL: http://setiathome.berkeley.edu/forum_help_desk.php
GUI URL:
name: Account
description: View your account information
URL: http://setiathome.berkeley.edu/home.php
GUI URL:
name: Preferences
description: View and modify your computing preferences
URL: http://setiathome.berkeley.edu/prefs.php?subset=global
GUI URL:
name: Tasks
description: View your recent tasks
URL: http://setiathome.berkeley.edu/results.php?userid=111111
GUI URL:
name: Computers
description: View a list of the computers on which you are running SETI@Home
URL: http://setiathome.berkeley.edu/hosts_user.php?userid=111111
GUI URL:
name: Team
description: View information about your team: neoX Group
URL: http://setiathome.berkeley.edu/team_display.php?teamid=22222
GUI URL:
name: Donate
description: Donate to SETI@home
URL: http://setiathome.berkeley.edu/sah_donate.php
jobs succeeded: 16
jobs failed: 0
elapsed time: 107328.542101
cross-project ID: 33333333333333333333333333333333
======== Tasks ========
1) -----------
name: blc34_2bit_guppi_58406_27281_HIP20917_0104.20977.818.21.44.184.vlar_1
WU name: blc34_2bit_guppi_58406_27281_HIP20917_0104.20977.818.21.44.184.vlar
project URL: http://setiathome.berkeley.edu/
received: Tue Apr 2 13:43:18 2019
report deadline: Sat May 25 18:42:59 2019
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 800
resources: 1 CPU
estimated CPU time remaining: 4658.869936
CPU time at last checkpoint: 1902.641000
current CPU time: 1949.207000
fraction done: 0.202014
swap size: 58 MB
working set size: 56 MB
2) -----------
name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.250.vlar_1
WU name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.250.vlar
project URL: http://setiathome.berkeley.edu/
received: Tue Apr 2 13:43:18 2019
report deadline: Sat May 25 18:42:59 2019
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 800
resources: 1 CPU
estimated CPU time remaining: 4586.517845
CPU time at last checkpoint: 1903.475000
current CPU time: 1909.324000
fraction done: 0.214406
swap size: 58 MB
working set size: 56 MB
3) -----------
name: blc34_2bit_guppi_58406_27281_HIP20917_0104.20977.818.21.44.149.vlar_0
WU name: blc34_2bit_guppi_58406_27281_HIP20917_0104.20977.818.21.44.149.vlar
project URL: http://setiathome.berkeley.edu/
received: Tue Apr 2 13:43:18 2019
report deadline: Sat May 25 18:42:59 2019
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 800
resources: 1 CPU
estimated CPU time remaining: 4643.434099
CPU time at last checkpoint: 1902.425000
current CPU time: 1904.189000
fraction done: 0.204658
swap size: 58 MB
working set size: 56 MB
4) -----------
name: blc34_2bit_guppi_58406_26949_HIP20491_0103.21476.0.21.44.21.vlar_1
WU name: blc34_2bit_guppi_58406_26949_HIP20491_0103.21476.0.21.44.21.vlar
project URL: http://setiathome.berkeley.edu/
received: Tue Apr 2 13:43:18 2019
report deadline: Sat May 25 18:42:59 2019
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 800
resources: 1 CPU
estimated CPU time remaining: 4636.707813
CPU time at last checkpoint: 1845.896000
current CPU time: 1862.998000
fraction done: 0.205810
swap size: 54 MB
working set size: 52 MB
5) -----------
name: blc34_2bit_guppi_58406_28965_HIP20350_0109.21465.0.22.45.59.vlar_1
WU name: blc34_2bit_guppi_58406_28965_HIP20350_0109.21465.0.22.45.59.vlar
project URL: http://setiathome.berkeley.edu/
received: Tue Apr 2 13:43:18 2019
report deadline: Sat May 25 18:42:59 2019
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 800
resources: 1 CPU
estimated CPU time remaining: 4665.477705
CPU time at last checkpoint: 1781.304000
current CPU time: 1805.969000
fraction done: 0.200882
swap size: 58 MB
working set size: 56 MB
6) -----------
name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.210.vlar_1
WU name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.210.vlar
project URL: http://setiathome.berkeley.edu/
received: Tue Apr 2 13:43:18 2019
report deadline: Sat May 25 18:42:59 2019
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 800
resources: 1 CPU
estimated CPU time remaining: 4675.696451
CPU time at last checkpoint: 1780.113000
current CPU time: 1784.886000
fraction done: 0.199132
swap size: 58 MB
working set size: 56 MB
7) -----------
name: blc34_2bit_guppi_58406_28965_HIP20350_0109.21465.0.22.45.101.vlar_1
WU name: blc34_2bit_guppi_58406_28965_HIP20350_0109.21465.0.22.45.101.vlar
project URL: http://setiathome.berkeley.edu/
received: Tue Apr 2 13:43:18 2019
report deadline: Sat May 25 18:42:59 2019
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 800
resources: 1 CPU
estimated CPU time remaining: 4705.804477
CPU time at last checkpoint: 1711.990000
current CPU time: 1768.862000
fraction done: 0.193975
swap size: 54 MB
working set size: 52 MB
8) -----------
name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.254.vlar_1
WU name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.254.vlar
project URL: http://setiathome.berkeley.edu/
received: Tue Apr 2 13:43:18 2019
report deadline: Sat May 25 18:42:59 2019
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 800
resources: 1 CPU
estimated CPU time remaining: 4928.870679
CPU time at last checkpoint: 1422.330000
current CPU time: 1466.627000
fraction done: 0.155767
swap size: 54 MB
working set size: 52 MB
boinccmd – the management tool for the command line
Here are the options you can use in version 7.14.2:
[myuser@compute1 ~]# boinccmd --help
usage: boinccmd [--host hostname] [--passwd passwd] [--unix_domain] command
default hostname: localhost
default password: contents of gui_rpc_auth.cfg
Commands:
--acct_mgr attach URL name passwd attach to account manager
--acct_mgr info show current account manager info
--acct_mgr sync synchronize with acct mgr
--acct_mgr detach detach from acct mgr
--client_version show client version
--create_account URL email passwd name
--file_transfer URL filename op file transfer operation
op = retry | abort
--get_app_config URL show app config for given project
--get_cc_status
--get_daily_xfer_history show network traffic history
--get_disk_usage show disk usage
--get_file_transfers show file transfers
--get_host_info
--get_message_count show largest message seqno
--get_messages [ seqno ] show messages > seqno
--get_notices [ seqno ] show notices > seqno
--get_project_config URL
--get_project_status show status of all attached projects
--get_proxy_settings
--get_simple_gui_info show status of projects and active tasks
--get_state show entire state
--get_tasks show tasks
--get_old_tasks show reported tasks from last 1 hour
--join_acct_mgr URL name passwd same as --acct_mgr attach
--lookup_account URL email passwd
--network_available retry deferred network communication
--project URL op project operation
op = reset | detach | update | suspend | resume | nomorework | allowmorework | detach_when_done | dont_detach_when_done
--project_attach URL auth attach to project
--quit tell client to exit
--quit_acct_mgr same as --acct_mgr detach
--read_cc_config
--read_global_prefs_override
--run_benchmarks
--set_gpu_mode mode duration set GPU run mode for given duration
mode = always | auto | never
--set_host_info product_name
--set_network_mode mode duration set network mode for given duration
mode = always | auto | never
--set_proxy_settings
--set_run_mode mode duration set run mode for given duration
mode = always | auto | never
--task url task_name op task operation
op = suspend | resume | abort
More clients in EPEL repository
[myuser@compute1 ~]# yum search boinc
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirror.checkdomain.de
* epel: mirror.wiuwiu.de
* extras: mirror.checkdomain.de
* updates: mirror.checkdomain.de
============================================================================ N/S matched: boinc ============================================================================
boinc-client.x86_64 : The BOINC client
boinc-client-devel.x86_64 : Development files for boinc-client
boinc-client-doc.noarch : Documentation files for boinc-client
boinc-client-static.x86_64 : Static libraries for boinc-client
boinc-manager.x86_64 : GUI to control and monitor boinc-client
Name and summary matches only, use "search all" for everything.
You must change the directory to “/var/lib/boinc” and run the command again. You may chown the directory to the boinc user. A directory for the project is created under “projects/” and a configuration file is created with name “account_boinc.bakerlab.org_rosetta.xml”.
root@srv ~ # cd /var/lib/boinc
root@srv /var/lib/boinc # boinccmd --project_attach "https://boinc.bakerlab.org/rosetta/" "1111_111111111111111111111111111"
root@srv chown -R boinc:boinc /var/lib/boinc
root@srv /var/lib/boinc # ls -altr projects/boinc.bakerlab.org_rosetta/
total 8
drwxrwx--x 4 boinc boinc 4096 Apr 4 03:01 ..
drwxrwx--x 2 boinc boinc 4096 Apr 4 03:01 .
root@srv /var/lib/boinc # cat account_boinc.bakerlab.org_rosetta.xml
<account>
<master_url>https://boinc.bakerlab.org/rosetta/</master_url>
<authenticator>1111_111111111111111111111111111</authenticator>
<project_name>Rosetta@home</project_name>
<project_preferences>
<resource_share>15</resource_share>
<project_specific>
<color_scheme>Tahiti Sunset</color_scheme>
</project_specific>
</project_preferences>
<gui_urls>
<gui_url>
<name>Message boards</name>
<description>Correspond with other users on the Rosetta@home message boards</description>
<url>http://boinc.bakerlab.org/rosetta/forum_index.php</url>
</gui_url>
<gui_url>
<name>Your account</name>
<description>View your account information</description>
<url>http://boinc.bakerlab.org/rosetta/home.php</url>
</gui_url>
<gui_url>
<name>Your tasks</name>
<description>View the last week or so of computational work</description>
<url>http://boinc.bakerlab.org/rosetta/results.php?userid=xxxx</url>
</gui_url>
</gui_urls>
</account>
If you want to use software raid device in your Gentoo Linux system (init boot, not systemd) with boot partition resided in the raid for better redundancy but using metadata 1.2 version you won’t have autodetection feature on boot and there are some additional steps you may consider!
Linux kernel’s autodetection feature of the software raid is only for superblock (metadata) version 0.90!
If you create a software raid with superblock version 1, 1.1 or 1.2 you want to boot from it you must use:
Grub 2, lilo won’t load the kernel from such paritions.
include mdadm in initramfs
initramfs booted with aditional parameter “domdadm” (in fact, it is not a “native” kernel option, read below)
The aim of this article is to show you the simplest way you can achieve your monitoring Nagios system to make a phone call on a critical, in our example CRITICAL host DOWN (the host is unreachable)! You will not need any server software to setup (including VoiceOverIP server), the only thing you will need is an account in https://www.twilio.com/, which even could be a free/trial account. The idea here is to be simple, cheap and easy to be accomplished.
To summarize it up:
to make a phone call to a real phone number on a CRITICAL Nagios notification
root access to the Nagios server, because you will put a simple bash script, which uses “curl” to open an URL
The idea here is to make a phone call on a really CRITICAL issue to be able to wake up the person on duty to check the monitoring system and the SMS messages he received! So the phone call might not be accepted, at all – the ring of the phone (or continues vibrating of your smart band on your wrist!) should be enough to get a second different type of notification after the first (only and most cases only?) SMS messaging (which has the infromation for the problem!). Keep on reading!
Selinux could sometime mess up with your setup. Let’s say you configured your rsync daemon but still, you get the error related to permissions when executing the rsync to copy files!
rsync: opendir "/." (in backup2) failed: Permission denied (13)
Apparently, the rsync client connects to the server and it finds there is a section name “backup2”, but still no permission despite you explicitly set in the section uid and ig to be root (uid=0 and gid=0 in the section)!
The most common reason is
selinux denies rsync process to open the directory exported by the path in your rsync configuration file.
By default, Selinux will deny access to any of the files and directories in your system! In most cases here what can you help:
setsebool -P rsync_export_all_ro=1
rsync_export_all_ro will export any files and directories read-only and requests like above will not be denied.
The capital letter “-P” is to set it permanently for the system over reboots. Keep on reading!
Here you will see our log of upgrading the Supermicro IPMI firmware with the cli tool included in the firmware package for your IPMI unit under Linux console.
If your server has built-in IPMI unit in the motherboard there will be a firmware for it next to the BIOS firmware in the Supermicro site. You go to the page of your Supermicro page and on the left part you have also the BIOS and IPMI firmware links. The IPMI firmware package has a Windows/DOS and Linux executable files to flash the firmware under the console.
So here we flash a new firmware to our motherboard is X10SLM+-F.
Here you can see left “Links & Resources” and click on ” BMC/IPMI Firmware” to download the latest IPMI firmware for your motherboard.
Upload the downloaded file in your server.
STEP 1) Unpack the firmware file downloaded from Supermicro site.
Here we include the verbose output of “tar” so you can see what files are included. The files we use here are highlighted.
There are 5 version of the flash utility Linux 32bit and 64bit, Windows 32bit and 64bit and a dos version.
STEP 2) Flash the BCM/IPMI firmware.
We choose here not to preserve configuration, because some old features might be incompatible with the new one. It is not mandatory to do it in fact we also tested with “to preserve” the old configuration and we have no problems afterwards.
We do not change almost anything in the IPMI configuration except admin password and the network settings and when flashing under the OS you have the ability to reconfigure it after the flashing process. Your server is up and running and you can use “ipmitool” to configure the IPMI module.
The whole process took about 15 minutes.
[root@srv ~/REDFISH_X10_372]# 2.07/linux/x64/AlUpdate -f REDFISH_X10_372.bin -r n
sh: cls: command not found
*****************************************************************************
* ATEN Technology, Inc. *
*****************************************************************************
* FUNCTION : IPMI FIRMWARE UPDATE UTILITY *
* VERSION : 2.07 *
* BUILD DATE : Jul 13 2016 *
* USAGE : *
* (1)Update FIRMWARE : AlUpdate -f filename.bin [OPTION] *
* (2)Dump FIRMWARE : AlUpdate -d filename *
* (3)Restore CONFIG : AlUpdate -c -f filename.bin *
* (4)Backup CONFIG : AlUpdate -c -d filename.bin *
*****************************************************************************
* OPTION *
* -i the IPMI channel, currently, kcs and lan are supported *
* LAN channel specific arguments *
* -h remote BMC address and RMCP+ port, (default port is 623) *
* -u IPMI user name *
* -p IPMI password correlated to IPMI user name *
* -r Preserve Configuration (default is Preserve) *
* n:No Preserve, reset to factory default settings *
* y:Preserve, keep all of the settings *
* -c IPMI configuration backup/restore *
* -f [restore.bin] Restore configurations *
* -d [backup.bin] Backup configurations *
*****************************************************************************
* EXAMPLE *
* we like to upgrade firmware through KCS channel *
* AlUpdate -f fwuperade.bin -i kcs -r y *
* AlUpdate -d fwdump.bin -i kcs -r y *
* *
* we like to restore/backup IPMI config through KCS channel *
* AlUpdate -c -f restore.bin -i kcs -r y *
* AlUpdate -c -d backup.bin -i kcs -r y *
* *
* we like to upgrade firmware through LAN channel with *
* - BMC IP address 10.11.12.13 port 623 *
* - IPMI username is usr *
* - Password for alice is pwd *
* - Preserve Configuration *
* AlUpdate -f fw.bin -i lan -h 10.11.12.13 623 -u usr -p pwd -r y *
* AlUpdate -d fwdump.bin -i lan -h 10.11.12.13 623 -u usr -p pwd -r y *
* *
* we like to restore/backup IPMI config through LAN channel with *
* - BMC IP address 10.11.12.13 port 623 *
* - IPMI username is usr *
* - Password for alice is pwd *
* - Preserve Configuration *
* AlUpdate -c -f fw.bin -i lan -h 10.11.12.13 623 -u usr -p pwd *
* AlUpdate -c -d fwdump.bin -i lan -h 10.11.12.13 623 -u usr -p pwd *
*****************************************************************************
2.07/linux/x64/AlUpdate -f REDFISH_X10_372.bin -r n
Try open dev ipmi0....
Check if this file is valid................
If the FW update fails,PLEASE TRY AGAIN
Load part 0 126008 bytes, [Ok]
Load part 1 14635008 bytes, [Ok]
Load part 2 1537585 bytes, [Ok]
Load part 3 8081440 bytes, [Ok]
Load part 4 262144 bytes, [Ok]
If the FW update fails. PLEASE WAIT 5 MINS AND REMOVE THE AC...
new firmware is updating...100%
Update Complete,Please wait for BMC reboot, about 1 min
[root@srv ~/REDFISH_X10_372]#
All the lines starting with “Load part” will shows progress percentages like:
Load part 1 14635008 bytes, 4137K bytes 29%"
And the line starting with “new firmware is updating…” also shows like:
new firmware is updating...28%
In dmesg you can see your IPMI module resets:
[root@conv1 ~]# dmesg
[1954154.242383] usb 3-7: USB disconnect, device number 2
[1954154.242385] usb 3-7.1: USB disconnect, device number 3
[1954185.337154] usb 3-7: new high-speed USB device number 4 using xhci_hcd
[1954185.501356] usb 3-7: New USB device found, idVendor=0557, idProduct=7000
[1954185.501358] usb 3-7: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[1954185.501879] hub 3-7:1.0: USB hub found
[1954185.501923] hub 3-7:1.0: 4 ports detected
[1954185.899168] usb 3-7.1: new low-speed USB device number 5 using xhci_hcd
[1954185.999375] usb 3-7.1: New USB device found, idVendor=0557, idProduct=2419
[1954185.999376] usb 3-7.1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[1954186.000708] input: HID 0557:2419 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7.1/3-7.1:1.0/input/input10
[1954186.051346] hid-generic 0003:0557:2419.0003: input,hidraw0: USB HID v1.00 Keyboard [HID 0557:2419] on usb-0000:00:14.0-7.1/input0
[1954186.052050] input: HID 0557:2419 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7.1/3-7.1:1.1/input/input11
[1954186.052423] hid-generic 0003:0557:2419.0004: input,hidraw1: USB HID v1.00 Mouse [HID 0557:2419] on usb-0000:00:14.0-7.1/input1
[1954199.668503] usb 3-7.1: USB disconnect, device number 5
[1954201.450533] usb 3-7.1: new low-speed USB device number 6 using xhci_hcd
[1954201.550755] usb 3-7.1: New USB device found, idVendor=0557, idProduct=2419
[1954201.550756] usb 3-7.1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[1954201.552044] input: HID 0557:2419 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7.1/3-7.1:1.0/input/input12
[1954201.602658] hid-generic 0003:0557:2419.0005: input,hidraw0: USB HID v1.00 Keyboard [HID 0557:2419] on usb-0000:00:14.0-7.1/input0
[1954201.603372] input: HID 0557:2419 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7.1/3-7.1:1.1/input/input13
[1954201.603729] hid-generic 0003:0557:2419.0006: input,hidraw1: USB HID v1.00 Mouse [HID 0557:2419] on usb-0000:00:14.0-7.1/input1
In most cases you’ll never want to modify the default settings for deleting cache items with proxy_cache_path directives. The problem is in a peak the file deleting could impact your server performance and even it could kill your server leaving it unresponsive for a period of time. You cannot instruct nginx with a schedule job for deletion cached items or ban the deletion when the server is busy or loaded. The manager just traces each zone for used cache capacity versus the maximum allowed size and if the used capacity is near or bigger than the maximum allowed size (max_size) the manager process triggers deletion with the default values – the nginx manager will try to delete at least 100 files (up to 200 milliseconds) and then it will sleep for 50 milliseconds then again it will try deleting 100 files. So your file system could receive at least 1000 files per second to delete!
This could lead your server to almost unresponsive state in the peaks.
And it could be perfectly OK in off-peaks, but there is no way how to tell nginx cache manager there is a plenty free space despite you reach the cache limit so at the moment it is not the best time to delete the cache!
manager_files – not more than this number of files to delete in one iteration. The default value is 100.
manager_threshold – limit the delete iteration time. The default value is 200 milliseconds and you must use nginx time syntax concatenated to the number you want, for example if you want 500 milliseconds you must use “500 ms”.
manager_sleep – how much time to sleep the manager before executing another delete iteration. The default value is 50 milliseconds and here you must use nginx time syntax concatenated to the number you want, for example if you want 500 milliseconds you must use “500 ms”.
The cache manager will delete not more than 2 files for up to 500 milliseconds and it will sleep 200 milliseconds before another delete iteration.
The best option for loaded servers
The best option for loaded servers with full cache is to balance the free space – delete small amount of files at once to be sure your server will not get loaded even the free space decreases at the peaks (so more files are cached than the nginx manager could delete – you are aware of this and the free space should be enough), but during the off peak (which normally is several times longer than the peak) the nginx manager could catch up with the deleting and it should free up some space (cached files are lesser than the deleted ones). Of course, you should tune this according to your situation.
The main idea is to delete in small amounts of files to not saturate your disks it could take longer to recover the free space, but it will not load your server in peaks. You should consider two things:
Free space – enough free space and to be sure the free space is enough for the peaks, when the cache could grow above the threshold.
Number of deletions per iteration – you should experiment with this. Fist you should be away how many files are added for a period of time, which includes one peak and one off-peak and then to balance the number in such a way that after the period the cache is not above the maximum size. Probably the best is to start with a 24 hours period, which includes at least one peak.
As you can see the example above only 2 files are good enough for an iteration for our case. Taking into account the 200ms sleep between the files’ deletions 10 files at most should be deleted per second. In our case it is not enough for the peak, but for the off-peak, which is 20 hours every 24 hours, is good enough to get into the maximum size limit of the cache.
In peaks deleting files could kill your server and easily the traffic could degraded multiple times than normal if the nginx cache manager start deleting files!
The server is perfectly normal but suddenly it just get loaded and all nginx processes are in D (“Disk sleep”) state.
What could it be? What is going on with your proxy server?
Probably the cache is full!
Unfortunately there is no way to check how much is filled the cache live – just an upgrade or restart of the nginx process will trigger nginx cache loader to check all the cache files and will write the cache size on exit in the error log – but be careful the cache loading is also IO intensive operation – stats all the cache files and they could be millions images).
Just increase the nginx cache drastically – add zero to the maximum cache size
Of course, you should have enough free space till you resolve the problem – for example more servers or manual deletion on peak-off or tune your cache deletion or any other solution….
Search for something like
The max size will increase from 400G to 4000G (4T)!
This will effectively stop the files deleting and the nginx cache manager will have slept for long time before invoking again to delete files. This could be life saving operation for your server at peak!
Here is a real graph from one of our servers – the cache manager started deleting files from the cache and the traffic dropped 99%!!!
SCREENSHOT 1) The nginx cache manager just started to delete files from the cache and this operation just killed our server completely.
You can see almost zero bandwidth! The problem was resolved when we reloaded nginx with a bigger cache max_size value. The nginx manager immediately went to sleep and no IO for deleting files. The load of the server returned to normal!
SCREENSHOT 2) Hard drives were saturated and the disk maxed the IO time to 10 ms.
Despite the bigger READ and WRITE IOPS there was 95-99% less traffic.
Here is a tip for the webmasters (or system admins) to discover whether the nginx using proxy_cache to cache files is deleting files at the moment! There situation where you may need to know if the loaded of a static media server is caused by the deletion of the cache manager or by the read or seek operations when serving the static files. The deletion is really slow and IO intensive operation, which could greatly impact the performance and traffic of the server.
Find the process nginx’s “cache manager process” and strace it:
We have a CentOS 7 server with a simple two hard drives setup in RAID1 of total 4 devices for boot, root, swap and storage. The storage device (/dev/md5) was removed and recreated with RAID0 for better performance, because the server was promoted as only cache server. Then the server was restarted and it never went up.
On IPMI KVM it just started loading the kernel and hanged up after several seconds without any additional information:
The kernel loads the mdadm devices and do not continue and the device md5 is missing.
To boot successfully you must remove the missing device
On the Grub 2 menu press “e” and you’ll get this screen. Here you can edit all lines if you need. You must remove the last rd.md.uuid in our case or the one you deleted. Remove it and press Ctrl+x to load the kernel.
There are two options you can do:
OPTION 1) Remove rd.md.uuid option of your old mdadm device
OPTION 2) Replace the ID in rd.md.uuid= with the new ID of the mdadm device.
Each of these two options could be used to solve the booting problem. Edit /etc/default/grub and replace or remove rd.md.uuid and generate the grub.conf.
You can find old mdadm ID in /etc/mdadm.conf (if you have not replace it there).
Here we edit our /boot/grub2/grub.cfg, replace the old uuid and generate grub.cfg (legacy BIOS):
[root@srv ~]# cat /etc/default/grub
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL="serial console"
GRUB_SERIAL_COMMAND="serial --speed=115200"
GRUB_CMDLINE_LINUX="rd.md.uuid=9c08f218:cd5c0f8f:d96bc0d1:57b77e99 rd.md.uuid=1f74a2e0:757bfb9f:9c860e50:325f37cb rd.md.uuid=29bf4aa8:b7dae21a:45f4c188:baea4c13 rd.md.uuid=901074eb:16ba7c5b:0af69934:e9444102 console=tty0 crashkernel=auto console=ttyS0,115200 net.ifnames=1"
[root@srv ~]# grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-957.5.1.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-957.5.1.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-05cb8c7b39fe0f70e3ce97e5beab809d
Found initrd image: /boot/initramfs-0-rescue-05cb8c7b39fe0f70e3ce97e5beab809d.img
done
[root@srv ~]# reboot
Use this for UEFI BIOS boot:
First check if /boot and /boot/efi are mounted and if not you must mount them with:
mount /boot
mount /boot/efi
Generate the grub.cfg
grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
Bonus
In fact when the original device was removed and added a new one we formatted it as usual. But it was not possible to mount it, you just execute mount
/dev/md5 /mnt/stor1
no error, but no mount could be found, the device was not mounted and when you execute
umount /mnt/stor1
The OS told the “/mnt/stor1” was not mounted. Several more tries were made unsuccessfully to mount the “/dev/md5”, then the restart was performed and the server never went up.
Suppose the systemd just did not allow to mount the device because of the boot parameters rd.md.uuid!
Manage Cookie Consent
We use technologies like cookies to store and/or access device information. We do this to improve browsing experience and to show (non-) personalized ads. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional
Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.