SSD cache device to a hard disk drive using LVM

This article is to show how simple is to use an SSD cache device to a hard disk drive. We also included statistics and graphs for several days of usage in one of our streaming servers.
Our setup:

  • 1 SSD disk Samsung 480G. It will be used for writeback cache device!
  • 1 Hard disk drive 1T

We included several graphs of this setup from one of our static media servers serving HLS video streaming.

The effectiveness of the cache is around 2-4 times at least!

Keep on reading!

Installing and running BOINC client (with SETI project) under CentOS 7

Here is how you can install a BOINC client and attach it to a project (SETI). We use only command line tools, not GUI involved here. You can attach to a project and do some various administrative work with

boinccmd

like to get the progress of the currently running project tasks, project info, manage tasks and more.
Steps to install and run a (SETI) project:

STEP 1) Install BOINC client

The installation of the BOINC client requires using EPEL repository. Become root user and install Epel repository and Boinc client.

sudo su
yum update -y
yum install -y epel-release
yum install -y boinc-client
systemctl start boinc-client
systemctl enable boinc-client

STEP 2) Attach to a project

Here we use our SETI project to attach the server to it. There are two ways to attach to a project:

  • Using URL project and account key (strong or weak – it works with both). You can get your account keys from the site project url.
  • Using URL project and account info – username and password

To attach to a project your boinc client must be up and running and to use boinccmd – The command line interface to the BOINC client:

cd /var/lib/boinc
boinccmd --project_attach "http://setiathome.berkeley.edu/" "111111_22222233333344444444444555555555"

The first argument of “–project_attach” is the URL address of the project site, in this case, SETI with “http://setiathome.berkeley.edu/” and the account key of our account is 111111_22222233333344444444444555555555 (this is not the real one!). After successful attach your client will start to download the project files and begin to work on units:

[myuser@compute1 ~]# boinccmd --get_tasks

======== Tasks ========
1) -----------
   name: blc36_2bit_guppi_58406_31023_HIP20352_0115.19579.818.22.45.146.vlar_0
   WU name: blc36_2bit_guppi_58406_31023_HIP20352_0115.19579.818.22.45.146.vlar
   project URL: http://setiathome.berkeley.edu/
   received: Tue Apr  2 13:38:07 2019
   report deadline: Sat May 25 18:37:49 2019
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 800
   resources: 1 CPU
   estimated CPU time remaining: 4843.738587
   CPU time at last checkpoint: 1326.038000
   current CPU time: 1366.303000
   fraction done: 0.170349
   swap size: 54 MB
   working set size: 52 MB
2) -----------
   name: blc34_2bit_guppi_58406_26949_HIP20491_0103.21476.0.21.44.1.vlar_1
   WU name: blc34_2bit_guppi_58406_26949_HIP20491_0103.21476.0.21.44.1.vlar
   project URL: http://setiathome.berkeley.edu/
   received: Tue Apr  2 13:43:18 2019
   report deadline: Sat May 25 18:42:59 2019
   ready to report: no
   state: downloaded
   scheduler state: uninitialized
   active_task_state: UNINITIALIZED
   app version num: 800
   resources: 1 CPU
   estimated CPU time remaining: 5838.283108
3) -----------
 ....

You can see all the tasks – the one running at the moment and those on the queue.
When the “fraction done” reaches 1.0 the tasks is ready to report.

To view all running work units and some useful links in the project site like forums, account, preferences, recent tasks, list of the computers on which you are running SETI@Home, team information and more you can use “–get_simple_gui_info” (some of the data here are changed):

[myuser@compute1 ~]# boinccmd --get_simple_gui_info
======== Projects ========
1) -----------
   name: SETI@home
   master URL: http://setiathome.berkeley.edu/
   user_name: neoX
   team_name: neoX Group
   resource share: 150.000000
   user_total_credit: 14376490.970263
   user_expavg_credit: 25085.116699
   host_total_credit: 0.000000
   host_expavg_credit: 0.000000
   nrpc_failures: 2
   master_fetch_failures: 0
   master fetch pending: no
   scheduler RPC pending: no
   trickle upload pending: no
   attached via Account Manager: no
   ended: no
   suspended via GUI: no
   don't request more work: no
   disk usage: 0.000000
   last RPC: Tue Apr  2 17:19:56 2019

   project files downloaded: 1554212310.422403
GUI URL:
   name: Message boards
   description: Correspond with other users on the SETI@home message boards
   URL: http://setiathome.berkeley.edu/forum_index.php
GUI URL:
   name: Help
   description: Ask questions and report problems
   URL: http://setiathome.berkeley.edu/forum_help_desk.php
GUI URL:
   name: Account
   description: View your account information
   URL: http://setiathome.berkeley.edu/home.php
GUI URL:
   name: Preferences
   description: View and modify your computing preferences
   URL: http://setiathome.berkeley.edu/prefs.php?subset=global
GUI URL:
   name: Tasks
   description: View your recent tasks
   URL: http://setiathome.berkeley.edu/results.php?userid=111111
GUI URL:
   name: Computers
   description: View a list of the computers on which you are running SETI@Home
   URL: http://setiathome.berkeley.edu/hosts_user.php?userid=111111
GUI URL:
   name: Team
   description: View information about your team: neoX Group
   URL: http://setiathome.berkeley.edu/team_display.php?teamid=22222
GUI URL:
   name: Donate
   description: Donate to SETI@home
   URL: http://setiathome.berkeley.edu/sah_donate.php
   jobs succeeded: 16
   jobs failed: 0
   elapsed time: 107328.542101
   cross-project ID: 33333333333333333333333333333333

======== Tasks ========
1) -----------
   name: blc34_2bit_guppi_58406_27281_HIP20917_0104.20977.818.21.44.184.vlar_1
   WU name: blc34_2bit_guppi_58406_27281_HIP20917_0104.20977.818.21.44.184.vlar
   project URL: http://setiathome.berkeley.edu/
   received: Tue Apr  2 13:43:18 2019
   report deadline: Sat May 25 18:42:59 2019
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 800
   resources: 1 CPU
   estimated CPU time remaining: 4658.869936
   CPU time at last checkpoint: 1902.641000
   current CPU time: 1949.207000
   fraction done: 0.202014
   swap size: 58 MB
   working set size: 56 MB
2) -----------
   name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.250.vlar_1
   WU name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.250.vlar
   project URL: http://setiathome.berkeley.edu/
   received: Tue Apr  2 13:43:18 2019
   report deadline: Sat May 25 18:42:59 2019
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 800
   resources: 1 CPU
   estimated CPU time remaining: 4586.517845
   CPU time at last checkpoint: 1903.475000
   current CPU time: 1909.324000
   fraction done: 0.214406
   swap size: 58 MB
   working set size: 56 MB
3) -----------
   name: blc34_2bit_guppi_58406_27281_HIP20917_0104.20977.818.21.44.149.vlar_0
   WU name: blc34_2bit_guppi_58406_27281_HIP20917_0104.20977.818.21.44.149.vlar
   project URL: http://setiathome.berkeley.edu/
   received: Tue Apr  2 13:43:18 2019
   report deadline: Sat May 25 18:42:59 2019
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 800
   resources: 1 CPU
   estimated CPU time remaining: 4643.434099
   CPU time at last checkpoint: 1902.425000
   current CPU time: 1904.189000
   fraction done: 0.204658
   swap size: 58 MB
   working set size: 56 MB
4) -----------
   name: blc34_2bit_guppi_58406_26949_HIP20491_0103.21476.0.21.44.21.vlar_1
   WU name: blc34_2bit_guppi_58406_26949_HIP20491_0103.21476.0.21.44.21.vlar
   project URL: http://setiathome.berkeley.edu/
   received: Tue Apr  2 13:43:18 2019
   report deadline: Sat May 25 18:42:59 2019
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 800
   resources: 1 CPU
   estimated CPU time remaining: 4636.707813
   CPU time at last checkpoint: 1845.896000
   current CPU time: 1862.998000
   fraction done: 0.205810
   swap size: 54 MB
   working set size: 52 MB
5) -----------
   name: blc34_2bit_guppi_58406_28965_HIP20350_0109.21465.0.22.45.59.vlar_1
   WU name: blc34_2bit_guppi_58406_28965_HIP20350_0109.21465.0.22.45.59.vlar
   project URL: http://setiathome.berkeley.edu/
   received: Tue Apr  2 13:43:18 2019
   report deadline: Sat May 25 18:42:59 2019
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 800
   resources: 1 CPU
   estimated CPU time remaining: 4665.477705
   CPU time at last checkpoint: 1781.304000
   current CPU time: 1805.969000
   fraction done: 0.200882
   swap size: 58 MB
   working set size: 56 MB
6) -----------
   name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.210.vlar_1
   WU name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.210.vlar
   project URL: http://setiathome.berkeley.edu/
   received: Tue Apr  2 13:43:18 2019
   report deadline: Sat May 25 18:42:59 2019
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 800
   resources: 1 CPU
   estimated CPU time remaining: 4675.696451
   CPU time at last checkpoint: 1780.113000
   current CPU time: 1784.886000
   fraction done: 0.199132
   swap size: 58 MB
   working set size: 56 MB
7) -----------
   name: blc34_2bit_guppi_58406_28965_HIP20350_0109.21465.0.22.45.101.vlar_1
   WU name: blc34_2bit_guppi_58406_28965_HIP20350_0109.21465.0.22.45.101.vlar
   project URL: http://setiathome.berkeley.edu/
   received: Tue Apr  2 13:43:18 2019
   report deadline: Sat May 25 18:42:59 2019
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 800
   resources: 1 CPU
   estimated CPU time remaining: 4705.804477
   CPU time at last checkpoint: 1711.990000
   current CPU time: 1768.862000
   fraction done: 0.193975
   swap size: 54 MB
   working set size: 52 MB
8) -----------
   name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.254.vlar_1
   WU name: blc34_2bit_guppi_58406_28625_HIP21029_0108.20913.818.21.44.254.vlar
   project URL: http://setiathome.berkeley.edu/
   received: Tue Apr  2 13:43:18 2019
   report deadline: Sat May 25 18:42:59 2019
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 800
   resources: 1 CPU
   estimated CPU time remaining: 4928.870679
   CPU time at last checkpoint: 1422.330000
   current CPU time: 1466.627000
   fraction done: 0.155767
   swap size: 54 MB
   working set size: 52 MB

boinccmd – the management tool for the command line

Here are the options you can use in version 7.14.2:

[myuser@compute1 ~]# boinccmd --help

usage: boinccmd [--host hostname] [--passwd passwd] [--unix_domain] command

default hostname: localhost
default password: contents of gui_rpc_auth.cfg
Commands:
 --acct_mgr attach URL name passwd  attach to account manager
 --acct_mgr info                    show current account manager info
 --acct_mgr sync                    synchronize with acct mgr
 --acct_mgr detach                  detach from acct mgr
 --client_version                   show client version
 --create_account URL email passwd name
 --file_transfer URL filename op    file transfer operation
   op = retry | abort
 --get_app_config URL               show app config for given project
 --get_cc_status
 --get_daily_xfer_history           show network traffic history
 --get_disk_usage                   show disk usage
 --get_file_transfers               show file transfers
 --get_host_info
 --get_message_count                show largest message seqno
 --get_messages [ seqno ]           show messages > seqno
 --get_notices [ seqno ]            show notices > seqno
 --get_project_config URL
 --get_project_status               show status of all attached projects
 --get_proxy_settings
 --get_simple_gui_info              show status of projects and active tasks
 --get_state                        show entire state
 --get_tasks                        show tasks
 --get_old_tasks                    show reported tasks from last 1 hour
 --join_acct_mgr URL name passwd    same as --acct_mgr attach
 --lookup_account URL email passwd
 --network_available                retry deferred network communication
 --project URL op                   project operation
   op = reset | detach | update | suspend | resume | nomorework | allowmorework | detach_when_done | dont_detach_when_done
 --project_attach URL auth          attach to project
 --quit                             tell client to exit
 --quit_acct_mgr                    same as --acct_mgr detach
 --read_cc_config
 --read_global_prefs_override
 --run_benchmarks
 --set_gpu_mode mode duration       set GPU run mode for given duration
   mode = always | auto | never
 --set_host_info product_name
 --set_network_mode mode duration   set network mode for given duration
   mode = always | auto | never
 --set_proxy_settings
 --set_run_mode mode duration       set run mode for given duration
   mode = always | auto | never
 --task url task_name op            task operation
   op = suspend | resume | abort

More clients in EPEL repository

[myuser@compute1 ~]# yum search boinc
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.checkdomain.de
 * epel: mirror.wiuwiu.de
 * extras: mirror.checkdomain.de
 * updates: mirror.checkdomain.de
============================================================================ N/S matched: boinc ============================================================================
boinc-client.x86_64 : The BOINC client
boinc-client-devel.x86_64 : Development files for boinc-client
boinc-client-doc.noarch : Documentation files for boinc-client
boinc-client-static.x86_64 : Static libraries for boinc-client
boinc-manager.x86_64 : GUI to control and monitor boinc-client

  Name and summary matches only, use "search all" for everything.

Troubleshooting

If you get the follwoing error:

root@srv ~ # boinccmd --project_attach "http://boinc.bakerlab.org/rosetta/" "1111_111111111111111111111111111"
Operation failed: authentication error

You must change the directory to “/var/lib/boinc” and run the command again. You may chown the directory to the boinc user. A directory for the project is created under “projects/” and a configuration file is created with name “account_boinc.bakerlab.org_rosetta.xml”.

root@srv ~ # cd /var/lib/boinc
root@srv /var/lib/boinc # boinccmd --project_attach "https://boinc.bakerlab.org/rosetta/" "1111_111111111111111111111111111"
root@srv chown -R boinc:boinc /var/lib/boinc
root@srv /var/lib/boinc # ls -altr projects/boinc.bakerlab.org_rosetta/
total 8
drwxrwx--x 4 boinc boinc 4096 Apr  4 03:01 ..
drwxrwx--x 2 boinc boinc 4096 Apr  4 03:01 .
root@srv /var/lib/boinc # cat account_boinc.bakerlab.org_rosetta.xml
<account>
    <master_url>https://boinc.bakerlab.org/rosetta/</master_url>
    <authenticator>1111_111111111111111111111111111</authenticator>
    <project_name>Rosetta@home</project_name>
<project_preferences>

<resource_share>15</resource_share>
<project_specific>
<color_scheme>Tahiti Sunset</color_scheme>
</project_specific>
</project_preferences>
<gui_urls>

    <gui_url>
        <name>Message boards</name>
        <description>Correspond with other users on the Rosetta@home message boards</description>
        <url>http://boinc.bakerlab.org/rosetta/forum_index.php</url>
    </gui_url>
    <gui_url>
        <name>Your account</name>
        <description>View your account information</description>
        <url>http://boinc.bakerlab.org/rosetta/home.php</url>
    </gui_url>
    <gui_url>
        <name>Your tasks</name>
        <description>View the last week or so of computational work</description>
        <url>http://boinc.bakerlab.org/rosetta/results.php?userid=xxxx</url>
    </gui_url>
    
</gui_urls>
</account>

The simplest nagios setup to make a phone call on a critical notification – using twilio service

The aim of this article is to show you the simplest way you can achieve your monitoring Nagios system to make a phone call on a critical, in our example CRITICAL host DOWN (the host is unreachable)! You will not need any server software to setup (including VoiceOverIP server), the only thing you will need is an account in https://www.twilio.com/, which even could be a free/trial account. The idea here is to be simple, cheap and easy to be accomplished.
To summarize it up:

to make a phone call to a real phone number on a CRITICAL Nagios notification

Here is what you need:

  1. A https://www.twilio.com/ trial/free account.
  2. root access to the Nagios server, because you will put a simple bash script, which uses “curl” to open an URL

The idea here is to make a phone call on a really CRITICAL issue to be able to wake up the person on duty to check the monitoring system and the SMS messages he received! So the phone call might not be accepted, at all – the ring of the phone (or continues vibrating of your smart band on your wrist!) should be enough to get a second different type of notification after the first (only and most cases only?) SMS messaging (which has the infromation for the problem!).
Keep on reading!

rsync and selinux – opendir failed: Permission denied

Selinux could sometime mess up with your setup. Let’s say you configured your rsync daemon but still, you get the error related to permissions when executing the rsync to copy files!

rsync: opendir "/." (in backup2) failed: Permission denied (13)

Apparently, the rsync client connects to the server and it finds there is a section name “backup2”, but still no permission despite you explicitly set in the section uid and ig to be root (uid=0 and gid=0 in the section)!

The most common reason is

selinux denies rsync process to open the directory exported by the path in your rsync configuration file.

By default, Selinux will deny access to any of the files and directories in your system! In most cases here what can you help:

setsebool -P rsync_export_all_ro=1

rsync_export_all_ro will export any files and directories read-only and requests like above will not be denied.
The capital letter “-P” is to set it permanently for the system over reboots.
Keep on reading!

Update Supermicro BMC/IPMI Firmware – under Linux console

Here you will see our log of upgrading the Supermicro IPMI firmware with the cli tool included in the firmware package for your IPMI unit under Linux console.
If your server has built-in IPMI unit in the motherboard there will be a firmware for it next to the BIOS firmware in the Supermicro site. You go to the page of your Supermicro page and on the left part you have also the BIOS and IPMI firmware links. The IPMI firmware package has a Windows/DOS and Linux executable files to flash the firmware under the console.
So here we flash a new firmware to our motherboard is X10SLM+-F.

Here you can see left “Links & Resources” and click on ” BMC/IPMI Firmware” to download the latest IPMI firmware for your motherboard.

main menu
Motherboard X10SLM+-F page in Supermicro site

Upload the downloaded file in your server.

STEP 1) Unpack the firmware file downloaded from Supermicro site.

Here we include the verbose output of “tar” so you can see what files are included. The files we use here are highlighted.

[root@srv ~]# ls -altr
total 25904
drwxr-xr-x. 94 root root    81920  3 Feb 17,42 ..
drwxr-xr-x.  2 root root     4096  3 Feb 17,43 .
-rw-r--r--.  1 root root 26432121  3 Feb 17,43 REDFISH_X10_372.zip
[root@srv ~]# mkdir REDFISH_X10_372
[root@srv ~]# cd REDFISH_X10_372/
[root@srv ~/REDFISH_X10_372]# unzip ../REDFISH_X10_372.zip 
Archive:  ../REDFISH_X10_372.zip
  inflating: Redfish_Ref_Guide_2.0.pdf  
   creating: 2.07/
   creating: 2.07/dos/
  inflating: 2.07/dos/AdUpdate.exe   
   creating: 2.07/linux/
   creating: 2.07/linux/x32/
  inflating: 2.07/linux/x32/AlUpdate  
   creating: 2.07/linux/x64/
  inflating: 2.07/linux/x64/AlUpdate  
  inflating: 2.07/ReleaseNote.txt    
   creating: 2.07/windows/
   creating: 2.07/windows/x32/
  inflating: 2.07/windows/x32/AwUpdate.exe  
  inflating: 2.07/windows/x32/phymem32.sys  
  inflating: 2.07/windows/x32/pmdll32.dll  
  inflating: 2.07/windows/x32/superbmc32.sys  
  inflating: 2.07/windows/x32/superdll_ssm32.dll  
   creating: 2.07/windows/x64/
  inflating: 2.07/windows/x64/AwUpdate.exe  
  inflating: 2.07/windows/x64/phymem64.sys  
  inflating: 2.07/windows/x64/pmdll64.dll  
  inflating: 2.07/windows/x64/superbmc.sys  
  inflating: 2.07/windows/x64/superdll_ssm64.dll  
  inflating: IPMI Firmware Update_NEW.doc  
  inflating: REDFISH_X10_372.bin

There are 5 version of the flash utility Linux 32bit and 64bit, Windows 32bit and 64bit and a dos version.

STEP 2) Flash the BCM/IPMI firmware.

We choose here not to preserve configuration, because some old features might be incompatible with the new one. It is not mandatory to do it in fact we also tested with “to preserve” the old configuration and we have no problems afterwards.
We do not change almost anything in the IPMI configuration except admin password and the network settings and when flashing under the OS you have the ability to reconfigure it after the flashing process. Your server is up and running and you can use “ipmitool” to configure the IPMI module.
The whole process took about 15 minutes.

[root@srv ~/REDFISH_X10_372]# 2.07/linux/x64/AlUpdate -f REDFISH_X10_372.bin -r n
sh: cls: command not found
*****************************************************************************
* ATEN Technology, Inc.                                                     *
*****************************************************************************
* FUNCTION   :  IPMI FIRMWARE UPDATE UTILITY                                *
* VERSION    :  2.07                                                        *
* BUILD DATE :  Jul 13 2016                                                 *
* USAGE      :                                                              *
*             (1)Update FIRMWARE : AlUpdate -f filename.bin [OPTION]        *
*             (2)Dump FIRMWARE   : AlUpdate -d filename                     *
*             (3)Restore CONFIG  : AlUpdate -c -f filename.bin              *
*             (4)Backup CONFIG   : AlUpdate -c -d filename.bin              *
*****************************************************************************
* OPTION                                                                    *
*   -i the IPMI channel, currently, kcs and lan are supported               *
* LAN channel specific arguments                                            *
*   -h remote BMC address and RMCP+ port, (default port is 623)             *
*   -u IPMI user name                                                       *
*   -p IPMI password correlated to IPMI user name                           *
*   -r Preserve Configuration (default is Preserve)                         *
*      n:No Preserve, reset to factory default settings                     *
*      y:Preserve, keep all of the settings                                 *
*   -c IPMI configuration backup/restore                                    *
*      -f [restore.bin] Restore configurations                              *
*      -d [backup.bin] Backup configurations                                *
*****************************************************************************
* EXAMPLE                                                                   *
*   we like to upgrade firmware through KCS channel                         *
*   AlUpdate -f fwuperade.bin -i kcs -r y                                   *
*   AlUpdate -d fwdump.bin -i kcs -r y                                      *
*                                                                           *
*   we like to restore/backup IPMI config through KCS channel               *
*   AlUpdate -c -f restore.bin -i kcs -r y                                  *
*   AlUpdate -c -d backup.bin -i kcs -r y                                   *
*                                                                           *
*   we like to upgrade firmware through LAN channel with                    *
*   - BMC IP address 10.11.12.13 port 623                                   *
*   - IPMI username is usr                                                  *
*   - Password for alice is pwd                                             *
*   - Preserve Configuration                                                *
*   AlUpdate -f fw.bin -i lan -h 10.11.12.13 623 -u usr -p pwd -r y         *
*   AlUpdate -d fwdump.bin -i lan -h 10.11.12.13 623 -u usr -p pwd -r y     *
*                                                                           *
*   we like to restore/backup IPMI config through LAN channel with          *
*   - BMC IP address 10.11.12.13 port 623                                   *
*   - IPMI username is usr                                                  *
*   - Password for alice is pwd                                             *
*   - Preserve Configuration                                                *
*   AlUpdate -c -f fw.bin -i lan -h 10.11.12.13 623 -u usr -p pwd           *
*   AlUpdate -c -d fwdump.bin -i lan -h 10.11.12.13 623 -u usr -p pwd       *
*****************************************************************************

2.07/linux/x64/AlUpdate -f REDFISH_X10_372.bin -r n 
Try open dev ipmi0....
Check if this file is valid................
If the FW update fails,PLEASE TRY AGAIN
Load part 0   126008 bytes, [Ok]                       
Load part 1 14635008 bytes, [Ok]                       
Load part 2  1537585 bytes, [Ok]                       
Load part 3  8081440 bytes, [Ok]                       
Load part 4   262144 bytes, [Ok]                       



                 If the FW update fails. PLEASE WAIT 5 MINS AND REMOVE THE AC...
new firmware is updating...100%
Update Complete,Please wait for BMC reboot, about 1 min                       
[root@srv ~/REDFISH_X10_372]# 

All the lines starting with “Load part” will shows progress percentages like:

Load part 1 14635008 bytes,  4137K bytes   29%"

And the line starting with “new firmware is updating…” also shows like:

new firmware is updating...28%

In dmesg you can see your IPMI module resets:

[root@conv1 ~]# dmesg
[1954154.242383] usb 3-7: USB disconnect, device number 2
[1954154.242385] usb 3-7.1: USB disconnect, device number 3
[1954185.337154] usb 3-7: new high-speed USB device number 4 using xhci_hcd
[1954185.501356] usb 3-7: New USB device found, idVendor=0557, idProduct=7000
[1954185.501358] usb 3-7: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[1954185.501879] hub 3-7:1.0: USB hub found
[1954185.501923] hub 3-7:1.0: 4 ports detected
[1954185.899168] usb 3-7.1: new low-speed USB device number 5 using xhci_hcd
[1954185.999375] usb 3-7.1: New USB device found, idVendor=0557, idProduct=2419
[1954185.999376] usb 3-7.1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[1954186.000708] input: HID 0557:2419 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7.1/3-7.1:1.0/input/input10
[1954186.051346] hid-generic 0003:0557:2419.0003: input,hidraw0: USB HID v1.00 Keyboard [HID 0557:2419] on usb-0000:00:14.0-7.1/input0
[1954186.052050] input: HID 0557:2419 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7.1/3-7.1:1.1/input/input11
[1954186.052423] hid-generic 0003:0557:2419.0004: input,hidraw1: USB HID v1.00 Mouse [HID 0557:2419] on usb-0000:00:14.0-7.1/input1
[1954199.668503] usb 3-7.1: USB disconnect, device number 5
[1954201.450533] usb 3-7.1: new low-speed USB device number 6 using xhci_hcd
[1954201.550755] usb 3-7.1: New USB device found, idVendor=0557, idProduct=2419
[1954201.550756] usb 3-7.1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[1954201.552044] input: HID 0557:2419 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7.1/3-7.1:1.0/input/input12
[1954201.602658] hid-generic 0003:0557:2419.0005: input,hidraw0: USB HID v1.00 Keyboard [HID 0557:2419] on usb-0000:00:14.0-7.1/input0
[1954201.603372] input: HID 0557:2419 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7.1/3-7.1:1.1/input/input13
[1954201.603729] hid-generic 0003:0557:2419.0006: input,hidraw1: USB HID v1.00 Mouse [HID 0557:2419] on usb-0000:00:14.0-7.1/input1

Tune nginx proxy cache – control the cache manager how to delete cached files

In most cases you’ll never want to modify the default settings for deleting cache items with proxy_cache_path directives. The problem is in a peak the file deleting could impact your server performance and even it could kill your server leaving it unresponsive for a period of time. You cannot instruct nginx with a schedule job for deletion cached items or ban the deletion when the server is busy or loaded. The manager just traces each zone for used cache capacity versus the maximum allowed size and if the used capacity is near or bigger than the maximum allowed size (max_size) the manager process triggers deletion with the default values – the nginx manager will try to delete at least 100 files (up to 200 milliseconds) and then it will sleep for 50 milliseconds then again it will try deleting 100 files. So your file system could receive at least 1000 files per second to delete!

This could lead your server to almost unresponsive state in the peaks.

And it could be perfectly OK in off-peaks, but there is no way how to tell nginx cache manager there is a plenty free space despite you reach the cache limit so at the moment it is not the best time to delete the cache!

You can tune three parameters per cache directory (manual here: https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path):

  • manager_files – not more than this number of files to delete in one iteration. The default value is 100.
  • manager_threshold – limit the delete iteration time. The default value is 200 milliseconds and you must use nginx time syntax concatenated to the number you want, for example if you want 500 milliseconds you must use “500 ms”.
  • manager_sleep – how much time to sleep the manager before executing another delete iteration. The default value is 50 milliseconds and here you must use nginx time syntax concatenated to the number you want, for example if you want 500 milliseconds you must use “500 ms”.
        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=4000g manager_files=2 manager_sleep=200ms manager_threshold=500ms;

The cache manager will delete not more than 2 files for up to 500 milliseconds and it will sleep 200 milliseconds before another delete iteration.

The best option for loaded servers

The best option for loaded servers with full cache is to balance the free space – delete small amount of files at once to be sure your server will not get loaded even the free space decreases at the peaks (so more files are cached than the nginx manager could delete – you are aware of this and the free space should be enough), but during the off peak (which normally is several times longer than the peak) the nginx manager could catch up with the deleting and it should free up some space (cached files are lesser than the deleted ones). Of course, you should tune this according to your situation.
The main idea is to delete in small amounts of files to not saturate your disks it could take longer to recover the free space, but it will not load your server in peaks. You should consider two things:

  1. Free space – enough free space and to be sure the free space is enough for the peaks, when the cache could grow above the threshold.
  2. Number of deletions per iteration – you should experiment with this. Fist you should be away how many files are added for a period of time, which includes one peak and one off-peak and then to balance the number in such a way that after the period the cache is not above the maximum size. Probably the best is to start with a 24 hours period, which includes at least one peak.

As you can see the example above only 2 files are good enough for an iteration for our case. Taking into account the 200ms sleep between the files’ deletions 10 files at most should be deleted per second. In our case it is not enough for the peak, but for the off-peak, which is 20 hours every 24 hours, is good enough to get into the maximum size limit of the cache.

Here you can learn how to verify your nginx is deleting cache files and the impact of the default settings on a busy server in a peak: how to disable effectively the deleting (purging) files from nginx proxy_cache (nginx cache manager process) Our loaded server just stopped serving files and the bandwidth decreased with 99% because nginx cache manager suddenly started deleting cached files.

how to disable effectively the deleting (purging) files from nginx proxy_cache (nginx cache manager process)

In peaks deleting files could kill your server and easily the traffic could degraded multiple times than normal if the nginx cache manager start deleting files!

The server is perfectly normal but suddenly it just get loaded and all nginx processes are in D (“Disk sleep”) state.

What could it be? What is going on with your proxy server?

Probably the cache is full!

Unfortunately there is no way to check how much is filled the cache live – just an upgrade or restart of the nginx process will trigger nginx cache loader to check all the cache files and will write the cache size on exit in the error log – but be careful the cache loading is also IO intensive operation – stats all the cache files and they could be millions images).

If you are sure the cache manager is to blame for the IO of your server (probably using this method – Check whether nginx cache manager is deleting files at the moment), you can stop it almost immediately!

Just increase the nginx cache drastically – add zero to the maximum cache size

Of course, you should have enough free space till you resolve the problem – for example more servers or manual deletion on peak-off or tune your cache deletion or any other solution….
Search for something like

        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=400g

And add zero to the max_size number like:

        proxy_cache_path /mnt/cache levels=1:2 keys_zone=CACHESTATICS:900m inactive=710h max_size=4000g

The max size will increase from 400G to 4000G (4T)!
This will effectively stop the files deleting and the nginx cache manager will have slept for long time before invoking again to delete files. This could be life saving operation for your server at peak!

Here is a real graph from one of our servers – the cache manager started deleting files from the cache and the traffic dropped 99%!!!

SCREENSHOT 1) The nginx cache manager just started to delete files from the cache and this operation just killed our server completely.

You can see almost zero bandwidth! The problem was resolved when we reloaded nginx with a bigger cache max_size value. The nginx manager immediately went to sleep and no IO for deleting files. The load of the server returned to normal!

main menu
nginx cache manager start deleting files

SCREENSHOT 2) Hard drives were saturated and the disk maxed the IO time to 10 ms.

Despite the bigger READ and WRITE IOPS there was 95-99% less traffic.

main menu
Disk IO Time when cache manager is working

Then you can tune the values for deleting files from the cache – Tune nginx proxy cache – control the cache manager how to delete cached files.

Check whether nginx cache manager is deleting files at the moment

Here is a tip for the webmasters (or system admins) to discover whether the nginx using proxy_cache to cache files is deleting files at the moment! There situation where you may need to know if the loaded of a static media server is caused by the deletion of the cache manager or by the read or seek operations when serving the static files. The deletion is really slow and IO intensive operation, which could greatly impact the performance and traffic of the server.
Find the process nginx’s “cache manager process” and strace it:

[root@srv ~]# ps axuf|grep nginx
root     31582  0.0  0.0 2906768 25108 ?       Ss   Feb15   0:01 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx    16008  1.9  1.3 2941188 440224 ?      S    16:39   1:33  \_ nginx: worker process
nginx    16009  1.5  1.2 2941188 398836 ?      S    16:39   1:12  \_ nginx: worker process
nginx    16010  0.5  0.7 2941984 239064 ?      S    16:39   0:26  \_ nginx: worker process
nginx    16011  0.7  0.9 2941984 299356 ?      D    16:39   0:35  \_ nginx: worker process
nginx    16012  1.2  1.1 2941188 389540 ?      D    16:39   1:01  \_ nginx: worker process
nginx    16013  2.3  1.5 2941188 487324 ?      D    16:39   1:55  \_ nginx: worker process
nginx    16014  0.0  0.6 2906772 224004 ?      S    16:39   0:01  \_ nginx: cache manager process
[root@srv ~]# strace -f -p 16014
strace: Process 16014 attached
gettid()                                = 16014
write(31, "2019/02/25 18:00:31 [info] 16014"..., 89) = 89
epoll_wait(36, [], 512, 5406)           = 0
unlink("/mnt/cache/0/39/c8ccbbc06d16debb1c8d58ceb6f99390") = 0
unlink("/mnt/cache/0/78/118924d7bf70e20fa8f790c6f9e7c780") = 0
unlink("/mnt/cache/3/ce/fab074cc670e6a80114dcbc398a63ce3") = 0
unlink("/mnt/cache/5/48/0b4e162dd7be8244815721fb7d68e485") = 0
unlink("/mnt/cache/5/56/e5eb4b38c7c8d209d0aabaf79ac02565") = 0
unlink("/mnt/cache/e/c6/207b432fa77375e4eefcaf52db250c6e") = 0
unlink("/mnt/cache/4/6d/ac0db27a03dabc79d869068db1b516d4") = 0
unlink("/mnt/cache/9/e8/91625c6e60de8e5425c4135c7dfb2e89") = 0
unlink("/mnt/cache/b/3c/f3c53000cf0cb20d55d8c09df8a733cb") = 0
unlink("/mnt/cache/f/f7/6f06423cd411b45816969fe020903f7f") = 0
unlink("/mnt/cache/f/50/c9b8ab72821a6e9bcb9c8d4b790dc50f") = 0
unlink("/mnt/cache/6/1f/74b0f1fdf1ac30db6af7793dc15671f6") = 0
unlink("/mnt/cache/0/83/caf199c1b99d438f96caec71bf2ea830") = 0
unlink("/mnt/cache/4/3d/c90f8fbbba4aaf407e386641dc2203d4") = 0
unlink("/mnt/cache/4/ad/d23cf8598020141b2bcec46d2b5cbad4") = 0
unlink("/mnt/cache/d/47/05973bc310503f36c67b7c1c24c8247d") = 0
unlink("/mnt/cache/f/11/e4fcbde8533d89105ab41f22c55e211f") = 0
unlink("/mnt/cache/2/06/29066a58e4116d24266026b4ed1e3062") = 0
epoll_wait(32, [], 512, 50)             = 0
unlink("/mnt/cache/4/6b/9a104ebdf70d00137a88d4584b2bb6b4") = 0
unlink("/mnt/cache/e/95/6d176447f57f21769d86a8f0b2a8b95e") = 0
unlink("/mnt/cache/b/b2/2f6f51163c65ae1fc06a913d6de1ab2b") = 0
unlink("/mnt/cache/a/24/2b058045a23b69de7a4442c9e6fce24a") = 0
unlink("/mnt/cache/7/60/00833e0b236ca8472f5be8227d645607") = 0
unlink("/mnt/cache/a/08/bf00eea300eff97dc4fffa61daaca08a") = 0
unlink("/mnt/cache/2/48/a291d8aca2b6f4f9471686eabe9b2482") = 0
unlink("/mnt/cache/0/e3/2d631adbc3bfdf8e44a51fa5453eee30") = 0
unlink("/mnt/cache/1/3b/08eef7c86c5ece9b5279b304dd86e3b1") = 0
unlink("/mnt/cache/b/a4/03213e4a8a1e8fb17ae698e54e70fa4b") = 0
unlink("/mnt/cache/b/a3/77f1b11811a9cda0ae93c498769f7a3b") = 0
unlink("/mnt/cache/4/01/1d50fac60681ae3263c8875775d20014") = 0
unlink("/mnt/cache/c/94/e71b96cbc65b248bd8e4540cbd69294c") = 0
unlink("/mnt/cache/1/59/99ec58e865b97e217835dd84f5f48591") = 0
unlink("/mnt/cache/4/b8/6a64825ce555b8f2440f051a7f7bcb84") = 0
unlink("/mnt/cache/7/51/fe2acbb895427ed8e406ce7e79d61517") = 0
.....
.....

You can tune the file removing from the cache with manager_files, manager_threshold and manager_sleep arguments of the proxy_cache_path.
If you came here searching information on the topic probably you should check out these articles, too: how to disable effectively the deleting (purging) files from nginx proxy_cache (nginx cache manager process) and Tune nginx proxy cache – control the cache manager how to delete cached files

Centos 7 Server hangs up on boot after deleting a software raid (mdadm device)

We have a CentOS 7 server with a simple two hard drives setup in RAID1 of total 4 devices for boot, root, swap and storage. The storage device (/dev/md5) was removed and recreated with RAID0 for better performance, because the server was promoted as only cache server. Then the server was restarted and it never went up.
On IPMI KVM it just started loading the kernel and hanged up after several seconds without any additional information:

The kernel loads the mdadm devices and do not continue and the device md5 is missing.

main menu
CentOS 7 kernel loading the mdadm RAID devices

To boot successfully you must remove the missing device

On the Grub 2 menu press “e” and you’ll get this screen. Here you can edit all lines if you need. You must remove the last rd.md.uuid in our case or the one you deleted. Remove it and press Ctrl+x to load the kernel.

main menu
Grub 2 edit

There are two options you can do:

  • OPTION 1) Remove rd.md.uuid option of your old mdadm device
  • OPTION 2) Replace the ID in rd.md.uuid= with the new ID of the mdadm device.

Each of these two options could be used to solve the booting problem. Edit /etc/default/grub and replace or remove rd.md.uuid and generate the grub.conf.
You can find old mdadm ID in /etc/mdadm.conf (if you have not replace it there).

[root@srv ~]# cat /etc/mdadm.conf 
ARRAY /dev/md2 level=raid1 num-devices=2 metadata=0.90 UUID=9c08f218:cd5c0f8f:d96bc0d1:57b77e99
ARRAY /dev/md3 level=raid1 num-devices=2 metadata=1.2 name=2035110:swap UUID=1f74a2e0:757bfb9f:9c860e50:325f37cb
ARRAY /dev/md4 level=raid1 num-devices=2 metadata=1.2 name=2035110:root UUID=29bf4aa8:b7dae21a:45f4c188:baea4c13
ARRAY /dev/md5 level=raid1 num-devices=2 metadata=1.2 name=2035110:storage1 UUID=e6eb2590:b767be36:c76bb869:45ff0c3c
[root@srv ~]# mdadm --detail --scan
ARRAY /dev/md2 metadata=0.90 UUID=9c08f218:cd5c0f8f:d96bc0d1:57b77e99
ARRAY /dev/md3 metadata=1.2 name=2035110:swap UUID=1f74a2e0:757bfb9f:9c860e50:325f37cb
ARRAY /dev/md4 metadata=1.2 name=2035110:root UUID=29bf4aa8:b7dae21a:45f4c188:baea4c13
ARRAY /dev/md/5 metadata=1.2 name=s2035110:5 UUID=901074eb:16ba7c5b:0af69934:e9444102
[root@srv ~]# mdadm --detail --scan > /etc/mdadm.conf 

Here is our old /etc/default/grub:

[root@srv ~]# cat /etc/default/grub 
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL="serial console"
GRUB_SERIAL_COMMAND="serial --speed=115200"
GRUB_CMDLINE_LINUX="rd.md.uuid=9c08f218:cd5c0f8f:d96bc0d1:57b77e99 rd.md.uuid=1f74a2e0:757bfb9f:9c860e50:325f37cb rd.md.uuid=29bf4aa8:b7dae21a:45f4c188:baea4c13 rd.md.uuid=e6eb2590:b767be36:c76bb869:45ff0c3c console=tty0 crashkernel=auto console=ttyS0,115200 net.ifnames=1"
GRUB_DISABLE_RECOVERY="true"

Here we edit our /boot/grub2/grub.cfg, replace the old uuid and generate grub.cfg (legacy BIOS):

[root@srv ~]# cat /etc/default/grub 
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL="serial console"
GRUB_SERIAL_COMMAND="serial --speed=115200"
GRUB_CMDLINE_LINUX="rd.md.uuid=9c08f218:cd5c0f8f:d96bc0d1:57b77e99 rd.md.uuid=1f74a2e0:757bfb9f:9c860e50:325f37cb rd.md.uuid=29bf4aa8:b7dae21a:45f4c188:baea4c13 rd.md.uuid=901074eb:16ba7c5b:0af69934:e9444102 console=tty0 crashkernel=auto console=ttyS0,115200 net.ifnames=1"
[root@srv ~]# grub2-mkconfig -o /boot/grub2/grub.cfg 
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-957.5.1.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-957.5.1.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-05cb8c7b39fe0f70e3ce97e5beab809d
Found initrd image: /boot/initramfs-0-rescue-05cb8c7b39fe0f70e3ce97e5beab809d.img
done
[root@srv ~]# reboot

Use this for UEFI BIOS boot:
First check if /boot and /boot/efi are mounted and if not you must mount them with:

mount /boot
mount /boot/efi

Generate the grub.cfg

grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

Bonus

In fact when the original device was removed and added a new one we formatted it as usual. But it was not possible to mount it, you just execute mount

/dev/md5 /mnt/stor1

no error, but no mount could be found, the device was not mounted and when you execute

umount /mnt/stor1

The OS told the “/mnt/stor1” was not mounted. Several more tries were made unsuccessfully to mount the “/dev/md5”, then the restart was performed and the server never went up.
Suppose the systemd just did not allow to mount the device because of the boot parameters rd.md.uuid!