Copy files with read errors successfully – skipping only errors (i.e. bad sectors)

Sometimes disks have errors or an SSD disk has a bad NAND cell. Saving the whole hard disk data may not be needed and when only a specific file or two are important and which cannot be copied by cp or rsync because of “Unrecovered read error”.
Furthermore, the SSD reallocates the bad cells, when there are writes to the cell(s), which may not occur years, but reading may be each day. Reading from a sector with bad NAND cells will result in slow IO (multiple read commands are executed before giving up). Copying the file to a new place without only 512 bytes may not harm the data, but it is difficult to be done with the generic tool for copying.
This article is to save single files from a mounted ext4 file system with bad sectors using the ddrescue tool – In fact, the ddrescue could save files or whole devices.

STEP 1) Install ddrescue.

Installing ddrescue is pretty easy. The tool is included in almost all Linux distributions and it doesn’t have many dependencies. Apparently, there is another dd_rescue tool, which is different than this one, just follow the link above for the tool used here.
CentOS 7/8 or Fedora:

yum install -y ddrescue

Ubuntu last 10 years versions:

apt install -y gddrescue


emerge -v ddrescue

STEP 2) Rescuing a single file with read errors because of bad sectors in a mounted file system.

[root@srv Snapshots]# ddrescue -v \{9f02ae0a-6dae-4729-b6a6-ec3f0550f294\}.vdi test2.vdi
GNU ddrescue 1.25
About to copy 15724 MBytes from '{9f02ae0a-6dae-4729-b6a6-ec3f0550f294}.vdi' to 'test2.vdi'
    Starting positions: infile = 0 B,  outfile = 0 B
    Copy block size: 128 sectors       Initial skip size: 384 sectors
Sector size: 512 Bytes

Press Ctrl-C to interrupt
     ipos:   13495 MB, non-trimmed:        0 B,  current rate:       0 B/s
     opos:   13495 MB, non-scraped:        0 B,  average rate:    162 MB/s
non-tried:        0 B,  bad-sector:     8192 B,    error rate:    4608 B/s
  rescued:   15724 MB,   bad areas:        2,        run time:      1m 36s
pct rescued:   99.99%, read errors:       18,  remaining time:          0s
                              time since last successful read:          0s
[root@srv Snapshots]# ls -al
total 52602944
drwx------. 2 root root        4096 Jun  2 02:22 .
drwxr-xr-x. 4 root root        4096 Jun  1 14:16 ..
-rw-------. 1 root root   459981735 Nov  8  2018 2018-11-08T15-19-17-776317000Z.sav
-rw-------. 1 root root   566704069 Jun  1 14:16 2020-06-01T11-16-05-735318000Z.sav
-rw-------. 1 root root  8329887744 Jun  1 12:53 {3d30ebea-2e2f-4e33-8088-d3d66f315e2c}.vdi
-rw-------. 1 root root 15724445696 Nov  8  2018 {9f02ae0a-6dae-4729-b6a6-ec3f0550f294}.vdi
-rw-------. 1 root root  4012900352 Jun  1 14:16 {f7e72510-7dce-48fd-b62c-630664ad984f}.vdi
-rw-r--r--. 1 root root 15724445696 Jun  2 02:24 test2.vdi
-rw-------. 1 root root  9051041792 Jun  2 02:19 test.vdi

Here is an animated gif of the ddrescue procedure:

main menu
ddrescue – copy files with bad sectors

Keep on reading!

Data too large, data for [] would be [] which is larger than the limit of

Rsyslog writing to Elasticsearch could lead to an error for some of the records and missing to save them in the backend:

{ ... { "error": { "root_cause": [ { "type": "circuit_breaking_exception", 
"reason": "[parent] Data too large, data for [<http_request>] would be [1008813778\/962mb], which is larger than the limit of [986061209\/940.3mb], 
real usage: [1008812248\/962mb], new bytes reserved: [1530\/1.4kb], usages [request=0\/0b, fielddata=317\/317b, in_flight_requests=1530\/1.4kb, accounting=178301893\/170mb]",
"bytes_wanted": 1008813778, "bytes_limit": 986061209, "durability": "PERMANENT" }], 
"type": "circuit_breaking_exception", "reason": "[parent] Data too large, data for [<http_request>] would be [1008813778\/962mb], which is larger than the limit of [986061209\/940.3mb], 
real usage: [1008812248\/962mb], new bytes reserved: [1530\/1.4kb], usages [request=0\/0b, fielddata=317\/317b, in_flight_requests=1530\/1.4kb, accounting=178301893\/170mb]",
"bytes_wanted": 1008813778, "bytes_limit": 986061209, "durability": "PERMANENT" }, "status": 429 } }

Unfortunately, such writes are not saved in the Elasticsearch and the data has been lost.

The problem here is that the Java VM has reached the maximum allowed memory and more memory should be allowed to be used by the Java Virtual Machine.

Find the Java VM option for the Elasticsearchjvm.options. In CentOS 7 the file is located in /etc/elasticsearch/jvm.options and set more memory with the variables “-Xms[SIZE]g -Xmx[SIZE]g”, such as:


|grep -v grep
This will allow 4G “maximum size of total heap space” to be used by the Java Virtual Machine. By default, it is 1G (-Xms1g -Xmx1g). It is a good idea to set it half of the server’s memory. Save and restart the Elasticsearch service as usual:

systemctl restart elasticsearch

You should see the variable in the command line with ps command:

[root@loganalyzer ~]# ps axuf|grep elasticsearch
elastic+   592 10.8 34.4 168638848 5493156 ?   Ssl  00:56   4:23 /usr/share/elasticsearch/jdk/bin/java -Des.networkaddress.cache.ttl=60
-Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 
-Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false 
-Dlog4j2.disable.jmx=true -Djava.locale.providers=COMPAT 
-Xms4g -Xmx4g 
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch 
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log elasticsearch
-XX:MaxDirectMemorySize=2147483648 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch 
-Des.distribution.flavor=default -Des.distribution.type=rpm -Des.bundled_jdk=true 
-cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/ --quiet
elastic+   690  0.0  0.0  70448  4516 ?        Sl   00:56   0:00  \_ /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller

The environment variable ES_JAVA_OPTS could be used, too.

ES_JAVA_OPTS="-Xms4g -Xmx4g" ./bin/elasticsearch 

aptly remove a package from a repository using the cli

Here is a fast tip – how to remove a package from our local aptly repository:

  1. Remove the package from the local repository.
  2. Create a new snapshot form the local repository.
  3. Publish the snapshot by switching to the newly created snapshot from the above step.

The commands executing over repository with name xenial-apps to remove package with name example-app and version The snapshot name xenial-apps1588149526 is just a temporary name used for the snapshot (the ID is unix timestamp of the current time).

aptly repo remove  xenial-apps 'example-app (='
aptly snapshot create xenial-apps1588149526 from repo xenial-apps
aptly publish switch xenial-apps ubuntu xenial-apps1588149526

Real world example.

This is the log from our system with just changed names:
Keep on reading!

Dracut boot failed with missing device – exit and continue normal booting!

This issue deserves a much more article, in fact, a straightforward tip:

You may be able to continue a normal boot only by typing “exit” and hitting enter in the “Dracut” console.

Most of the time this Dracut console entering is caused because the system administrator of the server/machine added, replaced or deleted a RAID or similar device and forgot to update the configuration (grub2 probably). And in most of these cases, the raid is not critical for machine normal boot from the root partition, but it may be critical for the services lately. Booting in normal mode, even without some devices, is the main goal because under the normal mode it easier to repair the system.
Check out the two articles on the topic (especially the first one):

SCREENSHOT 1) Just type “exit” and hit enter.

It’s worth noting that if you executed some commands in the console and/or mounted devices to test they are with healthy file system or for whatever reason you did it, the boot process may not continue after typeing exit and probablly a reboot is required. The server will go once more in this mode and then just typing will work.

main menu
type exit

Keep on reading!

Dual 10Gbit network using PCI 2.0 (5GT/s) x4 – what is the maximum bandwidth?

Ever wondered what is the maximum bandwidth of a Dual 10Gbit LAN card, which can be reached using a dual 10Gbit ports card in a PCI Express 2.0 (Speed 5GT/s) and Width x4?
Here is the graph:
h3>SCREENSHOT 1) The bandwidth never exceeds 13.90Gbps (performed with only synthetic tests and mixed synthetic plus real http traffic).

main menu
Max graph bandwidth – below 14Gbps

As you can see the total of the two network ports is a little bit under 14Gbps. We are using intel dual-port controller:

Intel Corporation Ethernet Server Adapter X520-2

Even the dmesg reports the card is not in the right place:

[ 2.541813] ixgbe 0000:82:00.0: (Speed:5.0GT/s, Width: x4, Encoding Loss:20%)
[ 2.541832] ixgbe 0000:82:00.0: This is not sufficient for optimal performance of this card.
[ 2.541854] ixgbe 0000:82:00.0: For optimal performance, at least 20GT/s of bandwidth is required.
[ 2.541876] ixgbe 0000:82:00.0: A slot with more lanes and/or higher speed is suggested.
[ 2.541978] ixgbe 0000:82:00.0: MAC: 2, PHY: 19, SFP+: 5, PBA No: FFFFFF-0FF
[ 2.541996] ixgbe 0000:82:00.0: 00:16:31:fd:03:b8
[ 2.543027] ixgbe 0000:82:00.0: Intel(R) 10 Gigabit Network Connection
[ 2.694839] ixgbe 0000:82:00.1: Multiqueue Enabled: Rx Queue count = 48, Tx Queue count = 48 XDP Queue count = 0
[ 2.695531] ixgbe 0000:82:00.1: PCI Express bandwidth of 16GT/s available
[ 2.696087] ixgbe 0000:82:00.1: (Speed:5.0GT/s, Width: x4, Encoding Loss:20%)
[ 2.696631] ixgbe 0000:82:00.1: This is not sufficient for optimal performance of this card.
[ 2.697181] ixgbe 0000:82:00.1: For optimal performance, at least 20GT/s of bandwidth is required.
[ 2.697723] ixgbe 0000:82:00.1: A slot with more lanes and/or higher speed is suggested.
[ 2.698352] ixgbe 0000:82:00.1: MAC: 2, PHY: 19, SFP+: 6, PBA No: FFFFFF-0FF
[ 2.698890] ixgbe 0000:82:00.1: 00:16:31:fd:03:b9
[ 2.700436] ixgbe 0000:82:00.1: Intel(R) 10 Gigabit Network Connection

The controller is in the PCI Express slot – PCI 2.0 (Speed 5.0GT/s) Width x4, but the capability of the card is Speed 5GT/s, Width x8. This can be seen with “lspci -vvv” and the meanings with simple words:

  • LnkCap – it is the device capability. In fact, this is the hightest possible speed of the device put in the slot.
  • LnkSta – the actual speed of the PCI Express link.

If the device capacity (LnkCap) is higher than the actual speed (LnkSta) you could put the device in another slot with a higher capacity to take full advantage of the device.

In our case, the maximum bandwidth of the two ports of the Dual 10G port Intel card was just below 14Gbps (13.85Gbps ~ 13.95Gbps). After we move the very same card in another slot with the capability of Speed 5GT/s Width x8, the card’s maximum bandwidth increased to 19.20Gbps ~ 19.40Gbps.

SCREENSHOT 2) After changing the slot of the network card, which supports PCI 2.0 (5GT/s) Width x8, the bandwidth tops arround 19.40Gbps in synthetic tests (performed with iperf3).

main menu
Max graph bandwidth – almost 20Gbps

Keep on reading!

syslog – UDP local to rsyslog and send remote with TCP and compression

This article is to show how to log Nginx’s access logs locally using UDP to the local rsyslog daemon, which will send the logs to a remote rsyslog server using TCP and compression. In general, logs could generate a lot of traffic and using UDP over distant locations could result in packet loss respectively logs’ lines loss. The idea here is to log messages locally using UDP (non-blocking way) to a local Syslog server, which will send the stream to a remote central Syslog server using TCP connections to be sure no packets are lost. In addition, we are going to enable local caching (if the remote server is temporary unreachable) and compression between the two Syslog servers.
Our goal is to use

  • UDP for our client program (Nginx in the case) for non-blocking log writes.
  • TCP between our local machine and the remote syslog server – to be sure not to lose messages on bad connectivity.
  • local caching for our client machine – not to lose messages if the remote syslog is temporary unreachable.
  • compression between the local machine and the remote syslog server.

The configuration and the commands are tested on CentOS 7, CentOS 8 and Ubuntu 18 LTS. Check out UDP remote logging here – nginx remote logging to UDP rsyslog server (CentOS 7).

STEP 1) Configure client’s local rsyslog to accept UDP log messages only if the messages’ tags are “nginx”

Couple of things should be enabled in the local client-size rsyslog daemon:

  • rsyslog to accept UDP messages. Uncomment or add the following under section “Modules” (probably the first section in the file?) in /etc/rsyslog.conf
    $ModLoad imudp
    $UDPServerRun 514


    input(type="imudp" port="514")

    The first is the old syntax, which is still supported and the second is the new syntax. F simplicity, all of the following configuration will be using the new syntax, because the old one is depricated.

  • Add a rule to catch the tag containing “nginx” and execute action to forward the messages to the remote server
    if ($syslogtag == 'nginx:') then {
    action(type="omfwd" target="" port="10514" protocol="tcp" compression.Mode="single" ZipLevel="9"
    queue.filename="forwarding" queue.spoolDirectory="/var/log" queue.size="1000000" queue.type="LinkedList" queue.maxFileSize="1g" queue.SaveOnShutdown="on"
    & stop
  • The options are almost self-explanatory, the important ones are there is no retry limit count of reconnecting to the server, there is in-disk cache of maximum 1G if the remote server is unavailable and the compression per message is turned on. More on actions, the forward module and the queue

And restart the rsyslog:

systemctl restart rsyslog

Keep on reading!

Remove disk (all partitions) from software RAID1 with mdadm and change layout of the disk

The following article is to show how to remove healthy partitions from software RAID1 devices to change the layout of the disk and then add them back to the array.
The mdadm is the tool to manipulate the software RAID devices under Linux and it is part of all Linux distributions (some don’t install it by default so it may need to be installed).

Software RAID layout

[root@srv ~]# cat /proc/mdstat 
Personalities : [raid1] 
md125 : active raid1 sda4[1] sdb3[0]
      1047552 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md126 : active raid1 sdb2[0] sda3[1]
      32867328 blocks super 1.2 [2/2] [UU]
md127 : active raid1 sda2[1] sdb1[0]
      52427776 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>

STEP 1) Make the partitions faulty.

The partitions cannot be removed if they are not faulty.

[root@srv ~]# mdadm --fail /dev/md125 /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md125
[root@srv ~]# mdadm --fail /dev/md126 /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md126
[root@srv ~]# mdadm --fail /dev/md127 /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md127

Keep on reading!

bonding – write error – device or resource busy – operation not permitted

Recently, there was a little bit of confusion when following the article about activating network bonding without ifenslave – How to enable Linux bonding without ifenslave. At first, there were couple of errors:

livecd ~ # echo balance-alb > /sys/class/net/bond0/bonding/mode
-bash: echo: write error: Device or resource busy
livecd ~ # echo "+enp129s0f0" > /sys/class/net/bond0/bonding/slaves
-bash: echo: write error: Operation not permitted

Or similar error when changing the bonding mode:

livecd ~ # echo 4 > /sys/class/net/bond0/bonding/mode
-bash: echo: write error: Directory not empty
livecd ~ # echo 802.3ad > /sys/class/net/bond0/bonding/mode
-bash: echo: write error: Directory not empty

The server just booted in rescue live cd and there is no active network configuration:

SCREENSHOT 1) Apparently, the /sys/class/net/bond0/bonding/mode and /sys/class/net/bond0/bonding/slaves are in read only state.

No writes means no new configuration could be installed and the bonding cannot be configured (reconfigured).

main menu
device or resource busy – operation not permitted

Bonding mode could be changed only when the bonding device is in DOWN state.

Network interfaces could be added to the boding device only if they were in DOWN state, too.

In addition, changing bonding mode could only happen if there were no network interfaces added to the bonding interface.

Keep on reading!

Install aptly under Ubuntu 18 LTS with nginx serving the packages and the first steps

This article is how to install aptly software, which offers easy Debian repository management.
First, few words for aptly and what tasks are really simple to do:

  • Mirror an existing (remote) repository. Make a local copy of Debian or Ubuntu repostories for all your internal infrastructure.
  • Create your own repositories
  • Create snapshots of repositories and mirrors.
  • Merge repositories
  • Make diff between repositories (in fact snapshots of repositories, but you may make a mirror of an repository and then make a snapshot and then make a diff with some other snapshot to see the changes between the different repositories or the time the snapshots are made).
  • Remove or add individual packages from official mirrored repositories.
  • Use api calls to manage the repositories. HTTP REST API is still in development, but a big part of it works.

For more information you may visit the official documentation page –

We are going to install the aptly and despite it could be used to serve the repository files we will use the Nginx web server for this work. Nginx is a more fast and reliable web server with easy installation of SSL certificates for our repositories.
The aptly is included in official Ubuntu repositories in the component universe, but at present, it is 2 to 3 versions behind the stable one from the aptly site, so we are going to use their repository to install aptly. Still, if you do not want to use
Keep on reading!