SSD cache device to a software raid using LVM2

Inspired by our article – SSD cache device to a hard disk drive using LVM, which uses SSD driver as a cache device to a single hard drive, we decided to make a new article, but this time using two hard drives in raid setup (in our case RAID1 for redundancy) and a single NVME SSD drive.
The goal:
Caching RAID1 consisting of two 8T hard drive with a single 1T NVME SSD drive. Caching reads and writes, i.e. the write-back is enabled.
Our setup:

1 NVME SSD disk Samsung 1T. It will be used for writeback cache device (you may use writethrough, too, to maintain the redundancy of the whole storage)!
2 Hard disk drive 8T grouped in RAID1 for redundancy.

STEP 1) Install lvm2 and enable the lvm2 service

Only this step is different on different Linux distributions. We included three of them:
Ubuntu 16+:

sudo apt update && apt upgrade -y
sudo apt install lvm2 -y
systemctl enable lvm2-lvmetad
systemctl start lvm2-lvmetad

CentOS 7:

yum update
yum install -y lvm2
systemctl enable lvm2-lvmetad
systemctl start lvm2-lvmetad

Gentoo:

emerge --sync
emerge -v sys-fs/lvm2
/etc/init.d/lvm start
rc-update add default lvm

STEP 2) Add the three partitions to the lvm2.

Two partitions in the hard drive and one in the NVME SSD (the cache device). We have set up a partition in the NVME SSD device to occupy 90% of the space (to have a better SSD endurance and in many cases performance).
The devices are “/dev/sda5”, “/dev/sdb5” and “/dev/nvme0n1p1”:

[root@static ~]# parted /dev/sda
GNU Parted 3.1
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p                                                                
Model: ATA HGST HUH721008AL (scsi)
Disk /dev/sda: 8002GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: pmbr_boot

Number  Start   End     Size    File system  Name     Flags
 4      1049kB  2097kB  1049kB                        bios_grub
 1      2097kB  17.2GB  17.2GB                        raid
 2      17.2GB  17.7GB  537MB                         raid
 3      17.7GB  71.4GB  53.7GB                        raid
 5      71.4GB  8002GB  7930GB               primary

(parted) q                                                                
[root@static ~]# parted /dev/sdb
GNU Parted 3.1
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p                                                                
Model: ATA HGST HUH721008AL (scsi)
Disk /dev/sdb: 8002GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: pmbr_boot

Number  Start   End     Size    File system  Name     Flags
 4      1049kB  2097kB  1049kB                        bios_grub
 1      2097kB  17.2GB  17.2GB                        raid
 2      17.2GB  17.7GB  537MB                         raid
 3      17.7GB  71.4GB  53.7GB                        raid
 5      71.4GB  8002GB  7930GB               primary

(parted) q

[root@static ~]# parted /dev/nvme0n1                                        
GNU Parted 3.1
Using /dev/nvme0n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p                                                                
Model: NVMe Device (nvme)
Disk /dev/nvme0n1: 1024GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End    Size   File system  Name     Flags
 1      1049kB  922GB  922GB               primary

(parted) q

Add partitions to the LVM2 (as physical volumes) and create an LVM Volume Group.

[root@static ~]# pvcreate /dev/sda5 /dev/sdb5 /dev/nvme0n1p1
  Physical volume "/dev/sda5" successfully created.
  Physical volume "/dev/sdb5" successfully created.
  Physical volume "/dev/nvme0n1p1" successfully created.
[root@static ~]# pvdisplay 
  "/dev/nvme0n1p1" is a new physical volume of "858.48 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/nvme0n1p1
  VG Name               
  PV Size               858.48 GiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               bwwHtj-yzZB-UMNk-2alQ-uuDa-B0V1-30k6GO
   
  "/dev/sdb5" is a new physical volume of "7.21 TiB"
  --- NEW Physical volume ---
  PV Name               /dev/sdb5
  VG Name               
  PV Size               7.21 TiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               9E0L1I-PSRZ-tDRE-FctM-pyuZ-sq12-FHzQcR
   
  "/dev/sda5" is a new physical volume of "7.21 TiB"
  --- NEW Physical volume ---
  PV Name               /dev/sda5
  VG Name               
  PV Size               7.21 TiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               WrfJjt-tR8P-WHNe-A3xq-51q5-UI3K-ozbypk

It could be done in one line with pvcreate. The pvdisplay will display meta information for the physical volumes (the partitions we’ve just added).

And create the LVM Volume Group device. The three physical volumes must be in the same group.

[root@static ~]# vgcreate VG_storage /dev/sda5 /dev/sdb5 /dev/nvme0n1p1 
  Volume group "VG_storage" successfully created
[root@static ~]# vgdisplay 
  --- Volume group ---
  VG Name               VG_storage
  System ID             
  Format                lvm2
  Metadata Areas        3
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                3
  Act PV                3
  VG Size               15.26 TiB
  PE Size               4.00 MiB
  Total PE              4001163
  Alloc PE / Size       0 / 0   
  Free  PE / Size       4001163 / 15.26 TiB
  VG UUID               kuEhlE-SVFu-fS0V-tlGU-YieW-jf7g-zVc590

Successfully created and you may verify it with vgdisplay

STEP 3) Create the mirror device.

First create the mirrored device using the two slow hard drive disks and their pertitions “/dev/sda5” and “/dev/sdb5”. We want to use all the available space on our slow disks in one logical storage device we use “100%FREE”. The name of the logical device is “lv_slow” hinting it consists of slow disks.

[root@static ~]# lvcreate --mirrors 1 --type raid1 -l 100%FREE -n lv_slow VG_storage /dev/sda5 /dev/sdb5
  Logical volume "lv_slow" created.
[root@static ~]# lvdisplay 
  --- Logical volume ---
  LV Path                /dev/VG_storage/lv_slow
  LV Name                lv_slow
  VG Name                VG_storage
  LV UUID                QdnHsj-pbYn-3sSv-97G5-ZTop-j5g9-VIFcGv
  LV Write Access        read/write
  LV Creation host, time static.proxy1.com, 2019-10-02 00:10:33 +0000
  LV Status              available
  # open                 0
  LV Size                7.21 TiB
  Current LE             1890695
  Mirrored volumes       2
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:4

And lvdisplay display will show meta information for the successfully created logical volume. Because it is a mirror (RAID1) the usable space is half of the total available of the two slow disks.

STEP 4) Create the cache pool logical device and then convert the logical slow volume to use the newly create cache pool logical device .

First, create the cache pool logical volume with the name “lv_cache” (to show it’s a fast SSD device). Again, we use 100% available space on the physical volume (100% from the partition we’ve used).

[root@static ~]# lvcreate --type cache-pool -l 100%FREE -c 1M --cachemode writeback -n lv_cache VG_storage /dev/nvme0n1p1 
  Logical volume "lv_cache" created.
[root@static ~]# lvdisplay 
  --- Logical volume ---
  LV Path                /dev/VG_storage/lv_slow
  LV Name                lv_slow
  VG Name                VG_storage
  LV UUID                QdnHsj-pbYn-3sSv-97G5-ZTop-j5g9-VIFcGv
  LV Write Access        read/write
  LV Creation host, time static.proxy1.com, 2019-10-02 00:10:33 +0000
  LV Status              available
  # open                 0
  LV Size                7.21 TiB
  Current LE             1890695
  Mirrored volumes       2
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:4
   
  --- Logical volume ---
  LV Path                /dev/VG_storage/lv_cache
  LV Name                lv_cache
  VG Name                VG_storage
  LV UUID                Rx3C7i-sTuY-D6B1-Bmcf-Gcuk-vVfM-bxLrSd
  LV Write Access        read/write
  LV Creation host, time static.proxy1.com, 2019-10-02 00:22:26 +0000
  LV Pool metadata       lv_cache_cmeta
  LV Pool data           lv_cache_cdata
  LV Status              NOT available
  LV Size                858.39 GiB
  Current LE             219749
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto

Verify with “lvdisplay” the cache-pool is created. We set two important parameters the write-back mode enabled and the chunk size is 1Mbyte – tune these properties for your workload.
And now convert the cache device – the slow device (logical volume lv_slow) device will have a cache device (logical volume lv_cache):

[root@static ~]# lvconvert --type cache --cachepool VG_storage/lv_cache VG_storage/lv_slow
Do you want wipe existing metadata of cache pool VG_storage/lv_cache? [y/n]: y
  WARNING: Data redundancy could be lost with writeback caching of raid logical volume!
  Logical volume VG_storage/lv_slow is now cached.
[root@static ~]# lvdisplay 
  --- Logical volume ---
  LV Path                /dev/VG_storage/lv_slow
  LV Name                lv_slow
  VG Name                VG_storage
  LV UUID                QdnHsj-pbYn-3sSv-97G5-ZTop-j5g9-VIFcGv
  LV Write Access        read/write
  LV Creation host, time static.proxy1.com, 2019-10-02 00:10:33 +0000
  LV Cache pool name     lv_cache
  LV Cache origin name   lv_slow_corig
  LV Status              available
  # open                 0
  LV Size                7.21 TiB
  Cache used blocks      0.01%
  Cache metadata blocks  15.78%
  Cache dirty blocks     0.00%
  Cache read hits/misses 14 / 29
  Cache wrt hits/misses  0 / 0
  Cache demotions        0
  Cache promotions       2
  Current LE             1890695
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:4

Note there is only one logical volume device with the name “lv_slow”, but still, you could see there is an additional logical device “inside” the lv_slow device – “lv_cache”. The properties (chunk size and write-back mode) we’ve set earlier creating the lv_cache a preserved for the new cached lv_slow device, that’s why on creation the command warns us the write-back mode breaks the data redundancy of the RAID1 (mirror)! Be careful with such setups – if write-back is enabled and there is a problem with the cache device (the SSD) you might lose all your data! Here we are going to use it for a proxy cache server and we can live on without the cache if something happens with the single point of failure – the SSD cache device. You could always use the write-through (writethrough is the LVM property) to have the reads cached and the redundancy.

STEP 5) Format and use the volume

Format and do not miss to include it in the /etc/fstab to mount it automatically on boot.

[root@static ~]# mkfs.ext4 /dev/VG_storage/lv_slow
mke2fs 1.42.9 (28-Dec-2013)
Discarding device blocks: done                            
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=256 blocks, Stripe width=256 blocks
242012160 inodes, 1936071680 blocks
96803584 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4085252096
59085 block groups
32768 blocks per group, 32768 fragments per group
4096 inodes per group
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
        102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
[root@static ~]# blkid |grep lv_slow
/dev/mapper/VG_storage-lv_slow_corig_rimage_0: UUID="7e1093ff-cdd5-4033-a15c-7af21e504fd9" TYPE="ext4" 
/dev/mapper/VG_storage-lv_slow_corig_rimage_1: UUID="7e1093ff-cdd5-4033-a15c-7af21e504fd9" TYPE="ext4" 
/dev/mapper/VG_storage-lv_slow: UUID="7e1093ff-cdd5-4033-a15c-7af21e504fd9" TYPE="ext4"

And add it to the /etc/fstab:

UUID=7e1093ff-cdd5-4033-a15c-7af21e504fd9 /mnt/storage           ext4    defaults,discard        1 2

And then just execute the mount command with “/mnt/storage” and you are ready to use your RAID1 with SSD cache device:

[root@static ~]# mount /mnt/storage
[root@static ~]# df -h
Filesystem                      Size  Used Avail Use% Mounted on
devtmpfs                         32G     0   32G   0% /dev
tmpfs                            32G     0   32G   0% /dev/shm
tmpfs                            32G  976K   32G   1% /run
tmpfs                            32G     0   32G   0% /sys/fs/cgroup
/dev/md2                         49G  1.5G   46G   4% /
/dev/md1                        487M  218M  245M  48% /boot
tmpfs                           6.3G     0  6.3G   0% /run/user/0
/dev/mapper/VG_storage-lv_slow  7.2T   93M  6.8T   1% /mnt/storage

Additional LVM information with lvs

[root@static ~]# lvs -a
  LV                       VG         Attr       LSize   Pool       Origin          Data%  Meta%  Move Log Cpy%Sync Convert
  [lv_cache]               VG_storage Cwi---C--- 858.39g                            0.07   15.78           0.00            
  [lv_cache_cdata]         VG_storage Cwi-ao---- 858.39g                                                                   
  [lv_cache_cmeta]         VG_storage ewi-ao----  44.00m                                                                   
  lv_slow                  VG_storage Cwi-aoC---   7.21t [lv_cache] [lv_slow_corig] 0.07   15.78           0.00            
  [lv_slow_corig]          VG_storage rwi-aoC---   7.21t                                                   4.69            
  [lv_slow_corig_rimage_0] VG_storage Iwi-aor---   7.21t                                                                   
  [lv_slow_corig_rimage_1] VG_storage Iwi-aor---   7.21t                                                                   
  [lv_slow_corig_rmeta_0]  VG_storage ewi-aor---   4.00m                                                                   
  [lv_slow_corig_rmeta_1]  VG_storage ewi-aor---   4.00m                                                                   
  [lvol0_pmspare]          VG_storage ewi-------  44.00m

5 thoughts on “SSD cache device to a software raid using LVM2”

Bob says:

July 23, 2022 at 7:22 pm

Excellent article, thank you. Question, why partitioning the NVME SSD to occupy 90% of the space would give a “better SSD endurance and in many cases performance”?

1. neoX says:
  
  July 24, 2022 at 8:36 pm
  
  Partitioning to use only 90%, because there is a “write amplification”, which may impact the write speed – you may check this on the topic: https://en.wikipedia.org/wiki/Write_amplification#Over-provisioning Many cheap SSDs won’t include much extra capacity! In general, it is better to use at most 90% of any SSD.
  
GeeBee says:

October 11, 2022 at 6:00 am

When the SSD cache drive dies, then what?

You’ve specified writeback mode, which is going to result in a total loss of all your data on the volume if the SSD dies. Which defeats the purpose of having RAID redundancy.

See: https://forum.proxmox.com/threads/lvm-failure-caused-by-cache-ssd-failure.73314/

1. neoX says:
  
  October 11, 2022 at 10:49 am
  
  It is stated in the beginning – “you may use writethrough, too, to maintain the redundancy of the whole storage”. If the storage is critical for you, you must use writethrough, not writeback.
  
  1. GeeBee says:
    
    October 12, 2022 at 7:08 pm
    
    Gapped that while focusing on the instructions. May want to add that note at the step where this command is issued. One could also try setting up a nested R1 of 2 SSD drives to use as SSD cache if they want to use writeback. I haven’t tested it perhaps you have.

Any IT here? Help Me!