Inspired by our article – SSD cache device to a hard disk drive using LVM, which uses SSD driver as a cache device to a single hard drive, we decided to make a new article, but this time using two hard drives in raid setup (in our case RAID1 for redundancy) and a single NVME SSD drive.
The goal:
Caching RAID1 consisting of two 8T hard drive with a single 1T NVME SSD drive. Caching reads and writes, i.e. the write-back is enabled.
Our setup:
- 1 NVME SSD disk Samsung 1T. It will be used for writeback cache device (you may use writethrough, too, to maintain the redundancy of the whole storage)!
- 2 Hard disk drive 8T grouped in RAID1 for redundancy.
STEP 1) Install lvm2 and enable the lvm2 service
Only this step is different on different Linux distributions. We included three of them:
Ubuntu 16+:
sudo apt update && apt upgrade -y sudo apt install lvm2 -y systemctl enable lvm2-lvmetad systemctl start lvm2-lvmetad
CentOS 7:
yum update yum install -y lvm2 systemctl enable lvm2-lvmetad systemctl start lvm2-lvmetad
Gentoo:
emerge --sync emerge -v sys-fs/lvm2 /etc/init.d/lvm start rc-update add default lvm
STEP 2) Add the three partitions to the lvm2.
Two partitions in the hard drive and one in the NVME SSD (the cache device). We have set up a partition in the NVME SSD device to occupy 90% of the space (to have a better SSD endurance and in many cases performance).
The devices are “/dev/sda5”, “/dev/sdb5” and “/dev/nvme0n1p1”:
[root@static ~]# parted /dev/sda GNU Parted 3.1 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) p Model: ATA HGST HUH721008AL (scsi) Disk /dev/sda: 8002GB Sector size (logical/physical): 512B/4096B Partition Table: gpt Disk Flags: pmbr_boot Number Start End Size File system Name Flags 4 1049kB 2097kB 1049kB bios_grub 1 2097kB 17.2GB 17.2GB raid 2 17.2GB 17.7GB 537MB raid 3 17.7GB 71.4GB 53.7GB raid 5 71.4GB 8002GB 7930GB primary (parted) q [root@static ~]# parted /dev/sdb GNU Parted 3.1 Using /dev/sdb Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) p Model: ATA HGST HUH721008AL (scsi) Disk /dev/sdb: 8002GB Sector size (logical/physical): 512B/4096B Partition Table: gpt Disk Flags: pmbr_boot Number Start End Size File system Name Flags 4 1049kB 2097kB 1049kB bios_grub 1 2097kB 17.2GB 17.2GB raid 2 17.2GB 17.7GB 537MB raid 3 17.7GB 71.4GB 53.7GB raid 5 71.4GB 8002GB 7930GB primary (parted) q [root@static ~]# parted /dev/nvme0n1 GNU Parted 3.1 Using /dev/nvme0n1 Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) p Model: NVMe Device (nvme) Disk /dev/nvme0n1: 1024GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 922GB 922GB primary (parted) q
Add partitions to the LVM2 (as physical volumes) and create an LVM Volume Group.
[root@static ~]# pvcreate /dev/sda5 /dev/sdb5 /dev/nvme0n1p1 Physical volume "/dev/sda5" successfully created. Physical volume "/dev/sdb5" successfully created. Physical volume "/dev/nvme0n1p1" successfully created. [root@static ~]# pvdisplay "/dev/nvme0n1p1" is a new physical volume of "858.48 GiB" --- NEW Physical volume --- PV Name /dev/nvme0n1p1 VG Name PV Size 858.48 GiB Allocatable NO PE Size 0 Total PE 0 Free PE 0 Allocated PE 0 PV UUID bwwHtj-yzZB-UMNk-2alQ-uuDa-B0V1-30k6GO "/dev/sdb5" is a new physical volume of "7.21 TiB" --- NEW Physical volume --- PV Name /dev/sdb5 VG Name PV Size 7.21 TiB Allocatable NO PE Size 0 Total PE 0 Free PE 0 Allocated PE 0 PV UUID 9E0L1I-PSRZ-tDRE-FctM-pyuZ-sq12-FHzQcR "/dev/sda5" is a new physical volume of "7.21 TiB" --- NEW Physical volume --- PV Name /dev/sda5 VG Name PV Size 7.21 TiB Allocatable NO PE Size 0 Total PE 0 Free PE 0 Allocated PE 0 PV UUID WrfJjt-tR8P-WHNe-A3xq-51q5-UI3K-ozbypk
It could be done in one line with pvcreate. The pvdisplay will display meta information for the physical volumes (the partitions we’ve just added).
And create the LVM Volume Group device. The three physical volumes must be in the same group.
[root@static ~]# vgcreate VG_storage /dev/sda5 /dev/sdb5 /dev/nvme0n1p1 Volume group "VG_storage" successfully created [root@static ~]# vgdisplay --- Volume group --- VG Name VG_storage System ID Format lvm2 Metadata Areas 3 Metadata Sequence No 1 VG Access read/write VG Status resizable MAX LV 0 Cur LV 0 Open LV 0 Max PV 0 Cur PV 3 Act PV 3 VG Size 15.26 TiB PE Size 4.00 MiB Total PE 4001163 Alloc PE / Size 0 / 0 Free PE / Size 4001163 / 15.26 TiB VG UUID kuEhlE-SVFu-fS0V-tlGU-YieW-jf7g-zVc590
Successfully created and you may verify it with vgdisplay
STEP 3) Create the mirror device.
First create the mirrored device using the two slow hard drive disks and their pertitions “/dev/sda5” and “/dev/sdb5”. We want to use all the available space on our slow disks in one logical storage device we use “100%FREE”. The name of the logical device is “lv_slow” hinting it consists of slow disks.
[root@static ~]# lvcreate --mirrors 1 --type raid1 -l 100%FREE -n lv_slow VG_storage /dev/sda5 /dev/sdb5 Logical volume "lv_slow" created. [root@static ~]# lvdisplay --- Logical volume --- LV Path /dev/VG_storage/lv_slow LV Name lv_slow VG Name VG_storage LV UUID QdnHsj-pbYn-3sSv-97G5-ZTop-j5g9-VIFcGv LV Write Access read/write LV Creation host, time static.proxy1.com, 2019-10-02 00:10:33 +0000 LV Status available # open 0 LV Size 7.21 TiB Current LE 1890695 Mirrored volumes 2 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:4
And lvdisplay display will show meta information for the successfully created logical volume. Because it is a mirror (RAID1) the usable space is half of the total available of the two slow disks.
STEP 4) Create the cache pool logical device and then convert the logical slow volume to use the newly create cache pool logical device .
First, create the cache pool logical volume with the name “lv_cache” (to show it’s a fast SSD device). Again, we use 100% available space on the physical volume (100% from the partition we’ve used).
[root@static ~]# lvcreate --type cache-pool -l 100%FREE -c 1M --cachemode writeback -n lv_cache VG_storage /dev/nvme0n1p1 Logical volume "lv_cache" created. [root@static ~]# lvdisplay --- Logical volume --- LV Path /dev/VG_storage/lv_slow LV Name lv_slow VG Name VG_storage LV UUID QdnHsj-pbYn-3sSv-97G5-ZTop-j5g9-VIFcGv LV Write Access read/write LV Creation host, time static.proxy1.com, 2019-10-02 00:10:33 +0000 LV Status available # open 0 LV Size 7.21 TiB Current LE 1890695 Mirrored volumes 2 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:4 --- Logical volume --- LV Path /dev/VG_storage/lv_cache LV Name lv_cache VG Name VG_storage LV UUID Rx3C7i-sTuY-D6B1-Bmcf-Gcuk-vVfM-bxLrSd LV Write Access read/write LV Creation host, time static.proxy1.com, 2019-10-02 00:22:26 +0000 LV Pool metadata lv_cache_cmeta LV Pool data lv_cache_cdata LV Status NOT available LV Size 858.39 GiB Current LE 219749 Segments 1 Allocation inherit Read ahead sectors auto
Verify with “lvdisplay” the cache-pool is created. We set two important parameters the write-back mode enabled and the chunk size is 1Mbyte – tune these properties for your workload.
And now convert the cache device – the slow device (logical volume lv_slow) device will have a cache device (logical volume lv_cache):
[root@static ~]# lvconvert --type cache --cachepool VG_storage/lv_cache VG_storage/lv_slow Do you want wipe existing metadata of cache pool VG_storage/lv_cache? [y/n]: y WARNING: Data redundancy could be lost with writeback caching of raid logical volume! Logical volume VG_storage/lv_slow is now cached. [root@static ~]# lvdisplay --- Logical volume --- LV Path /dev/VG_storage/lv_slow LV Name lv_slow VG Name VG_storage LV UUID QdnHsj-pbYn-3sSv-97G5-ZTop-j5g9-VIFcGv LV Write Access read/write LV Creation host, time static.proxy1.com, 2019-10-02 00:10:33 +0000 LV Cache pool name lv_cache LV Cache origin name lv_slow_corig LV Status available # open 0 LV Size 7.21 TiB Cache used blocks 0.01% Cache metadata blocks 15.78% Cache dirty blocks 0.00% Cache read hits/misses 14 / 29 Cache wrt hits/misses 0 / 0 Cache demotions 0 Cache promotions 2 Current LE 1890695 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:4
Note there is only one logical volume device with the name “lv_slow”, but still, you could see there is an additional logical device “inside” the lv_slow device – “lv_cache”. The properties (chunk size and write-back mode) we’ve set earlier creating the lv_cache a preserved for the new cached lv_slow device, that’s why on creation the command warns us the write-back mode breaks the data redundancy of the RAID1 (mirror)! Be careful with such setups – if write-back is enabled and there is a problem with the cache device (the SSD) you might lose all your data! Here we are going to use it for a proxy cache server and we can live on without the cache if something happens with the single point of failure – the SSD cache device. You could always use the write-through (writethrough is the LVM property) to have the reads cached and the redundancy.
STEP 5) Format and use the volume
Format and do not miss to include it in the /etc/fstab to mount it automatically on boot.
[root@static ~]# mkfs.ext4 /dev/VG_storage/lv_slow mke2fs 1.42.9 (28-Dec-2013) Discarding device blocks: done Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=256 blocks, Stripe width=256 blocks 242012160 inodes, 1936071680 blocks 96803584 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4085252096 59085 block groups 32768 blocks per group, 32768 fragments per group 4096 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done [root@static ~]# blkid |grep lv_slow /dev/mapper/VG_storage-lv_slow_corig_rimage_0: UUID="7e1093ff-cdd5-4033-a15c-7af21e504fd9" TYPE="ext4" /dev/mapper/VG_storage-lv_slow_corig_rimage_1: UUID="7e1093ff-cdd5-4033-a15c-7af21e504fd9" TYPE="ext4" /dev/mapper/VG_storage-lv_slow: UUID="7e1093ff-cdd5-4033-a15c-7af21e504fd9" TYPE="ext4"
And add it to the /etc/fstab:
UUID=7e1093ff-cdd5-4033-a15c-7af21e504fd9 /mnt/storage ext4 defaults,discard 1 2
And then just execute the mount command with “/mnt/storage” and you are ready to use your RAID1 with SSD cache device:
[root@static ~]# mount /mnt/storage [root@static ~]# df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 32G 0 32G 0% /dev tmpfs 32G 0 32G 0% /dev/shm tmpfs 32G 976K 32G 1% /run tmpfs 32G 0 32G 0% /sys/fs/cgroup /dev/md2 49G 1.5G 46G 4% / /dev/md1 487M 218M 245M 48% /boot tmpfs 6.3G 0 6.3G 0% /run/user/0 /dev/mapper/VG_storage-lv_slow 7.2T 93M 6.8T 1% /mnt/storage
Additional LVM information with lvs
[root@static ~]# lvs -a LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert [lv_cache] VG_storage Cwi---C--- 858.39g 0.07 15.78 0.00 [lv_cache_cdata] VG_storage Cwi-ao---- 858.39g [lv_cache_cmeta] VG_storage ewi-ao---- 44.00m lv_slow VG_storage Cwi-aoC--- 7.21t [lv_cache] [lv_slow_corig] 0.07 15.78 0.00 [lv_slow_corig] VG_storage rwi-aoC--- 7.21t 4.69 [lv_slow_corig_rimage_0] VG_storage Iwi-aor--- 7.21t [lv_slow_corig_rimage_1] VG_storage Iwi-aor--- 7.21t [lv_slow_corig_rmeta_0] VG_storage ewi-aor--- 4.00m [lv_slow_corig_rmeta_1] VG_storage ewi-aor--- 4.00m [lvol0_pmspare] VG_storage ewi------- 44.00m
Excellent article, thank you. Question, why partitioning the NVME SSD to occupy 90% of the space would give a “better SSD endurance and in many cases performance”?
Partitioning to use only 90%, because there is a “write amplification”, which may impact the write speed – you may check this on the topic: https://en.wikipedia.org/wiki/Write_amplification#Over-provisioning Many cheap SSDs won’t include much extra capacity! In general, it is better to use at most 90% of any SSD.
When the SSD cache drive dies, then what?
You’ve specified writeback mode, which is going to result in a total loss of all your data on the volume if the SSD dies. Which defeats the purpose of having RAID redundancy.
See: https://forum.proxmox.com/threads/lvm-failure-caused-by-cache-ssd-failure.73314/
It is stated in the beginning – “you may use writethrough, too, to maintain the redundancy of the whole storage”. If the storage is critical for you, you must use writethrough, not writeback.
Gapped that while focusing on the instructions. May want to add that note at the step where this command is issued. One could also try setting up a nested R1 of 2 SSD drives to use as SSD cache if they want to use writeback. I haven’t tested it perhaps you have.