隨著SSD價格的不斷下降,許多技術愛好者和企業開始考慮使用 Ceph 建立基於 SSD 的儲存池,以追求更高的效能。但要確保 Ceph 達到出色的效能,選擇合適的 SSD 極為關鍵。在本篇中,我們將探討如何選擇適合 Ceph 的 SSD。
Table of Contents
為何 SSD 選擇如此重要?
當前,消費者可以以相對低廉的價格購買到大容量的 NVMe SSD,這吸引了許多技術愛好者和節約成本的企業主投資全 SSD 的 Ceph 解決方案。然而,只選擇基於價格的 SSD 可能不會達到預期的效能。
Ceph 寫入的特性
當我們談到 Ceph,必須了解它是如何處理寫入操作的。Ceph 的所有寫入都是事務性 (transactional) 的,這意味著每次的寫入操作都會等待直到它們確實被寫入了所有的 OSD 並通過 fsync() 同步到硬碟。這也暗示了 Ceph 不會使用硬碟上的快取,而是希望資料可以直接寫入 NAND。
這種寫入行為對於一些消費級SSD而言是一大挑戰,例如,某些 SSD 宣稱的是 80,000 IOPS 的效能,但在實際使用中可能只能達到 500-1000 IOPS。
以下測試結果中可以看到,Samsung 980 SSD 在開啟 fsync 後,其效能測試結果只有大約 600 IOPS。
fio -ioengine=libaio -name=test -filename=/dev/nvme0n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=2046KiB/s][w=511 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3054244: Wed Oct 18 08:03:48 2023
write: IOPS=574, BW=2297KiB/s (2352kB/s)(33.7MiB/15001msec); 0 zone resets
slat (nsec): min=1804, max=27462, avg=4744.29, stdev=1192.14
clat (nsec): min=1082, max=72217, avg=16433.32, stdev=2111.01
lat (nsec): min=12904, max=76816, avg=21177.61, stdev=2867.60
clat percentiles (nsec):
| 1.00th=[13248], 5.00th=[14912], 10.00th=[15040], 20.00th=[15296],
| 30.00th=[15424], 40.00th=[15680], 50.00th=[15808], 60.00th=[16064],
| 70.00th=[16512], 80.00th=[17792], 90.00th=[18560], 95.00th=[19584],
| 99.00th=[22912], 99.50th=[25984], 99.90th=[31616], 99.95th=[43264],
| 99.99th=[72192]
bw ( KiB/s): min= 2040, max= 2440, per=100.00%, avg=2305.38, stdev=85.23, samples=29
iops : min= 510, max= 610, avg=576.34, stdev=21.31, samples=29
lat (usec) : 2=0.01%, 20=96.45%, 50=3.51%, 100=0.03%
fsync/fdatasync/sync_file_range:
sync (usec): min=1108, max=6437, avg=1735.07, stdev=240.93
sync percentiles (usec):
| 1.00th=[ 1188], 5.00th=[ 1565], 10.00th=[ 1582], 20.00th=[ 1631],
| 30.00th=[ 1663], 40.00th=[ 1680], 50.00th=[ 1696], 60.00th=[ 1713],
| 70.00th=[ 1778], 80.00th=[ 1827], 90.00th=[ 1893], 95.00th=[ 2024],
| 99.00th=[ 2638], 99.50th=[ 2966], 99.90th=[ 4047], 99.95th=[ 5473],
| 99.99th=[ 6456]
cpu : usr=0.29%, sys=0.89%, ctx=25839, majf=0, minf=12
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,8615,0,8615 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=2297KiB/s (2352kB/s), 2297KiB/s-2297KiB/s (2352kB/s-2352kB/s), io=33.7MiB (35.3MB), run=15001-15001msec
Disk stats (read/write):
nvme0n1: ios=46/17115, merge=0/0, ticks=10/14751, in_queue=29383, util=99.46%
斷電保護:一個不可忽視的功能
SSD 的斷電保護確保在突然斷電時,所有的寫入操作都可以正確完成。這使得 SSD 的快取更像非易失性儲存,因此 SSD 控制器可以安全地忽略 fsync(),有信心資料將被正確寫入。
SSD 的斷電保護是一項確保在意外斷電時,硬碟上快取中的資料仍然可以被寫入 NAND 的功能。當SSD 具有這項功能時,會使得 SSD 的快取更像非易失性儲存 (non-volatile),控制器可以安全地忽略 fsync,因為它相信即使在突然斷電後,資料也能成功寫入。
有斷電保護的 SSD (Koxia CD6) 測試結果跟上面 Samsung 980 做個對比,可以看到平均有 36.6k IOPS:
fio -ioengine=libaio -name=test -filename=/dev/nvme2n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=144MiB/s][w=36.9k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3720075: Wed Oct 18 17:04:25 2023
write: IOPS=36.6k, BW=143MiB/s (150MB/s)(2144MiB/15001msec); 0 zone resets
slat (usec): min=3, max=1828, avg= 5.84, stdev= 3.94
clat (nsec): min=880, max=1772.2k, avg=17675.95, stdev=5502.60
lat (usec): min=14, max=1867, avg=23.52, stdev= 6.86
clat percentiles (usec):
| 1.00th=[ 16], 5.00th=[ 17], 10.00th=[ 18], 20.00th=[ 18],
| 30.00th=[ 18], 40.00th=[ 18], 50.00th=[ 18], 60.00th=[ 18],
| 70.00th=[ 18], 80.00th=[ 19], 90.00th=[ 19], 95.00th=[ 20],
| 99.00th=[ 24], 99.50th=[ 28], 99.90th=[ 50], 99.95th=[ 69],
| 99.99th=[ 151]
bw ( KiB/s): min=134970, max=167288, per=100.00%, avg=146491.97, stdev=6017.07, samples=29
iops : min=33742, max=41822, avg=36622.90, stdev=1504.33, samples=29
lat (nsec) : 1000=0.01%
lat (usec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=96.92%, 50=2.95%
lat (usec) : 100=0.08%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%
fsync/fdatasync/sync_file_range:
sync (usec): min=2, max=1790, avg=20.43, stdev= 5.76
sync percentiles (usec):
| 1.00th=[ 17], 5.00th=[ 20], 10.00th=[ 20], 20.00th=[ 20],
| 30.00th=[ 20], 40.00th=[ 20], 50.00th=[ 21], 60.00th=[ 21],
| 70.00th=[ 21], 80.00th=[ 22], 90.00th=[ 22], 95.00th=[ 22],
| 99.00th=[ 28], 99.50th=[ 32], 99.90th=[ 55], 99.95th=[ 74],
| 99.99th=[ 159]
cpu : usr=19.77%, sys=40.01%, ctx=549142, majf=0, minf=24
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,548737,0,548736 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=143MiB/s (150MB/s), 143MiB/s-143MiB/s (150MB/s-150MB/s), io=2144MiB (2248MB), run=15001-15001msec
Disk stats (read/write):
nvme2n1: ios=0/542994, merge=0/0, ticks=0/7889, in_queue=7890, util=99.36%
甚至連 SATA 的 Intel DC S3500 SSD 表現都相較 Samsung 980 好很多
fio -ioengine=libaio -name=test -filename=/dev/sda -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=34.0MiB/s][w=8712 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3172: Thu Oct 19 04:58:04 2023
write: IOPS=8713, BW=34.0MiB/s (35.7MB/s)(511MiB/15001msec); 0 zone resets
slat (nsec): min=5381, max=60223, avg=5695.98, stdev=404.63
clat (usec): min=35, max=289, avg=49.89, stdev=15.41
lat (usec): min=44, max=295, avg=55.59, stdev=15.42
clat percentiles (usec):
| 1.00th=[ 41], 5.00th=[ 41], 10.00th=[ 42], 20.00th=[ 42],
| 30.00th=[ 43], 40.00th=[ 43], 50.00th=[ 44], 60.00th=[ 47],
| 70.00th=[ 49], 80.00th=[ 53], 90.00th=[ 69], 95.00th=[ 82],
| 99.00th=[ 113], 99.50th=[ 127], 99.90th=[ 182], 99.95th=[ 206],
| 99.99th=[ 265]
bw ( KiB/s): min=34672, max=34992, per=100.00%, avg=34877.79, stdev=57.17, samples=29
iops : min= 8668, max= 8748, avg=8719.45, stdev=14.29, samples=29
lat (usec) : 50=74.10%, 100=24.24%, 250=1.64%, 500=0.03%
fsync/fdatasync/sync_file_range:
sync (usec): min=81, max=432, avg=108.51, stdev=32.18
sync percentiles (usec):
| 1.00th=[ 87], 5.00th=[ 88], 10.00th=[ 89], 20.00th=[ 90],
| 30.00th=[ 91], 40.00th=[ 92], 50.00th=[ 96], 60.00th=[ 100],
| 70.00th=[ 110], 80.00th=[ 124], 90.00th=[ 149], 95.00th=[ 163],
| 99.00th=[ 265], 99.50th=[ 297], 99.90th=[ 355], 99.95th=[ 367],
| 99.99th=[ 392]
cpu : usr=1.86%, sys=14.13%, ctx=392131, majf=0, minf=12
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,130710,0,130709 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=34.0MiB/s (35.7MB/s), 34.0MiB/s-34.0MiB/s (35.7MB/s-35.7MB/s), io=511MiB (535MB), run=15001-15001msec
Disk stats (read/write):
sda: ios=0/258623, merge=0/0, ticks=0/13669, in_queue=20496, util=99.21%
如何測試 SSD 的效能?
要確認 SSD 是否真正適合在 Ceph 中使用,一個好的方法是進行效能測試。以下是一個推薦的測試指令:
fio -ioengine=libaio -name=test -filename=/dev/nvme2n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
其中的 -fsync=1
確保了每次寫入都會同步到 SSD 上,這也反映了 Ceph 實際的操作方式。
結語:選擇SSD的最佳策略
基於以上的討論,當我們選擇適合Ceph的SSD時,可以遵循以下兩個基本原則:
- 企業級 NVMe > 企業級 SATA/SAS >>>>>> 消費級 NVMe/SATA/SAS。
- SSD 應具有斷電保護功能。
只要根據這些原則進行選擇,即可確保你的 Ceph 環境能夠獲得最佳的效能。
附錄
測試環境
Samsung 980 1TB
- CPU: AMD Epyc 7413
- RAM: 8 x 32GB DDR4 3200 RDIMM
- kernel: 6.1.0-9-amd64
Koxia CD6 3.84TB
- CPU: 2 x Ampere Altra Q80-30
- RAM: 4 x 32GB DDR4 3200 RDIMM
- kernel: 6.1.0-12-arm64
Intel DC S3500 1.6TB
- CPU: AMD Epyc 7302P
- RAM: 4 x 32GB DDR4 2933 RDIMM
- kernel: 6.2.16-15-pve
Reference
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.