如何挑選適合 Ceph 的 SSD

如何挑選適合 Ceph 的 SSD

隨著SSD價格的不斷下降,許多技術愛好者和企業開始考慮使用 Ceph 建立基於 SSD 的儲存池,以追求更高的效能。但要確保 Ceph 達到出色的效能,選擇合適的 SSD 極為關鍵。在本篇中,我們將探討如何選擇適合 Ceph 的 SSD。

為何 SSD 選擇如此重要?

當前,消費者可以以相對低廉的價格購買到大容量的 NVMe SSD,這吸引了許多技術愛好者和節約成本的企業主投資全 SSD 的 Ceph 解決方案。然而,只選擇基於價格的 SSD 可能不會達到預期的效能。

Ceph 寫入的特性

當我們談到 Ceph,必須了解它是如何處理寫入操作的。Ceph 的所有寫入都是事務性 (transactional) 的,這意味著每次的寫入操作都會等待直到它們確實被寫入了所有的 OSD 並通過 fsync() 同步到硬碟。這也暗示了 Ceph 不會使用硬碟上的快取,而是希望資料可以直接寫入 NAND。

這種寫入行為對於一些消費級SSD而言是一大挑戰,例如,某些 SSD 宣稱的是 80,000 IOPS 的效能,但在實際使用中可能只能達到 500-1000 IOPS。

以下測試結果中可以看到,Samsung 980 SSD 在開啟 fsync 後,其效能測試結果只有大約 600 IOPS。

fio -ioengine=libaio -name=test -filename=/dev/nvme0n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=2046KiB/s][w=511 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3054244: Wed Oct 18 08:03:48 2023
  write: IOPS=574, BW=2297KiB/s (2352kB/s)(33.7MiB/15001msec); 0 zone resets
    slat (nsec): min=1804, max=27462, avg=4744.29, stdev=1192.14
    clat (nsec): min=1082, max=72217, avg=16433.32, stdev=2111.01
     lat (nsec): min=12904, max=76816, avg=21177.61, stdev=2867.60
    clat percentiles (nsec):
     |  1.00th=[13248],  5.00th=[14912], 10.00th=[15040], 20.00th=[15296],
     | 30.00th=[15424], 40.00th=[15680], 50.00th=[15808], 60.00th=[16064],
     | 70.00th=[16512], 80.00th=[17792], 90.00th=[18560], 95.00th=[19584],
     | 99.00th=[22912], 99.50th=[25984], 99.90th=[31616], 99.95th=[43264],
     | 99.99th=[72192]
   bw (  KiB/s): min= 2040, max= 2440, per=100.00%, avg=2305.38, stdev=85.23, samples=29
   iops        : min=  510, max=  610, avg=576.34, stdev=21.31, samples=29
  lat (usec)   : 2=0.01%, 20=96.45%, 50=3.51%, 100=0.03%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=1108, max=6437, avg=1735.07, stdev=240.93
    sync percentiles (usec):
     |  1.00th=[ 1188],  5.00th=[ 1565], 10.00th=[ 1582], 20.00th=[ 1631],
     | 30.00th=[ 1663], 40.00th=[ 1680], 50.00th=[ 1696], 60.00th=[ 1713],
     | 70.00th=[ 1778], 80.00th=[ 1827], 90.00th=[ 1893], 95.00th=[ 2024],
     | 99.00th=[ 2638], 99.50th=[ 2966], 99.90th=[ 4047], 99.95th=[ 5473],
     | 99.99th=[ 6456]
  cpu          : usr=0.29%, sys=0.89%, ctx=25839, majf=0, minf=12
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,8615,0,8615 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2297KiB/s (2352kB/s), 2297KiB/s-2297KiB/s (2352kB/s-2352kB/s), io=33.7MiB (35.3MB), run=15001-15001msec

Disk stats (read/write):
  nvme0n1: ios=46/17115, merge=0/0, ticks=10/14751, in_queue=29383, util=99.46%

斷電保護:一個不可忽視的功能

SSD 的斷電保護確保在突然斷電時,所有的寫入操作都可以正確完成。這使得 SSD 的快取更像非易失性儲存,因此 SSD 控制器可以安全地忽略 fsync(),有信心資料將被正確寫入。

SSD 的斷電保護是一項確保在意外斷電時,硬碟上快取中的資料仍然可以被寫入 NAND 的功能。當SSD 具有這項功能時,會使得 SSD 的快取更像非易失性儲存 (non-volatile),控制器可以安全地忽略 fsync,因為它相信即使在突然斷電後,資料也能成功寫入。

有斷電保護的 SSD (Koxia CD6) 測試結果跟上面 Samsung 980 做個對比,可以看到平均有 36.6k IOPS:

fio -ioengine=libaio -name=test -filename=/dev/nvme2n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=144MiB/s][w=36.9k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3720075: Wed Oct 18 17:04:25 2023
  write: IOPS=36.6k, BW=143MiB/s (150MB/s)(2144MiB/15001msec); 0 zone resets
    slat (usec): min=3, max=1828, avg= 5.84, stdev= 3.94
    clat (nsec): min=880, max=1772.2k, avg=17675.95, stdev=5502.60
     lat (usec): min=14, max=1867, avg=23.52, stdev= 6.86
    clat percentiles (usec):
     |  1.00th=[   16],  5.00th=[   17], 10.00th=[   18], 20.00th=[   18],
     | 30.00th=[   18], 40.00th=[   18], 50.00th=[   18], 60.00th=[   18],
     | 70.00th=[   18], 80.00th=[   19], 90.00th=[   19], 95.00th=[   20],
     | 99.00th=[   24], 99.50th=[   28], 99.90th=[   50], 99.95th=[   69],
     | 99.99th=[  151]
   bw (  KiB/s): min=134970, max=167288, per=100.00%, avg=146491.97, stdev=6017.07, samples=29
   iops        : min=33742, max=41822, avg=36622.90, stdev=1504.33, samples=29
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.02%, 4=0.01%, 10=0.01%, 20=96.92%, 50=2.95%
  lat (usec)   : 100=0.08%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=2, max=1790, avg=20.43, stdev= 5.76
    sync percentiles (usec):
     |  1.00th=[   17],  5.00th=[   20], 10.00th=[   20], 20.00th=[   20],
     | 30.00th=[   20], 40.00th=[   20], 50.00th=[   21], 60.00th=[   21],
     | 70.00th=[   21], 80.00th=[   22], 90.00th=[   22], 95.00th=[   22],
     | 99.00th=[   28], 99.50th=[   32], 99.90th=[   55], 99.95th=[   74],
     | 99.99th=[  159]
  cpu          : usr=19.77%, sys=40.01%, ctx=549142, majf=0, minf=24
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,548737,0,548736 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=143MiB/s (150MB/s), 143MiB/s-143MiB/s (150MB/s-150MB/s), io=2144MiB (2248MB), run=15001-15001msec

Disk stats (read/write):
  nvme2n1: ios=0/542994, merge=0/0, ticks=0/7889, in_queue=7890, util=99.36%

甚至連 SATA 的 Intel DC S3500 SSD 表現都相較 Samsung 980 好很多

fio -ioengine=libaio -name=test -filename=/dev/sda -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=34.0MiB/s][w=8712 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3172: Thu Oct 19 04:58:04 2023
  write: IOPS=8713, BW=34.0MiB/s (35.7MB/s)(511MiB/15001msec); 0 zone resets
    slat (nsec): min=5381, max=60223, avg=5695.98, stdev=404.63
    clat (usec): min=35, max=289, avg=49.89, stdev=15.41
     lat (usec): min=44, max=295, avg=55.59, stdev=15.42
    clat percentiles (usec):
     |  1.00th=[   41],  5.00th=[   41], 10.00th=[   42], 20.00th=[   42],
     | 30.00th=[   43], 40.00th=[   43], 50.00th=[   44], 60.00th=[   47],
     | 70.00th=[   49], 80.00th=[   53], 90.00th=[   69], 95.00th=[   82],
     | 99.00th=[  113], 99.50th=[  127], 99.90th=[  182], 99.95th=[  206],
     | 99.99th=[  265]
   bw (  KiB/s): min=34672, max=34992, per=100.00%, avg=34877.79, stdev=57.17, samples=29
   iops        : min= 8668, max= 8748, avg=8719.45, stdev=14.29, samples=29
  lat (usec)   : 50=74.10%, 100=24.24%, 250=1.64%, 500=0.03%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=81, max=432, avg=108.51, stdev=32.18
    sync percentiles (usec):
     |  1.00th=[   87],  5.00th=[   88], 10.00th=[   89], 20.00th=[   90],
     | 30.00th=[   91], 40.00th=[   92], 50.00th=[   96], 60.00th=[  100],
     | 70.00th=[  110], 80.00th=[  124], 90.00th=[  149], 95.00th=[  163],
     | 99.00th=[  265], 99.50th=[  297], 99.90th=[  355], 99.95th=[  367],
     | 99.99th=[  392]
  cpu          : usr=1.86%, sys=14.13%, ctx=392131, majf=0, minf=12
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,130710,0,130709 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=34.0MiB/s (35.7MB/s), 34.0MiB/s-34.0MiB/s (35.7MB/s-35.7MB/s), io=511MiB (535MB), run=15001-15001msec

Disk stats (read/write):
  sda: ios=0/258623, merge=0/0, ticks=0/13669, in_queue=20496, util=99.21%

如何測試 SSD 的效能?

要確認 SSD 是否真正適合在 Ceph 中使用,一個好的方法是進行效能測試。以下是一個推薦的測試指令:

fio -ioengine=libaio -name=test -filename=/dev/nvme2n1 -fsync=1 -direct=1 -bs=4k -iodepth=1 -rw=randwrite -runtime=15

其中的 -fsync=1 確保了每次寫入都會同步到 SSD 上,這也反映了 Ceph 實際的操作方式。

結語:選擇SSD的最佳策略

基於以上的討論,當我們選擇適合Ceph的SSD時,可以遵循以下兩個基本原則:

  1. 企業級 NVMe > 企業級 SATA/SAS >>>>>> 消費級 NVMe/SATA/SAS。
  2. SSD 應具有斷電保護功能。

只要根據這些原則進行選擇,即可確保你的 Ceph 環境能夠獲得最佳的效能。

附錄

測試環境

Samsung 980 1TB

  • CPU: AMD Epyc 7413
  • RAM: 8 x 32GB DDR4 3200 RDIMM
  • kernel: 6.1.0-9-amd64

Koxia CD6 3.84TB

  • CPU: 2 x Ampere Altra Q80-30
  • RAM: 4 x 32GB DDR4 3200 RDIMM
  • kernel: 6.1.0-12-arm64

Intel DC S3500 1.6TB

  • CPU: AMD Epyc 7302P
  • RAM: 4 x 32GB DDR4 2933 RDIMM
  • kernel: 6.2.16-15-pve

Reference


Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.

Leave a Reply