集群包括七个节点,其中三个为纯存储节点,四个存储计算一体节点,均处于同一内网,根据“基于PVE Ceph集群搭建(一):集群40GbEx2聚合测试”中的测试节点间40GbEx2聚合后可实现50GbE互联。
软件版本
proxmox-ve: 7.3-1 (running kernel: 5.15.74-1-pve)
ceph:17.2.5-pve1 quincy (stable)
iperf:2.0.14a (2 October 2020) pthreads
fio:3.25
Node name | Node IP | Motherboard | CPU | Memory | Network |
---|---|---|---|---|---|
node1 | 192.168.1.11 | H12SSL-i | EPYC 7502QS | Samsung 3200 32G 2R8 x8 | 40GbE x2 |
node2 | 192.168.1.12 | X9DRi-F | E5-2670 x2 | Samsung 1600 16G 2R4 x8 | 40GbE x2 |
node3 | 192.168.1.13 | Dell R730xd | E5-2666v3 x2 | Samsung 2133 16G 2R4 x6 | 40GbE x2 |
node4 | 192.168.1.14 | Dell R720xd | E5-2630L x2 | Samsung 1600 16G 2R4 x4 | 40GbE x2 |
node5 | 192.168.1.15 | Dell R720xd | E5-2696v2 x2 | Samsung 1600 16G 2R4 x4 | 40GbE x2 |
node6 | 192.168.1.16 | Dell R730xd | E5-2680v3 x2 | Samsung 2133 16G 2R4 x4 | 40GbE x2 |
node7 | 192.168.1.17 | Dell R730xd | E5-2680v3 x2 | Samsung 2133 16G 2R4 x4 | 40GbE x2 |
Node name | HDD | NVME | OSD Setting |
---|---|---|---|
node1 | 10T x6 | PM9A1 512G x2 | HDD + 1% NVME DB/WAL |
node2 | 6T x12 | PM983A 900G x1 | HDD + 1% NVME DB/WAL |
node4 | 300G x2+ 3T x4+ 4T x3 | PM983A 900G x1 | HDD + 1% NVME DB/WAL |
node6 | 6T x4 | PM983A 900G x1 | HDD + 1% NVME DB/WAL |
node7 | 6T x4 | PM983A 900G x1 | HDD + 1% NVME DB/WAL |
Ceph共36个OSD
mon: node1、node3、node4、node6、node7
mgr: node1、node2、node3、node4、node5、node6、node7
mds: node1、node3、node5、node6、node7(node1 atcive)
ceph osd pool create testbench 100 100 (创建测试池testbench)
ceph osd pool application enable testbench rbd (分配application)
rbd create testbench/disk1 --size 1024000
rbd map testbench/disk1
mkfs.xfs /dev/rbd0
mount /dev/rbd0 /mnt/rbd
/mnt/rbd用于fio对rbd进行测试
ceph fs new cephfs cephfs_data cephfs.metadata
mount -t ceph :/ /mnt/pve/cephfs -o name=admin,secret=
/mnt/pve/cephfs用于fio对cephfs进行测试
Ceph推荐使用NFS-ganesha来提供NFS服务,基于PVE集群搭建的ceph不能直接提供NFS配置,因此暂时未进行测试,后续补上。
本次性能测试从以下几项展开:
rados bench -p testbench 30 write -b 4M -t 16 --no-cleanup | tee rados_write.log
rados bench -p testbench 30 rand -t 16 | tee rados_rand.log
rados bench -p testbench 30 seq -t 16 | tee rados_seq.log
Test | bandwidth(MB/sec) | IOPS | clat(ms) |
---|---|---|---|
write | 738.117 | 184 | 86 |
rand | 998.687 | 249 | 64 |
seq | 998.544 | 249 | 63 |
Ceph自带的rados工具针对4M进行测试。
[global]
ioengine=libaio
direct=1
size=5g
lockmem=1G
runtime=30
group_reporting
directory=/mnt/rbd
numjobs=1
iodepth=1
[4k_randwrite]
stonewall
rw=randwrite
bs=4k
[4k_randread]
stonewall
rw=randread
bs=4k
[64k_write]
stonewall
rw=write
bs=64k
[64k_read]
stonewall
rw=read
bs=64k
[1M_write]
stonewall
rw=write
bs=1M
[1M_read]
stonewall
rw=read
bs=1M
for iodepth in 1 2 4 8 16 32; do
sed -i "/^iodepth/c iodepth=${iodepth}" fio_rbd.conf && fio fio_rbd.conf | tee rbd_i$(printf "%02d" ${iodepth}).log && sleep 20s
done
[global]
ioengine=libaio
direct=1
size=5g
lockmem=1G
runtime=30
group_reporting
directory=/mnt/pve/cephfs
numjobs=1
iodepth=1
[4k_randwrite]
stonewall
rw=randwrite
bs=4k
[4k_randread]
stonewall
rw=randread
bs=4k
[64k_write]
stonewall
rw=write
bs=64k
[64k_read]
stonewall
rw=read
bs=64k
[1M_write]
stonewall
rw=write
bs=1M
[1M_read]
stonewall
rw=read
bs=1M
for iodepth in 1 2 4 8 16 32; do
sed -i "/^iodepth/c iodepth=${iodepth}" fio_cephfs.conf && fio fio_cephfs.conf | tee cephfs_i$(printf "%02d" ${iodepth}).log && sleep 20s
done
针对我当前Ceph的配置,36 OSD + 1% NVME共200T存储空间的情况下,结合测试结果,为了得到更好的性能,参数设置:
Ceph官方文档对Cache Tier的解释:Cache Tier为Ceph客户端提供了更好的I/O性能,用于存储在支持存储层的数据子集。缓存分层包括创建一个相对快速的存储设备池(例如,固态驱动器)配置为缓存层,以及一个由擦除编码或相对较慢/便宜的设备组成的备份池,配置为经济的存储层。Ceph objecter处理放置对象的位置,分层代理决定何时将对象从缓存中冲到备份存储层。所以缓存层和后备存储层对Ceph客户端是完全透明的。
Ceph官方文档提到,值得注意的是:
由于NVME基本被OSD的DB/WAL设备占用,后续继续增加SSD/NVME设备后对Cache Tier进行测试。
下一篇将会更新“基于PVE CEPH集群搭建(三):Cephfs、RBD、NFS存储池性能调优”