Creating CepH Storage with Tbps Bandwidth

News Report: Engineer Achieves Record-breaking Storage Cluster with Ceph

An engineer from Clyso has summarized the experience of creating a storage cluster based on the fault-tolerant distributed system Ceph, achieving a bandwidth exceeding the tebibyte per second. This milestone marks the first cluster based on Ceph to reach such a high performance level, but it required overcoming several unexpected challenges.

To increase productivity by 10-20%, the engineers discovered that enabling energy-saving settings in the BIOS of the servers and disabling C-state (which adjusts energy-saving parameters based on load) resulted in significant improvements. Additionally, when using NVMe drives, the Linux kernel consumed significant time processing Spin-blocking during the IMMU mapping process. Disconnecting IMMU in the core led to noticeable performance improvements in tests involving 4MB block size records and reads.

However, disabling IMMU did not resolve performance issues with randomly recording 4KB blocks. Further investigations found that the assembly scenarios of Ceph from the projects gentoo and ubuntu needed corrections. These corrections involved compiling Ceph with the RELWITHDEBInfo option, which utilized the GCC optimization mode “-O2” and noticeably boosted performance. Furthermore, compiling with the TCMALLOC library caused a decrease in performance, so changing the compilation flags and terminating the use of TCMALLOC resulted in a threefold decrease in packaging time and increased productivity of random operations with 4K blocks. Finally, the engineers optimized Reef Rocksdb and Placies (PG) in the final stages.

The storage cluster is formed from 68 nodes based on Dell Poweredge R6615 servers equipped with the AMD EPYC 9454P 48C/96T CPUs. Each node is equipped with 10 NVMe drives (Dell 15.36TB), two 100gbe Mellanox Connectx-6 Ethernet Adapters, and 192 GB of RAM. The software stack is based on Ubuntu 04/20/6 and Ceph 17.2.7. The cluster includes 630 Object Storage Daemon (OSD) processes, one OSD for each NVMe drive, three Monitor (mon) processes to monitor the cluster’s condition, and one manager (mgr) process to manage the service. The storage size reaches 8.2 petabytes.

Sources:

/Reports, release notes, official announcements.