Google has introduced the implementation of the Device Memory TCP (DEVMEM TCP) mechanism, which enables direct transmission of data from the memory of one device to another device without the need for intermediate copying. This implementation, currently at the RFC stage, is meant for discussion and reviewing by the community and has not been integrated into the main Linux kernel.
The introduction of Device Memory TCP is expected to greatly enhance interaction efficiency in clusters and distributed machine learning systems that utilize additional trace payments. As the use of machine learning accelerators leads to a significant increase in the amount of information transmitted during model training, there is a need for improved throughput and data transmission efficiency to fully utilize the available computing resources of GPUs and TPUs.
Traditionally, data transfer between devices on different hosts requires copying data from the device memory to the host memory, transferring the information over the network to another host, and then copying it from the memory of the target host to another device. This process is suboptimal and can impose additional memory and PCIe bus load when dealing with large volumes of data.
The Device Memory TCP mechanism eliminates the need for host memory in this chain by enabling direct data transmission from devices to network packages stored in the memory of the devices. This mechanism requires a network card that can process packet headers in separate buffers and data into packages. The DMabuf mechanism is used to load data from the device’s memory into the payload buffer of the network card, while the headers are transferred from the main memory and filled with the TCP/IP steel system. Additionally, the capabilities of network cards for separate flow processing in different RX-dancers can be leveraged to further enhance efficiency.
Considering the increasing demand for high-performance data exchange mechanisms between devices, especially in distributed machine learning systems where accelerators are on different hosts and when training models using external SSD drives, the Device Memory TCP mechanism proves to be essential. Productivity testing conducted on a configuration with 4 GPUs and 4 network cards demonstrated that the use of Device Memory TCP achieves up to 96.6% of the available linear speed when directly transmitting data between device memories.