Yandex, a technology company, has announced the release of the initial texts of its distributed storage and processing platform called YTSAURUS, which supports data manipulation using the mapreduce paradigm, the SQL-quotes, distributed file system, and NOSQL storage in the key-value format. This platform can be scaled to clusters, including more than 10,000 nodes and covering up to a million processors and thousands of GPU to solve machine learning problems. The project is open source and is written in C/C++, and is available under the license Apache 2.0, which is publicly released on Github.
YTSAURUS is used in the Yandex infrastructure for the efficient use of computing power of the company’s supercomputers. Isolated containers launched on physical servers can be used as forming a unit cluster. Excables can be in storage data placed on various media, such as hard drives, SSD, NVME, and RAM. The cluster supports dynamic addition and removal of nodes, reservation, automatic replication, updating the cluster software without stopping work, and automatic restoration of excessness in the event of the failure of the units.
YTSAURUS has three types of clusters: computing clusters, clusters for dynamic tables and storage in the key-value and gelative clusters format. This platform can provide funds for storing and processing data from tens of thousands of users. YTSAURUS is used in various areas at Yandex, which include the storage of information about users of the advertising network, training in machine learning models, the formation of a search index, and building a data storage for services such as Yandex taxis, food, shop, and delivery.
YTSAURUS has basic use scenarios which include batch processing using mapreduce and spyt for processing structured and semi-structured data such as logs or financial transactions. Ad hoc analytics provide fast requests via CHYT (cluster from servers clickhouse on the computing nodes of YTSAURUS) without copying data into a separate analytical system. Additionally, machine training is available on the platform using GPU cluster management for teaching models with billions of parameters, while the storage of meta-information provides a transaction storage of meta-information and a reliable service of distributed locks, and building data storage and ETL for multi-level data processing using typical tools such as Apache Spark, SQL or Mapreduce.