IBM introduced Initiative CodeNet aimed at the provision of data set researchers allowing Experiment Using machine learning methods to create translators from one programming language to another, as well as generators and code analyzers. CodeNet includes a collection of 14 million code examples that decide 4053 typical programming problems. In sum, the collection has about 500 million code lines and covers 55 programming languages, such as modern languages, such as C ++, Java, Python and GO, and outdated, including Cobol, Pascal and Fortran. Project operations distributed under the Apache 2.0 license, and data sets are planned to distribute in the form of public domain.
Examples are equipped with annotations and implement identical algorithms in different programming languages. It is assumed that the proposed set will help the training of machine learning systems and the development of innovation in the field of broadcasting and machine pavement, by analogy with how the abstract images of ImageNET helped the development of image recognition systems and machine vision. As one of the main sources of formation of the collection, various programming competitions are called.
Unlike traditional translators implemented on the basis of transformation rules, machine learning systems can capture and take into account the context of using the code. When converting from one programming language to another context is no less important than when transferring from one human language to another. It is the lack of context’s accounting prevents code transformation from outdated languages such as Cobol.
The presence of a large base of implementations of algorithms in various languages will help the creation of universal machine learning systems, which, instead of direct broadcast between specific languages, manipulate a more abstract view of the code independent of specific programming languages. Such a system can be used as a translator that translates the transmitted code on any of the supported languages into its internal abstract representation, from which the code can then be generated on the set of languages.