microsoft introduced the neural network kosmos -1, which combines various input modes – text, audio, images and videos – and will become the basis To create universal artificial intelligence. Researchers called the system “multimodal model of a large language” (MLLM). Multimodal model is capable of:
- Analyze images;
- Solve visual puzzles;
- recognize the text;
- Pass visual tests for IQ with an accuracy of 22-26%;
- Understand instructions in a natural language.
1-2 – Visual explanation, 3-4 – answer to the question, 5 – answer to the question of the web page, 6 – simple mathematical equation, 7-8 – recognition of numbers
Microsoft taught Kosmos-1 according to the Internet, including excerpts from The Pile (text resource in English with a volume of 800 GB) and the Common CRAWL web archive. >
After training, the researchers evaluated the abilities of Kosmos-1 in several tests, namely:
Understanding the language;
text generation;
Classification of the text without optical symbol recognition;
Generation of signatures to images;
Visual answers to questions;
answers to questions from web pages;
Classification of images.
It is noted that in many of these tests Kosmos-1 surpassed modern models.
kosmos -1 also was able to correctly answer the Raven test question only in 22% of cases (with a thinner setting – in 26% of cases).
1-2 – signatures to images, 3-6 – answers to visual requests, 7-8 – recognition of the text in the picture, 9-11 – maintaining the dialogue.
Researchers plan to increase the size of the model, as well as integrate voice capabilities. In addition, Kosmos-1 will soon be open to developers.