The optical recognition of the text software Tessect 5.3.4 has been released. This new version supports the recognition of UTF-8 symbols and texts in over 100 languages, including Russian, Kazakh, Belarusian and Ukrainian. Users can store the results in open text and HTML formats (Hocr), as well as Alto (XML), PDF, and TSV. Originally developed in the laboratory of Hewlett Packard in the years 1985-1995, the code was made open source under the Apache license in 2005 and has since been further developed with the help of Google employees. The initial texts of the project can be found on GitHub.
Tessect offers a console utility and a Libtessect library that allows users to embed text recognition functions into other applications. Several third-party projects, such as gimagereader, vietocr, and yagf, also rely on Tessect for text recognition. The software offers two recognition engines – a classic engine that recognizes text at the level of individual characters and a new engine based on a machine learning system using LSTM recurrent neural networks, which allows for accurate recognition of entire lines. Ready-made trained models for 123 languages are available for use. To optimize performance, Tessect utilizes modules using Openmp and SIMD instructions AVX2, AVX, AVX512F, Neon, or SSE4.1.
The main improvements in Tessect 5.3.4 are:
- Improved recognition of images by URL with file downloads using the Libcurl library. When loading, the “USR-Agent”