Release of text recognition system Tessect 5.2

published the release of the optical recognition of the text Tessect 5.2 , supporting the recognition of UTF-8 symbols and texts in more than 100 languages, including Russian, Kazakh, Belarusian and Ukrainian. The result can be stored both in open text and in HTML formats (HOCR), Alto (XML), PDF and TSV. Initially, the system was created in 1985-1995 in the laboratory of Hewlett Packard, in 2005 the code was opened under the Apache license and further developed with the participation of Google employees. The initial texts of the project are distributed under the license Apache 2.0.

Tessecta includes the console utility and the Libtessect library for embedding text recognition functions to other applications. From the supporting Tesseraact of third-party gui interprets can be noted gimageeader , vietocr and yagf . Two recognition engines are proposed: a classic, recognizing text at the level of templates of individual characters, and a new, based on the use of a machine learning system based on the LSTM recurrent neural network, optimized for recognition of entirely lines and allowing to achieve a significant increase in accuracy. Ready-made trained models are published for 123 languages ​​. To optimize performance, modules using Openmp and SIMD instructions AVX2, AVX, AVX512F, NeON or SSE4.1.

are offered to optimize the performance.

The main improvements in Tessect 5.2:

  • Added optimization implemented using the instructions Intel AVX512F.
  • In C API, a function for initializing Tessecta with loading from memory of the machine learning model is implemented.
  • Added an Invert_threshold parameter, which determines the level of inverting text lines. By default, the value of 0.7 is set. To turn off inverting, set the value 0.
  • The processing of very large documents on 32-bit hosts is established.
  • The transition with the use of Std :: regex to STD :: String.
  • Improved assembly scenarios for Autotools, Cmake and continuous integration systems.
/Media reports.