TESSRACT 5.0 text recognition system Release

published Release of the text of optical text recognition Tesserator 4.1 , supporting the recognition of UTF-8 characters and texts on more than 100 languages, including Russian, Kazakh, Belarusian and Ukrainian. The result can be stored both by open text and in HTML (HOCR), Alto (XML), PDF and TSV formats. Initially, the system was created in 1985-1995 in the Hewlett Packard laboratory, in 2005 the code was opened under the Apache license and later developed with the participation of Google employees. The source texts of the project distributed under the Apache 2.0 license.

TESSSERACT includes a console utility and libteserator library to embed text recognition functions to other applications. From supporting TesseRact third-party GUI interfaces can be noted GimageReader , vietocr and yagf . Two recognition engines are offered: the classic, recognizing text at the templates level of individual characters, and the new, based on the use of the machine learning system based on the recurrent neural network LSTM optimized to recognize the entire lines and allowing to achieve a significant increase in accuracy. Ready-made trained models are published for 123 languages ​​. To optimize performance, modules are offered using OpenMP and SIMD instructions AVX2, AVX, NEON or SSE4.1.

Basic improvements in Tesserator 4.1:

  • A significant change in the version number is associated with the changes in the API changes that violate compatibility. In particular, the publicly available LIBTESSERAC API is no longer tied to proprietary data types GenericVector and String, instead of which std :: string and std :: vector is involved in the code.
  • Reorganization of the tree of source texts. Public header files are moved to the Include / TesseRact catalog.
  • Recycled Memory Management, all calls malloc and free are replaced by C ++ code. Conducted a general upgrade code.
  • Added optimization for Arm and ARM64 architectures, the ARM Neon instructions are used to speed up the calculations. Performance optimization for all architectures has been carried out.
  • New models of models and text recognition modes are implemented, based on the use of floating semicolons. New modes are characterized by higher performance and reduction in memory consumption. In the LSTM engine, the fast mode Float32 is enabled by default.
  • Transfer to the use of Normalization Unicode using the NFC (Normalization Form Canonical).
  • Added option to configure log details (–Loglevel).
/Media reports.