Tessect 5.5.0 Text Recognition System Launch

published the release of the system of optical recognition of the text tessect 5.5.0 , which supports Unicode and recognition of texts in more than 100 languages, including Russian, Kazakh, Belarusian and Ukrainian. The result can be stored both in open text and in HTML formats (Hocr), Alto (XML), PDF and TSV. Initially, the system was created in 1985-1995 in the laboratory of Hewlett Packard, in 2005 the code was opened under the Apache license and further developed with the participation of Google employees. The initial texts of the project spread under the license Apache 2.0.

Tessecta includes a console utility and Libtessect library to embed text recognition functions to other applications. Of the supporting TesseRact, third-party gui interface can be noted gimageereader , vietocr and yagf . Two recognition engines are proposed: a classic, recognizing text at the level of templates of individual characters, and a new, based on the use of a machine learning system based on the LSTM recurrent neural network, optimized for recognition of entirely lines and allowing to achieve a significant increase in accuracy. Ready-made trained models are published for 123 languages ​​. To optimize performance, modules using Openmp and SIMD instructions AVX2, AVX, AVX512F, Neon or SSE4.1.

are offered to optimize the performance.

The main improvements :

  • Added support for vector extensions risc-v v , based which are prepared assembler optimization for systems with RISC-V processors.
  • When recording a result in format hocr it is ensured in the created OCRP_DIR and OCRP_LANG.
  • parameters.

  • Modernized code to determine affordable language models.
  • Improved code to form files in HOCR format and removed the conversion of files on the Windows.
  • platform.

  • The indication of symbolic values ​​in the options “–oem” and “-psm”
  • “is allowed.

  • The code replaced the functions of Access and _Access with the STD method :: Filesystem :: Exists (). TPRINTF functions are replaced by the use of Tesserr flow.
  • Tensorflow machine learning platform has been deleted, which was realized at one time, but was never involved in the implementation of AI recognition models.
  • Improved installer for the Windows.
  • platform.

  • Submodul Googletest is updated to version 1.15.2.

/Reports, release notes, official announcements.