TESSRACT 5.1 text recognition system Release

published Release of the text optical text recognition system Tesserator 5.1 , supporting the recognition of UTF-8 characters and texts on more than 100 languages, including Russian, Kazakh, Belarusian and Ukrainian. The result can be stored both by open text and in HTML (HOCR), Alto (XML), PDF and TSV formats. Initially, the system was created in 1985-1995 in the Hewlett Packard laboratory, in 2005 the code was opened under the Apache license and later developed with the participation of Google employees. The source texts of the project distributed under the Apache 2.0 license.

TESSSERACT includes a console utility and libteserator library to embed text recognition functions to other applications. From supporting TesseRact third-party GUI interfaces can be noted GimageReader , vietocr and yagf . Two recognition engines are offered: the classic, recognizing text at the templates level of individual characters, and the new, based on the use of the machine learning system based on the recurrent neural network LSTM optimized to recognize the entire lines and allowing to achieve a significant increase in accuracy. Ready-made trained models are published for 123 languages ​​. To optimize performance, modules are offered using OpenMP and SIMD instructions AVX2, AVX, NEON or SSE4.1.

Basic Improvements in Tesserator 5.1:

  • implements the ability to process areas with images and lines when displaying Alto, HOCR and TEXT formats.
  • Added new CURL_TIMEOUT LKZ CURL_EASY_SETOP.
  • Improved assembly system.
  • work to remove unused code
  • eliminated failures caused by incorrect processing of zero pointers in the class Page PageTerator :: Orientation.
/Media reports.