TESSECT 5.4.0 TEXT RECOGNITION SYSTEM RELEASED

The system of optical text recognition Tessect 5.4.0 has been released, supporting UTF-8 character recognition and text in over 100 languages including Russian, Kazakh, Belarusian, and Ukrainian. Results can be stored in open text, HTML formats (Hocr), Alto (XML), PDF, and TSV. Originally developed in the Hewlett Packard laboratory from 1985-1995, the code was opened under the Apache license in 2005 and further developed with contributions from Google employees. The project’s initial texts are distributed under the Apache 2.0 license.

Tessect includes a console utility and Libtessect library for integrating text recognition functions into other applications. Notable GUI interfaces from third-party Tesseract support include gImageReader, VietOCR, and YAGF. Two recognition engines are available: a classic engine for template-based character recognition, and a new engine utilizing LSTM recurrent neural networks for line-based recognition leading to significantly improved accuracy. Pre-trained models are available for 123 languages, with optimization modules for performance using Openmp and SIMD instructions AVX2, AVX, AVX512F, Neon, or SSE4.1.

Main improvements include:

  • Added support for exporting and data drawing in page-xml format.
  • Implemented the ability to train models using PNG files instead of LSTMF files.
  • Enhanced drawing in PDF format.
  • Expanded API for determining text inclination.
  • Resolved performance issues detected during scanning in the Coverity system.
/Reports, release notes, official announcements.