Data

Tesseract version 4.0 releases with new LSTM based engine, and an updated build system

2 min read

Google released version 4.0 of its OCR engine, Tesseract, yesterday. Tesseract 4.0 comes with a new neural net (LSTM) based OCR engine, updated build system, other improvements, and bug fixes.

Tesseract is an OCR engine that offers support for unicode (a specification that supports all character set) and comes with an ability to recognize more than 100 languages out of the box. It can be trained to recognize other languages and is used for text detection on mobile devices, videos, and in Gmail image spam detection.

Let’s have a look at what’s new in Tesseract 4.0.

New neural net (LSTM) based OCR engine

The new OCR engine uses a neural network system based on LSTMs, with major accuracy gains. This consists of new training tools for the LSTM OCR engine. You can train a new model from scratch or by fine-tuning an existing model. Trained data including LSTM models and 123 languages have been added to the new OCR engine. Optional accelerated code paths have been added for the LSTM recognizer:

Moreover, a new parameter lstm_choice_mode that allows including alternative symbol choices in the hOCR output has been added.

Updated Build System

Tesseract 4.0 uses semantic versioning and requires Leptonica 1.74.0 or a higher version. In case you want to build Tesseract from source code then a compiler with strong C++ 11 support is necessary.

Unit tests have been added to the main repo. Tesseract’s source tree has been reorganized in version 4.0. A new option has been added that lets you compile Tesseract without the code of the legacy OCR engine.

Bug Fixes

  • Issues in trainingdata rendering have been fixed.
  • Damage caused to binary images when processing PDFs has been fixed.
  • Issues in the OpenCL code have been fixed. OpenCL now works fine for the legacy Tesseract OCR engine but the performance hasn’t improved yet.

Other Improvements

  • Multi-page TIFF handling is improved in Tesseract 4.0.
  • Improvements are made to PDF rendering.
  • The version information and improved help texts have been added to the training tools.
  • tessedit_pageseg_mode 1 has been removed from hocr, pdf, and tsv config files. The user has to now explicitly use –psm 1 if that is desired.

For more information, check out the official release notes.

Read Next

Tesla v9 to incorporate neural networks for autopilot

Neural Network Intelligence: Microsoft’s open source automated machine learning toolkit

Natasha Mathur

Tech writer at the Packt Hub. Dreamer, book nerd, lover of scented candles, karaoke, and Gilmore Girls.

Share
Published by
Natasha Mathur
Tags: AI News

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago