OCR Engine Tesseract 5.0 converts to float for faster training and recognition

AI/ML

By Julia Schmidt

December 1, 2021

OCR Engine Tesseract 5.0 converts to float for faster training and recognition

After more than 2.5 years in alpha, version 5.0 of the popular optical character recognition engine Tesseract has finally made it across the finish line.

Tesseract was initially developed at Hewlett-Packard and has been around since 1985. The company open-sourced the project in 2005, with Google overtaking most of the development work the following year. Since 2018, Tesseract is back to being dependent on non-funded community contributions, which might be the reason for the long alpha phase.

Optical character recognition is the capability of analysing images of text and numbers and turning them into actual text or digit series. This can be useful for further processing or just turning paper documents into a searchable digital representation. Tesseract promises to recognise more than 100 languages and supports a number of output formats including plain text, HTML, and PDF.

While the last major release presented the addition of neural networks to improve recognition results, Tesseract 5.0 looks to impress users with faster training and recognition out of the box. Its speed-up is mainly thanks to the Tesseract authors switching from double calculations to floats — which is said to come with the added bonus of needing less system memory.

With the update comes better support for Arm NEON, and additional binarisation options, so that users can check if using adaptive Otsu thresholding or the Sauvola method for local binarisation yields more correct results for their use-case. New options -dl and -ld in combine_tessdata are meant to provide developers with details of traineddata files.

According to the change notes, the team also put some effort into modernising the Tesseract codebase, cleaning up renderers, and getting rid of proprietary data types in its public API, which could lead to some breakage in old software using the engine. Same goes for any code using pdf.ttf directly for some reason, as it isn’t needed anymore and has hence been removed.

Apart from these changes, the contributors got rid of bugs, improved unit and fuzzing tests, and clarified training messages.

Sourcegraph coding assistant now supports Anthropic Claude 3 – though limited to 7K token input

Supabase moves out of beta, adds supports for Swift, plugs in Oriole storage engine

Go dev survey shows frustration with Python’s dominance of AI

AI coding: Hugging Face engineer extols benefits of open source models, but hard questions remain

.NET Smart Components experiment the "Visual Basic" of AI programming?

GitHub autofix progresses to public beta: insecure code corrected with AI, but only for enterprise

JetBrains bows to user pressure and unbundles AI Assistant in new IntelliJ IDEA beta

Hands On: Netlify AI-assisted deployment aims to reduce log-diving

Stack Overflow turns to Google for hosting and AI features, trusts in Gemini for tech answers

Employing your cloud data warehouse to scale up AI/ML

Rust-based Zed editor now open source – with built-in support for OpenAI and GitHub Copilot