Tesseract was initially developed at Hewlett-Packard and has been around since 1985. The company open-sourced the project in 2005, with Google overtaking most of the development work the following year. Since 2018, Tesseract is back to being dependent on non-funded community contributions, which might be the reason for the long alpha phase.
Optical character recognition is the capability of analysing images of text and numbers and turning them into actual text or digit series. This can be useful for further processing or just turning paper documents into a searchable digital representation. Tesseract promises to recognise more than 100 languages and supports a number of output formats including plain text, HTML, and PDF.
While the last major release presented the addition of neural networks to improve recognition results, Tesseract 5.0 looks to impress users with faster training and recognition out of the box. Its speed-up is mainly thanks to the Tesseract authors switching from double calculations to floats — which is said to come with the added bonus of needing less system memory.
With the update comes better support for Arm NEON, and additional binarisation options, so that users can check if using adaptive Otsu thresholding or the Sauvola method for local binarisation yields more correct results for their use-case. New options
combine_tessdata are meant to provide developers with details of traineddata files.
According to the change notes, the team also put some effort into modernising the Tesseract codebase, cleaning up renderers, and getting rid of proprietary data types in its public API, which could lead to some breakage in old software using the engine. Same goes for any code using pdf.ttf directly for some reason, as it isn’t needed anymore and has hence been removed.
Apart from these changes, the contributors got rid of bugs, improved unit and fuzzing tests, and clarified training messages.