How to Install The Latest Tesseract OCR 5 in Ubuntu 20.04 / 18.04 / 21.10

This simple tutorial shows how to install the latest Tesseract OCR engine in all current Ubuntu releases via PPA.

Tesseract is the most accurate open-source OCR engine that reads a wide variety of image formats and converts them to text in over 40 languages. Tesseract 5.0.0 was officially released a few days ago that features:

  • Faster training and OCR performance while less memory usage via ‘fast bloats’.
  • Support for latest macOS and Apple Silicon
  • Better ARM/ARM64 support.
  • API improvements and more.

How to Install Tesseract OCR in Ubuntu:

The optical character recognition engine is available in Ubuntu repositories though it’s always old.

Thanks to Alexander Pozdnyakov, the maintainer of Tesseract OCR in Debian/Ubuntu official repository, also maintains few PPAs with the latest packages. And, most CPU architectures (amd64, i386, arm64/armhf, ppc64el, s390x) are supported.

Option 1: Add Tesseract 4.x PPA

For the latest release of Tesseract OCR 4 (v4.1.3 so far), the stable PPA maintains the packages for Ubuntu 18.04, Ubuntu 20.04, Ubuntu 21.10, and old Ubuntu 16.04/14.04.

Advertisements

Press Ctrl+Alt+T on keyboard to open terminal. When it opens, run the command below to add the PPA:

sudo add-apt-repository ppa:alex-p/tesseract-ocr

Type user password when it asks (no visual feedback) and hit Enter to continue.

Option 2: Add Tesseract 5 PPA

The new 5.x release series is available in the Devel PPA for Ubuntu 18.04, Ubuntu 20.04, and Ubuntu 21.04. Ubuntu 21.10 is somehow not supported at the moment.

Also, press Ctrl+Alt+T to open terminal and run command:

sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel

NOTE: install the OCR from Devel PPA will override the old 4.x packages, though it’s not 100 % API compatible with v4.0.

Option 3: Add Tesseract repository for Debian:

For Debian Stretch, Buster, Bullseye, and Sid, there’s apt repositories for both Tesseract v4 and v5. Along with Ubuntu 21.10 users may follow the link button below to add the repository:

Tesseract repository for Debian

Update and Install Tesseract:

After adding a PPA or repository from the previous options, run command in terminal to refresh system package cache in case you’re still running old Ubuntu 18.04 and earlier:

sudo apt update

And, finally install the software engine via command:

sudo apt install tesseract-ocr

Or, upgrade the package using Software Updater:

How to Remove PPAs & uninstall Tesseract OCR:

To remove the PPAs, either run previous add-apt-repository command with --remove flag, or use Software & Updates utility under ‘Other Software’ tab.

To remove OCR engine, use command:

sudo apt remove --autoremove tesseract-ocr tesseract-ocr-*

You may also remove the libtesseract* package, which will however remove other app packages (e.g., gImageReader) that depends on it.