OCR in scanned PDF files with OCRmyPDF

April 2, 2021

If you’ve ever tried to copy or search text in a scanned PDF, you’ve probably noticed it doesn’t work — it’s basically just a collection of images. To fix this, you need to run an optical character recognition (OCR) process.

My go-to tool for this is https://ocrmypdf.readthedocs.io/en/latest/, which adds a searchable text layer to the original file using the https://github.com/tesseract-ocr/tesseract engine.

1. Installation

On Debian-based systems, you can install everything you need with:

sudo apt install ocrmypdf pngquant unpaper tesseract-ocr tesseract-ocr-spa

Note: If you want to use the advanced --jbig2-lossy optimization, you’ll need to install https://ocrmypdf.readthedocs.io/en/latest/jbig2.html manually, as it’s usually not included in official repositories due to licensing issues.

2. Basic usage: single file

Here’s the command I usually use to process a single file. It cleans up the scan, optimizes it, and outputs a searchable PDF:

ocrmypdf -l spa --output-type pdf \
    -r --remove-background --clean-final --optimize 3 \
    -d -c -i --remove-vectors --jbig2-lossy \
    scanned.pdf output_with_ocr.pdf

2.1. What the main options do:

-l spa: sets the language (Spanish in this case).
-r: automatically rotates pages if needed.
--optimize 3: applies fairly aggressive compression.
--clean-final: removes artifacts after OCR.

3. Batch processing

If you’ve got a lot of PDFs, this small loop processes all of them recursively:

find . -name '*.pdf' | while read -r pdf; do
    ocrmypdf -l spa --output-type pdf \
    -r --remove-background --clean-final --optimize 3 \
    -d -c -i --remove-vectors --jbig2-lossy \
    "$pdf" "${pdf%.pdf}_ocr.pdf"
done

That’s it — nothing fancy, just a handy reference I use myself and might be useful to others.