OCR in scanned PDF files with OCRmyPDF
If you’ve ever tried to copy or search text in a scanned PDF, you’ve probably noticed it doesn’t work — it’s basically just a collection of images. To fix this, you need to run an optical character recognition (OCR) process.
My go-to tool for this is https://ocrmypdf.readthedocs.io/en/latest/, which adds a searchable text layer to the original file using the https://github.com/tesseract-ocr/tesseract engine.
1. Installation
On Debian-based systems, you can install everything you need with:
sudo apt install ocrmypdf pngquant unpaper tesseract-ocr tesseract-ocr-spa
Note: If you want to use the advanced --jbig2-lossy optimization, you’ll need to install https://ocrmypdf.readthedocs.io/en/latest/jbig2.html manually, as it’s usually not included in official repositories due to licensing issues.
2. Basic usage: single file
Here’s the command I usually use to process a single file. It cleans up the scan, optimizes it, and outputs a searchable PDF:
ocrmypdf -l spa --output-type pdf \
-r --remove-background --clean-final --optimize 3 \
-d -c -i --remove-vectors --jbig2-lossy \
scanned.pdf output_with_ocr.pdf
2.1. What the main options do:
-l spa: sets the language (Spanish in this case).-r: automatically rotates pages if needed.--optimize 3: applies fairly aggressive compression.--clean-final: removes artifacts after OCR.
3. Batch processing
If you’ve got a lot of PDFs, this small loop processes all of them recursively:
find . -name '*.pdf' | while read -r pdf; do
ocrmypdf -l spa --output-type pdf \
-r --remove-background --clean-final --optimize 3 \
-d -c -i --remove-vectors --jbig2-lossy \
"$pdf" "${pdf%.pdf}_ocr.pdf"
done
That’s it — nothing fancy, just a handy reference I use myself and might be useful to others.