This is a Python-based application for extracting text from scanned PDF files. It uses Tesseract OCR and Poppler to process each page, and outputs the recognized text to a .txt file. The tool includes a simple file selector and displays progress in the terminal. It's also packaged as a standalone .exe using PyInstaller, so it can run on any Windows machine without requiring Python or additional installations.
- Converts scanned PDF pages to images using Poppler
- Extracts text using Tesseract OCR (supports Spanish)
- Displays progress using a terminal progress bar
- Saves the extracted text to
texto_extraido.txt - Automatically opens the text file after processing
- Can be compiled into a single-file
.exe
- Python 3.8 or later
- Tesseract OCR (installed locally)
- Poppler for Windows
- Python packages:
pytesseract,pdf2image,tqdm spa.traineddatafile for Spanish OCR
This app uses Tesseract OCR with support for both English and Spanish.
Make sure your Tesseract installation includes the following language data files:
eng.traineddata(included by default)spa.traineddata(must be downloaded manually if not present)
To install Spanish language support, download spa.traineddata from the official repo:
https://github.com/tesseract-ocr/tessdata
Place the file inside the tessdata folder of your Tesseract installation directory.
- Download the Windows installer from:
https://github.com/UB-Mannheim/tesseract/wiki - Install it to:
C:\Tesseract-OCRor a similar folder - Copy the full path to
tesseract.exefor later use - Make sure the folder
tessdata/includesspa.traineddata(for Spanish).
- Download from:
https://github.com/oschwartz10612/poppler-windows/releases/ - Extract it to a folder such as:
C:\poppler - The path to use in code is:
C:\poppler\Library\bin
pip install pytesseract pdf2image tqdmTo use the application from source:
- Make sure Python and all dependencies are installed.
- Open a terminal in the project directory.
- Run the script:
python ocr_pdf.py- A file picker window will appear. Select a scanned PDF file.
- The extracted text will be saved to texto_extraido.txt and opened automatically.
You can compile the project as a portable .exe using PyInstaller.
- Project Structure Required Your project folder should look like this:
ocr_pdf/
├── ocr_pdf.py
├── README.md
├── requirements.txt
├── Tesseract-OCR/
│ └── tesseract.exe, tessdata/, etc.
└── poppler/
└── Library/
└── bin/
└── pdfinfo.exe, other DLLs...- Run PyInstaller Inside the ocr_pdf folder, run:
pyinstaller --onefile ^
--add-data "Tesseract-OCR;Tesseract-OCR" ^
--add-data "poppler\\Library\\bin;poppler\\Library\\bin" ^
ocr_pdf.pyNotes: The --add-data argument ensures all necessary binaries are included. The resulting executable will be created in the dist/ folder as ocr_pdf.exe.
You can share the .exe file from dist/ directly. The end user can:
- Double-click to open the application.
- Select a PDF file.
- Receive the extracted text in a plain .txt file, opened automatically.
No Python, Tesseract, or Poppler installations are needed on the target machine.
This application was developed for a freelance client who needed to extract Spanish-language text from scanned PDF documents. The final product is a self-contained .exe that works on any Windows machine and outputs the OCR results to a text file with no installation required.