Intelligent OCR with Post-Processing Downsampling
Transform scanned documents into searchable PDFs while maintaining maximum OCR accuracy and flexible output resolutions.
Traditional OCR tools force you to choose between accuracy and file size. Morphic gives you both.
❌ OCR at 300 DPI → Poor accuracy, small files
❌ OCR at 600 DPI → Great accuracy, huge files
✅ OCR at 600 DPI → Downsample to 300 DPI → Great accuracy, small files
OCR always runs on the highest resolution images for maximum accuracy. Downsampling happens AFTER OCR is complete, preserving text recognition quality while reducing file size.
- 🧠 Smart DPI Handling: Auto-detects DPI from image EXIF metadata
- 📉 Post-OCR Downsampling: OCR on full resolution, output at any target DPI
- 🗜️ JPEG2000 Support: Native JP2/JPX embedding via PyMuPDF for superior compression
- 📁 Flexible Input: Process PDFs or folders of images
- 🔤 Natural Sorting: Images sorted correctly (page-2 before page-10)
- 💾 Memory Efficient: Configurable batch processing
- 📊 Rich Logging: Detailed progress with CPU/memory monitoring
# Clone the repository
git clone https://github.com/yourusername/morphic.git
cd morphic
# Install dependencies
pip install -r requirements.txt
# Install poppler for PDF processing
# macOS:
brew install poppler
# Ubuntu/Debian:
sudo apt-get install poppler-utilsUV is a fast Python package installer written in Rust.
# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone repository
git clone https://github.com/yourusername/morphic.git
cd morphic
# Install dependencies (much faster!)
uv pip install -r requirements.txt
# Install poppler (same as above)
brew install poppler # macOS
# or
sudo apt-get install poppler-utils # LinuxSee UV_INSTALL.md for detailed UV setup and advanced features.
# Show quick help (no installation needed)
python morphic.py
# Show full help
python morphic.py --help
# Maximum quality (no downsampling)
python morphic.py \
--input-pdf-file scan.pdf \
--output-pdf-file output.pdf \
--source-dpi 600 \
--output-pdf-dpi 600
# Web-optimized (OCR at 600, output at 300)
python morphic.py \
--input-pdf-file scan.pdf \
--output-pdf-file web.pdf \
--source-dpi 600 \
--output-pdf-dpi 300 \
--output-pdf-images-format jp2When you run python morphic.py with no arguments, you'll see:
╔══════════════════════════════════════════════════════════╗
║ MORPHIC ║
║ Intelligent OCR with Downsampling ║
╚══════════════════════════════════════════════════════════╝
Usage: python morphic.py [OPTIONS]
Required: Choose ONE input source
--input-pdf-file PATH OCR a PDF file
--input-image-folder PATH OCR a folder of images
...
# Master archive (full quality)
python morphic.py \
--input-image-folder ./scans/ \
--output-pdf-file master_600dpi.pdf \
--output-pdf-dpi 600 \
--output-pdf-images-format jp2
# Web distribution (smaller, same OCR quality)
python morphic.py \
--input-image-folder ./scans/ \
--output-pdf-file web_300dpi.pdf \
--output-pdf-dpi 300 \
--output-pdf-images-format jp2
# Email-friendly (tiny, same OCR quality)
python morphic.py \
--input-image-folder ./scans/ \
--output-pdf-file email_150dpi.pdf \
--output-pdf-dpi 150 \
--output-pdf-images-format jpegAll three PDFs have identical OCR quality - only the embedded image resolution differs!
# Your scanner produced a 500MB PDF at 600 DPI
python morphic.py \
--input-pdf-file huge_scan.pdf \
--output-pdf-file optimized.pdf \
--source-dpi 600 \
--output-pdf-dpi 300 \
--output-pdf-images-format jp2
# Result: ~125MB file with perfect OCR# Images have DPI in EXIF - Morphic auto-detects!
python morphic.py \
--input-image-folder ./photos/ \
--output-pdf-file result.pdf \
--output-pdf-dpi 300
# Images without EXIF default to 600 DPI--input-pdf-file PATH- Input PDF file to OCR--input-image-folder PATH- Folder of page images (auto-sorted)
--output-pdf-file PATH- Where to save searchable PDF
-
--source-dpi INT- DPI for OCR processing (default: 600)- PDFs: Rasterization resolution
- Images: Fallback if no EXIF DPI (auto-detected when available)
-
--output-pdf-dpi INT- Target DPI for output (default: same as source)- Lower than source = downsampling for smaller files
- Same as source = no downsampling
-
--output-pdf-images-format {jp2,jpx,png,jpeg}- Image codec (default: jp2)jp2/jpx: JPEG2000 - best for high-DPI (recommended)png: Lossless, larger filesjpeg: Lossy, smallest files
-
--page-queue-depth INT- Pages in memory at once (1-10, default: 5)- Lower = less RAM
- Higher = faster on powerful systems
-
--debug- Verbose logging with resource monitoring
┌─────────────────┐
│ Input Source │ (PDF or images)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Auto-Detect DPI │ (from EXIF or --source-dpi)
└────────┬────────┘
│
▼
┌─────────────────┐
│ EasyOCR at │ (Always maximum resolution)
│ Native/Max DPI │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Downsample │ (If output-dpi < source-dpi)
│ (Optional) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Embed in PDF │ (JPEG2000 or PNG/JPEG)
│ + OCR Text │ (Coordinates auto-scaled)
└─────────────────┘
When downsampling, OCR bounding boxes are automatically scaled:
# OCR detects text at 600 DPI: bbox = (1200, 800)
# Output is 300 DPI: scale_factor = 300/600 = 0.5
# Final coords: (600, 400)Text remains perfectly aligned with downsampled images!
For a 200-page 8×10" book:
| Configuration | File Size | OCR Quality | Use Case |
|---|---|---|---|
| 600 DPI JP2 | ~800 MB | ⭐⭐⭐⭐⭐ | Master archive |
| 300 DPI JP2 | ~200 MB | ⭐⭐⭐⭐⭐ | Web distribution |
| 150 DPI JPEG | ~50 MB | ⭐⭐⭐⭐⭐ | Email attachments |
Note: OCR quality is identical in all cases - only image resolution differs!
- OCR Speed: ~3-5 seconds per page (GPU) or ~10-15 seconds (CPU)
- Memory Usage: ~500MB - 1GB with default queue depth
- Compression: JPEG2000 at 600 DPI ≈ 60-70% size of PNG
JPEG2000 (JP2/JPX) provides superior compression for high-DPI scans:
- 2-4× smaller than PNG at equivalent quality
- Native PDF support via
JPXDecodefilter - No quality loss from re-encoding (unlike reportlab)
Morphic checks image EXIF data for DPI:
# Pillow extracts DPI from EXIF tags
img.info['dpi'] # e.g., (600, 600)If no EXIF data, falls back to --source-dpi.
Confirm JPEG2000 is properly embedded:
qpdf --stream-data=uncompress output.pdf | grep JPXDecodeYou should see JPXDecode filters in the PDF structure.
- GPU mode: Check CUDA installation
- Auto-fallback: Tool switches to CPU automatically
- PDF specification doesn't support WebP codec
- Use
jp2,png, orjpeginstead
- Reduce
--page-queue-depthto 2 or 3 - Lower
--source-dpiif OCR quality permits
- Images use natural sort (1, 2, 10 not 1, 10, 2)
- Zero-pad filenames if needed:
page-001.png
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new features
- Submit a pull request
MIT License - see LICENSE file for details.
Built with:
- EasyOCR - Neural network OCR
- PyMuPDF - PDF manipulation
- pdf2image - PDF to image conversion
- Pillow - Image processing
Questions? Open an issue or reach out!
Happy OCR'ing! 🔮