Skip to content
/ morphic Public

High-fidelity OCR + pre-RAG pipeline processor featuring: 1.) Tesseract OCR 2.) Built-in cross-line dehyphenation + real word verification 3.) Support for TIFF series, & JPEG2000 (jpx) for hi-fidelity pdf sources with logistically significant size savings. Morphic assists in pre-RAG PDF prep for analysis, large-scale ingest & agentic analysis

License

Notifications You must be signed in to change notification settings

Node0/morphic

Repository files navigation

🔮 Morphic

Version 0.1

Intelligent OCR with Post-Processing Downsampling

Transform scanned documents into searchable PDFs while maintaining maximum OCR accuracy and flexible output resolutions.

Python 3.11+ License: MIT


✨ What Makes Morphic Different?

Traditional OCR tools force you to choose between accuracy and file size. Morphic gives you both.

The Problem

❌ OCR at 300 DPI → Poor accuracy, small files
❌ OCR at 600 DPI → Great accuracy, huge files

The Morphic Solution

✅ OCR at 600 DPI → Downsample to 300 DPI → Great accuracy, small files

OCR always runs on the highest resolution images for maximum accuracy. Downsampling happens AFTER OCR is complete, preserving text recognition quality while reducing file size.


🎯 Key Features

  • 🧠 Smart DPI Handling: Auto-detects DPI from image EXIF metadata
  • 📉 Post-OCR Downsampling: OCR on full resolution, output at any target DPI
  • 🗜️ JPEG2000 Support: Native JP2/JPX embedding via PyMuPDF for superior compression
  • 📁 Flexible Input: Process PDFs or folders of images
  • 🔤 Natural Sorting: Images sorted correctly (page-2 before page-10)
  • 💾 Memory Efficient: Configurable batch processing
  • 📊 Rich Logging: Detailed progress with CPU/memory monitoring

🚀 Quick Start

Installation

Option 1: Traditional pip

# Clone the repository
git clone https://github.com/yourusername/morphic.git
cd morphic

# Install dependencies
pip install -r requirements.txt

# Install poppler for PDF processing
# macOS:
brew install poppler

# Ubuntu/Debian:
sudo apt-get install poppler-utils

Option 2: UV (10-100× Faster!) ⚡

UV is a fast Python package installer written in Rust.

# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone repository
git clone https://github.com/yourusername/morphic.git
cd morphic

# Install dependencies (much faster!)
uv pip install -r requirements.txt

# Install poppler (same as above)
brew install poppler  # macOS
# or
sudo apt-get install poppler-utils  # Linux

See UV_INSTALL.md for detailed UV setup and advanced features.

Basic Usage

# Show quick help (no installation needed)
python morphic.py

# Show full help
python morphic.py --help

# Maximum quality (no downsampling)
python morphic.py \
  --input-pdf-file scan.pdf \
  --output-pdf-file output.pdf \
  --source-dpi 600 \
  --output-pdf-dpi 600

# Web-optimized (OCR at 600, output at 300)
python morphic.py \
  --input-pdf-file scan.pdf \
  --output-pdf-file web.pdf \
  --source-dpi 600 \
  --output-pdf-dpi 300 \
  --output-pdf-images-format jp2

When you run python morphic.py with no arguments, you'll see:

╔══════════════════════════════════════════════════════════╗
║                    MORPHIC                               ║
║          Intelligent OCR with Downsampling               ║
╚══════════════════════════════════════════════════════════╝

Usage: python morphic.py [OPTIONS]

Required: Choose ONE input source
  --input-pdf-file PATH         OCR a PDF file
  --input-image-folder PATH     OCR a folder of images
...

📖 Use Cases

1. Create Multiple Versions from High-Res Scans

# Master archive (full quality)
python morphic.py \
  --input-image-folder ./scans/ \
  --output-pdf-file master_600dpi.pdf \
  --output-pdf-dpi 600 \
  --output-pdf-images-format jp2

# Web distribution (smaller, same OCR quality)
python morphic.py \
  --input-image-folder ./scans/ \
  --output-pdf-file web_300dpi.pdf \
  --output-pdf-dpi 300 \
  --output-pdf-images-format jp2

# Email-friendly (tiny, same OCR quality)
python morphic.py \
  --input-image-folder ./scans/ \
  --output-pdf-file email_150dpi.pdf \
  --output-pdf-dpi 150 \
  --output-pdf-images-format jpeg

All three PDFs have identical OCR quality - only the embedded image resolution differs!

2. Optimize Existing PDF Scans

# Your scanner produced a 500MB PDF at 600 DPI
python morphic.py \
  --input-pdf-file huge_scan.pdf \
  --output-pdf-file optimized.pdf \
  --source-dpi 600 \
  --output-pdf-dpi 300 \
  --output-pdf-images-format jp2

# Result: ~125MB file with perfect OCR

3. Process Folder of Mixed-DPI Images

# Images have DPI in EXIF - Morphic auto-detects!
python morphic.py \
  --input-image-folder ./photos/ \
  --output-pdf-file result.pdf \
  --output-pdf-dpi 300

# Images without EXIF default to 600 DPI

🎛️ Command-Line Options

Input (Required - Pick One)

  • --input-pdf-file PATH - Input PDF file to OCR
  • --input-image-folder PATH - Folder of page images (auto-sorted)

Output (Required)

  • --output-pdf-file PATH - Where to save searchable PDF

DPI Control (Optional)

  • --source-dpi INT - DPI for OCR processing (default: 600)

    • PDFs: Rasterization resolution
    • Images: Fallback if no EXIF DPI (auto-detected when available)
  • --output-pdf-dpi INT - Target DPI for output (default: same as source)

    • Lower than source = downsampling for smaller files
    • Same as source = no downsampling

Format & Performance

  • --output-pdf-images-format {jp2,jpx,png,jpeg} - Image codec (default: jp2)

    • jp2/jpx: JPEG2000 - best for high-DPI (recommended)
    • png: Lossless, larger files
    • jpeg: Lossy, smallest files
  • --page-queue-depth INT - Pages in memory at once (1-10, default: 5)

    • Lower = less RAM
    • Higher = faster on powerful systems
  • --debug - Verbose logging with resource monitoring


📐 How It Works

The Workflow

┌─────────────────┐
│  Input Source   │ (PDF or images)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Auto-Detect DPI │ (from EXIF or --source-dpi)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   EasyOCR at    │ (Always maximum resolution)
│  Native/Max DPI │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Downsample    │ (If output-dpi < source-dpi)
│  (Optional)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Embed in PDF   │ (JPEG2000 or PNG/JPEG)
│  + OCR Text     │ (Coordinates auto-scaled)
└─────────────────┘

Coordinate Scaling

When downsampling, OCR bounding boxes are automatically scaled:

# OCR detects text at 600 DPI: bbox = (1200, 800)
# Output is 300 DPI: scale_factor = 300/600 = 0.5
# Final coords: (600, 400)

Text remains perfectly aligned with downsampled images!


📊 Expected Results

File Size Comparison

For a 200-page 8×10" book:

Configuration File Size OCR Quality Use Case
600 DPI JP2 ~800 MB ⭐⭐⭐⭐⭐ Master archive
300 DPI JP2 ~200 MB ⭐⭐⭐⭐⭐ Web distribution
150 DPI JPEG ~50 MB ⭐⭐⭐⭐⭐ Email attachments

Note: OCR quality is identical in all cases - only image resolution differs!

Performance

  • OCR Speed: ~3-5 seconds per page (GPU) or ~10-15 seconds (CPU)
  • Memory Usage: ~500MB - 1GB with default queue depth
  • Compression: JPEG2000 at 600 DPI ≈ 60-70% size of PNG

🔧 Technical Details

Why JPEG2000?

JPEG2000 (JP2/JPX) provides superior compression for high-DPI scans:

  • 2-4× smaller than PNG at equivalent quality
  • Native PDF support via JPXDecode filter
  • No quality loss from re-encoding (unlike reportlab)

DPI Auto-Detection

Morphic checks image EXIF data for DPI:

# Pillow extracts DPI from EXIF tags
img.info['dpi']  # e.g., (600, 600)

If no EXIF data, falls back to --source-dpi.

Verification

Confirm JPEG2000 is properly embedded:

qpdf --stream-data=uncompress output.pdf | grep JPXDecode

You should see JPXDecode filters in the PDF structure.


🐛 Troubleshooting

"EasyOCR failed to initialize"

  • GPU mode: Check CUDA installation
  • Auto-fallback: Tool switches to CPU automatically

"WebP is not supported"

  • PDF specification doesn't support WebP codec
  • Use jp2, png, or jpeg instead

"Out of memory"

  • Reduce --page-queue-depth to 2 or 3
  • Lower --source-dpi if OCR quality permits

"Pages out of order"

  • Images use natural sort (1, 2, 10 not 1, 10, 2)
  • Zero-pad filenames if needed: page-001.png

🤝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Submit a pull request

📄 License

MIT License - see LICENSE file for details.


🙏 Acknowledgments

Built with:


📬 Contact

Questions? Open an issue or reach out!

Happy OCR'ing! 🔮

About

High-fidelity OCR + pre-RAG pipeline processor featuring: 1.) Tesseract OCR 2.) Built-in cross-line dehyphenation + real word verification 3.) Support for TIFF series, & JPEG2000 (jpx) for hi-fidelity pdf sources with logistically significant size savings. Morphic assists in pre-RAG PDF prep for analysis, large-scale ingest & agentic analysis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages