Skip to content

docketanalyzer/ocr

Repository files navigation

Docket Analyzer OCR

Installation

pip install 'docketanalyzer[ocr]'

Local Usage

Process a document:

from docketanalyzer.ocr import pdf_document

path = 'path/to/doc.pdf
doc = pdf_document(path) # the input can also be raw bytes
doc.process()

for page in doc:
    for block in page:
        for line in block:
            pass

You can also stream pages as they are processed:

doc = pdf_document(path)

for page in doc.stream():
    print(page.text)

Pages, blocks, and lines have common attributes:

# where item is a page, block, or line

item.data # A dictionary representation of the item and it's children
item.text # The item's text content
item.page_num # The page the item appears on
item.i # The item-level index
item.id # A unique id constructed from the item and it's parents index (e.g. 3-2-1 for the first line in the second block on the third page).
item.bbox # Bounding box (blocks and lines only)
item.clip() # Extract element as an image from the original pdf

Blocks also have a block type attribute:

print(block.block_type) # 'title', 'text', 'figure', etc.

Save and load data:

# Saving a document
doc.save('doc.json')

# Loading a document
doc = pdf_document(path, load='doc.json')

Remote Usage

You can also serve this tool with Docker.

docker pull nadahlberg/docketanalyzer-ocr:latest
docker run --gpus all -p 8000:8000 nadahlberg/docketanalyzer-ocr:latest

And then use process the document in remote mode:

doc = pdf_document(path, remote=True) # pass endpoint_url if not using localhost

for page in doc.stream():
    print(page.text)

S3 Support

When using the remote service, if you want to avoid sending the file in a POST request, configure your S3 credentials. Your document will be temporarily pushed to your bucket to be retrieved by the service.

To configure your S3 credentials run:

da configure s3

Or set the following in your env:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_S3_BUCKET_NAME
AWS_S3_ENDPOINT_URL

Usage is identical. We default to using S3 if credentials are available. You can control this explicitly by passing use_s3=False to pdf_document.

Serverless Support

For serverless usage you can deploy this to RunPod. To get set up:

  1. Create a serverless worker on RunPod using the docker container.
nadahlberg/docketanalyzer-ocr:latest
  1. Add the following custom run command.
python -u handler.py
  1. Add your S3 credentials to the RunPod worker.
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_S3_BUCKET_NAME
AWS_S3_ENDPOINT_URL
  1. On your local machine, configure your RunPod key and the worker id.

You can run:

da configure runpod

Or set the following in your env:

RUNPOD_API_KEY
RUNPOD_OCR_ENDPOINT_ID

Usage is otherwise identical, just use remote=True with pdf_document

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published