Skip to content

Better document that setting an ocr_engine means that if no text layer is found, it will use this engine, regardless of force_ocr #335

@duckduckgrayduck

Description

@duckduckgrayduck

Testing out some improvements python-documentcloud library, I discovered an unintended behavior that should be fixed.

2025-09-22 13:55:08,772 INFO squarelet request: post - documents/process/ - {'json': [{'id': 26105881, 'force_ocr': False, 'ocr_engine': 'textract'}, {'id': 26105882, 'force_ocr': False, 'ocr_engine': 'textract'}]}
2025-09-22 13:55:08,774 DEBUG urllib3.connectionpool Starting new HTTPS connection (1): api.www.documentcloud.org:443
2025-09-22 13:55:09,088 DEBUG urllib3.connectionpool https://api.www.documentcloud.org:443 "POST /api/documents/process/ HTTP/1.1" 200 None
2025-09-22 13:55:09,088 DEBUG squarelet response: 200 - b'"OK"'
2025-09-22 13:55:09,094 INFO documentcloud Process payload: [{'id': 26105881, 'force_ocr': False, 'ocr_engine': 'textract'}, {'id': 26105882, 'force_ocr': False, 'ocr_engine': 'textract'}]
2025-09-22 13:55:09,094 INFO documentcloud Upload directory complete

This document: https://www.documentcloud.org/documents/26105882-test/, despite having ocr_engine:False and an ocr_engine specified, because the document didn't have a text layer provided, DocumentCloud used the provided OCR engine to run OCR anyway. Since our policy is to run OCR if no text layer is detected, it should use tesseract and ignore the provided ocr_engine since force_ocr was explicitly set to False.

I am handling this in the python-documentcloud library so that it can't happen, but we should probably also fix this behavior on the backend

Metadata

Metadata

Labels

documentationImprovements or additions to documentation

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions