Better document that setting an ocr_engine means that if no text layer is found, it will use this engine, regardless of force_ocr

Testing out some improvements python-documentcloud library, I discovered an unintended behavior that should be fixed. 

2025-09-22 13:55:08,772 INFO     squarelet                 request: post - documents/process/ - {'json': [{'id': 26105881, 'force_ocr': False, 'ocr_engine': 'textract'}, {'id': 26105882, 'force_ocr': False, 'ocr_engine': 'textract'}]}
2025-09-22 13:55:08,774 DEBUG    urllib3.connectionpool    Starting new HTTPS connection (1): api.www.documentcloud.org:443
2025-09-22 13:55:09,088 DEBUG    urllib3.connectionpool    https://api.www.documentcloud.org:443 "POST /api/documents/process/ HTTP/1.1" 200 None
2025-09-22 13:55:09,088 DEBUG    squarelet                 response: 200 - b'"OK"'
2025-09-22 13:55:09,094 INFO     documentcloud             Process payload: [{'id': 26105881, 'force_ocr': False, 'ocr_engine': 'textract'}, {'id': 26105882, 'force_ocr': False, 'ocr_engine': 'textract'}]
2025-09-22 13:55:09,094 INFO     documentcloud             Upload directory complete


This document: https://www.documentcloud.org/documents/26105882-test/, despite having ocr_engine:False and an ocr_engine specified, because the document didn't have a text layer provided, DocumentCloud used the provided OCR engine to run OCR anyway. Since our policy is to run OCR if no text layer is detected, it should use tesseract and ignore the provided ocr_engine since force_ocr was explicitly set to False. 

I am handling this in the python-documentcloud library so that it can't happen, but we should probably also fix this behavior on the backend 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better document that setting an ocr_engine means that if no text layer is found, it will use this engine, regardless of force_ocr #335

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Better document that setting an ocr_engine means that if no text layer is found, it will use this engine, regardless of force_ocr #335

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions