-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Testing out some improvements python-documentcloud library, I discovered an unintended behavior that should be fixed.
2025-09-22 13:55:08,772 INFO squarelet request: post - documents/process/ - {'json': [{'id': 26105881, 'force_ocr': False, 'ocr_engine': 'textract'}, {'id': 26105882, 'force_ocr': False, 'ocr_engine': 'textract'}]}
2025-09-22 13:55:08,774 DEBUG urllib3.connectionpool Starting new HTTPS connection (1): api.www.documentcloud.org:443
2025-09-22 13:55:09,088 DEBUG urllib3.connectionpool https://api.www.documentcloud.org:443 "POST /api/documents/process/ HTTP/1.1" 200 None
2025-09-22 13:55:09,088 DEBUG squarelet response: 200 - b'"OK"'
2025-09-22 13:55:09,094 INFO documentcloud Process payload: [{'id': 26105881, 'force_ocr': False, 'ocr_engine': 'textract'}, {'id': 26105882, 'force_ocr': False, 'ocr_engine': 'textract'}]
2025-09-22 13:55:09,094 INFO documentcloud Upload directory complete
This document: https://www.documentcloud.org/documents/26105882-test/, despite having ocr_engine:False and an ocr_engine specified, because the document didn't have a text layer provided, DocumentCloud used the provided OCR engine to run OCR anyway. Since our policy is to run OCR if no text layer is detected, it should use tesseract and ignore the provided ocr_engine since force_ocr was explicitly set to False.
I am handling this in the python-documentcloud library so that it can't happen, but we should probably also fix this behavior on the backend