Skip to content

Conversation

@vblagoje
Copy link
Member

@vblagoje vblagoje commented Jan 7, 2026

Why

Azure deprecated azure-ai-formrecognizer in favor of azure-ai-documentintelligence (v1.0.0, GA Dec 2024). New package supports markdown output format (GitHub Flavored Markdown) which is better suited for RAG/LLM applications - tables inline with context, preserved document structure (headings, lists), no manual assembly required.

What

Added AzureDocumentIntelligenceConverter component:

  • Uses azure-ai-documentintelligence>=1.0.0 package (2024-11-30 API)
  • Markdown output mode (default): single document with inline tables, preserved structure
  • Text output mode (backward compat): separate CSV table documents or markdown tables
  • Simplified API: removed page_layout, threshold_y, preceding_context_len, following_context_len, merge_multiple_column_headers
  • Added output_format (markdown/text), table_format (csv/markdown)

Deprecated AzureOCRDocumentConverter (removal in Haystack 2.25)

How can it be used

  from haystack.components.converters import AzureDocumentIntelligenceConverter
  from haystack.utils import Secret

  # Markdown mode (recommended for RAG)
  converter = AzureDocumentIntelligenceConverter(
      endpoint=os.environ["AZURE_DI_ENDPOINT"],
      api_key=Secret.from_env_var("AZURE_AI_API_KEY"),
      output_format="markdown"
  )
  results = converter.run(sources=["invoice.pdf"])
  # Returns single document with markdown, tables inline

  # Text mode (backward compat)
  converter = AzureDocumentIntelligenceConverter(
      endpoint=os.environ["AZURE_DI_ENDPOINT"],
      api_key=Secret.from_env_var("AZURE_AI_API_KEY"),
      output_format="text",
      table_format="csv"
  )
  # Returns separate CSV table documents + text document

How did you test it

  • 3 unit tests (init, to_dict, from_dict)
  • 4 integration tests with real Azure API (markdown output, text+CSV tables, metadata handling, multiple files)

Notes for the reviewer

Migration path from old converter:

  • page_layout="natural" → output_format="markdown"
  • Remove context/layout params (Azure API handles this now)
  • Tables inline in markdown mode vs separate CSV docs

@vercel
Copy link

vercel bot commented Jan 7, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
haystack-docs Ready Ready Preview, Comment Jan 8, 2026 11:16am

@vblagoje
Copy link
Member Author

vblagoje commented Jan 7, 2026

Waiting for secrets to be set. I'll double check everything once again and convert draft to PR

@vblagoje vblagoje marked this pull request as ready for review January 8, 2026 11:13
@vblagoje vblagoje requested a review from a team as a code owner January 8, 2026 11:13
@vblagoje vblagoje requested review from sjrl and removed request for a team January 8, 2026 11:13
@vblagoje
Copy link
Member Author

vblagoje commented Jan 8, 2026

@sjrl please don't review yet as I'm trying to figure out if new github secrets will get picked up will merge + open state

@vblagoje
Copy link
Member Author

vblagoje commented Jan 8, 2026

Closing this PR to see if new secrets get picked up in a new PR

@vblagoje vblagoje closed this Jan 8, 2026
@vblagoje vblagoje deleted the azure_doc_intelligence branch January 8, 2026 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add a Azure OCR Converter that uses the azure-ai-documentintelligence library

2 participants