Skip to content

Conversation

@nishika26
Copy link
Collaborator

@nishika26 nishika26 commented Dec 24, 2025

Summary

Target issue is #489

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced provider-based collection management system to support multiple LLM providers
    • Collections now store provider-specific configuration data
    • Added OpenAI provider with support for customizable collection parameters
  • Refactor

    • Restructured collection creation and deletion workflows to use provider abstraction
    • Extended public model exports for improved API accessibility

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 24, 2025

📝 Walkthrough

Walkthrough

A new provider abstraction system is introduced for collection management, replacing direct provider-specific logic with a unified interface. Changes include database schema extensions (provider enum, collection_blob column), reorganized data models for request/response handling, and refactored service layer operations through a registry-based provider pattern supporting OpenAI backend.

Changes

Cohort / File(s) Summary
Database Schema Migration
backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py
Adds PostgreSQL ENUM provider_enum with "openai" value; introduces collection_blob (JSONB) and provider columns to collection table; populates existing NULL providers to "openai"; enforces NOT NULL constraint post-population; updates llm_service_name comment.
Data Models - Collection Definitions
backend/app/models/collection/request.py
Introduces ProviderType enum; adds provider and collection_blob fields to Collection; replaces document handling with DocumentInput type; introduces CreateCollectionParams, CreationRequest, DeletionRequest, CallbackRequest, and ProviderOptions types; includes document deduplication and provider value normalization.
Data Models - Collection Responses
backend/app/models/collection/response.py
New module defining public response types: CreateCollectionResult, CollectionIDPublic, CollectionPublic, and CollectionWithDocsPublic with appropriate fields and inheritance structure.
Model Exports
backend/app/models/collection/__init__.py, backend/app/models/__init__.py
Aggregates and re-exports collection-related types (CreationRequest, DeletionRequest, ProviderType, CreateCollectionParams, CreateCollectionResult, etc.) for simplified public API imports.
Collection Services
backend/app/services/collections/create_collection.py, backend/app/services/collections/delete_collection.py
Replaces direct provider-specific logic with provider abstraction via get_llm_provider; removes with_assistant parameter from execute_job; delegates creation/deletion operations to provider interface; stores provider-specific metadata (llm_service_id, llm_service_name, collection_blob) in database.
Collection Helpers
backend/app/services/collections/helpers.py
Introduces get_service_name(provider: str) helper function mapping providers to service names; removes OPENAI_VECTOR_STORE constant; updates logic to use helper function.
Provider Abstraction - Foundation
backend/app/services/collections/providers/base.py, backend/app/services/collections/providers/openai.py
Introduces abstract BaseProvider class with create, delete, and cleanup methods; implements OpenAIProvider handling vector store creation, optional assistant creation, and resource cleanup.
Provider Abstraction - Registry
backend/app/services/collections/providers/registry.py, backend/app/services/collections/providers/__init__.py
Introduces LLMProvider registry mapping provider names to provider classes; adds get_llm_provider factory function resolving provider class and retrieving credentials; exports public provider APIs.
Test Updates
backend/app/tests/api/routes/collections/test_collection_info.py, backend/app/tests/api/routes/collections/test_collection_list.py, backend/app/tests/utils/collection.py
Updates test assertions to use get_service_name("openai") helper; sets provider=ProviderType.OPENAI on test Collection instances.

Sequence Diagrams

sequenceDiagram
    participant Client
    participant CollectionService as Collection Service
    participant Provider as LLMProvider
    participant OpenAIAPI as OpenAI API
    participant Database as Database
    
    Client->>CollectionService: execute_job(CreationRequest)
    activate CollectionService
    
    rect rgb(200, 220, 255)
        Note over CollectionService: Initialize Provider
        CollectionService->>Provider: get_llm_provider(provider="openai")
        Provider->>Provider: Lookup credentials via registry
        Provider->>OpenAIAPI: Initialize OpenAI client
        activate Provider
    end
    
    rect rgb(220, 240, 220)
        Note over CollectionService: Delegate Creation
        CollectionService->>Provider: provider.create(CreationRequest, storage, DocumentCrud)
        Provider->>OpenAIAPI: Create vector store from batched documents
        Provider->>OpenAIAPI: Optionally create assistant if model/instructions provided
        Provider-->>CollectionService: CreateCollectionResult (llm_service_id, llm_service_name, collection_blob)
        deactivate Provider
    end
    
    rect rgb(240, 220, 220)
        Note over CollectionService: Persist to Database
        CollectionService->>Database: Store Collection with provider, collection_blob, llm_service_id
        Database-->>CollectionService: Stored
    end
    
    alt Success
        CollectionService-->>Client: Job completed
    else Failure
        rect rgb(255, 200, 200)
            Note over CollectionService: Cleanup on Failure
            CollectionService->>Provider: provider.cleanup(result)
            Provider->>OpenAIAPI: Delete created vector store/assistant
        end
        CollectionService-->>Client: Job failed
    end
    
    deactivate CollectionService
Loading
sequenceDiagram
    participant Client
    participant CollectionService as Collection Service
    participant Provider as LLMProvider
    participant OpenAIAPI as OpenAI API
    participant Database as Database
    
    Client->>CollectionService: delete_collection(collection_id)
    activate CollectionService
    
    rect rgb(240, 240, 240)
        Note over CollectionService: Fetch Collection & Initialize Provider
        CollectionService->>Database: Fetch Collection (includes llm_service_name, provider)
        Database-->>CollectionService: Collection object
        CollectionService->>Provider: get_llm_provider(provider)
        activate Provider
    end
    
    rect rgb(220, 240, 220)
        Note over CollectionService: Delete External Resource
        CollectionService->>Provider: provider.delete(collection)
        alt llm_service_name != "openai vector store"
            Provider->>OpenAIAPI: Delete assistant
        else llm_service_name == "openai vector store"
            Provider->>OpenAIAPI: Delete vector store
        end
        Provider-->>CollectionService: Deleted
        deactivate Provider
    end
    
    rect rgb(240, 220, 220)
        Note over CollectionService: Remove from Database
        CollectionService->>Database: Delete collection record
        Database-->>CollectionService: Deleted
    end
    
    CollectionService-->>Client: Deletion complete
    deactivate CollectionService
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • avirajsingh7
  • Prajna1999
  • kartpop

Poem

🐰 A provider's registry hops into place,
Abstract and clean, with elegant grace!
Vector stores dance with assistants alike,
Through OpenAI's tunnels—oh what a sight!
Collections now bloom with their provider's light. 🌸

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly summarizes the main change: refactoring the collection module to be provider-agnostic by introducing a provider abstraction layer and registry system.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch enhancement/collection_provider_agnostic

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nishika26 nishika26 self-assigned this Dec 24, 2025
@nishika26 nishika26 added the enhancement New feature or request label Dec 24, 2025
@nishika26 nishika26 linked an issue Dec 24, 2025 that may be closed by this pull request
@nishika26 nishika26 marked this pull request as ready for review December 26, 2025 04:23
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/collections/create_collection.py (1)

269-270: Potential NameError if CreationRequest parsing fails.

If CreationRequest(**request) on line 156 raises an exception, creation_request is never assigned. The check on line 269 will then raise a NameError.

Proposed fix: initialize creation_request before try block or guard the check
+    creation_request = None
+
     try:
         creation_request = CreationRequest(**request)
         # ...

     except Exception as err:
         # ...

-        if creation_request and creation_request.callback_url and collection_job:
+        if creation_request is not None and creation_request.callback_url and collection_job:
             failure_payload = build_failure_payload(collection_job, str(err))
             send_callback(creation_request.callback_url, failure_payload)
🧹 Nitpick comments (11)
backend/app/services/collections/helpers.py (1)

17-25: Consider raising an error or logging for unknown providers.

Returning an empty string for unknown providers could lead to silent failures downstream. Consider logging a warning or raising a ValueError for unsupported providers to make debugging easier.

🔎 Suggested improvement
 def get_service_name(provider: str) -> str:
     """Get the collection service name for a provider."""
     names = {
         "openai": "openai vector store",
         #   "bedrock": "bedrock knowledge base",
         #  "gemini": "gemini file search store",
     }
-    return names.get(provider.lower(), "")
+    service_name = names.get(provider.lower())
+    if service_name is None:
+        logger.warning(f"[get_service_name] Unknown provider: {provider}")
+        return ""
+    return service_name
backend/app/services/collections/providers/base.py (3)

30-53: Docstring parameters don't match the method signature.

The docstring mentions batch_size, with_assistant, and assistant_options parameters that don't exist in the actual method signature. Also:

  • Line 48: "CreateCollectionresult" → "CreateCollectionResult" (typo)
  • Line 51: "kb_blob" → "collection_blob" (field name mismatch)
  • Line 53: error message says "execute method" but method is named "create"
Proposed fix
     @abstractmethod
     def create(
         self,
         collection_request: CreationRequest,
         storage: CloudStorage,
         document_crud: DocumentCrud,
     ) -> CreateCollectionResult:
         """Create collection with documents and optionally an assistant.

         Args:
-            collection_params: Collection parameters (name, description, chunking_params, etc.)
+            collection_request: Creation request containing collection params and options
             storage: Cloud storage instance for file access
             document_crud: DocumentCrud instance for fetching documents
-            batch_size: Number of documents to process per batch
-            with_assistant: Whether to create an assistant/agent
-            assistant_options: Options for assistant creation (provider-specific)

         Returns:
-            CreateCollectionresult containing:
+            CreateCollectionResult containing:
             - llm_service_id: ID of the created resource (vector store or assistant)
             - llm_service_name: Name of the service
-            - kb_blob: All collection params except documents
+            - collection_blob: All collection params except documents
         """
-        raise NotImplementedError("Providers must implement execute method")
+        raise NotImplementedError("Providers must implement create method")

55-65: Docstring Args don't match the method signature.

The docstring mentions llm_service_id and llm_service_name as parameters, but the actual signature only accepts collection: Collection.

Proposed fix
     @abstractmethod
     def delete(self, collection: Collection) -> None:
         """Delete remote resources associated with a collection.

         Called when a collection is being deleted and remote resources need to be cleaned up.

         Args:
-            llm_service_id: ID of the resource to delete
-            llm_service_name: Name of the service (determines resource type)
+            collection: The collection whose remote resources should be deleted
         """
         raise NotImplementedError("Providers must implement delete method")

67-76: Typo in docstring.

Line 74: "CreateCollectionresult" should be "CreateCollectionResult".

Proposed fix
-            collection_result: The CreateCollectionresult returned from execute, containing resource IDs
+            collection_result: The CreateCollectionResult returned from create, containing resource IDs
backend/app/services/collections/create_collection.py (1)

35-42: Unused with_assistant parameter.

The with_assistant parameter is accepted but never used in start_job. The assistant creation logic is now determined by checking model and instructions in the provider. Consider removing this parameter if it's no longer needed.

Proposed fix
 def start_job(
     db: Session,
     request: CreationRequest,
     project_id: int,
     collection_job_id: UUID,
-    with_assistant: bool,
     organization_id: int,
 ) -> str:
backend/app/services/collections/providers/openai.py (4)

2-2: Unused import: Any.

The Any type is imported but not used in this file.

Proposed fix
 import logging
-from typing import Any
 
 from openai import OpenAI

24-26: Redundant self.client assignment.

super().__init__(client) already assigns self.client = client in BaseProvider.__init__. The second assignment on line 26 is redundant.

Proposed fix
     def __init__(self, client: OpenAI):
         super().__init__(client)
-        self.client = client

62-65: Log messages reference wrong method name.

The log prefix says [OpenAIProvider.execute] but the method is named create. Per coding guidelines, log messages should be prefixed with the function name.

Proposed fix for all occurrences in create method
             logger.info(
-                "[OpenAIProvider.execute] Vector store created | "
+                "[OpenAIProvider.create] Vector store created | "
                 f"vector_store_id={vector_store.id}, batches={len(docs_batches)}"
             )

Apply similar changes to lines 93-95, 104-105, and 114-118.


60-60: Consider explicit loop for generator consumption.

Using list() to consume a generator whose result is discarded can be unclear. A for loop or collections.deque(maxlen=0) pattern would make intent clearer.

Proposed alternative
-            list(vector_store_crud.update(vector_store.id, storage, docs_batches))
+            for _ in vector_store_crud.update(vector_store.id, storage, docs_batches):
+                pass
backend/app/services/collections/providers/registry.py (1)

61-69: Unreachable else branch and logging format.

The else branch (lines 65-69) is unreachable because LLMProvider.get(provider) on line 47 already raises ValueError for unsupported providers. Also, the log message on line 67 should use square brackets per coding guidelines: [get_llm_provider].

Proposed fix: remove unreachable code or convert to assertion
     if provider == LLMProvider.OPENAI:
         if "api_key" not in credentials:
             raise ValueError("OpenAI credentials not configured for this project.")
         client = OpenAI(api_key=credentials["api_key"])
-    else:
-        logger.error(
-            f"[get_llm_provider] Unsupported provider type requested: {provider}"
-        )
-        raise ValueError(f"Provider '{provider}' is not supported.")
+    else:
+        # This branch is unreachable as LLMProvider.get validates the provider,
+        # but kept as defensive programming for future provider additions.
+        raise AssertionError(f"Unhandled provider: {provider}")

     return provider_class(client=client)
backend/app/models/collection/response.py (1)

20-29: Add provider field to CollectionPublic.

The Collection database model includes a provider field (ProviderType enum) that represents the LLM provider (e.g., "openai"). This field is missing from CollectionPublic and should be exposed in the response schema. Per learnings, provider and llm_service_name serve different purposes—provider indicates the LLM provider name while llm_service_name specifies the particular service from that provider. Exposing both fields provides complete information to API consumers about the collection's LLM configuration.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 91941f9 and 946e7c7.

📒 Files selected for processing (15)
  • backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py
  • backend/app/models/__init__.py
  • backend/app/models/collection/__init__.py
  • backend/app/models/collection/request.py
  • backend/app/models/collection/response.py
  • backend/app/services/collections/create_collection.py
  • backend/app/services/collections/delete_collection.py
  • backend/app/services/collections/helpers.py
  • backend/app/services/collections/providers/__init__.py
  • backend/app/services/collections/providers/base.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/services/collections/providers/registry.py
  • backend/app/tests/api/routes/collections/test_collection_info.py
  • backend/app/tests/api/routes/collections/test_collection_list.py
  • backend/app/tests/utils/collection.py
🧰 Additional context used
📓 Path-based instructions (6)
backend/app/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement business logic in services located in backend/app/services/

Files:

  • backend/app/services/collections/delete_collection.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/services/collections/providers/base.py
  • backend/app/services/collections/providers/registry.py
  • backend/app/services/collections/create_collection.py
  • backend/app/services/collections/providers/__init__.py
  • backend/app/services/collections/helpers.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase

Files:

  • backend/app/services/collections/delete_collection.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/services/collections/providers/base.py
  • backend/app/services/collections/providers/registry.py
  • backend/app/tests/utils/collection.py
  • backend/app/tests/api/routes/collections/test_collection_list.py
  • backend/app/services/collections/create_collection.py
  • backend/app/services/collections/providers/__init__.py
  • backend/app/services/collections/helpers.py
  • backend/app/models/collection/__init__.py
  • backend/app/models/collection/response.py
  • backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py
  • backend/app/tests/api/routes/collections/test_collection_info.py
  • backend/app/models/__init__.py
  • backend/app/models/collection/request.py
backend/app/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use factory pattern for test fixtures in backend/app/tests/

Files:

  • backend/app/tests/utils/collection.py
  • backend/app/tests/api/routes/collections/test_collection_list.py
  • backend/app/tests/api/routes/collections/test_collection_info.py
backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use sa_column_kwargs["comment"] to describe database columns, especially for non-obvious purposes, status/type fields, JSON/metadata columns, and foreign keys

Files:

  • backend/app/models/collection/__init__.py
  • backend/app/models/collection/response.py
  • backend/app/models/__init__.py
  • backend/app/models/collection/request.py
backend/app/alembic/versions/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Generate database migrations using alembic revision --autogenerate -m "Description" --rev-id <number> where rev-id is the latest existing revision ID + 1

Files:

  • backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py
backend/app/models/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use SQLModel for database models located in backend/app/models/

Files:

  • backend/app/models/__init__.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:25.880Z
Learning: In backend/app/models/collection.py, the `provider` field indicates the LLM provider name (e.g., "openai"), while `llm_service_name` specifies which particular service from that provider is being used. These fields serve different purposes and are not redundant.
📚 Learning: 2025-12-17T10:16:25.880Z
Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:25.880Z
Learning: In backend/app/models/collection.py, the `provider` field indicates the LLM provider name (e.g., "openai"), while `llm_service_name` specifies which particular service from that provider is being used. These fields serve different purposes and are not redundant.

Applied to files:

  • backend/app/services/collections/delete_collection.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/services/collections/providers/registry.py
  • backend/app/tests/utils/collection.py
  • backend/app/tests/api/routes/collections/test_collection_list.py
  • backend/app/services/collections/providers/__init__.py
  • backend/app/services/collections/helpers.py
  • backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py
  • backend/app/tests/api/routes/collections/test_collection_info.py
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Applies to backend/app/crud/*.py : Use CRUD pattern for database access operations located in `backend/app/crud/`

Applied to files:

  • backend/app/services/collections/delete_collection.py
📚 Learning: 2025-12-17T10:16:16.173Z
Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:16.173Z
Learning: In backend/app/models/collection.py, treat provider as the LLM provider name (e.g., 'openai') and llm_service_name as the specific service from that provider. These fields serve different purposes and should remain non-redundant. Document their meanings, add clear type hints (e.g., provider: str, llm_service_name: str), and consider a small unit test or validation to ensure they are distinct and used appropriately, preventing accidental aliasing or duplication across the model or serializers.

Applied to files:

  • backend/app/models/collection/__init__.py
  • backend/app/models/collection/response.py
  • backend/app/models/__init__.py
  • backend/app/models/collection/request.py
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Applies to backend/app/models/**/*.py : Use `sa_column_kwargs["comment"]` to describe database columns, especially for non-obvious purposes, status/type fields, JSON/metadata columns, and foreign keys

Applied to files:

  • backend/app/models/collection/request.py
🧬 Code graph analysis (11)
backend/app/services/collections/delete_collection.py (4)
backend/app/services/collections/providers/registry.py (1)
  • get_llm_provider (44-71)
backend/app/services/collections/providers/openai.py (1)
  • delete (121-148)
backend/app/services/collections/providers/base.py (1)
  • delete (56-65)
backend/app/crud/collection/collection.py (1)
  • delete (103-111)
backend/app/services/collections/providers/base.py (5)
backend/app/crud/document/document.py (1)
  • DocumentCrud (13-134)
backend/app/core/cloud/storage.py (1)
  • CloudStorage (113-141)
backend/app/models/collection/request.py (2)
  • CreationRequest (224-236)
  • Collection (26-92)
backend/app/models/collection/response.py (1)
  • CreateCollectionResult (10-13)
backend/app/services/collections/providers/openai.py (3)
  • create (28-119)
  • delete (121-148)
  • cleanup (150-160)
backend/app/tests/utils/collection.py (2)
backend/app/models/collection/request.py (1)
  • ProviderType (16-19)
backend/app/services/collections/helpers.py (1)
  • get_service_name (18-25)
backend/app/tests/api/routes/collections/test_collection_list.py (1)
backend/app/services/collections/helpers.py (1)
  • get_service_name (18-25)
backend/app/services/collections/providers/__init__.py (3)
backend/app/services/collections/providers/base.py (1)
  • BaseProvider (9-84)
backend/app/services/collections/providers/openai.py (1)
  • OpenAIProvider (21-160)
backend/app/services/collections/providers/registry.py (2)
  • LLMProvider (14-41)
  • get_llm_provider (44-71)
backend/app/services/collections/helpers.py (1)
backend/app/services/collections/providers/registry.py (1)
  • get (28-36)
backend/app/models/collection/__init__.py (2)
backend/app/models/collection/request.py (7)
  • Collection (26-92)
  • CreationRequest (224-236)
  • DeletionRequest (239-243)
  • CallbackRequest (197-203)
  • AssistantOptions (141-194)
  • CreateCollectionParams (106-138)
  • ProviderType (16-19)
backend/app/models/collection/response.py (4)
  • CollectionIDPublic (16-17)
  • CollectionPublic (20-29)
  • CollectionWithDocsPublic (32-33)
  • CreateCollectionResult (10-13)
backend/app/models/collection/response.py (1)
backend/app/models/document.py (1)
  • DocumentPublic (72-85)
backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py (2)
backend/app/services/collections/providers/openai.py (1)
  • create (28-119)
backend/app/services/collections/providers/base.py (1)
  • create (31-53)
backend/app/tests/api/routes/collections/test_collection_info.py (1)
backend/app/services/collections/helpers.py (1)
  • get_service_name (18-25)
backend/app/models/collection/request.py (3)
backend/app/core/util.py (1)
  • now (11-12)
backend/app/models/organization.py (1)
  • Organization (44-82)
backend/app/models/project.py (1)
  • Project (51-107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.12, 6)
🔇 Additional comments (25)
backend/app/services/collections/helpers.py (1)

108-111: LGTM!

The refactor to use get_service_name("openai") instead of the removed constant is correct and maintains the existing behavior while aligning with the new provider abstraction.

backend/app/models/collection/request.py (4)

130-138: LGTM!

The deduplication logic in model_post_init correctly removes duplicate documents by ID while preserving order.


214-221: LGTM!

The normalize_provider validator correctly handles case-insensitive provider matching by normalizing to lowercase before validation.


224-243: LGTM!

The CreationRequest and DeletionRequest models are well-structured, properly compose their parent classes, and provide clear field descriptions.


35-47: ENUM type name is consistent. The provider field correctly uses name="providertype" (lowercase) in both the model definition and the Alembic migration at 041_adding_blob_column_in_collection_table.py. The create_type difference is intentional—the migration uses create_type=True to create the type, while the model uses create_type=False to use the existing type.

backend/app/tests/api/routes/collections/test_collection_list.py (2)

10-10: LGTM!

Importing get_service_name from the helpers module is correct and aligns with the provider abstraction changes.


105-106: LGTM!

Using get_service_name("openai") instead of a hardcoded string improves maintainability and ensures consistency with the service layer.

backend/app/tests/api/routes/collections/test_collection_info.py (2)

12-12: LGTM!

Import aligns with the provider abstraction pattern used across test files.


167-168: LGTM!

Assertion updated consistently with other test files to use the helper function.

backend/app/tests/utils/collection.py (3)

11-14: LGTM!

Imports correctly added to support the provider abstraction in test utilities.


42-50: LGTM!

The get_collection function correctly sets provider=ProviderType.OPENAI on the created Collection, aligning with the new provider-based model.


67-75: LGTM!

The get_vector_store_collection function correctly uses both get_service_name("openai") for the service name and ProviderType.OPENAI for the provider field, maintaining consistency with the provider abstraction.

backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py (1)

24-54: LGTM on the safe migration pattern.

The upgrade correctly follows the safe pattern for adding a NOT NULL column with existing data:

  1. Add column as nullable
  2. Backfill existing rows with default value
  3. Alter column to NOT NULL
backend/app/services/collections/delete_collection.py (2)

17-20: LGTM!

Imports correctly updated to use the new provider registry pattern, removing direct OpenAI CRUD dependencies.


159-180: Session management looks correct.

The provider is obtained within a session context (for credential lookup), but provider.delete(collection) is called outside the session block. This is appropriate since the external API call shouldn't hold the database session open.

backend/app/models/__init__.py (1)

9-19: LGTM!

The expanded exports correctly expose the new collection-related types (CreateCollectionParams, CreateCollectionResult, CreationRequest, DeletionRequest, ProviderType) needed for the provider-agnostic collection management.

backend/app/services/collections/providers/base.py (1)

78-84: LGTM!

The get_provider_name utility cleanly derives a lowercase provider name from the class name by convention.

backend/app/services/collections/providers/__init__.py (1)

1-6: LGTM!

The re-exports consolidate the provider package's public API cleanly. Consider adding an __all__ list for explicit export control, though this is optional.

backend/app/models/collection/response.py (2)

10-17: LGTM!

CreateCollectionResult and CollectionIDPublic models are well-defined with proper type hints.


32-33: LGTM!

CollectionWithDocsPublic correctly extends CollectionPublic with an optional documents list.

backend/app/services/collections/create_collection.py (1)

254-260: LGTM!

The provider cleanup is properly guarded—only attempts cleanup if both provider and result are available, with error handling to prevent masking the original exception.

backend/app/models/collection/__init__.py (1)

1-15: LGTM!

The package correctly aggregates and re-exports public types from the request and response submodules, providing a clean import surface.

backend/app/services/collections/providers/openai.py (1)

121-160: LGTM!

The delete and cleanup methods correctly handle both assistant and vector store resources with proper error handling and logging.

backend/app/services/collections/providers/registry.py (2)

14-41: LGTM!

The LLMProvider registry pattern is well-structured and extensible, with clear methods for provider lookup and listing supported providers.


44-59: LGTM!

The factory function properly validates credentials existence and provider-specific requirements before constructing the client.

Comment on lines +66 to +76
def downgrade():
op.alter_column(
"collection",
"llm_service_name",
existing_type=sa.VARCHAR(),
comment="Name of the LLM service provider",
existing_comment="Name of the LLM service",
existing_nullable=False,
)
op.drop_column("collection", "provider")
op.drop_column("collection", "collection_blob")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing ENUM type drop in downgrade.

The downgrade() function drops the provider and collection_blob columns but doesn't drop the providertype ENUM type. This could leave orphaned types in the database after a rollback.

🔎 Proposed fix
 def downgrade():
     op.alter_column(
         "collection",
         "llm_service_name",
         existing_type=sa.VARCHAR(),
         comment="Name of the LLM service provider",
         existing_comment="Name of the LLM service",
         existing_nullable=False,
     )
     op.drop_column("collection", "provider")
     op.drop_column("collection", "collection_blob")
+    provider_enum.drop(op.get_bind(), checkifexists=True)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py
around lines 66 to 76, the downgrade drops the provider and collection_blob
columns but does not remove the providertype ENUM type, leaving an orphaned type
in the database; update downgrade to drop the providertype ENUM after dropping
the provider column by using op.execute or sa.Enum(...).drop(op.get_bind(),
checkfirst=True) (or op.execute('DROP TYPE IF EXISTS providertype') depending on
DB) to remove the ENUM type safely and ensure checkfirst behavior so downgrade
is idempotent.

Comment on lines +95 to +103
class DocumentInput(SQLModel):
"""Document to be added to knowledge base."""

name: str | None = Field(
description="Display name for the document",
)
batch_size: int = Field(
default=1,
description=(
"Number of documents to send to OpenAI in a single "
"transaction. See the `file_ids` parameter in the "
"vector store [create batch](https://platform.openai.com/docs/api-reference/vector-stores-file-batches/createBatch)."
),
id: UUID = Field(
description="Reference to uploaded file/document in Kaapi",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

DocumentInput.name field missing default value.

The name field is typed as str | None but lacks a default=None, which will make it a required field despite being optional by type.

🔎 Proposed fix
 class DocumentInput(SQLModel):
     """Document to be added to knowledge base."""
 
     name: str | None = Field(
+        default=None,
         description="Display name for the document",
     )
     id: UUID = Field(
         description="Reference to uploaded file/document in Kaapi",
     )
🤖 Prompt for AI Agents
In backend/app/models/collection/request.py around lines 95 to 103, the
DocumentInput.name is annotated as optional (str | None) but lacks a default so
Pydantic/SQLModel treats it as required; update the Field declaration to include
default=None (i.e., name: str | None = Field(default=None, description="Display
name for the document")) so the field is truly optional and behaves as intended.

Comment on lines +109 to +112
name: str | None = Field(
min_length=1,
description="Name of the knowledge base to create or update",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find and read the file to see the actual context
fd backend/app/models/collection/request.py --type f

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Read the specific file around lines 109-112
if [ -f "backend/app/models/collection/request.py" ]; then
  wc -l backend/app/models/collection/request.py
fi

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 119


🏁 Script executed:

# Get the file structure to understand imports and class definition
head -50 backend/app/models/collection/request.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1386


🏁 Script executed:

# Get lines 100-120 to see the name field and context
sed -n '100,120p' backend/app/models/collection/request.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 729


🏁 Script executed:

python3 << 'EOF'
from pydantic import Field
from sqlmodel import SQLModel

class CreateCollectionParams(SQLModel):
    name: str | None = Field(
        min_length=1,
        description="Name of the knowledge base to create or update",
    )

# Test 1: None value
try:
    t1 = CreateCollectionParams(name=None)
    print(f'Test 1 - None accepted: {t1.name}')
except Exception as e:
    print(f'Test 1 - Validation error with None: {type(e).__name__}: {e}')

# Test 2: Empty string
try:
    t2 = CreateCollectionParams(name="")
    print(f'Test 2 - Empty string accepted: {t2.name}')
except Exception as e:
    print(f'Test 2 - Validation error with empty string: {type(e).__name__}: {e}')

# Test 3: Valid string
try:
    t3 = CreateCollectionParams(name="test")
    print(f'Test 3 - Valid string accepted: {t3.name}')
except Exception as e:
    print(f'Test 3 - Validation error with valid string: {type(e).__name__}: {e}')

# Test 4: No name provided
try:
    t4 = CreateCollectionParams()
    print(f'Test 4 - No name provided, defaults to: {t4.name}')
except Exception as e:
    print(f'Test 4 - Validation error when name not provided: {type(e).__name__}: {e}')
EOF

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 161


🏁 Script executed:

# Check the full CreateCollectionParams class and surrounding context
sed -n '105,145p' backend/app/models/collection/request.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1494


🏁 Script executed:

# Check if there are imports or other usage patterns in the file
grep -n "sa_column_kwargs" backend/app/models/collection/request.py | head -20

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 659


Add explicit default=None to the name field.

The name field is typed as str | None but lacks an explicit default=None, while other Optional fields in this class (description, chunking_params, additional_params) all include it. This inconsistency can cause unexpected validation behavior in Pydantic. Add default=None to match the pattern: name: str | None = Field(min_length=1, default=None, description="...").

🤖 Prompt for AI Agents
In backend/app/models/collection/request.py around lines 109 to 112, the name
field is annotated as str | None but lacks an explicit default=None whereas
other optional fields include it; update the Field call to add default=None
(i.e., Field(min_length=1, default=None, description="Name of the knowledge base
to create or update")) so Pydantic treats it consistently and avoids unexpected
validation behavior.

Comment on lines +180 to +184
result = provider.create(
collection_request=creation_request,
storage=storage,
document_crud=document_crud,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

provider.create called outside session context—verify document_crud remains usable.

document_crud is created inside a with Session(engine) block (lines 159-178), but provider.create is called after the block exits. Since DocumentCrud holds a reference to the session, operations like read_each inside batch_documents may fail if the session is closed.

Proposed fix: move provider.create inside the session block
         with Session(engine) as session:
             collection_job_crud = CollectionJobCrud(session, project_id)
             collection_job = collection_job_crud.read_one(job_uuid)
             collection_job = collection_job_crud.update(
                 job_uuid,
                 CollectionJobUpdate(
                     task_id=task_id,
                     status=CollectionJobStatus.PROCESSING,
                 ),
             )

             storage = get_cloud_storage(session=session, project_id=project_id)
             document_crud = DocumentCrud(session, project_id)

             provider = get_llm_provider(
                 session=session,
                 provider=creation_request.provider,
                 project_id=project_id,
                 organization_id=organization_id,
             )

-        result = provider.create(
-            collection_request=creation_request,
-            storage=storage,
-            document_crud=document_crud,
-        )
+            result = provider.create(
+                collection_request=creation_request,
+                storage=storage,
+                document_crud=document_crud,
+            )
🤖 Prompt for AI Agents
In backend/app/services/collections/create_collection.py around lines 180 to
184, provider.create is being invoked after the with Session(engine) block that
constructed document_crud has closed, which can leave DocumentCrud holding a
closed session; move the provider.create(...) call (and any dependent logic like
batch_documents or document reads) inside the with Session(engine): block where
document_crud is created so the CRUD instance uses an open session, or
alternatively refactor DocumentCrud so it does not capture the session (e.g.,
pass sessions per-call) if keeping the call outside is required.

Comment on lines +173 to +180
provider = get_llm_provider(
session=session,
provider=deletion_request.provider,
project_id=project_id,
organization_id=organization_id,
)

# Delete the corresponding OpenAI resource (vector store or assistant)
if is_vector:
OpenAIVectorStoreCrud(client).delete(llm_service_id)
else:
OpenAIAssistantCrud(client).delete(llm_service_id)
provider.delete(collection)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and content around the problematic lines
head -n 200 backend/app/services/collections/delete_collection.py | tail -n +150

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1840


🏁 Script executed:

# Also get the full file to understand the context
wc -l backend/app/services/collections/delete_collection.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 132


🏁 Script executed:

# Search for provider validation patterns in this file and related files
rg "collection\.provider" backend/app/services/collections/delete_collection.py -A3 -B3

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Search for deletion_request and how it's used
rg "deletion_request" backend/app/services/collections/delete_collection.py -A2 -B2

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 952


🏁 Script executed:

# Get the full delete_collection.py file content
cat -n backend/app/services/collections/delete_collection.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 8238


🏁 Script executed:

# Find the Collection and DeletionRequest models
fd -e py | xargs rg "class Collection\b|class DeletionRequest" -A10

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1228


🏁 Script executed:

# Check the get_llm_provider function to see if it validates provider
rg "def get_llm_provider" -A20

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 3605


Add provider mismatch validation to prevent deletion with wrong provider.

The code uses deletion_request.provider to instantiate the LLM provider, but collection has its own provider attribute (defined in the Collection model). Without validating these match, deletion could attempt to use the wrong provider's client to delete a collection.

Add a validation check after retrieving the collection:

if deletion_request.provider != collection.provider:
    raise ValueError(f"Provider mismatch: request={deletion_request.provider}, collection={collection.provider}")
🤖 Prompt for AI Agents
In backend/app/services/collections/delete_collection.py around lines 173 to
180, after you retrieve the collection and before calling
get_llm_provider()/provider.delete(), validate that deletion_request.provider
matches collection.provider and raise a ValueError when they differ; add a
conditional that compares deletion_request.provider and collection.provider and
raises a ValueError with a clear message like "Provider mismatch: request=<...>,
collection=<...>" so provider.delete is only invoked when the providers match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collections: making this module llm provider agnostic

2 participants