-
Notifications
You must be signed in to change notification settings - Fork 7
Collection: making the module provider agnostic #508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughA new provider abstraction system is introduced for collection management, replacing direct provider-specific logic with a unified interface. Changes include database schema extensions (provider enum, collection_blob column), reorganized data models for request/response handling, and refactored service layer operations through a registry-based provider pattern supporting OpenAI backend. Changes
Sequence DiagramssequenceDiagram
participant Client
participant CollectionService as Collection Service
participant Provider as LLMProvider
participant OpenAIAPI as OpenAI API
participant Database as Database
Client->>CollectionService: execute_job(CreationRequest)
activate CollectionService
rect rgb(200, 220, 255)
Note over CollectionService: Initialize Provider
CollectionService->>Provider: get_llm_provider(provider="openai")
Provider->>Provider: Lookup credentials via registry
Provider->>OpenAIAPI: Initialize OpenAI client
activate Provider
end
rect rgb(220, 240, 220)
Note over CollectionService: Delegate Creation
CollectionService->>Provider: provider.create(CreationRequest, storage, DocumentCrud)
Provider->>OpenAIAPI: Create vector store from batched documents
Provider->>OpenAIAPI: Optionally create assistant if model/instructions provided
Provider-->>CollectionService: CreateCollectionResult (llm_service_id, llm_service_name, collection_blob)
deactivate Provider
end
rect rgb(240, 220, 220)
Note over CollectionService: Persist to Database
CollectionService->>Database: Store Collection with provider, collection_blob, llm_service_id
Database-->>CollectionService: Stored
end
alt Success
CollectionService-->>Client: Job completed
else Failure
rect rgb(255, 200, 200)
Note over CollectionService: Cleanup on Failure
CollectionService->>Provider: provider.cleanup(result)
Provider->>OpenAIAPI: Delete created vector store/assistant
end
CollectionService-->>Client: Job failed
end
deactivate CollectionService
sequenceDiagram
participant Client
participant CollectionService as Collection Service
participant Provider as LLMProvider
participant OpenAIAPI as OpenAI API
participant Database as Database
Client->>CollectionService: delete_collection(collection_id)
activate CollectionService
rect rgb(240, 240, 240)
Note over CollectionService: Fetch Collection & Initialize Provider
CollectionService->>Database: Fetch Collection (includes llm_service_name, provider)
Database-->>CollectionService: Collection object
CollectionService->>Provider: get_llm_provider(provider)
activate Provider
end
rect rgb(220, 240, 220)
Note over CollectionService: Delete External Resource
CollectionService->>Provider: provider.delete(collection)
alt llm_service_name != "openai vector store"
Provider->>OpenAIAPI: Delete assistant
else llm_service_name == "openai vector store"
Provider->>OpenAIAPI: Delete vector store
end
Provider-->>CollectionService: Deleted
deactivate Provider
end
rect rgb(240, 220, 220)
Note over CollectionService: Remove from Database
CollectionService->>Database: Delete collection record
Database-->>CollectionService: Deleted
end
CollectionService-->>Client: Deletion complete
deactivate CollectionService
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/collections/create_collection.py (1)
269-270: PotentialNameErrorifCreationRequestparsing fails.If
CreationRequest(**request)on line 156 raises an exception,creation_requestis never assigned. The check on line 269 will then raise aNameError.Proposed fix: initialize creation_request before try block or guard the check
+ creation_request = None + try: creation_request = CreationRequest(**request) # ... except Exception as err: # ... - if creation_request and creation_request.callback_url and collection_job: + if creation_request is not None and creation_request.callback_url and collection_job: failure_payload = build_failure_payload(collection_job, str(err)) send_callback(creation_request.callback_url, failure_payload)
🧹 Nitpick comments (11)
backend/app/services/collections/helpers.py (1)
17-25: Consider raising an error or logging for unknown providers.Returning an empty string for unknown providers could lead to silent failures downstream. Consider logging a warning or raising a
ValueErrorfor unsupported providers to make debugging easier.🔎 Suggested improvement
def get_service_name(provider: str) -> str: """Get the collection service name for a provider.""" names = { "openai": "openai vector store", # "bedrock": "bedrock knowledge base", # "gemini": "gemini file search store", } - return names.get(provider.lower(), "") + service_name = names.get(provider.lower()) + if service_name is None: + logger.warning(f"[get_service_name] Unknown provider: {provider}") + return "" + return service_namebackend/app/services/collections/providers/base.py (3)
30-53: Docstring parameters don't match the method signature.The docstring mentions
batch_size,with_assistant, andassistant_optionsparameters that don't exist in the actual method signature. Also:
- Line 48: "CreateCollectionresult" → "CreateCollectionResult" (typo)
- Line 51: "kb_blob" → "collection_blob" (field name mismatch)
- Line 53: error message says "execute method" but method is named "create"
Proposed fix
@abstractmethod def create( self, collection_request: CreationRequest, storage: CloudStorage, document_crud: DocumentCrud, ) -> CreateCollectionResult: """Create collection with documents and optionally an assistant. Args: - collection_params: Collection parameters (name, description, chunking_params, etc.) + collection_request: Creation request containing collection params and options storage: Cloud storage instance for file access document_crud: DocumentCrud instance for fetching documents - batch_size: Number of documents to process per batch - with_assistant: Whether to create an assistant/agent - assistant_options: Options for assistant creation (provider-specific) Returns: - CreateCollectionresult containing: + CreateCollectionResult containing: - llm_service_id: ID of the created resource (vector store or assistant) - llm_service_name: Name of the service - - kb_blob: All collection params except documents + - collection_blob: All collection params except documents """ - raise NotImplementedError("Providers must implement execute method") + raise NotImplementedError("Providers must implement create method")
55-65: Docstring Args don't match the method signature.The docstring mentions
llm_service_idandllm_service_nameas parameters, but the actual signature only acceptscollection: Collection.Proposed fix
@abstractmethod def delete(self, collection: Collection) -> None: """Delete remote resources associated with a collection. Called when a collection is being deleted and remote resources need to be cleaned up. Args: - llm_service_id: ID of the resource to delete - llm_service_name: Name of the service (determines resource type) + collection: The collection whose remote resources should be deleted """ raise NotImplementedError("Providers must implement delete method")
67-76: Typo in docstring.Line 74: "CreateCollectionresult" should be "CreateCollectionResult".
Proposed fix
- collection_result: The CreateCollectionresult returned from execute, containing resource IDs + collection_result: The CreateCollectionResult returned from create, containing resource IDsbackend/app/services/collections/create_collection.py (1)
35-42: Unusedwith_assistantparameter.The
with_assistantparameter is accepted but never used instart_job. The assistant creation logic is now determined by checkingmodelandinstructionsin the provider. Consider removing this parameter if it's no longer needed.Proposed fix
def start_job( db: Session, request: CreationRequest, project_id: int, collection_job_id: UUID, - with_assistant: bool, organization_id: int, ) -> str:backend/app/services/collections/providers/openai.py (4)
2-2: Unused import:Any.The
Anytype is imported but not used in this file.Proposed fix
import logging -from typing import Any from openai import OpenAI
24-26: Redundantself.clientassignment.
super().__init__(client)already assignsself.client = clientinBaseProvider.__init__. The second assignment on line 26 is redundant.Proposed fix
def __init__(self, client: OpenAI): super().__init__(client) - self.client = client
62-65: Log messages reference wrong method name.The log prefix says
[OpenAIProvider.execute]but the method is namedcreate. Per coding guidelines, log messages should be prefixed with the function name.Proposed fix for all occurrences in create method
logger.info( - "[OpenAIProvider.execute] Vector store created | " + "[OpenAIProvider.create] Vector store created | " f"vector_store_id={vector_store.id}, batches={len(docs_batches)}" )Apply similar changes to lines 93-95, 104-105, and 114-118.
60-60: Consider explicit loop for generator consumption.Using
list()to consume a generator whose result is discarded can be unclear. Aforloop orcollections.deque(maxlen=0)pattern would make intent clearer.Proposed alternative
- list(vector_store_crud.update(vector_store.id, storage, docs_batches)) + for _ in vector_store_crud.update(vector_store.id, storage, docs_batches): + passbackend/app/services/collections/providers/registry.py (1)
61-69: Unreachable else branch and logging format.The
elsebranch (lines 65-69) is unreachable becauseLLMProvider.get(provider)on line 47 already raisesValueErrorfor unsupported providers. Also, the log message on line 67 should use square brackets per coding guidelines:[get_llm_provider].Proposed fix: remove unreachable code or convert to assertion
if provider == LLMProvider.OPENAI: if "api_key" not in credentials: raise ValueError("OpenAI credentials not configured for this project.") client = OpenAI(api_key=credentials["api_key"]) - else: - logger.error( - f"[get_llm_provider] Unsupported provider type requested: {provider}" - ) - raise ValueError(f"Provider '{provider}' is not supported.") + else: + # This branch is unreachable as LLMProvider.get validates the provider, + # but kept as defensive programming for future provider additions. + raise AssertionError(f"Unhandled provider: {provider}") return provider_class(client=client)backend/app/models/collection/response.py (1)
20-29: Addproviderfield toCollectionPublic.The
Collectiondatabase model includes aproviderfield (ProviderType enum) that represents the LLM provider (e.g., "openai"). This field is missing fromCollectionPublicand should be exposed in the response schema. Per learnings,providerandllm_service_nameserve different purposes—providerindicates the LLM provider name whilellm_service_namespecifies the particular service from that provider. Exposing both fields provides complete information to API consumers about the collection's LLM configuration.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
backend/app/alembic/versions/041_adding_blob_column_in_collection_table.pybackend/app/models/__init__.pybackend/app/models/collection/__init__.pybackend/app/models/collection/request.pybackend/app/models/collection/response.pybackend/app/services/collections/create_collection.pybackend/app/services/collections/delete_collection.pybackend/app/services/collections/helpers.pybackend/app/services/collections/providers/__init__.pybackend/app/services/collections/providers/base.pybackend/app/services/collections/providers/openai.pybackend/app/services/collections/providers/registry.pybackend/app/tests/api/routes/collections/test_collection_info.pybackend/app/tests/api/routes/collections/test_collection_list.pybackend/app/tests/utils/collection.py
🧰 Additional context used
📓 Path-based instructions (6)
backend/app/services/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement business logic in services located in
backend/app/services/
Files:
backend/app/services/collections/delete_collection.pybackend/app/services/collections/providers/openai.pybackend/app/services/collections/providers/base.pybackend/app/services/collections/providers/registry.pybackend/app/services/collections/create_collection.pybackend/app/services/collections/providers/__init__.pybackend/app/services/collections/helpers.py
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets:logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase
Files:
backend/app/services/collections/delete_collection.pybackend/app/services/collections/providers/openai.pybackend/app/services/collections/providers/base.pybackend/app/services/collections/providers/registry.pybackend/app/tests/utils/collection.pybackend/app/tests/api/routes/collections/test_collection_list.pybackend/app/services/collections/create_collection.pybackend/app/services/collections/providers/__init__.pybackend/app/services/collections/helpers.pybackend/app/models/collection/__init__.pybackend/app/models/collection/response.pybackend/app/alembic/versions/041_adding_blob_column_in_collection_table.pybackend/app/tests/api/routes/collections/test_collection_info.pybackend/app/models/__init__.pybackend/app/models/collection/request.py
backend/app/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use factory pattern for test fixtures in
backend/app/tests/
Files:
backend/app/tests/utils/collection.pybackend/app/tests/api/routes/collections/test_collection_list.pybackend/app/tests/api/routes/collections/test_collection_info.py
backend/app/models/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
sa_column_kwargs["comment"]to describe database columns, especially for non-obvious purposes, status/type fields, JSON/metadata columns, and foreign keys
Files:
backend/app/models/collection/__init__.pybackend/app/models/collection/response.pybackend/app/models/__init__.pybackend/app/models/collection/request.py
backend/app/alembic/versions/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Generate database migrations using
alembic revision --autogenerate -m "Description" --rev-id <number>where rev-id is the latest existing revision ID + 1
Files:
backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py
backend/app/models/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use SQLModel for database models located in
backend/app/models/
Files:
backend/app/models/__init__.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:25.880Z
Learning: In backend/app/models/collection.py, the `provider` field indicates the LLM provider name (e.g., "openai"), while `llm_service_name` specifies which particular service from that provider is being used. These fields serve different purposes and are not redundant.
📚 Learning: 2025-12-17T10:16:25.880Z
Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:25.880Z
Learning: In backend/app/models/collection.py, the `provider` field indicates the LLM provider name (e.g., "openai"), while `llm_service_name` specifies which particular service from that provider is being used. These fields serve different purposes and are not redundant.
Applied to files:
backend/app/services/collections/delete_collection.pybackend/app/services/collections/providers/openai.pybackend/app/services/collections/providers/registry.pybackend/app/tests/utils/collection.pybackend/app/tests/api/routes/collections/test_collection_list.pybackend/app/services/collections/providers/__init__.pybackend/app/services/collections/helpers.pybackend/app/alembic/versions/041_adding_blob_column_in_collection_table.pybackend/app/tests/api/routes/collections/test_collection_info.py
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Applies to backend/app/crud/*.py : Use CRUD pattern for database access operations located in `backend/app/crud/`
Applied to files:
backend/app/services/collections/delete_collection.py
📚 Learning: 2025-12-17T10:16:16.173Z
Learnt from: nishika26
Repo: ProjectTech4DevAI/kaapi-backend PR: 502
File: backend/app/models/collection.py:29-32
Timestamp: 2025-12-17T10:16:16.173Z
Learning: In backend/app/models/collection.py, treat provider as the LLM provider name (e.g., 'openai') and llm_service_name as the specific service from that provider. These fields serve different purposes and should remain non-redundant. Document their meanings, add clear type hints (e.g., provider: str, llm_service_name: str), and consider a small unit test or validation to ensure they are distinct and used appropriately, preventing accidental aliasing or duplication across the model or serializers.
Applied to files:
backend/app/models/collection/__init__.pybackend/app/models/collection/response.pybackend/app/models/__init__.pybackend/app/models/collection/request.py
📚 Learning: 2025-12-17T15:39:30.469Z
Learnt from: CR
Repo: ProjectTech4DevAI/kaapi-backend PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-17T15:39:30.469Z
Learning: Applies to backend/app/models/**/*.py : Use `sa_column_kwargs["comment"]` to describe database columns, especially for non-obvious purposes, status/type fields, JSON/metadata columns, and foreign keys
Applied to files:
backend/app/models/collection/request.py
🧬 Code graph analysis (11)
backend/app/services/collections/delete_collection.py (4)
backend/app/services/collections/providers/registry.py (1)
get_llm_provider(44-71)backend/app/services/collections/providers/openai.py (1)
delete(121-148)backend/app/services/collections/providers/base.py (1)
delete(56-65)backend/app/crud/collection/collection.py (1)
delete(103-111)
backend/app/services/collections/providers/base.py (5)
backend/app/crud/document/document.py (1)
DocumentCrud(13-134)backend/app/core/cloud/storage.py (1)
CloudStorage(113-141)backend/app/models/collection/request.py (2)
CreationRequest(224-236)Collection(26-92)backend/app/models/collection/response.py (1)
CreateCollectionResult(10-13)backend/app/services/collections/providers/openai.py (3)
create(28-119)delete(121-148)cleanup(150-160)
backend/app/tests/utils/collection.py (2)
backend/app/models/collection/request.py (1)
ProviderType(16-19)backend/app/services/collections/helpers.py (1)
get_service_name(18-25)
backend/app/tests/api/routes/collections/test_collection_list.py (1)
backend/app/services/collections/helpers.py (1)
get_service_name(18-25)
backend/app/services/collections/providers/__init__.py (3)
backend/app/services/collections/providers/base.py (1)
BaseProvider(9-84)backend/app/services/collections/providers/openai.py (1)
OpenAIProvider(21-160)backend/app/services/collections/providers/registry.py (2)
LLMProvider(14-41)get_llm_provider(44-71)
backend/app/services/collections/helpers.py (1)
backend/app/services/collections/providers/registry.py (1)
get(28-36)
backend/app/models/collection/__init__.py (2)
backend/app/models/collection/request.py (7)
Collection(26-92)CreationRequest(224-236)DeletionRequest(239-243)CallbackRequest(197-203)AssistantOptions(141-194)CreateCollectionParams(106-138)ProviderType(16-19)backend/app/models/collection/response.py (4)
CollectionIDPublic(16-17)CollectionPublic(20-29)CollectionWithDocsPublic(32-33)CreateCollectionResult(10-13)
backend/app/models/collection/response.py (1)
backend/app/models/document.py (1)
DocumentPublic(72-85)
backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py (2)
backend/app/services/collections/providers/openai.py (1)
create(28-119)backend/app/services/collections/providers/base.py (1)
create(31-53)
backend/app/tests/api/routes/collections/test_collection_info.py (1)
backend/app/services/collections/helpers.py (1)
get_service_name(18-25)
backend/app/models/collection/request.py (3)
backend/app/core/util.py (1)
now(11-12)backend/app/models/organization.py (1)
Organization(44-82)backend/app/models/project.py (1)
Project(51-107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: checks (3.12, 6)
🔇 Additional comments (25)
backend/app/services/collections/helpers.py (1)
108-111: LGTM!The refactor to use
get_service_name("openai")instead of the removed constant is correct and maintains the existing behavior while aligning with the new provider abstraction.backend/app/models/collection/request.py (4)
130-138: LGTM!The deduplication logic in
model_post_initcorrectly removes duplicate documents by ID while preserving order.
214-221: LGTM!The
normalize_providervalidator correctly handles case-insensitive provider matching by normalizing to lowercase before validation.
224-243: LGTM!The
CreationRequestandDeletionRequestmodels are well-structured, properly compose their parent classes, and provide clear field descriptions.
35-47: ENUM type name is consistent. Theproviderfield correctly usesname="providertype"(lowercase) in both the model definition and the Alembic migration at041_adding_blob_column_in_collection_table.py. Thecreate_typedifference is intentional—the migration usescreate_type=Trueto create the type, while the model usescreate_type=Falseto use the existing type.backend/app/tests/api/routes/collections/test_collection_list.py (2)
10-10: LGTM!Importing
get_service_namefrom the helpers module is correct and aligns with the provider abstraction changes.
105-106: LGTM!Using
get_service_name("openai")instead of a hardcoded string improves maintainability and ensures consistency with the service layer.backend/app/tests/api/routes/collections/test_collection_info.py (2)
12-12: LGTM!Import aligns with the provider abstraction pattern used across test files.
167-168: LGTM!Assertion updated consistently with other test files to use the helper function.
backend/app/tests/utils/collection.py (3)
11-14: LGTM!Imports correctly added to support the provider abstraction in test utilities.
42-50: LGTM!The
get_collectionfunction correctly setsprovider=ProviderType.OPENAIon the created Collection, aligning with the new provider-based model.
67-75: LGTM!The
get_vector_store_collectionfunction correctly uses bothget_service_name("openai")for the service name andProviderType.OPENAIfor the provider field, maintaining consistency with the provider abstraction.backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py (1)
24-54: LGTM on the safe migration pattern.The upgrade correctly follows the safe pattern for adding a NOT NULL column with existing data:
- Add column as nullable
- Backfill existing rows with default value
- Alter column to NOT NULL
backend/app/services/collections/delete_collection.py (2)
17-20: LGTM!Imports correctly updated to use the new provider registry pattern, removing direct OpenAI CRUD dependencies.
159-180: Session management looks correct.The provider is obtained within a session context (for credential lookup), but
provider.delete(collection)is called outside the session block. This is appropriate since the external API call shouldn't hold the database session open.backend/app/models/__init__.py (1)
9-19: LGTM!The expanded exports correctly expose the new collection-related types (
CreateCollectionParams,CreateCollectionResult,CreationRequest,DeletionRequest,ProviderType) needed for the provider-agnostic collection management.backend/app/services/collections/providers/base.py (1)
78-84: LGTM!The
get_provider_nameutility cleanly derives a lowercase provider name from the class name by convention.backend/app/services/collections/providers/__init__.py (1)
1-6: LGTM!The re-exports consolidate the provider package's public API cleanly. Consider adding an
__all__list for explicit export control, though this is optional.backend/app/models/collection/response.py (2)
10-17: LGTM!
CreateCollectionResultandCollectionIDPublicmodels are well-defined with proper type hints.
32-33: LGTM!
CollectionWithDocsPubliccorrectly extendsCollectionPublicwith an optional documents list.backend/app/services/collections/create_collection.py (1)
254-260: LGTM!The provider cleanup is properly guarded—only attempts cleanup if both
providerandresultare available, with error handling to prevent masking the original exception.backend/app/models/collection/__init__.py (1)
1-15: LGTM!The package correctly aggregates and re-exports public types from the request and response submodules, providing a clean import surface.
backend/app/services/collections/providers/openai.py (1)
121-160: LGTM!The
deleteandcleanupmethods correctly handle both assistant and vector store resources with proper error handling and logging.backend/app/services/collections/providers/registry.py (2)
14-41: LGTM!The
LLMProviderregistry pattern is well-structured and extensible, with clear methods for provider lookup and listing supported providers.
44-59: LGTM!The factory function properly validates credentials existence and provider-specific requirements before constructing the client.
| def downgrade(): | ||
| op.alter_column( | ||
| "collection", | ||
| "llm_service_name", | ||
| existing_type=sa.VARCHAR(), | ||
| comment="Name of the LLM service provider", | ||
| existing_comment="Name of the LLM service", | ||
| existing_nullable=False, | ||
| ) | ||
| op.drop_column("collection", "provider") | ||
| op.drop_column("collection", "collection_blob") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing ENUM type drop in downgrade.
The downgrade() function drops the provider and collection_blob columns but doesn't drop the providertype ENUM type. This could leave orphaned types in the database after a rollback.
🔎 Proposed fix
def downgrade():
op.alter_column(
"collection",
"llm_service_name",
existing_type=sa.VARCHAR(),
comment="Name of the LLM service provider",
existing_comment="Name of the LLM service",
existing_nullable=False,
)
op.drop_column("collection", "provider")
op.drop_column("collection", "collection_blob")
+ provider_enum.drop(op.get_bind(), checkifexists=True)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py
around lines 66 to 76, the downgrade drops the provider and collection_blob
columns but does not remove the providertype ENUM type, leaving an orphaned type
in the database; update downgrade to drop the providertype ENUM after dropping
the provider column by using op.execute or sa.Enum(...).drop(op.get_bind(),
checkfirst=True) (or op.execute('DROP TYPE IF EXISTS providertype') depending on
DB) to remove the ENUM type safely and ensure checkfirst behavior so downgrade
is idempotent.
| class DocumentInput(SQLModel): | ||
| """Document to be added to knowledge base.""" | ||
|
|
||
| name: str | None = Field( | ||
| description="Display name for the document", | ||
| ) | ||
| batch_size: int = Field( | ||
| default=1, | ||
| description=( | ||
| "Number of documents to send to OpenAI in a single " | ||
| "transaction. See the `file_ids` parameter in the " | ||
| "vector store [create batch](https://platform.openai.com/docs/api-reference/vector-stores-file-batches/createBatch)." | ||
| ), | ||
| id: UUID = Field( | ||
| description="Reference to uploaded file/document in Kaapi", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DocumentInput.name field missing default value.
The name field is typed as str | None but lacks a default=None, which will make it a required field despite being optional by type.
🔎 Proposed fix
class DocumentInput(SQLModel):
"""Document to be added to knowledge base."""
name: str | None = Field(
+ default=None,
description="Display name for the document",
)
id: UUID = Field(
description="Reference to uploaded file/document in Kaapi",
)🤖 Prompt for AI Agents
In backend/app/models/collection/request.py around lines 95 to 103, the
DocumentInput.name is annotated as optional (str | None) but lacks a default so
Pydantic/SQLModel treats it as required; update the Field declaration to include
default=None (i.e., name: str | None = Field(default=None, description="Display
name for the document")) so the field is truly optional and behaves as intended.
| name: str | None = Field( | ||
| min_length=1, | ||
| description="Name of the knowledge base to create or update", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find and read the file to see the actual context
fd backend/app/models/collection/request.py --type fRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 57
🏁 Script executed:
# Read the specific file around lines 109-112
if [ -f "backend/app/models/collection/request.py" ]; then
wc -l backend/app/models/collection/request.py
fiRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 119
🏁 Script executed:
# Get the file structure to understand imports and class definition
head -50 backend/app/models/collection/request.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 1386
🏁 Script executed:
# Get lines 100-120 to see the name field and context
sed -n '100,120p' backend/app/models/collection/request.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 729
🏁 Script executed:
python3 << 'EOF'
from pydantic import Field
from sqlmodel import SQLModel
class CreateCollectionParams(SQLModel):
name: str | None = Field(
min_length=1,
description="Name of the knowledge base to create or update",
)
# Test 1: None value
try:
t1 = CreateCollectionParams(name=None)
print(f'Test 1 - None accepted: {t1.name}')
except Exception as e:
print(f'Test 1 - Validation error with None: {type(e).__name__}: {e}')
# Test 2: Empty string
try:
t2 = CreateCollectionParams(name="")
print(f'Test 2 - Empty string accepted: {t2.name}')
except Exception as e:
print(f'Test 2 - Validation error with empty string: {type(e).__name__}: {e}')
# Test 3: Valid string
try:
t3 = CreateCollectionParams(name="test")
print(f'Test 3 - Valid string accepted: {t3.name}')
except Exception as e:
print(f'Test 3 - Validation error with valid string: {type(e).__name__}: {e}')
# Test 4: No name provided
try:
t4 = CreateCollectionParams()
print(f'Test 4 - No name provided, defaults to: {t4.name}')
except Exception as e:
print(f'Test 4 - Validation error when name not provided: {type(e).__name__}: {e}')
EOFRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 161
🏁 Script executed:
# Check the full CreateCollectionParams class and surrounding context
sed -n '105,145p' backend/app/models/collection/request.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 1494
🏁 Script executed:
# Check if there are imports or other usage patterns in the file
grep -n "sa_column_kwargs" backend/app/models/collection/request.py | head -20Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 659
Add explicit default=None to the name field.
The name field is typed as str | None but lacks an explicit default=None, while other Optional fields in this class (description, chunking_params, additional_params) all include it. This inconsistency can cause unexpected validation behavior in Pydantic. Add default=None to match the pattern: name: str | None = Field(min_length=1, default=None, description="...").
🤖 Prompt for AI Agents
In backend/app/models/collection/request.py around lines 109 to 112, the name
field is annotated as str | None but lacks an explicit default=None whereas
other optional fields include it; update the Field call to add default=None
(i.e., Field(min_length=1, default=None, description="Name of the knowledge base
to create or update")) so Pydantic treats it consistently and avoids unexpected
validation behavior.
| result = provider.create( | ||
| collection_request=creation_request, | ||
| storage=storage, | ||
| document_crud=document_crud, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provider.create called outside session context—verify document_crud remains usable.
document_crud is created inside a with Session(engine) block (lines 159-178), but provider.create is called after the block exits. Since DocumentCrud holds a reference to the session, operations like read_each inside batch_documents may fail if the session is closed.
Proposed fix: move provider.create inside the session block
with Session(engine) as session:
collection_job_crud = CollectionJobCrud(session, project_id)
collection_job = collection_job_crud.read_one(job_uuid)
collection_job = collection_job_crud.update(
job_uuid,
CollectionJobUpdate(
task_id=task_id,
status=CollectionJobStatus.PROCESSING,
),
)
storage = get_cloud_storage(session=session, project_id=project_id)
document_crud = DocumentCrud(session, project_id)
provider = get_llm_provider(
session=session,
provider=creation_request.provider,
project_id=project_id,
organization_id=organization_id,
)
- result = provider.create(
- collection_request=creation_request,
- storage=storage,
- document_crud=document_crud,
- )
+ result = provider.create(
+ collection_request=creation_request,
+ storage=storage,
+ document_crud=document_crud,
+ )🤖 Prompt for AI Agents
In backend/app/services/collections/create_collection.py around lines 180 to
184, provider.create is being invoked after the with Session(engine) block that
constructed document_crud has closed, which can leave DocumentCrud holding a
closed session; move the provider.create(...) call (and any dependent logic like
batch_documents or document reads) inside the with Session(engine): block where
document_crud is created so the CRUD instance uses an open session, or
alternatively refactor DocumentCrud so it does not capture the session (e.g.,
pass sessions per-call) if keeping the call outside is required.
| provider = get_llm_provider( | ||
| session=session, | ||
| provider=deletion_request.provider, | ||
| project_id=project_id, | ||
| organization_id=organization_id, | ||
| ) | ||
|
|
||
| # Delete the corresponding OpenAI resource (vector store or assistant) | ||
| if is_vector: | ||
| OpenAIVectorStoreCrud(client).delete(llm_service_id) | ||
| else: | ||
| OpenAIAssistantCrud(client).delete(llm_service_id) | ||
| provider.delete(collection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file structure and content around the problematic lines
head -n 200 backend/app/services/collections/delete_collection.py | tail -n +150Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 1840
🏁 Script executed:
# Also get the full file to understand the context
wc -l backend/app/services/collections/delete_collection.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 132
🏁 Script executed:
# Search for provider validation patterns in this file and related files
rg "collection\.provider" backend/app/services/collections/delete_collection.py -A3 -B3Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 57
🏁 Script executed:
# Search for deletion_request and how it's used
rg "deletion_request" backend/app/services/collections/delete_collection.py -A2 -B2Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 952
🏁 Script executed:
# Get the full delete_collection.py file content
cat -n backend/app/services/collections/delete_collection.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 8238
🏁 Script executed:
# Find the Collection and DeletionRequest models
fd -e py | xargs rg "class Collection\b|class DeletionRequest" -A10Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 1228
🏁 Script executed:
# Check the get_llm_provider function to see if it validates provider
rg "def get_llm_provider" -A20Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 3605
Add provider mismatch validation to prevent deletion with wrong provider.
The code uses deletion_request.provider to instantiate the LLM provider, but collection has its own provider attribute (defined in the Collection model). Without validating these match, deletion could attempt to use the wrong provider's client to delete a collection.
Add a validation check after retrieving the collection:
if deletion_request.provider != collection.provider:
raise ValueError(f"Provider mismatch: request={deletion_request.provider}, collection={collection.provider}")🤖 Prompt for AI Agents
In backend/app/services/collections/delete_collection.py around lines 173 to
180, after you retrieve the collection and before calling
get_llm_provider()/provider.delete(), validate that deletion_request.provider
matches collection.provider and raise a ValueError when they differ; add a
conditional that compares deletion_request.provider and collection.provider and
raises a ValueError with a clear message like "Provider mismatch: request=<...>,
collection=<...>" so provider.delete is only invoked when the providers match.
Summary
Target issue is #489
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.
Summary by CodeRabbit
Release Notes
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.