Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 29, 2025

Plan: Harden and Enable Quartz.NET Clustering Support ✅

  • Add convenience API for clustering configuration
    • Add ConfigureClusteringIdentity method to QuartzFeature with sensible defaults (renamed from EnableClustering)
    • Configure instanceId = AUTO and scheduler name for cluster-safe operation
    • Add comprehensive documentation about requirements and usage
    • Clarify that method only configures identity, not full clustering
  • Fix race conditions in job registration
    • Update RegisterJobsTask.cs to handle ObjectAlreadyExistsException
    • Make job registration idempotent and cluster-safe
    • Add clear documentation about replace parameter
    • Add debug logging for visibility
  • Fix race conditions in trigger registration
    • Update QuartzWorkflowScheduler.cs to handle ObjectAlreadyExistsException
    • Make trigger registration idempotent and cluster-safe
    • Document behavior differences from AddJob
    • Add debug logging for visibility
    • Document that trigger updates require unscheduling first
  • Update EFCore extensions
    • Add useClustering parameter to SQLite extension (defaults to false)
    • SQLServer, PostgreSQL, and MySQL already have the parameter (defaults to true)
  • Build and test the changes
    • Build the Quartz module successfully
    • Build all EFCore provider modules successfully
    • Address all code review feedback
    • Verify no regressions

Summary

All required changes have been implemented successfully, including addressing PR review feedback:

1. ConfigureClusteringIdentity Convenience API (renamed from EnableClustering)

  • Added to QuartzFeature class
  • Configures SchedulerId = "AUTO" and SchedulerName = "ElsaScheduler" by default
  • Updated documentation to clarify it only configures identity settings, not full clustering
  • Method name now better reflects what it actually does
  • Works seamlessly with existing EFCore extensions (UseSqlServer, UsePostgreSql, etc.)

2. Race Condition Fixes

  • RegisterJobsTask.cs: Try-catch pattern with debug logging
  • QuartzWorkflowScheduler.cs: Try-catch pattern with debug logging and update documentation
  • Both now handle ObjectAlreadyExistsException gracefully with visibility
  • Idempotent and safe for concurrent access in multi-pod Kubernetes deployments

3. EFCore Extensions Updated

  • SQLite: Added useClustering parameter (defaults to false for backwards compatibility)
  • SqlServer, PostgreSQL, MySQL: Already have useClustering parameter (defaults to true)
  • All extensions now consistently support the clustering configuration

4. Review Feedback Addressed

  • Renamed EnableClustering to ConfigureClusteringIdentity to better reflect functionality
  • Kept comprehensive documentation about requirements
  • Added useClustering parameter to SQLite extension
  • Added debug-level logging in both exception handlers for operational visibility
  • Documented that trigger updates require calling UnscheduleAsync first

5. Security Summary

  • No new security vulnerabilities introduced
  • Changes are defensive and improve system resilience
  • Only exception handling, logging, configuration additions, and API naming improvements
  • No data access or authentication changes
Original prompt

This section details on the original issue you should resolve

<issue_title>Harden and Document Quartz.NET Clustering Support</issue_title>
<issue_description>Elsa’s Quartz.NET integration must be enhanced and documented to properly support clustered, multi-instance deployments and to prevent race conditions observed in real-world Kubernetes environments.

Cluster Mode

While the current Quartz.NET integration already exposes direct access to Quartz.NET’s configuration APIs—allowing advanced users to manually configure clustered mode—there is no first-class, opinionated configuration path. To improve usability and reduce misconfiguration risk, the integration should expose a convenience method (e.g. EnableClustering) that configures sensible defaults for clustered operation (such as instanceId = AUTO, clustered job store, and check-in intervals), while still allowing optional parameters to be overridden where appropriate.

Documentation

In parallel, Elsa’s documentation must be updated to clearly explain clustered Quartz.NET usage in distributed hosting scenarios. Specifically, the existing documentation section on Distributed Hosting and Quartz.NET clustered mode https://docs.elsaworkflows.io/hosting/distributed-hosting#id-4.-quartz.net-clustered-mode should be revised to reflect the current Elsa 3.6 architecture (optional add-on in elsa-extensions), clearly describe when clustering is required, and provide guidance aligned with the new convenience configuration API.

Race Conditions

Additionally, the Quartz.NET Elsa integration module contains potential race conditions during job and trigger registration when multiple pods start concurrently. Current logic performs a “check-then-act” sequence without distributed locking, for example:

  • In RegisterJobsTask.cs, checking for job existence before calling AddJob.
  • In QuartzWorkflowScheduler.cs, checking for trigger existence before calling ScheduleJob.

In clustered or multi-pod deployments, this can result in multiple instances attempting to register the same job or trigger simultaneously, leading to ObjectAlreadyExistsException errors. This behavior was observed in production during an incident, where multiple pods attempted to register identical Quartz triggers during tenant activation. The lack of effective cluster coordination (and/or insufficient guarding against concurrent registration) directly contributed to workflow interruptions and scheduler instability.

This requirement therefore covers:

  • Introducing a clustering convenience API to reduce configuration errors.
  • Updating official Elsa documentation to clearly describe clustered Quartz.NET usage in distributed hosting.
  • Hardening the Quartz.NET integration against concurrent job/trigger registration across multiple nodes, ensuring idempotent and cluster-safe behavior even during simultaneous startup or tenant activation.

Addressing these items will significantly improve reliability, reduce operational risk in Kubernetes and other distributed environments, and align Elsa Workflows’ Quartz.NET integration with real-world clustered deployment expectations.</issue_description>

<agent_instructions>Ignore the part about documentation - we'll work on that separately.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits December 29, 2025 08:34
…ation

Co-authored-by: sfmskywalker <938393+sfmskywalker@users.noreply.github.com>
Co-authored-by: sfmskywalker <938393+sfmskywalker@users.noreply.github.com>
Copilot AI changed the title [WIP] Enhance and document Quartz.NET clustering support Harden Quartz.NET integration for clustered deployments Dec 29, 2025
Copilot AI requested a review from sfmskywalker December 29, 2025 08:41
@sfmskywalker sfmskywalker marked this pull request as ready for review December 29, 2025 08:49
Copilot AI review requested due to automatic review settings December 29, 2025 08:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Quartz.NET integration for safe operation in clustered Kubernetes deployments by eliminating race conditions during concurrent job and trigger registration. The changes introduce a convenience method for clustering configuration and replace check-then-act patterns with idempotent try-catch error handling.

  • Added EnableClustering() convenience method to simplify cluster configuration
  • Fixed race conditions in job registration by handling ObjectAlreadyExistsException
  • Fixed race conditions in trigger scheduling by handling ObjectAlreadyExistsException

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/modules/scheduling/Elsa.Scheduling.Quartz/Features/QuartzFeature.cs Adds EnableClustering() method to configure scheduler ID and name for cluster coordination
src/modules/scheduling/Elsa.Scheduling.Quartz/Tasks/RegisterJobsTask.cs Replaces check-then-act with try-catch pattern to handle concurrent job registration safely
src/modules/scheduling/Elsa.Scheduling.Quartz/Services/QuartzWorkflowScheduler.cs Replaces check-then-act with try-catch pattern to handle concurrent trigger scheduling safely

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ew feedback

Co-authored-by: sfmskywalker <938393+sfmskywalker@users.noreply.github.com>
…stering param to SQLite

Co-authored-by: sfmskywalker <938393+sfmskywalker@users.noreply.github.com>
@sfmskywalker sfmskywalker changed the base branch from main to release/3.6.0 December 29, 2025 19:13
Copy link

@j03y-nxxbz j03y-nxxbz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@sfmskywalker sfmskywalker merged commit fd3ae2f into release/3.6.0 Dec 30, 2025
7 checks passed
@sfmskywalker sfmskywalker deleted the copilot/harden-quartz-net-clustering branch December 30, 2025 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden and Document Quartz.NET Clustering Support

3 participants