Skip to content

Conversation

@ruslandoga
Copy link
Contributor

@ruslandoga ruslandoga commented Dec 22, 2025

  • ANL-1154 -- Periodically perform a Alerts scheduler sync, every 60 minutes
  • ANL-1156 -- Fetch the alert definition just before executing the query, to avoid using stale data -- this is done by storing and using alert_query_id instead of %AlertQuery{} struct in the Quantum job.

@ruslandoga ruslandoga changed the title draft: sync altert jobs draft: sync alert jobs Dec 22, 2025
:ok | {:error, :not_enabled} | {:error, :below_min_cluster_size}
def run_alert(alert_id, :scheduled) when is_integer(alert_id) do
# sync the alert job for the next run
sync_alert_job(alert_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be slightly better to return the alert job (if present) as an :ok tuple from the sync job, then that would allow us to avoid a 2nd query to re-fetch the alert job.

Copy link
Contributor Author

@ruslandoga ruslandoga Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which second query do you mean? The run_alert/2 below seems to operate on AlertQuery, not Quantum.Job, and the job only knows about alert_id now (since we don't really need to put %AlertQuery{} in the job anymore if we refetch the alert definition each time).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_alert_job would perform 1 db query on the scheduler_node, then perform another db query at get_alert_query_by at line 285, so that would result in 2 db queries being performed.

alternative is to fetch the alert_query first then pass it to sync_alert_job, reversing the order and reducing db queries to 1

Copy link
Contributor Author

@ruslandoga ruslandoga Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to call sync_alert_job which "re-adds" the job from run_query, or is it enough to unschedule the job for alerts that no longer exist, i.e., 4519e14. Or maybe it can just be no-op and sync would then be done periodically (every 60 minutes), but that has another problem: #3059 (comment)

Re-adding the job from run_query/1 feels off for cron jobs somehow, but in light of #3059 (comment) it might be the only way to reliably re-sync active jobs. One possible problem with that approach is when an rare alert gets modified to become less rare, the update won't take place until it runs at least once which might be surprising.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to go with no syncing in run_alert/1 since it already uses the up-to-date alert query thanks to fetching it by id on each run (and being no-op if it doesn't exist), and try and update sync_alert_jobs to be a bit smarter (instead of "full" delete followed by "full" insert) to avoid the problem in #3059 (comment)

@ruslandoga ruslandoga marked this pull request as ready for review January 6, 2026 20:07
@ruslandoga ruslandoga changed the title draft: sync alert jobs feat: sync alert jobs Jan 6, 2026
:ok | {:error, :not_enabled} | {:error, :below_min_cluster_size}
def run_alert(alert_id, :scheduled) when is_integer(alert_id) do
# sync the alert job for the next run
sync_alert_job(alert_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_alert_job would perform 1 db query on the scheduler_node, then perform another db query at get_alert_query_by at line 285, so that would result in 2 db queries being performed.

alternative is to fetch the alert_query first then pass it to sync_alert_job, reversing the order and reducing db queries to 1

Comment on lines 417 to 420
# ensure config allows execution
old_config = Application.get_env(:logflare, Logflare.Alerting)
Application.put_env(:logflare, Logflare.Alerting, min_cluster_size: 0, enabled: true)
on_exit(fn -> Application.put_env(:logflare, Logflare.Alerting, old_config) end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should put this in a common setup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 55fc122

end

test "run_alert/2 unschedules job if alert is missing", %{user: user} do
{:ok, alert} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missing the configuration setup, if config is disabled then this would pass without verifying the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 55fc122

Cluster.Utils.rpc_call(node, func)

nil ->
raise "Alerting scheduler node not found"
Copy link
Contributor Author

@ruslandoga ruslandoga Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously that function was able to silently be a no-op if scheduler node wasn't found, I wonder if this raise is a good idea? It would make these cases more noticeable, if they ever happen.

@ruslandoga ruslandoga requested a review from Ziinc January 7, 2026 14:20
],
alerts_scheduler_sync: [
run_strategy: Quantum.RunStrategy.Local,
schedule: "0 * * * *",
Copy link
Contributor Author

@ruslandoga ruslandoga Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's possible for alert schedule be rarer than alerts_scheduler_sync schedule, less than once in 60 minutes? That would probably mean that those jobs would never run, since in the current do_sync_alert_jobs they would keep getting re-added (which in Quantum doesn't seem (?) to execute the job right away but kind of reschedules and effectively postpones them indefinitely).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there are jobs that run on a minutely basis, or every hour

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that behaviour would not be good. Perhaps separate sync schedule jobs, syncing once a day for hourly schedules and once a minute for jobs less than an hour

AlertsScheduler.delete_job(job.name)
end
AlertsScheduler.delete_job(to_job_name(alert_id))
{:error, :not_found}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an error though, since sync completed as intended. More like {:ok, :removed} or :ok or :removed maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants