-
Notifications
You must be signed in to change notification settings - Fork 74
feat: sync alert jobs #3059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: sync alert jobs #3059
Conversation
lib/logflare/alerting.ex
Outdated
| :ok | {:error, :not_enabled} | {:error, :below_min_cluster_size} | ||
| def run_alert(alert_id, :scheduled) when is_integer(alert_id) do | ||
| # sync the alert job for the next run | ||
| sync_alert_job(alert_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be slightly better to return the alert job (if present) as an :ok tuple from the sync job, then that would allow us to avoid a 2nd query to re-fetch the alert job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which second query do you mean? The run_alert/2 below seems to operate on AlertQuery, not Quantum.Job, and the job only knows about alert_id now (since we don't really need to put %AlertQuery{} in the job anymore if we refetch the alert definition each time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_alert_job would perform 1 db query on the scheduler_node, then perform another db query at get_alert_query_by at line 285, so that would result in 2 db queries being performed.
alternative is to fetch the alert_query first then pass it to sync_alert_job, reversing the order and reducing db queries to 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need to call sync_alert_job which "re-adds" the job from run_query, or is it enough to unschedule the job for alerts that no longer exist, i.e., 4519e14. Or maybe it can just be no-op and sync would then be done periodically (every 60 minutes), but that has another problem: #3059 (comment)
Re-adding the job from run_query/1 feels off for cron jobs somehow, but in light of #3059 (comment) it might be the only way to reliably re-sync active jobs. One possible problem with that approach is when an rare alert gets modified to become less rare, the update won't take place until it runs at least once which might be surprising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to go with no syncing in run_alert/1 since it already uses the up-to-date alert query thanks to fetching it by id on each run (and being no-op if it doesn't exist), and try and update sync_alert_jobs to be a bit smarter (instead of "full" delete followed by "full" insert) to avoid the problem in #3059 (comment)
lib/logflare/alerting.ex
Outdated
| :ok | {:error, :not_enabled} | {:error, :below_min_cluster_size} | ||
| def run_alert(alert_id, :scheduled) when is_integer(alert_id) do | ||
| # sync the alert job for the next run | ||
| sync_alert_job(alert_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_alert_job would perform 1 db query on the scheduler_node, then perform another db query at get_alert_query_by at line 285, so that would result in 2 db queries being performed.
alternative is to fetch the alert_query first then pass it to sync_alert_job, reversing the order and reducing db queries to 1
test/logflare/alerting_test.exs
Outdated
| # ensure config allows execution | ||
| old_config = Application.get_env(:logflare, Logflare.Alerting) | ||
| Application.put_env(:logflare, Logflare.Alerting, min_cluster_size: 0, enabled: true) | ||
| on_exit(fn -> Application.put_env(:logflare, Logflare.Alerting, old_config) end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should put this in a common setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 55fc122
| end | ||
|
|
||
| test "run_alert/2 unschedules job if alert is missing", %{user: user} do | ||
| {:ok, alert} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is missing the configuration setup, if config is disabled then this would pass without verifying the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 55fc122
| Cluster.Utils.rpc_call(node, func) | ||
|
|
||
| nil -> | ||
| raise "Alerting scheduler node not found" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously that function was able to silently be a no-op if scheduler node wasn't found, I wonder if this raise is a good idea? It would make these cases more noticeable, if they ever happen.
| ], | ||
| alerts_scheduler_sync: [ | ||
| run_strategy: Quantum.RunStrategy.Local, | ||
| schedule: "0 * * * *", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's possible for alert schedule be rarer than alerts_scheduler_sync schedule, less than once in 60 minutes? That would probably mean that those jobs would never run, since in the current do_sync_alert_jobs they would keep getting re-added (which in Quantum doesn't seem (?) to execute the job right away but kind of reschedules and effectively postpones them indefinitely).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there are jobs that run on a minutely basis, or every hour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that behaviour would not be good. Perhaps separate sync schedule jobs, syncing once a day for hourly schedules and once a minute for jobs less than an hour
| AlertsScheduler.delete_job(job.name) | ||
| end | ||
| AlertsScheduler.delete_job(to_job_name(alert_id)) | ||
| {:error, :not_found} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not an error though, since sync completed as intended. More like {:ok, :removed} or :ok or :removed maybe?
alert_query_idinstead of%AlertQuery{}struct in the Quantum job.