fix: reduce cost of prebuild failure (#17697)

Relates to https://github.com/coder/coder/issues/17432

### Part 1:

Notes:
- `GetPresetsAtFailureLimit` SQL query is added, which is similar to
`GetPresetsBackoff`, they use same CTEs: `filtered_builds`,
`time_sorted_builds`, but they are still different.

- Query is executed on every loop iteration. We can consider marking
specific preset as permanently failed as an optimization to avoid
executing query on every loop iteration. But I decided don't do it for
now.

- By default `FailureHardLimit` is set to 3.

- `FailureHardLimit` is configurable. Setting it to zero - means that
hard limit is disabled.

### Part 2

Notes:
- `PrebuildFailureLimitReached` notification is added.
- Notification is sent to template admins.
- Notification is sent only the first time, when hard limit is reached.
But it will `log.Warn` on every loop iteration.
- I introduced this enum:
```sql
CREATE TYPE prebuild_status AS ENUM (
  'normal',           -- Prebuilds are working as expected; this is the default, healthy state.
  'hard_limited',     -- Prebuilds have failed repeatedly and hit the configured hard failure limit; won't be retried anymore.
  'validation_failed' -- Prebuilds failed due to a non-retryable validation error (e.g. template misconfiguration); won't be retried.
);
```
`validation_failed` not used in this PR, but I think it will be used in
next one, so I wanted to save us an extra migration.

- Notification looks like this:
<img width="472" alt="image"
src="https://github.com/user-attachments/assets/e10efea0-1790-4e7f-a65c-f94c40fced27"
/>

### Latest notification views:
<img width="463" alt="image"
src="https://github.com/user-attachments/assets/11310c58-68d1-4075-a497-f76d854633fe"
/>
<img width="725" alt="image"
src="https://github.com/user-attachments/assets/6bbfe21a-91ac-47c3-a9d1-21807bb0c53a"
/>
This commit is contained in:
Yevhenii Shcherbina
2025-05-21 15:16:38 -04:00
committed by GitHub
parent e1934fe119
commit 53e8e9c7cd
32 changed files with 1160 additions and 60 deletions

View File

@ -27,6 +27,7 @@ RETURNING w.id, w.name;
SELECT
t.id AS template_id,
t.name AS template_name,
o.id AS organization_id,
o.name AS organization_name,
tv.id AS template_version_id,
tv.name AS template_version_name,
@ -34,6 +35,7 @@ SELECT
tvp.id,
tvp.name,
tvp.desired_instances AS desired_instances,
tvp.prebuild_status,
t.deleted,
t.deprecated != '' AS deprecated
FROM templates t
@ -129,6 +131,42 @@ WHERE tsb.rn <= tsb.desired_instances -- Fetch the last N builds, where N is the
AND created_at >= @lookback::timestamptz
GROUP BY tsb.template_version_id, tsb.preset_id, fc.num_failed;
-- GetPresetsAtFailureLimit groups workspace builds by preset ID.
-- Each preset is associated with exactly one template version ID.
-- For each preset, the query checks the last hard_limit builds.
-- If all of them failed, the preset is considered to have hit the hard failure limit.
-- The query returns a list of preset IDs that have reached this failure threshold.
-- Only active template versions with configured presets are considered.
-- name: GetPresetsAtFailureLimit :many
WITH filtered_builds AS (
-- Only select builds which are for prebuild creations
SELECT wlb.template_version_id, wlb.created_at, tvp.id AS preset_id, wlb.job_status, tvp.desired_instances
FROM template_version_presets tvp
INNER JOIN workspace_latest_builds wlb ON wlb.template_version_preset_id = tvp.id
INNER JOIN workspaces w ON wlb.workspace_id = w.id
INNER JOIN template_versions tv ON wlb.template_version_id = tv.id
INNER JOIN templates t ON tv.template_id = t.id AND t.active_version_id = tv.id
WHERE tvp.desired_instances IS NOT NULL -- Consider only presets that have a prebuild configuration.
AND wlb.transition = 'start'::workspace_transition
AND w.owner_id = 'c42fdf75-3097-471c-8c33-fb52454d81c0'
),
time_sorted_builds AS (
-- Group builds by preset, then sort each group by created_at.
SELECT fb.template_version_id, fb.created_at, fb.preset_id, fb.job_status, fb.desired_instances,
ROW_NUMBER() OVER (PARTITION BY fb.preset_id ORDER BY fb.created_at DESC) as rn
FROM filtered_builds fb
)
SELECT
tsb.template_version_id,
tsb.preset_id
FROM time_sorted_builds tsb
-- For each preset, check the last hard_limit builds.
-- If all of them failed, the preset is considered to have hit the hard failure limit.
WHERE tsb.rn <= @hard_limit::bigint
AND tsb.job_status = 'failed'::provisioner_job_status
GROUP BY tsb.template_version_id, tsb.preset_id
HAVING COUNT(*) = @hard_limit::bigint;
-- name: GetPrebuildMetrics :many
SELECT
t.name as template_name,