Relates to https://github.com/coder/coder/issues/17432
### Part 1:
Notes:
- `GetPresetsAtFailureLimit` SQL query is added, which is similar to
`GetPresetsBackoff`, they use same CTEs: `filtered_builds`,
`time_sorted_builds`, but they are still different.
- Query is executed on every loop iteration. We can consider marking
specific preset as permanently failed as an optimization to avoid
executing query on every loop iteration. But I decided don't do it for
now.
- By default `FailureHardLimit` is set to 3.
- `FailureHardLimit` is configurable. Setting it to zero - means that
hard limit is disabled.
### Part 2
Notes:
- `PrebuildFailureLimitReached` notification is added.
- Notification is sent to template admins.
- Notification is sent only the first time, when hard limit is reached.
But it will `log.Warn` on every loop iteration.
- I introduced this enum:
```sql
CREATE TYPE prebuild_status AS ENUM (
'normal', -- Prebuilds are working as expected; this is the default, healthy state.
'hard_limited', -- Prebuilds have failed repeatedly and hit the configured hard failure limit; won't be retried anymore.
'validation_failed' -- Prebuilds failed due to a non-retryable validation error (e.g. template misconfiguration); won't be retried.
);
```
`validation_failed` not used in this PR, but I think it will be used in
next one, so I wanted to save us an extra migration.
- Notification looks like this:
<img width="472" alt="image"
src="https://github.com/user-attachments/assets/e10efea0-1790-4e7f-a65c-f94c40fced27"
/>
### Latest notification views:
<img width="463" alt="image"
src="https://github.com/user-attachments/assets/11310c58-68d1-4075-a497-f76d854633fe"
/>
<img width="725" alt="image"
src="https://github.com/user-attachments/assets/6bbfe21a-91ac-47c3-a9d1-21807bb0c53a"
/>
This PR refactors the CompleteJob function to use database transactions
more consistently for better atomicity guarantees. The large function
was broken down into three specialized handlers:
- completeTemplateImportJob
- completeWorkspaceBuildJob
- completeTemplateDryRunJob
Each handler now uses the Database.InTx wrapper to ensure all database
operations for a job completion are performed within a single
transaction, preventing partial updates in case of failures.
Added comprehensive tests for transaction behavior for each job type.
Fixes#17694🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Claude <noreply@anthropic.com>
Dynamic params skip parameter validation in coder/coder.
This is because conditional parameters cannot be validated
with the static parameters in the database.
# Use workspace.OwnerUsername instead of fetching the owner
This PR optimizes the agent API by using the `workspace.OwnerUsername` field directly instead of making an additional database query to fetch the owner's username. The change removes the need to call `GetUserByID` in the manifest API and workspace agent RPC endpoints.
An issue arose when the agent token was scoped without access to user data (`api_key_scope = "no_user_data"`), causing the agent to fail to fetch the manifest due to an RBAC issue.
Change-Id: I3b6e7581134e2374b364ee059e3b18ece3d98b41
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Pass through the user input as is. The previous code only passed through
parameters that existed in the db (static params). This would omit
conditional params.
Validation is enforced by the dynamic params websocket, so validation at
this point is not required.
Closes https://github.com/coder/internal/issues/648
This change introduces a new `ParentId` field to the agent's manifest.
This will allow an agent to know if it is a child or not, as well as
knowing who the owner is.
This is part of the Dev Container Agents work
# Description
This PR adds the `worker_name` field to the provisioner jobs endpoint.
To achieve this, the following SQL query was updated:
-
`GetProvisionerJobsByOrganizationAndStatusWithQueuePositionAndProvisioner`
As a result, the `codersdk.ProvisionerJob` type, which represents the
provisioner job API response, was modified to include the new field.
**Notes:**
* As mentioned in
[comment](https://github.com/coder/coder/pull/17877#discussion_r2093218206),
the `GetProvisionerJobsByIDsWithQueuePosition` query was not changed due
to load concerns. This means that for template and template version
endpoints, `worker_id` will still be returned, but `worker_name` will
not.
* Similar to `worker_id`, the `worker_name` is only present once a job
is assigned to a provisioner daemon. For jobs in a pending state (not
yet assigned), neither `worker_id` nor `worker_name` will be returned.
---
# Affected Endpoints
- `/organizations/{organization}/provisionerjobs`
- `/organizations/{organization}/provisionerjobs/{job}`
---
# Testing
- Added new tests verifying that both `worker_id` and `worker_name` are
returned once a provisioner job reaches the **succeeded** state.
- Existing tests covering state transitions and other logic remain
unchanged, as they test different scenarios.
---
# Front-end Changes
Admin provisioner jobs dashboard:
<img width="1088" alt="Screenshot 2025-05-16 at 11 51 33"
src="https://github.com/user-attachments/assets/0e20e360-c615-4497-84b7-693777c5443e"
/>
Fixes: https://github.com/coder/coder/issues/16982
closes https://github.com/coder/internal/issues/632
`pubsubReinitSpy` used to signal that a subscription had happened before
it actually had.
This created a slight opportunity for the main goroutine to publish
before the actual subscription was listening. The published event was
then dropped, leading to a failed test.
fixes#17070
Cleans up our handling of APIKey expiration and OIDC to keep them separate concepts. For an OIDC-login APIKey, both the APIKey and OIDC link must be valid to login. If the OIDC link is expired and we have a refresh token, we will attempt to refresh.
OIDC refreshes do not have any effect on APIKey expiry.
https://github.com/coder/coder/issues/17070#issuecomment-2886183613 explains why this is the correct behavior.
Existing template versions do not have the metadata (modules + plan) in
the db. So revert to using static parameter information from the
original template import.
This data will still be served over the websocket.
Also add some clarification about the lack of database constraints for
soft template deletion.
---------
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
Closes https://github.com/coder/internal/issues/369
We can't know whether a replacement (i.e. drift of terraform state
leading to a resource needing to be deleted/recreated) will take place
apriori; we can only detect it at `plan` time, because the provider
decides whether a resource must be replaced and it cannot be inferred
through static analysis of the template.
**This is likely to be the most common gotcha with using prebuilds,
since it requires a slight template modification to use prebuilds
effectively**, so let's head this off before it's an issue for
customers.
Drift details will now be logged in the workspace build logs:

Plus a notification will be sent to template admins when this situation
arises:

A new metric - `coderd_prebuilt_workspaces_resource_replacements_total`
- will also increment each time a workspace encounters replacements.
We only track _that_ a resource replacement occurred, not how many. Just
one is enough to ruin a prebuild, but we can't know apriori which
replacement would cause this.
For example, say we have 2 replacements: a `docker_container` and a
`null_resource`; we don't know which one might
cause an issue (or indeed if either would), so we just track the
replacement.
---------
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
This pull request allows coder workspace agents to be reinitialized when
a prebuilt workspace is claimed by a user. This facilitates the transfer
of ownership between the anonymous prebuilds system user and the new
owner of the workspace.
Only a single agent per prebuilt workspace is supported for now, but
plumbing has already been done to facilitate the seamless transition to
multi-agent support.
---------
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
Avoids two sequential scans of massive tables (`workspace_builds`,
`provisioner_jobs`) and uses index scans instead. This new view largely
replicates our already optimized query `GetWorkspaces` to fetch the
latest build.
The original query and the new query were compared against the dogfood
database to ensure they return the exact same data in the exact same
order (minus the new `workspaces.deleted = false` filter to improve
performance even more). The performance is massively improved even
without the `workspaces.deleted = false` filter, but it was added to
improve it even more.
Note: these query times are probably inflated due to high database load
on our dogfood environment that this intends to partially resolve.
Before: 2,139ms
([explain](https://explain.dalibo.com/plan/997e4fch241b46e6))
After: 33ms
([explain](https://explain.dalibo.com/plan/c888dc223870f181))
Co-authored-by: Cian Johnston <cian@coder.com>
---------
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
## Description
Modifies the behaviour of the "list templates" API endpoints to return
non-deprecated templates by default. Users can still query for
deprecated templates by specifying the `deprecated=true` query
parameter.
**Note:** The deprecation feature is an enterprise-level feature
## Affected Endpoints
* /api/v2/organizations/{organization}/templates
* /api/v2/templates
Fixes#17565
The changes in `coder/preview` necessitated the changes in
`codersdk/richparameters.go` & `provisioner/terraform/resources.go`.
---------
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Steven Masley <stevenmasley@gmail.com>
Closes https://github.com/coder/coder/issues/17691
`ExtractOrganizationMembersParam` will allow fetching a user with only
organization permissions. If the user belongs to 0 orgs, then the user "does not exist"
from an org perspective. But if you are a site-wide admin, then the user does exist.
Closes https://github.com/coder/internal/issues/609.
As seen in the below logs, the `last_used_at` time was updating, but just to the same value that it was on creation; `dbtime.Now` was called in quick succession.
```
t.go:106: 2025-05-05 12:11:54.166 [info] coderd.workspace_usage_tracker: updated workspaces last_used_at count=1 now="2025-05-05T12:11:54.161329Z"
t.go:106: 2025-05-05 12:11:54.172 [debu] coderd: GET host=localhost:50422 path=/api/v2/workspaces/745b7ff3-47f2-4e1a-9452-85ea48ba5c46 proto=HTTP/1.1 remote_addr=127.0.0.1 start="2025-05-05T12:11:54.1669073Z" workspace_name=peaceful_faraday34 requestor_id=b2cf02ae-2181-480b-bb1f-95dc6acb6497 requestor_name=testuser requestor_email="" took=5.2105ms status_code=200 latency_ms=5 params_workspace=745b7ff3-47f2-4e1a-9452-85ea48ba5c46 request_id=7fd5ea90-af7b-4104-91c5-9ca64bc2d5e6
workspaceagentsrpc_test.go:70:
Error Trace: C:/actions-runner/coder/coder/coderd/workspaceagentsrpc_test.go:70
Error: Should be true
Test: TestWorkspaceAgentReportStats
Messages: 2025-05-05 12:11:54.161329 +0000 UTC is not after 2025-05-05 12:11:54.161329 +0000 UTC
```
If we change the initial `LastUsedAt` time to be a time in the past, ticking with a `dbtime.Now` will always update it to a later value. If it never updates, the condition will still fail.
* Refactors toolsdk.Tools to remove opaque `map[string]any` argument in
favour of typed args structs.
* Refactors toolsdk.Tools to remove opaque passing of dependencies via
`context.Context` in favour of a tool dependencies struct.
* Adds panic recovery and clean context middleware to all tools.
* Adds `GenericTool` implementation to allow keeping `toolsdk.All` with
uniform type signature while maintaining type information in handlers.
* Adds stricter checks to `patchWorkspaceAgentAppStatus` handler.
There were too many ways to configure the agentcontainers API resulting
in inconsistent behavior or features not being enabled. This refactor
introduces a control flag for enabling or disabling the containers API.
When disabled, all implementations are no-op and explicit endpoint
behaviors are defined. When enabled, concrete implementations are used
by default but can be overridden by passing options.