Commit Graph

552 Commits

Author SHA1 Message Date
1e1e6f3bd1 fix: skip TestReinitializeAgent until we can adapt it for windows (#17968)
relates to https://github.com/coder/internal/issues/642

I've reached a timebox trying to get a script for windows to work, so
I'm skipping it for now.
2025-05-22 08:48:40 +02:00
53e8e9c7cd fix: reduce cost of prebuild failure (#17697)
Relates to https://github.com/coder/coder/issues/17432

### Part 1:

Notes:
- `GetPresetsAtFailureLimit` SQL query is added, which is similar to
`GetPresetsBackoff`, they use same CTEs: `filtered_builds`,
`time_sorted_builds`, but they are still different.

- Query is executed on every loop iteration. We can consider marking
specific preset as permanently failed as an optimization to avoid
executing query on every loop iteration. But I decided don't do it for
now.

- By default `FailureHardLimit` is set to 3.

- `FailureHardLimit` is configurable. Setting it to zero - means that
hard limit is disabled.

### Part 2

Notes:
- `PrebuildFailureLimitReached` notification is added.
- Notification is sent to template admins.
- Notification is sent only the first time, when hard limit is reached.
But it will `log.Warn` on every loop iteration.
- I introduced this enum:
```sql
CREATE TYPE prebuild_status AS ENUM (
  'normal',           -- Prebuilds are working as expected; this is the default, healthy state.
  'hard_limited',     -- Prebuilds have failed repeatedly and hit the configured hard failure limit; won't be retried anymore.
  'validation_failed' -- Prebuilds failed due to a non-retryable validation error (e.g. template misconfiguration); won't be retried.
);
```
`validation_failed` not used in this PR, but I think it will be used in
next one, so I wanted to save us an extra migration.

- Notification looks like this:
<img width="472" alt="image"
src="https://github.com/user-attachments/assets/e10efea0-1790-4e7f-a65c-f94c40fced27"
/>

### Latest notification views:
<img width="463" alt="image"
src="https://github.com/user-attachments/assets/11310c58-68d1-4075-a497-f76d854633fe"
/>
<img width="725" alt="image"
src="https://github.com/user-attachments/assets/6bbfe21a-91ac-47c3-a9d1-21807bb0c53a"
/>
2025-05-21 15:16:38 -04:00
a123900fe8 chore: remove coder/preview dependency from codersdk (#17939) 2025-05-20 10:45:12 -05:00
e76d58f2b6 chore: disable parameter validatation for dynamic params for all transitions (#17926)
Dynamic params skip parameter validation in coder/coder.
This is because conditional parameters cannot be validated 
with the static parameters in the database.
2025-05-20 10:09:53 -05:00
f36fb67f57 chore: use static params when dynamic param metadata is missing (#17836)
Existing template versions do not have the metadata (modules + plan) in
the db. So revert to using static parameter information from the
original template import.

This data will still be served over the websocket.
2025-05-16 11:47:59 -05:00
2aa8cbebd7 fix: exclude deleted templates from metrics collection (#17839)
Also add some clarification about the lack of database constraints for
soft template deletion.

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
2025-05-15 13:33:58 +02:00
789c4beba7 chore: add dynamic parameter error if missing metadata from provisioner (#17809) 2025-05-14 12:21:36 -05:00
6e967780c9 feat: track resource replacements when claiming a prebuilt workspace (#17571)
Closes https://github.com/coder/internal/issues/369

We can't know whether a replacement (i.e. drift of terraform state
leading to a resource needing to be deleted/recreated) will take place
apriori; we can only detect it at `plan` time, because the provider
decides whether a resource must be replaced and it cannot be inferred
through static analysis of the template.

**This is likely to be the most common gotcha with using prebuilds,
since it requires a slight template modification to use prebuilds
effectively**, so let's head this off before it's an issue for
customers.

Drift details will now be logged in the workspace build logs:


![image](https://github.com/user-attachments/assets/da1988b6-2cbe-4a79-a3c5-ea29891f3d6f)

Plus a notification will be sent to template admins when this situation
arises:


![image](https://github.com/user-attachments/assets/39d555b1-a262-4a3e-b529-03b9f23bf66a)

A new metric - `coderd_prebuilt_workspaces_resource_replacements_total`
- will also increment each time a workspace encounters replacements.

We only track _that_ a resource replacement occurred, not how many. Just
one is enough to ruin a prebuild, but we can't know apriori which
replacement would cause this.
For example, say we have 2 replacements: a `docker_container` and a
`null_resource`; we don't know which one might
cause an issue (or indeed if either would), so we just track the
replacement.

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
2025-05-14 14:52:22 +02:00
425ee6fa55 feat: reinitialize agents when a prebuilt workspace is claimed (#17475)
This pull request allows coder workspace agents to be reinitialized when
a prebuilt workspace is claimed by a user. This facilitates the transfer
of ownership between the anonymous prebuilds system user and the new
owner of the workspace.

Only a single agent per prebuilt workspace is supported for now, but
plumbing has already been done to facilitate the seamless transition to
multi-agent support.

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
2025-05-14 14:15:36 +02:00
b2a1de9e2a feat: fetch prebuilds metrics state in background (#17792)
`Collect()` is called whenever the `/metrics` endpoint is hit to
retrieve metrics.

The queries used in prebuilds metrics collection are quite heavy, and we
want to avoid having them running concurrently / too often to keep db
load down.

Here I'm moving towards a background retrieval of the state required to
set the metrics, which gets invalidated every interval.

Also introduces `coderd_prebuilt_workspaces_metrics_last_updated` which
operators can use to determine when these metrics go stale.

See https://github.com/coder/coder/pull/17789 as well.

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
2025-05-13 20:27:41 +02:00
64807e1d61 chore: apply the 4mb max limit on drpc protocol message size (#17771)
Respect the 4mb max limit on proto messages
2025-05-13 11:24:51 -05:00
37832413ba chore: resolve internal drpc package conflict (#17770)
Our internal drpc package name conflicts with the external one in usage. 
`drpc.*` == external
`drpcsdk.*` == internal
2025-05-12 10:31:38 -05:00
d5360a6da0 chore: fetch workspaces by username with organization permissions (#17707)
Closes https://github.com/coder/coder/issues/17691

`ExtractOrganizationMembersParam` will allow fetching a user with only
organization permissions. If the user belongs to 0 orgs, then the user "does not exist" 
from an org perspective. But if you are a site-wide admin, then the user does exist.
2025-05-08 14:41:17 -05:00
a646478aed fix: move pubsub publishing out of database transactions to avoid conn exhaustion (#17648)
Database transactions hold onto connections, and `pubsub.Publish` tries
to acquire a connection of its own. If the latter is called within a
transaction, this can lead to connection exhaustion.

I plan two follow-ups to this PR:

1. Make connection counts tuneable

https://github.com/coder/coder/blob/main/cli/server.go#L2360-L2376

We will then be able to write tests showing how connection exhaustion
occurs.

2. Write a linter/ruleguard to prevent `pubsub.Publish` from being
called within a transaction.

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
2025-05-05 11:54:18 +02:00
ef11d4f769 fix: fix bug with deletion of prebuilt workspaces (#17652)
Don't specify the template version for a delete transition, because the
prebuilt workspace may have been created using an older template
version.
If the template version isn't explicitly set, the builder will
automatically use the version from the last workspace build - which is
the desired behavior.
2025-05-01 17:26:30 -04:00
98e5611e16 fix: fix for prebuilds claiming and deletion (#17624)
PR contains:
- fix for claiming & deleting prebuilds with immutable params
- unit test for claiming scenario
- unit test for deletion scenario

The parameter resolver was failing when deleting/claiming prebuilds
because a value for a previously-used parameter was provided to the
resolver, but since the value was unchanged (it's coming from the
preset) it failed in the resolver. The resolver was missing a check to
see if the old value != new value; if the values match then there's no
mutation of an immutable parameter.

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
2025-05-01 08:52:23 +00:00
6936a7b5a2 fix: fix prebuild omissions (#17579)
Fixes accidental omission from https://github.com/coder/coder/pull/17527

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
2025-04-30 14:26:30 +00:00
02b2de9ae4 refactor: skip reconciliation for some presets (#17595) 2025-04-29 07:55:37 -04:00
a78f0fc4e1 refactor: use specific error for agpl and prebuilds (#17591)
Follow-up PR to https://github.com/coder/coder/pull/17458
Addresses this discussion:
https://github.com/coder/coder/pull/17458#discussion_r2055940797
2025-04-28 16:37:41 -04:00
9167cbfe4c refactor: claim prebuilt workspace tests (#17567)
Follow-up to: https://github.com/coder/coder/pull/17458
Specifically it addresses these discussions:
- https://github.com/coder/coder/pull/17458#discussion_r2053531445
2025-04-28 12:49:23 -04:00
e0483e3136 feat: add prebuilds metrics collector (#17547)
Closes https://github.com/coder/internal/issues/509

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
2025-04-28 12:28:56 +02:00
08ad910171 feat: add prebuilds configuration & bootstrapping (#17527)
Closes https://github.com/coder/internal/issues/508

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Cian Johnston <cian@coder.com>
2025-04-25 11:07:15 +02:00
118f12ac3a feat: implement claiming of prebuilt workspaces (#17458)
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <danny@coder.com>
Co-authored-by: Edward Angert <EdwardAngert@users.noreply.github.com>
Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>
Co-authored-by: Jaayden Halko <jaayden.halko@gmail.com>
Co-authored-by: Ethan <39577870+ethanndickson@users.noreply.github.com>
Co-authored-by: M Atif Ali <atif@coder.com>
Co-authored-by: Aericio <16523741+Aericio@users.noreply.github.com>
Co-authored-by: M Atif Ali <me@matifali.dev>
Co-authored-by: Michael Suchacz <203725896+ibetitsmike@users.noreply.github.com>
2025-04-24 09:39:38 -04:00
27bc60d1b9 feat: implement reconciliation loop (#17261)
Closes https://github.com/coder/internal/issues/510

<details>
<summary> Refactoring Summary </summary>

### 1) `CalculateActions` Function

#### Issues Before Refactoring:

- Large function (~150 lines), making it difficult to read and maintain.
- The control flow is hard to follow due to complex conditional logic.
- The `ReconciliationActions` struct was partially initialized early,
then mutated in multiple places, making the flow error-prone.

Original source:  

fe60b569ad/coderd/prebuilds/state.go (L13-L167)

#### Improvements After Refactoring:

- Simplified and broken down into smaller, focused helper methods.
- The flow of the function is now more linear and easier to understand.
- Struct initialization is cleaner, avoiding partial and incremental
mutations.

Refactored function:  

eeb0407d78/coderd/prebuilds/state.go (L67-L84)

---

### 2) `ReconciliationActions` Struct

#### Issues Before Refactoring:

- The struct mixed both actionable decisions and diagnostic state, which
blurred its purpose.
- It was unclear which fields were necessary for reconciliation logic,
and which were purely for logging/observability.

#### Improvements After Refactoring:

- Split into two clear, purpose-specific structs:
- **`ReconciliationActions`** — defines the intended reconciliation
action.
- **`ReconciliationState`** — captures runtime state and metadata,
primarily for logging and diagnostics.

Original struct:  

fe60b569ad/coderd/prebuilds/reconcile.go (L29-L41)

</details>

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Sas Swart <sas.swart.cdk@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Dean Sheather <dean@deansheather.com>
Co-authored-by: Spike Curtis <spike@coder.com>
Co-authored-by: Danny Kopping <danny@coder.com>
2025-04-17 09:29:29 -04:00
669e790df6 test: add unit test to excercise bug when idp sync hits deleted orgs (#17405)
Deleted organizations are still attempting to sync members. This causes
an error on inserting the member, and would likely cause issues later in
the sync process even if that member is inserted. Deleted orgs should be
skipped.
2025-04-16 09:27:35 -05:00
06d39151dc feat: extend request logs with auth & DB info (#17304)
Closes #16903
2025-04-15 13:27:23 +02:00
c06ef7c1eb chore!: remove JFrog integration (#17353)
- Removes displaying XRay scan results in the dashboard. I'm not sure
  anyone was even using this integration so it's just debt for us to
  maintain. We can open up a separate issue to get rid of the db tables
  once we know for sure that we haven't broken anyone.
2025-04-11 14:45:21 -04:00
46d4b28384 chore: add x-authz-checks debug header when running in dev mode (#16873) 2025-04-10 11:36:27 -06:00
0b58798a1a feat: remove site wide perms from creating a workspace (#17296)
Creating a workspace required `read` on site wide `user`. 
Only organization permissions should be required.
2025-04-09 14:35:43 -05:00
52d555880c chore: add custom samesite options to auth cookies (#16885)
Allows controlling `samesite` cookie settings from the deployment config
2025-04-08 14:15:14 -05:00
ce22de8d15 feat: log long-lived connections acceptance (#17219)
Closes #16904
2025-04-08 08:30:05 +00:00
0b2b643ce2 feat: persist prebuild definitions on template import (#16951)
This PR allows provisioners to recognise and report prebuild definitions
to the coder control plane. It also allows the coder control plane to
then persist these to its store.

closes https://github.com/coder/internal/issues/507

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: evgeniy-scherbina <evgeniy.shcherbina.es@gmail.com>
2025-04-07 10:35:28 +02:00
e8b7ce80de ci: re-enable revive and gosec linters (#17225)
* Reenables revive linter for test files (with an exception for the
`unused-parameter` rule)
* Reenables gosec linter for test files
2025-04-02 16:19:23 +01:00
17ddee05e5 chore: update golang to 1.24.1 (#17035)
- Update go.mod to use Go 1.24.1
- Update GitHub Actions setup-go action to use Go 1.24.1
- Fix linting issues with golangci-lint by:
  - Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1)

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <claude@anthropic.com>
2025-03-26 01:56:39 -05:00
cf10d98aab fix: improve error message when deleting organization with resources (#17049)
Closes
[coder/internal#477](https://github.com/coder/internal/issues/477)

![Screenshot 2025-03-21 at 11 25
57 AM](https://github.com/user-attachments/assets/50cc03e9-395d-4fc7-8882-18cb66b1fac9)

I'm solving this issue in two parts:

1. Updated the postgres function so that it doesn't omit 0 values in the
error
2. Created a new query to fetch the number of resources associated with
an organization and using that information to provider a cleaner error
message to the frontend

> **_NOTE:_** SQL is not my strong suit, and the code was created with
the help of AI. So I'd take extra time looking over what I wrote there
2025-03-25 15:31:24 -04:00
4c33846f6d chore: add prebuilds system user (#16916)
Pre-requisite for https://github.com/coder/coder/pull/16891

Closes https://github.com/coder/internal/issues/515

This PR introduces a new concept of a "system" user.

Our data model requires that all workspaces have an owner (a `users`
relation), and prebuilds is a feature that will spin up workspaces to be
claimed later by actual users - and thus needs to own the workspaces in
the interim.

Naturally, introducing a change like this touches a few aspects around
the codebase and we've taken the approach _default hidden_ here; in
other words, queries for users will by default _exclude_ all system
users, but there is a flag to ensure they can be displayed. This keeps
the changeset relatively small.

This user has minimal permissions (it's equivalent to a `member` since
it has no roles). It will be associated with the default org in the
initial migration, and thereafter we'll need to somehow ensure its
membership aligns with templates (which are org-scoped) for which it'll
need to provision prebuilds; that's a solution we'll have in a
subsequent PR.

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Sas Swart <sas.swart.cdk@gmail.com>
2025-03-25 12:18:06 +00:00
092c129de0 chore: perform several small frontend permissions refactors (#16735) 2025-03-07 10:33:09 -07:00
522181fead feat(coderd): add new dispatch logic for coder inbox (#16764)
This PR is [resolving the dispatch part of Coder
Inbocx](https://github.com/coder/internal/issues/403).

Since the DB layer has been merged - we now want to insert notifications
into Coder Inbox in parallel of the other delivery target.

To do so, we push two messages instead of one using the `Enqueue`
method.
2025-03-05 22:43:18 +01:00
04c33968cf refactor: replace golang.org/x/exp/slices with slices (#16772)
The experimental functions in `golang.org/x/exp/slices` are now
available in the standard library since Go 1.21.

Reference: https://go.dev/doc/go1.21#slices

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2025-03-04 00:46:49 +11:00
91a4a98c27 chore: add an unassign action for roles (#16728) 2025-02-27 10:39:06 -07:00
cccdf1ecac feat: implement WorkspaceCreationBan org role (#16686)
Using negative permissions, this role prevents a user's ability to
create & delete a workspace within a given organization.

Workspaces are uniquely owned by an org and a user, so the org has to
supercede the user permission with a negative permission.

# Use case

Organizations must be able to restrict a member's ability to create a
workspace. This permission is implicitly granted (see
https://github.com/coder/coder/issues/16546#issuecomment-2655437860).

To revoke this permission, the solution chosen was to use negative
permissions in a built in role called `WorkspaceCreationBan`.

# Rational

Using negative permissions is new territory, and not ideal. However,
workspaces are in a unique position.

Workspaces have 2 owners. The organization and the user. To prevent
users from creating a workspace in another organization, an [implied
negative
permission](36d9f5ddb3/coderd/rbac/policy.rego (L172-L192))
is used. So the truth table looks like: _how to read this table
[here](36d9f5ddb3/coderd/rbac/README.md (roles))_

| Role (example)  | Site | Org  | User | Result |
|-----------------|------|------|------|--------|
| non-org-member  | \_   | N    | YN\_ | N      |
| user            | \_   | \_   | Y    | Y      |
| WorkspaceBan    | \_   | N    | Y    | Y      |
| unauthenticated | \_   | \_   | \_   | N      |


This new role, `WorkspaceCreationBan` is the same truth table condition
as if the user was not a member of the organization (when doing a
workspace create/delete). So this behavior **is not entirely new**.

<details>

<summary>How to do it without a negative permission</summary>

The alternate approach would be to remove the implied permission, and
grant it via and organization role. However this would add new behavior
that an organizational role has the ability to grant a user permissions
on their own resources?

It does not make sense for an org role to prevent user from changing
their profile information for example. So the only option is to create a
new truth table column for resources that are owned by both an
organization and a user.

| Role (example)  | Site | Org  |User+Org| User | Result |
|-----------------|------|------|--------|------|--------|
| non-org-member  | \_   | N    |  \_    | \_   | N      |
| user            | \_   | \_   |  \_    | \_   | N      |
| WorkspaceAllow  | \_   | \_   |   Y    | \_   | Y      |
| unauthenticated | \_   | \_   |  \_    | \_   | N      |

Now a user has no opinion on if they can create a workspace, which feels
a little wrong. A user should have the authority over what is theres.

There is fundamental _philosophical_ question of "Who does a workspace
belong to?". The user has some set of autonomy, yet it is the
organization that controls it's existence. A head scratcher 🤔

</details>

## Will we need more negative built in roles?

There are few resources that have shared ownership. Only
`ResourceOrganizationMember` and `ResourceGroupMember`. Since negative
permissions is intended to revoke access to a shared resource, then
**no.** **This is the only one we need**.

Classic resources like `ResourceTemplate` are entirely controlled by the
Organization permissions. And resources entirely in the user control
(like user profile) are only controlled by `User` permissions.


![Uploading Screenshot 2025-02-26 at 22.26.52.png…]()

---------

Co-authored-by: Jaayden Halko <jaayden.halko@gmail.com>
Co-authored-by: ケイラ <mckayla@hey.com>
2025-02-27 06:23:18 -05:00
95363c9041 fix(enterprise/coderd): remove useless provisioner daemon id from request (#16723)
`ServeProvisionerDaemonRequest` has had an ID field for quite a while
now.
This field is only used for telemetry purposes; the actual daemon ID is
created upon insertion in the database. There's no reason to set it, and
it's confusing to do so. Deprecating the field and removing references
to it.
2025-02-27 09:08:08 +00:00
e005e4e51d chore: merge provisioner key and provisioner permissions (#16628)
Provisioner key permissions were never any different than provisioners.
Merging them for a cleaner permission story until they are required (if
ever) to be seperate.

This removed `ResourceProvisionerKey` from RBAC and just uses the
existing `ResourceProvisioner`.
2025-02-24 13:31:11 -06:00
546a549dcf feat: enable soft delete for organizations (#16584)
- Add deleted column to organizations table
- Add trigger to check for existing workspaces, templates, groups and
members in a org before allowing the soft delete

---------

Co-authored-by: Steven Masley <stevenmasley@gmail.com>
Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
2025-02-24 12:59:41 -05:00
9469b78290 fix!: enforce regex for agent names (#16641)
Underscores and double hyphens are now blocked. The regex is almost the
exact same as the `coder_app` `slug` regex, but uppercase characters are
still permitted.
2025-02-20 05:09:26 +00:00
7f061b9faf fix(coderd): add stricter authorization for provisioners endpoint (#16587)
References #16558
2025-02-17 14:34:47 +02:00
77306f3de1 feat(coderd): add filters and fix template for provisioner daemons (#16558)
This change adds provisioner daemon ID filter to the provisioner daemons
endpoint, and also implements the limiting to 50 results.

Test coverage is greatly improved and template information for jobs
associated to the daemon was also fixed.

Updates #15084
Updates #15192
Related #16532
2025-02-14 17:26:46 +02:00
d0a534e30d chore: prevent authentication of non-unique oidc subjects (#16498)
Any IdP returning an empty field here breaks the assumption of a
unique subject id. This is defined in the OIDC spec.
2025-02-10 09:31:08 -06:00
0e2ae10b47 feat: add additional patch routes for group and role idp sync (#16351) 2025-01-31 12:14:24 -07:00
6ea5c6f0ef fix: show user-auth provisioners for all organizations (#16350) 2025-01-30 14:08:27 -07:00