This pull request implements RFC 8707, Resource Indicators for OAuth 2.0 (https://datatracker.ietf.org/doc/html/rfc8707), to enhance the security of our OAuth 2.0 provider.
This change enables proper audience validation and binds access tokens to their intended resource, which is crucial
for preventing token misuse in multi-tenant environments or deployments with multiple resource servers.
## Key Changes:
* Resource Parameter Support: Adds support for the resource parameter in both the authorization (`/oauth2/authorize`) and token (`/oauth2/token`) endpoints, allowing clients to specify the intended resource server.
* Audience Validation: Implements server-side validation to ensure that the resource parameter provided during the token exchange matches the one from the authorization request.
* API Middleware Enforcement: Introduces a new validation step in the API authentication middleware (`coderd/httpmw/apikey.go`) to verify that the audience of the access token matches the resource server being accessed.
* Database Schema Updates:
* Adds a `resource_uri` column to the `oauth2_provider_app_codes` table to store the resource requested during authorization.
* Adds an `audience` column to the `oauth2_provider_app_tokens` table to bind the issued token to a specific audience.
* Enhanced PKCE: Includes a minor enhancement to the PKCE implementation to protect against timing attacks.
* Comprehensive Testing: Adds extensive new tests to `coderd/oauth2_test.go` to cover various RFC 8707 scenarios, including valid flows, mismatched resources, and refresh token validation.
## How it Works:
1. An OAuth2 client specifies the target resource (e.g., https://coder.example.com) using the resource parameter in the authorization request.
2. The authorization server stores this resource URI with the authorization code.
3. During the token exchange, the server validates that the client provides the same resource parameter.
4. The server issues an access token with an audience claim set to the validated resource URI.
5. When the client uses the access token to call an API endpoint, the middleware verifies that the token's audience matches the URL of the Coder deployment, rejecting any tokens intended for a different resource.
This ensures that a token issued for one Coder deployment cannot be used to access another, significantly strengthening our authentication security.
---
Change-Id: I3924cb2139e837e3ac0b0bd40a5aeb59637ebc1b
Signed-off-by: Thomas Kosiewski <tk@coder.com>
This PR provides two commands:
* `coder prebuilds pause`
* `coder prebuilds resume`
These allow the suspension of all prebuilds activity, intended for use
if prebuilds are misbehaving.
## Summary
This PR implements critical MCP OAuth2 compliance features for Coder's authorization server, adding PKCE support, resource parameter handling, and OAuth2 server metadata discovery. This brings Coder's OAuth2 implementation significantly closer to production readiness for MCP (Model Context Protocol)
integrations.
## What's Added
### OAuth2 Authorization Server Metadata (RFC 8414)
- Add `/.well-known/oauth-authorization-server` endpoint for automatic client discovery
- Returns standardized metadata including supported grant types, response types, and PKCE methods
- Essential for MCP client compatibility and OAuth2 standards compliance
### PKCE Support (RFC 7636)
- Implement Proof Key for Code Exchange with S256 challenge method
- Add `code_challenge` and `code_challenge_method` parameters to authorization flow
- Add `code_verifier` validation in token exchange
- Provides enhanced security for public clients (mobile apps, CLIs)
### Resource Parameter Support (RFC 8707)
- Add `resource` parameter to authorization and token endpoints
- Store resource URI and bind tokens to specific audiences
- Critical for MCP's resource-bound token model
### Enhanced OAuth2 Error Handling
- Add OAuth2-compliant error responses with proper error codes
- Use standard error format: `{"error": "code", "error_description": "details"}`
- Improve error consistency across OAuth2 endpoints
### Authorization UI Improvements
- Fix authorization flow to use POST-based consent instead of GET redirects
- Remove dependency on referer headers for security decisions
- Improve CSRF protection with proper state parameter validation
## Why This Matters
**For MCP Integration:** MCP requires OAuth2 authorization servers to support PKCE, resource parameters, and metadata discovery. Without these features, MCP clients cannot securely authenticate with Coder.
**For Security:** PKCE prevents authorization code interception attacks, especially critical for public clients. Resource binding ensures tokens are only valid for intended services.
**For Standards Compliance:** These are widely adopted OAuth2 extensions that improve interoperability with modern OAuth2 clients.
## Database Changes
- **Migration 000343:** Adds `code_challenge`, `code_challenge_method`, `resource_uri` to `oauth2_provider_app_codes`
- **Migration 000343:** Adds `audience` field to `oauth2_provider_app_tokens` for resource binding
- **Audit Updates:** New OAuth2 fields properly tracked in audit system
- **Backward Compatibility:** All changes maintain compatibility with existing OAuth2 flows
## Test Coverage
- Comprehensive PKCE test suite in `coderd/identityprovider/pkce_test.go`
- OAuth2 metadata endpoint tests in `coderd/oauth2_metadata_test.go`
- Integration tests covering PKCE + resource parameter combinations
- Negative tests for invalid PKCE verifiers and malformed requests
## Testing Instructions
```bash
# Run the comprehensive OAuth2 test suite
./scripts/oauth2/test-mcp-oauth2.sh
Manual Testing with Interactive Server
# Start Coder in development mode
./scripts/develop.sh
# In another terminal, set up test app and run interactive flow
eval $(./scripts/oauth2/setup-test-app.sh)
./scripts/oauth2/test-manual-flow.sh
# Opens browser with OAuth2 flow, handles callback automatically
# Clean up when done
./scripts/oauth2/cleanup-test-app.sh
Individual Component Testing
# Test metadata endpoint
curl -s http://localhost:3000/.well-known/oauth-authorization-server | jq .
# Test PKCE generation
./scripts/oauth2/generate-pkce.sh
# Run specific test suites
go test -v ./coderd/identityprovider -run TestVerifyPKCE
go test -v ./coderd -run TestOAuth2AuthorizationServerMetadata
```
### Breaking Changes
None. All changes maintain backward compatibility with existing OAuth2 flows.
---
Change-Id: Ifbd0d9a543d545f9f56ecaa77ff2238542ff954a
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Closes#17689
This PR optimizes the audit logs query performance by extracting the
count operation into a separate query and replacing the OR-based
workspace_builds with conditional joins.
## Query changes
* Extracted count query to separate one
* Replaced single `workspace_builds` join with OR conditions with
separate conditional joins
* Added conditional joins
* `wb_build` for workspace_build audit logs (which is a direct lookup)
* `wb_workspace` for workspace create audit logs (via workspace)
Optimized AuditLogsOffset query:
https://explain.dalibo.com/plan/4g1hbedg4a564bg8
New CountAuditLogs query:
https://explain.dalibo.com/plan/ga2fbcecb9efbce3
Fixes https://github.com/coder/internal/issues/695
Retries initial connection to postgres in testing up to 3 seconds if we
see "reset by peer", which probably means that some other test proc just
started the container.
---------
Co-authored-by: Hugo Dutka <hugo@coder.com>
## Description
Follow-up from PR https://github.com/coder/coder/pull/18333
Related with:
https://github.com/coder/coder/pull/18333#discussion_r2159300881
This changes the authorization logic to first try the normal workspace
authorization check, and only if the resource is a prebuilt workspace,
fall back to the prebuilt workspace authorization check. Since prebuilt
workspaces are a subset of workspaces, the normal workspace check is
more likely to succeed. This is a small optimization to reduce
unnecessary prebuilt authorization calls.
`wsbuilder` hits the file cache when running validation. This solution is imperfect, but by first sorting workspaces by their template version id, the cache hit rate should improve.
Add an endpoint to fetch AI task prompts for multiple workspace builds
at the same time. A prompt is the value of the "AI Prompt" workspace
build parameter. On main, the only way our API allows fetching workspace
build parameters is by using the `/workspacebuilds/$build_id/parameters`
endpoint, requiring a separate API call for every build.
The Tasks dashboard fetches Task workspaces in order to show them in a
list, and then needs to fetch the value of the `AI Prompt` parameter for
every task workspace (using its latest build id), requiring an
additional API call for each list item. This endpoint will allow the
dashboard to make just 2 calls to render the list: one to fetch task
workspaces, the other to fetch prompts.
<img width="1512" alt="Screenshot 2025-06-20 at 11 33 11"
src="https://github.com/user-attachments/assets/92899999-e922-44c5-8325-b4b23a0d2bff"
/>
Related to https://github.com/coder/internal/issues/660.
Use richer `previewtypes.Parameter` for `wsbuilder`. This is a pre-requirement to adding dynamic parameter validation.
The richer type contains more information than the `db` parameter, so the conversion is lossless.
Alternate fix for https://github.com/coder/coder/issues/18080
Modifies wsbuilder to complete the provisioner job and mark the
workspace as deleted if it is clear that no provisioner will be able to
pick up the delete build.
This has a significant advantage of not deviating too much from the
current semantics of `POST /api/v2/workspacebuilds`.
https://github.com/coder/coder/pull/18460 ends up returning a 204 on
orphan delete due to no build being created.
Downside is that we have to duplicate some responsibilities of
provisionerdserver in wsbuilder.
There is a slight gotcha to this approach though: if you stop a
provisioner and then immediately try to orphan-delete, the job will
still be created because of the provisioner heartbeat interval. However
you can cancel it and try again.
"Idle" is more accurate than "complete" since:
1. AgentAPI only knows if the screen is active; it has no way of knowing
if the task is complete.
2. The LLM might be done with its current prompt, but that does not mean
the task is complete either (it likely needs refinement).
The "complete" state will be reserved for future definition.
Additionally, in the case where the screen goes idle but the LLM never
reported a status update, we can get an idle icon without a message, and
it looks kinda janky in the UI so if there is no message I display the
state text.
Closes https://github.com/coder/internal/issues/699
## Description
This PR adds support for deleting prebuilt workspaces via the
authorization layer. It introduces special-case handling to ensure that
`prebuilt_workspace` permissions are evaluated when attempting to delete
a prebuilt workspace, falling back to the standard `workspace` resource
as needed.
Prebuilt workspaces are a subset of workspaces, identified by having
`owner_id` set to `PREBUILD_SYSTEM_USER`.
This means:
* A user with `prebuilt_workspace.delete` permission is allowed to
**delete only prebuilt workspaces**.
* A user with `workspace.delete` permission can **delete both normal and
prebuilt workspaces**.
⚠️ This implementation is scoped to **deletion operations only**. No
other operations are currently supported for the `prebuilt_workspace`
resource.
To delete a workspace, users must have the following permissions:
* `workspace.read`: to read the current workspace state
* `update`: to modify workspace metadata and related resources during
deletion (e.g., updating the `deleted` field in the database)
* `delete`: to perform the actual deletion of the workspace
## Changes
* Introduced `authorizeWorkspace()` helper to handle prebuilt workspace
authorization logic.
* Ensured both `prebuilt_workspace` and `workspace` permissions are
checked.
* Added comments to clarify the current behavior and limitations.
* Moved `SystemUserID` constant from the `prebuilds` package to the
`database` package `PrebuildsSystemUserID` to resolve an import cycle
(commit
f24e4ab4b6).
* Update middleware `ExtractOrganizationMember` to include system user
members.
Closes https://github.com/coder/internal/issues/312
Depends on https://github.com/coder/terraform-provider-coder/pull/408
This PR adds support for defining an **autoscaling block** for
prebuilds, allowing number of desired instances to scale dynamically
based on a schedule.
Example usage:
```
data "coder_workspace_preset" "us-nix" {
...
prebuilds = {
instances = 0 # default to 0 instances
scheduling = {
timezone = "UTC" # a single timezone is used for simplicity
# Scale to 3 instances during the work week
schedule {
cron = "* 8-18 * * 1-5" # from 8AM–6:59PM, Mon–Fri, UTC
instances = 3 # scale to 3 instances
}
# Scale to 1 instance on Saturdays for urgent support queries
schedule {
cron = "* 8-14 * * 6" # from 8AM–2:59PM, Sat, UTC
instances = 1 # scale to 1 instance
}
}
}
}
```
### Behavior
- Multiple `schedule` blocks per `prebuilds` block are supported.
- If the current time matches any defined autoscaling schedule, the
corresponding number of instances is used.
- If no schedule matches, the **default instance count**
(`prebuilds.instances`) is used as a fallback.
### Why
This feature allows prebuild instance capacity to adapt to predictable
usage patterns, such as:
- Scaling up during business hours or high-demand periods
- Reducing capacity during off-hours to save resources
### Cron specification
The cron specification is interpreted as a **continuous time range.**
For example, the expression:
```
* 9-18 * * 1-5
```
is intended to represent a continuous range from **09:00 to 18:59**,
Monday through Friday.
However, due to minor implementation imprecision, it is currently
interpreted as a range from **08:59:00 to 18:58:59**, Monday through
Friday.
This slight discrepancy arises because the evaluation is based on
whether a specific **point in time** falls within the range, using the
`github.com/coder/coder/v2/coderd/schedule/cron` library, which performs
per-minute matching rather than strict range evaluation.
---------
Co-authored-by: Danny Kopping <danny@coder.com>
Deletion of data is uncommon in our database, so the introduction of sub agents
and the deletion of them introduced issues with foreign key assumptions, as can
be seen in coder/internal#685. We could have only addressed the specific case by
allowing cascade deletion of stats as well as handling in the stats collector,
but it's unclear how many more such edge-cases we could run into.
In this change, we mark the rows as deleted via boolean instead, and filter them
out in all relevant queries.
Fixescoder/internal#685
The fields must be nullable because there’s a period of time between
inserting a row into the database and finishing the “plan” provisioner
job when the final value of the field is unknown.
The file cache was caching the `Unauthorized` errors if a user without
the right perms opened the file first. So all future opens would fail.
Now the cache always opens with a subject that can read files. And authz
is checked on the Acquire per user.
This PR implements protobuf streaming to handle large module files by:
1. **Streaming large payloads**: When module files exceed the 4MB limit,
they're streamed in chunks using a new UploadFile RPC method
2. **Database storage**: Streamed files are stored in the database and
referenced by hash for deduplication
3. **Backward compatibility**: Small module files continue using the
existing direct payload method
Adds database migrations required for the Tasks feature.
There's a slight difference between the migrations in this PR and the
RFC: this PR adds `NOT NULL` constraints to the `has_ai_task` columns.
It was an oversight on my part when I wrote the RFC - I assumed the
`DEFAULT FALSE` value would make the columns implicitly NOT NULL, but
that's not the case with Postgres. We have no use for the NULL value.
The `DEFAULT FALSE` statement ensures that the migration will pass even
when there are existing rows in the template version and workspace
builds tables, so there's no danger in adding the `NOT NULL`
constraints.
As part of an information architecture overhaul, this PR reorganizes the
About section and adds a Support section (but not content to it yet)
[preview](https://coder.com/docs/@docs-ia-about/about)
this PR is intentionally limited in scope so that we can ship meaningful
changes faster and followup PRs should include:
- [ ] edit + overhaul the About page
- [ ] decide on the `start` directory
- [ ] ~screenshots page updates~ (this should happen July or later)
redirects PR: https://github.com/coder/coder.com/pull/944
---------
Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>
Closes#17982.
The purpose of this PR is to expose network latency via the API used by Coder Desktop.
This PR has the tunnel ping all known agents every 5 seconds, in order to produce an instance of:
```proto
message LastPing {
// latency is the RTT of the ping to the agent.
google.protobuf.Duration latency = 1;
// did_p2p indicates whether the ping was sent P2P, or over DERP.
bool did_p2p = 2;
// preferred_derp is the human readable name of the preferred DERP region,
// or the region used for the last ping, if it was sent over DERP.
string preferred_derp = 3;
// preferred_derp_latency is the last known latency to the preferred DERP
// region. Unset if the region does not appear in the DERP map.
optional google.protobuf.Duration preferred_derp_latency = 4;
}
```
The contents of this message are stored and included on all subsequent upsertions of the agent.
Note that we upsert existing agents every 5 seconds to update the `last_handshake` value.
On the desktop apps, this message will be used to produce a tooltip similar to that of the VS Code extension:
<img width="495" alt="image" src="https://github.com/user-attachments/assets/d8b65f3d-f536-4c64-9af9-35c1a42c92d2" />
(wording not final)
Unlike the VS Code extension, we omit:
- The Latency of *all* available DERP regions. It seems not ideal to send a copy of this entire map for every online agent, and it certainly doesn't make sense for it to be on the `Agent` or `LastPing` message.
If we do want to expose this info on Coder Desktop, we should consider how best to do so; maybe we want to include it on a more generic `Netcheck` message.
- The current throughput (Bytes up/down). This is out of scope of the linked issue, and is non-trivial to implement. I'm also not sure of the value given the frequency we're doing these status updates (every 5 seconds).
If we want to expose it, it'll be in a separate PR.
<img width="343" alt="image" src="https://github.com/user-attachments/assets/8447d03b-9721-4111-8ac1-332d70a1e8f1" />
Instead of using `ResourceSystem` as the resource for
`InsertWorkspaceApp`, we instead use the associated workspace (if it
exists), with the action `ActionUpdate`.