Commit Graph

217 Commits

Author SHA1 Message Date
f6bf6c6ec4 fix!: use names not IDs for agent SSH key seed (#17258)
Changes the SSH host key seeding to use the owner username, workspace name, and agent name. This prevents SSH from complaining about a mismatched host key if you use Coder Desktop to connect, and delete and recreate your workspace with the same name. Previously this would generate a different key because the workspace ID changed.

We also include the owner's username in anticipation of using Coder Desktop to access shared workspaces (or as a superuser) down the road, so that workspaces with the same name owned by different users will not have the same key.

This change is **BREAKING** in a limited sense that early access users of Coder Desktop will see their SSH clients complain about host keys changing the first time each workspace is rebuilt with this code. It can be resolved by clearing your `.ssh/known_hosts` file of the Coder workspaces you access this way.
2025-04-04 12:51:46 +04:00
b61f0ab958 fix(agent): ensure SSH server shutdown with process groups (#17227)
Fix hanging workspace shutdowns caused by orphaned SSH child processes.
Key changes:

- Create process groups for non-PTY SSH sessions
- Send SIGHUP to entire process group for proper termination
- Add 5-second timeout to prevent indefinite blocking

Fixes #17108
2025-04-03 16:01:43 +03:00
7d4b3c8634 feat(agent): add devcontainer autostart support (#17076)
This change adds support for devcontainer autostart in workspaces. The
preconditions for utilizing this feature are:

1. The `coder_devcontainer` resource must be defined in Terraform
2. By the time the startup scripts have completed,
	- The `@devcontainers/cli` tool must be installed
	- The given workspace folder must contain a devcontainer configuration

Example Terraform:

```tf
resource "coder_devcontainer" "coder" {
  agent_id         = coder_agent.main.id
  workspace_folder = "/home/coder/coder"
  config_path      = ".devcontainer/devcontainer.json" # (optional)
}
```

Closes #16423
2025-03-27 12:31:30 +02:00
1bbbae8d57 chore: migrate to github.com/coder/clistat (#17107)
Migrate from in-tree `clistat` package to
https://github.com/coder/clistat.
2025-03-26 10:36:53 +00:00
17ddee05e5 chore: update golang to 1.24.1 (#17035)
- Update go.mod to use Go 1.24.1
- Update GitHub Actions setup-go action to use Go 1.24.1
- Fix linting issues with golangci-lint by:
  - Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1)

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <claude@anthropic.com>
2025-03-26 01:56:39 -05:00
765e7058e8 chore: use container memory if containerised for oom notifications (#17062)
Currently we query only the underlying host's memory usage for our
memory resource monitor. This PR changes that to check if the workspace
is in a container, and if so it queries the container's memory usage,
falling back to the host's memory usage if not.
2025-03-24 11:14:21 +00:00
dfcd93b26e feat: enable agent connection reports by default, remove flag (#16778)
This change enables agent connection reports by default and removes the
experimental flag for enabling them.

Updates #15139
2025-03-03 18:37:28 +02:00
04c33968cf refactor: replace golang.org/x/exp/slices with slices (#16772)
The experimental functions in `golang.org/x/exp/slices` are now
available in the standard library since Go 1.21.

Reference: https://go.dev/doc/go1.21#slices

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2025-03-04 00:46:49 +11:00
d0e2060692 feat(agent): add second SSH listener on port 22 (#16627)
Some checks are pending
ci / changes (push) Waiting to run
ci / lint (push) Blocked by required conditions
ci / gen (push) Waiting to run
ci / fmt (push) Blocked by required conditions
ci / test-go (macos-latest) (push) Blocked by required conditions
ci / test-go (ubuntu-latest) (push) Blocked by required conditions
ci / test-go (windows-2022) (push) Blocked by required conditions
ci / test-cli (macos-latest) (push) Blocked by required conditions
ci / test-cli (windows-2022) (push) Blocked by required conditions
ci / test-go-pg (ubuntu-latest) (push) Blocked by required conditions
ci / test-go-pg-16 (push) Blocked by required conditions
ci / test-go-race (push) Blocked by required conditions
ci / test-go-race-pg (push) Blocked by required conditions
ci / test-go-tailnet-integration (push) Blocked by required conditions
ci / test-js (push) Blocked by required conditions
ci / test-e2e (push) Blocked by required conditions
ci / test-e2e-premium (push) Blocked by required conditions
ci / chromatic (push) Blocked by required conditions
ci / offlinedocs (push) Blocked by required conditions
ci / required (push) Blocked by required conditions
ci / build-dylib (push) Blocked by required conditions
ci / build (push) Blocked by required conditions
ci / deploy (push) Blocked by required conditions
ci / deploy-wsproxies (push) Blocked by required conditions
ci / sqlc-vet (push) Blocked by required conditions
ci / notify-slack-on-failure (push) Blocked by required conditions
OpenSSF Scorecard / Scorecard analysis (push) Waiting to run
Fixes: https://github.com/coder/internal/issues/377

Added an additional SSH listener on port 22, so the agent now listens on both, port one and port 22.

---
Change-Id: Ifd986b260f8ac317e37d65111cd4e0bd1dc38af8
Signed-off-by: Thomas Kosiewski <tk@coder.com>
2025-03-03 04:47:42 +01:00
ec44f06f5c feat(cli): allow SSH command to connect to running container (#16726)
Fixes https://github.com/coder/coder/issues/16709 and
https://github.com/coder/coder/issues/16420

Adds the capability to`coder ssh` into a running container if `CODER_AGENT_DEVCONTAINERS_ENABLE=true`.

Notes:
* SFTP is currently not supported
* Haven't tested X11 container forwarding
* Haven't tested agent forwarding
2025-02-28 09:38:45 +00:00
4ba5a8a2ba feat(agent): add connection reporting for SSH and reconnecting PTY (#16652)
Updates #15139
2025-02-27 10:45:45 +00:00
172e52317c feat(agent): wire up agentssh server to allow exec into container (#16638)
Builds on top of https://github.com/coder/coder/pull/16623/ and wires up
the ReconnectingPTY server. This does nothing to wire up the web
terminal yet but the added test demonstrates the functionality working.

Other changes:
* Refactors and moves the `SystemEnvInfo` interface to the
`agent/usershell` package to address follow-up from
https://github.com/coder/coder/pull/16623#discussion_r1967580249
* Marks `usershellinfo.Get` as deprecated. Consumers should use the
`EnvInfoer` interface instead.

---------

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
Co-authored-by: Danny Kopping <danny@coder.com>
2025-02-26 09:03:27 +00:00
660746462e fix(agent/agentssh): use deterministic host key for SSH server (#16626)
Fixes: https://github.com/coder/coder/issues/16490

The Agent's SSH server now initially generates fixed host keys and, once it receives its manifest, generates and replaces that host key with the one derived from the workspace ID, ensuring consistency across agent restarts. This prevents SSH warnings and host key verification errors when connecting to workspaces through Coder Desktop.

While deterministic keys might seem insecure, the underlying Wireguard tunnel already provides encryption and anti-spoofing protection at the network layer, making this approach acceptable for our use case.

---
Change-Id: I8c7e3070324e5d558374fd6891eea9d48660e1e9
Signed-off-by: Thomas Kosiewski <tk@coder.com>
2025-02-21 14:58:41 +01:00
b07b33ec9d feat: add agentapi endpoint to report connections for audit (#16507)
This change adds a new `ReportConnection` endpoint to the `agentapi`.

The protocol version was bumped previously, so it has been omitted here.

This allows the agent to report connection events, for example when the
user connects to the workspace via SSH or VS Code.

Updates #15139
2025-02-20 14:52:01 +02:00
4edd77bc82 chore(agent/agentssh): extract CreateCommandDeps (#16603)
Extracts environment-level dependencies of
`agentssh.Server.CreateCommand()` to an interface to allow alternative
implementations to be passed in.
2025-02-19 09:03:59 +00:00
bc609d0056 feat: integrate agentAPI with resources monitoring logic (#16438)
As part of the new resources monitoring logic - more specifically for
OOM & OOD Notifications , we need to update the AgentAPI , and the
agents logic.

This PR aims to do it, and more specifically :  
We are updating the AgentAPI & TailnetAPI to version 24 to add two new
methods in the AgentAPI :
- One method to fetch the resources monitoring configuration
- One method to push the datapoints for the resources monitoring.

Also, this PR adds a new logic on the agent side, with a routine running
and ticking - fetching the resources usage each time , but also storing
it in a FIFO like queue.

Finally, this PR fixes a problem we had with RBAC logic on the resources
monitoring model, applying the same logic than we have for similar
entities.
2025-02-14 10:28:15 +01:00
31b1ff7d3b feat(agent): add container list handler (#16346)
Fixes https://github.com/coder/coder/issues/16268

- Adds `/api/v2/workspaceagents/:id/containers` coderd endpoint that allows listing containers
visible to the agent. Optional filtering by labels is supported.
- Adds go tools to the `coder-dylib` CI step so we can generate mocks if needed
2025-02-10 11:29:30 +00:00
ce573b9faa fix: add agent exec abstraction (#15717) 2024-12-04 23:30:25 +02:00
1f238fed59 feat: integrate new agentexec pkg (#15609)
- Integrates the `agentexec` pkg into the agent and removes the
legacy system of iterating over the process tree. It adds some linting
rules to hopefully catch future improper uses of `exec.Command` in the package.
2024-11-27 20:12:15 +02:00
103824f726 fix: fix panic while tearing down reconnecting PTY (#15615)
fixes https://github.com/coder/internal/issues/221

Fixes an issue where two goroutines were sharing the `err` variable, leading to a data race where we'd fail to process the error and then nil-pointer panic.

I ended up refactoring reconnecting PTY stuff into the `reconnectingpty` package, instead of having it on the agent.  That `createTailnet` routine had waaay too many deeply nested goroutines, which is I'm sure a big contributor to the bug appearing in the first place.
2024-11-22 09:46:25 +04:00
40802958e9 fix: use explicit api versions for agent and tailnet (#15508)
Bumps the Tailnet and Agent API version 2.3, and creates some extra controls and machinery around these versions.

What happened is that we accidentally shipped two new API features without bumping the version.  `ScriptCompleted` on the Agent API in Coder v2.16 and `RefreshResumeToken` on the Tailnet API in Coder v2.15.

Since we can't easily retroactively bump the versions, we'll roll these changes into API version 2.3 along with the new WorkspaceUpdates RPC, which hasn't been released yet.  That means there is some ambiguity in Coder v2.15-v2.17 about exactly what methods are supported on the Tailnet and Agent APIs.  This isn't great, but hasn't caused us major issues because 

1. RefreshResumeToken is considered optional, and clients just log and move on if the RPC isn't supported. 
2. Agents basically never get started talking to a Coderd that is older than they are, since the agent binary is normally downloaded from Coderd at workspace start.

Still it's good to get things squared away in terms of versions for SDK users and possible edge cases around client and server versions.

To mitigate against this thing happening again, this PR also:

1. adds a CODEOWNERS for the API proto packages, so I'll review changes
2. defines interface types for different API versions, and has the agent explicitly use a specific version.  That way, if you add a new method, and try to use it in the agent without thinking explicitly about versions, it won't compile.

With the protocol controllers stuff, we've sort of already abstracted the Tailnet API such that the interface type strategy won't work, but I'll work on getting the Controller to be version aware, such that it can check the API version it's getting against the controllers it has -- in a later PR.
2024-11-15 11:16:28 +04:00
886dcbec84 chore: refactor coordination (#15343)
Refactors the way clients of the Tailnet API (clients of the API, which include both workspace "agents" and "clients") interact with the API.  Introduces the idea of abstract "controllers" for each of the RPCs in the API, and implements a Coordination controller by refactoring from `workspacesdk`.

chore re: #14729
2024-11-05 13:50:10 +04:00
c5a4095610 fix: include custom agent headers in tailnet to support DERP connections (#15145)
Fixes #15131.
2024-10-21 20:59:21 +11:00
7da231bc92 fix: fix error handling to prevent spam in proc prio management (#15071) 2024-10-15 02:17:10 +00:00
8785a51b09 feat: include Coder service prefix on agents (#14944)
fixes #14715

Configures agents to use an address both in the Tailscale service prefix and the new Coder service prefix. Also modifies the Coordinator auth to allow the new prefix.

Updates `coder/tailscale` to include https://github.com/coder/tailscale/pull/62 which fixes a bug around forwarding TCP connections to localhost.  This functionality is tested in the modifications to `TestAgent_Dial`.
2024-10-04 10:16:33 +04:00
7d9f5ab81d chore: add Coder service prefix to tailnet (#14943)
re: #14715

This PR introduces the Coder service prefix: `fd60:627a:a42b::/48` and refactors our existing code as calling the Tailscale service prefix explicitly (rather than implicitly).

Removes the unused `Addresses` agent option. All clients today assume they can compute the Agent's IP address based on its UUID, so an agent started with a custom address would break things.
2024-10-04 10:04:10 +04:00
ae522c558d feat: add agent timings (#14713)
* feat: begin impl of agent script timings

* feat: add job_id and display_name to script timings

* fix: increment migration number

* fix: rename migrations from 251 to 254

* test: get tests compiling

* fix: appease the linter

* fix: get tests passing again

* fix: drop column from correct table

* test: add fixture for agent script timings

* fix: typo

* fix: use job id used in provisioner job timings

* fix: increment migration number

* test: behaviour of script runner

* test: rewrite test

* test: does exit 1 script break things?

* test: rewrite test again

* fix: revert change

Not sure how this came to be, I do not recall manually changing
these files.

* fix: let code breathe

* fix: wrap errors

* fix: justify nolint

* fix: swap require.Equal argument order

* fix: add mutex operations

* feat: add 'ran_on_start' and 'blocked_login' fields

* fix: update testdata fixture

* fix: refer to agent_id instead of job_id in timings

* fix: JobID -> AgentID in dbauthz_test

* fix: add 'id' to scripts, make timing refer to script id

* fix: fix broken tests and convert bug

* fix: update testdata fixtures

* fix: update testdata fixtures again

* feat: capture stage and if script timed out

* fix: update migration number

* test: add test for script api

* fix: fake db query

* fix: use UTC time

* fix: ensure r.scriptComplete is not nil

* fix: move err check to right after call

* fix: uppercase sql

* fix: use dbtime.Now()

* fix: debug log on r.scriptCompleted being nil

* fix: ensure correct rbac permissions

* chore: remove DisplayName

* fix: get tests passing

* fix: remove space in sql up

* docs: document ExecuteOption

* fix: drop 'RETURNING' from sql

* chore: remove 'display_name' from timing table

* fix: testdata fixture

* fix: put r.scriptCompleted call in goroutine

* fix: track goroutine for test + use separate context for reporting

* fix: appease linter, handle trackCommandGoroutine error

* fix: resolve race condition

* feat: replace timed_out column with status column

* test: update testdata fixture

* fix: apply suggestions from review

* revert: linter changes
2024-09-24 10:51:49 +01:00
2df9a3e554 fix: fix tailnet remoteCoordination to wait for server (#14666)
Fixes #12560

When gracefully disconnecting from the coordinator, we would send the Disconnect message and then close the dRPC stream.  However, closing the dRPC stream can cause the server not to process the Disconnect message, since we use the stream context in a `select` while sending it to the coordinator.

This is a product bug uncovered by the flake, and probably results in us failing graceful disconnect some minority of the time.

Instead, the `remoteCoordination` (and `inMemoryCoordination` for consistency) should send the Disconnect message and then wait for the coordinator to hang up (on some graceful disconnect timer, in the form of a context).
2024-09-16 09:24:30 +04:00
bfdc29f466 fix: suppress benign errors when listing processes (#14660) 2024-09-12 23:00:04 +01:00
9f4972901c fix: avoid logging no such process errors for process priority (#14655) 2024-09-12 18:00:42 +00:00
fb3523b37f chore: remove legacy AgentIP address (#14640)
Removes the support for the Agent's "legacy IP" which was a hardcoded IP address all agents used to use, before we introduced "single tailnet". Single tailnet went GA in 2.7.0.
2024-09-12 07:40:19 +04:00
c8580a415a feat: expose current agent connections by type via prometheus (#14612) 2024-09-11 14:13:30 +10:00
839918c5e7 chore(docs): document agent api debug endpoints (#14454)
* chore(docs): add agent api debug docs

* chore(docs): add sections to agent api readme

* chore(docs): link debug manifest to agentsdk.Manifest schema

* chore(docs): add high level overview of agent api debug docs

* chore(docs): link to agent api docs from reference

* chore(docs): fix invalid paths

* chore(docs): use env variable for coder agent debug address
2024-08-28 09:47:14 +01:00
8c0565177e chore(agent): remove err=<nil> log for batch update metadata complete (#14179)
Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
2024-08-07 11:31:47 +00:00
e96652ebbc feat: block file transfers for security (#13501) 2024-06-10 12:12:23 +00:00
b248f125e1 chore: rename notification banners to announcement banners (#13419) 2024-05-31 10:59:28 -06:00
d8e0be6ee6 feat: add support for multiple banners (#13081) 2024-05-08 15:40:43 -06:00
d51c6912a7 fix: make handleManifest always signal dependents (#13141)
Fixes #13139

Using a bare channel to signal dependent goroutines means that we can only signal success, not failure, which leads to deadlock if we fail in a way that doesn't cause the whole `apiConnRoutineManager` to tear down routines.

Instead, we use a new object called a `checkpoint` that signals success or failure, so that dependent routines get unblocked if the routine they depend on fails.
2024-05-06 14:47:41 +04:00
2efb46a10e chore: remove superfluous context.Canceled handling (#13140)
Removes a check for `context.Canceled` inside the `handleManifest` routine.  This checking is handled in the `apiConnRoutineManager`, so checking inside the handler is redundant.
2024-05-06 14:33:16 +04:00
99dda4a43a fix(agent): keep track of lastReportIndex between invocations of reportLifecycle() (#13075) 2024-04-25 16:54:51 +01:00
426e9f2b96 feat: support adjusting child proc oom scores (#12655) 2024-04-03 09:42:03 -05:00
4d5a7b2d56 chore(codersdk): move all tailscale imports out of codersdk (#12735)
Currently, importing `codersdk` just to interact with the API requires
importing tailscale, which causes builds to fail unless manually using
our fork.
2024-03-26 12:44:31 -05:00
b0c4e7504c feat(support): add client magicsock and agent prometheus metrics to support bundle (#12604)
* feat(codersdk): add ability to fetch prometheus metrics directly from agent
* feat(support): add client magicsock and agent prometheus metrics to support bundle
* refactor(support): simplify AgentInfo control flow

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
2024-03-15 15:33:49 +00:00
653ddccd8e fix(agent): remove unused token debug handler (#12602) 2024-03-15 09:43:36 +00:00
63696d762f feat(codersdk): add debug handlers for logs, manifest, and token to agent (#12593)
* feat(codersdk): add debug handlers for logs, manifest, and token to agent

* add more logging

* use io.LimitReader instead of seeking
2024-03-14 15:36:12 +00:00
3b406878e0 feat(agent): expose HTTP debug server over tailnet API (#12582) 2024-03-14 10:02:01 +00:00
b0afffbafb feat: use v2 API for agent metadata updates (#12281)
Switches the agent to report metadata over the v2 API.

Fixes #10534
2024-02-26 09:50:19 +04:00
aa7a9f5cc4 feat: use v2 API for agent lifecycle updates (#12278)
Agent uses the v2 API to post lifecycle updates.

Part of #10534
2024-02-23 15:24:28 +04:00
4cc132cea0 feat: switch agent to use v2 API for sending logs (#12068)
Changes the agent to use the new v2 API for sending logs, via the logSender component.

We keep the PatchLogs function around, but deprecate it so that we can test the v1 endpoint.
2024-02-23 11:27:15 +04:00
af3fdc68c3 chore: refactor agent routines that use the v2 API (#12223)
In anticipation of needing the `LogSender` to run on a context that doesn't get immediately canceled when you `Close()` the agent, I've undertaken a little refactor to manage the goroutines that get run against the Tailnet and Agent API connection.

This handles controlling two contexts, one that gets canceled right away at the start of graceful shutdown, and another that stays up to allow graceful shutdown to complete.
2024-02-23 11:04:23 +04:00