Fixes https://github.com/coder/coder/issues/16268
- Adds `/api/v2/workspaceagents/:id/containers` coderd endpoint that allows listing containers
visible to the agent. Optional filtering by labels is supported.
- Adds go tools to the `coder-dylib` CI step so we can generate mocks if needed
Migrates us to `coder/websocket` v1.8.12 rather than `nhooyr/websocket` on an older version.
Works around https://github.com/coder/websocket/issues/504 by adding an explicit test for `xerrors.Is(err, io.EOF)` where we were previously getting `io.EOF` from the netConn.
fixes#14881
Our handlers for streaming logs don't read from the websocket. We don't allow the client to send us any data, but the websocket library we use requires reading from the websocket to properly handle pings and closing. Not doing so can [can cause the websocket to hang on write](https://github.com/coder/websocket/issues/405), leaking go routines which were noticed in #14881.
This fixes the issue, and in process refactors our log streaming to a encoder/decoder package which provides generic types for sending JSON over websocket.
I'd also like for us to upgrade to the latest https://github.com/coder/websocket but we should also upgrade our tailscale fork before doing so to avoid including two copies of the websocket library.
* feat: begin impl of agent script timings
* feat: add job_id and display_name to script timings
* fix: increment migration number
* fix: rename migrations from 251 to 254
* test: get tests compiling
* fix: appease the linter
* fix: get tests passing again
* fix: drop column from correct table
* test: add fixture for agent script timings
* fix: typo
* fix: use job id used in provisioner job timings
* fix: increment migration number
* test: behaviour of script runner
* test: rewrite test
* test: does exit 1 script break things?
* test: rewrite test again
* fix: revert change
Not sure how this came to be, I do not recall manually changing
these files.
* fix: let code breathe
* fix: wrap errors
* fix: justify nolint
* fix: swap require.Equal argument order
* fix: add mutex operations
* feat: add 'ran_on_start' and 'blocked_login' fields
* fix: update testdata fixture
* fix: refer to agent_id instead of job_id in timings
* fix: JobID -> AgentID in dbauthz_test
* fix: add 'id' to scripts, make timing refer to script id
* fix: fix broken tests and convert bug
* fix: update testdata fixtures
* fix: update testdata fixtures again
* feat: capture stage and if script timed out
* fix: update migration number
* test: add test for script api
* fix: fake db query
* fix: use UTC time
* fix: ensure r.scriptComplete is not nil
* fix: move err check to right after call
* fix: uppercase sql
* fix: use dbtime.Now()
* fix: debug log on r.scriptCompleted being nil
* fix: ensure correct rbac permissions
* chore: remove DisplayName
* fix: get tests passing
* fix: remove space in sql up
* docs: document ExecuteOption
* fix: drop 'RETURNING' from sql
* chore: remove 'display_name' from timing table
* fix: testdata fixture
* fix: put r.scriptCompleted call in goroutine
* fix: track goroutine for test + use separate context for reporting
* fix: appease linter, handle trackCommandGoroutine error
* fix: resolve race condition
* feat: replace timed_out column with status column
* test: update testdata fixture
* fix: apply suggestions from review
* revert: linter changes
Currently, importing `codersdk` just to interact with the API requires
importing tailscale, which causes builds to fail unless manually using
our fork.
In anticipation of needing the `LogSender` to run on a context that doesn't get immediately canceled when you `Close()` the agent, I've undertaken a little refactor to manage the goroutines that get run against the Tailnet and Agent API connection.
This handles controlling two contexts, one that gets canceled right away at the start of graceful shutdown, and another that stays up to allow graceful shutdown to complete.
I noticed in testing that the CLI wasn't correctly sending the disconnect message when it shuts down, and thus agents are seeing this as a "lost" peer, rather than a "disconnected" one.
What was happening is that we just used a single context for everything from the netconn to the RPCs, and when the context was canceled we failed to send the disconnect message due to canceled context.
So, this PR splits things into two contexts, with a graceful one set to last up to 1 second longer than the main one.
Adds logging to yamux when used for tailnet client connections, e.g. CLI and wsproxy. This could be useful for debugging connection issues with tailnet v2 API.
Fixes#8218
Removes `wsconncache` and related "is legacy?" functions and API calls that were used by it.
The only leftover is that Agents still use the legacy IP, so that back level clients or workspace proxies can dial them correctly.
We should eventually remove this: #11819
fixes#10533
refactors `codersdk` workspace agent dialer to use a single websocket connection to the tailnet v2 API for both coordination and DERPMap updates, rather than separate websockets (and the v1 API for DERPMaps).
fixes#10531
Adds a check for `version` on connection to the Agent API websocket endpoint. This is primarily for future-proofing, so that up-level agents get a sensible error if they connect to a back-level Coderd.
It also refactors the location of the `CurrentVersion` variables, to be part of the `proto` packages, since the versions refer to the APIs defined therein.
This one is huge, and I'm sorry.
The problem is that once I change `tailnet.Conn` to start doing v2 behavior, I kind of have to change it everywhere, including in CoderSDK (CLI), the agent, wsproxy, and ServerTailnet.
There is still a bit more cleanup to do, and I need to add code so that when we lose connection to the Coordinator, we mark all peers as LOST, but that will be in a separate PR since this is big enough!
* feat: allow external services to be authable
* Refactor external auth config structure for defaults
* Add support for new config properties
* Change the name of external auth
* Move externalauth -> external-auth
* Run gen
* Fix tests
* Fix MW tests
* Fix git auth redirect
* Fix lint
* Fix name
* Allow any ID
* Fix invalid type test
* Fix e2e tests
* Fix comments
* Fix colors
* Allow accepting any type as string
* Run gen
* Fix href
* chore: rename `git_auth` to `external_auth` in our schema
We're changing Git auth to be external auth. It will support
any OAuth2 or OIDC provider.
To split up the larger change I want to contribute the schema
changes first, and I'll add the feature itself in another PR.
* Fix names
* Fix outdated view
* Rename some additional places
* Fix sort order
* Fix template versions auth route
* Fix types
* Fix dbauthz
* chore: add /v2 to import module path
go mod requires semantic versioning with versions greater than 1.x
This was a mechanical update by running:
```
go install github.com/marwan-at-work/mod/cmd/mod@latest
mod upgrade
```
Migrate generated files to import /v2
* Fix gen
* chore: rename startup logs to agent logs
This also adds a `source` property to every agent log. It
should allow us to group logs and display them nicer in
the UI as they stream in.
* Fix migration order
* Fix naming
* Rename the frontend
* Fix tests
* Fix down migration
* Match enums for workspace agent logs
* Fix inserting log source
* Fix migration order
* Fix logs tests
* Fix psql insert
This commit reverts some of the changes in #8029 and implements an
alternative method of keeping track of when the startup script has ended
and there will be no more logs.
This is achieved by adding new agent fields for tracking when the agent
enters the "starting" and "ready"/"start_error" lifecycle states. The
timestamps simplify logic since we don't need understand if the current
state is before or after the state we're interested in. They can also be
used to show data like how long the startup script took to execute. This
also allowed us to remove the EOF field from the logs as the
implementation was problematic when we returned the EOF log entry in the
response since requesting _after_ that ID would give no logs and the API
would thus lose track of EOF.
* feat(coderd,agent): send startup log eof at the end
* fix(coderd): fix edge case in startup log pubsub
* fix(coderd): ensure startup logs are closed on lifecycle state change (fallback)
* fix(codersdk): fix startup log channel shared memory bug
* fix(site): remove the EOF log line
* fix(codersdk): wait for subscription in WatchWorkspaceAgentMetadata
* fix(coderd): subscribe before sending initial metadata event
* test(coderd): add retries to TestWorkspaceAgent_Metadata to avoid flake
* feat: allow DERP headers to be set
* chore: remove custom flag
* Clone DERP header on client create
* Adjust to use interface to cast headers
---------
Co-authored-by: Kyle Carberry <kyle@carberry.com>