- Adds `testutil.GoleakOptions` and consolidates existing options to
this location
- Pre-emptively adds required ignore for this Dependabot PR to pass CI
https://github.com/coder/coder/pull/16066
- Integrates the `agentexec` pkg into the agent and removes the
legacy system of iterating over the process tree. It adds some linting
rules to hopefully catch future improper uses of `exec.Command` in the package.
Refactors our use of `slogtest` to instantiate a "standard logger" across most of our tests. This standard logger incorporates https://github.com/coder/slog/pull/217 to also ignore database query canceled errors by default, which are a source of low-severity flakes.
Any test that has set non-default `slogtest.Options` is left alone. In particular, `coderdtest` defaults to ignoring all errors. We might consider revisiting that decision now that we have better tools to target the really common flaky Error logs on shutdown.
Closes#14729
Expands the Coordination controller used by the CLI client to allow multiple tunnel destinations (agents). Our current client uses just one, but this unifies the logic so that when we add Coder VPN, 1 is just a special case of "many."
chore of #14729
Refactors the `ServerTailnet` to use `tailnet.Controller` so that we reuse logic around reconnection and handling control messages, instead of reimplementing. This unifies our "client" use of the tailscale API across CLI, coderd, and wsproxy.
Refactors the way clients of the Tailnet API (clients of the API, which include both workspace "agents" and "clients") interact with the API. Introduces the idea of abstract "controllers" for each of the RPCs in the API, and implements a Coordination controller by refactoring from `workspacesdk`.
chore re: #14729
fixes#14715
Configures agents to use an address both in the Tailscale service prefix and the new Coder service prefix. Also modifies the Coordinator auth to allow the new prefix.
Updates `coder/tailscale` to include https://github.com/coder/tailscale/pull/62 which fixes a bug around forwarding TCP connections to localhost. This functionality is tested in the modifications to `TestAgent_Dial`.
re: #14715
This PR introduces the Coder service prefix: `fd60:627a:a42b::/48` and refactors our existing code as calling the Tailscale service prefix explicitly (rather than implicitly).
Removes the unused `Addresses` agent option. All clients today assume they can compute the Agent's IP address based on its UUID, so an agent started with a custom address would break things.
* feat: begin impl of agent script timings
* feat: add job_id and display_name to script timings
* fix: increment migration number
* fix: rename migrations from 251 to 254
* test: get tests compiling
* fix: appease the linter
* fix: get tests passing again
* fix: drop column from correct table
* test: add fixture for agent script timings
* fix: typo
* fix: use job id used in provisioner job timings
* fix: increment migration number
* test: behaviour of script runner
* test: rewrite test
* test: does exit 1 script break things?
* test: rewrite test again
* fix: revert change
Not sure how this came to be, I do not recall manually changing
these files.
* fix: let code breathe
* fix: wrap errors
* fix: justify nolint
* fix: swap require.Equal argument order
* fix: add mutex operations
* feat: add 'ran_on_start' and 'blocked_login' fields
* fix: update testdata fixture
* fix: refer to agent_id instead of job_id in timings
* fix: JobID -> AgentID in dbauthz_test
* fix: add 'id' to scripts, make timing refer to script id
* fix: fix broken tests and convert bug
* fix: update testdata fixtures
* fix: update testdata fixtures again
* feat: capture stage and if script timed out
* fix: update migration number
* test: add test for script api
* fix: fake db query
* fix: use UTC time
* fix: ensure r.scriptComplete is not nil
* fix: move err check to right after call
* fix: uppercase sql
* fix: use dbtime.Now()
* fix: debug log on r.scriptCompleted being nil
* fix: ensure correct rbac permissions
* chore: remove DisplayName
* fix: get tests passing
* fix: remove space in sql up
* docs: document ExecuteOption
* fix: drop 'RETURNING' from sql
* chore: remove 'display_name' from timing table
* fix: testdata fixture
* fix: put r.scriptCompleted call in goroutine
* fix: track goroutine for test + use separate context for reporting
* fix: appease linter, handle trackCommandGoroutine error
* fix: resolve race condition
* feat: replace timed_out column with status column
* test: update testdata fixture
* fix: apply suggestions from review
* revert: linter changes
Fixes#12560
When gracefully disconnecting from the coordinator, we would send the Disconnect message and then close the dRPC stream. However, closing the dRPC stream can cause the server not to process the Disconnect message, since we use the stream context in a `select` while sending it to the coordinator.
This is a product bug uncovered by the flake, and probably results in us failing graceful disconnect some minority of the time.
Instead, the `remoteCoordination` (and `inMemoryCoordination` for consistency) should send the Disconnect message and then wait for the coordinator to hang up (on some graceful disconnect timer, in the form of a context).
Currently, importing `codersdk` just to interact with the API requires
importing tailscale, which causes builds to fail unless manually using
our fork.
Changes the agent to use the new v2 API for sending logs, via the logSender component.
We keep the PatchLogs function around, but deprecate it so that we can test the v1 endpoint.
The agent is extended with a `--script-data-dir` flag, defaulting to the
OS temp dir. This dir is used for storing `coder-script-data/bin` and
`coder-script/[script uuid]`. The former is a place for all scripts to
place executable binaries that will be available by other scripts, SSH
sessions, etc. The latter is a place for the script to store files.
Since we default to OS temp dir, files are ephemeral by default. In the
future, we may consider adding new env vars or changing the default
storage location. Workspace startup speed could potentially benefit from
scripts being able to skip steps that require downloading software. We
may also extend this with more env variables (e.g. persistent storage in
HOME).
Fixes#11131
This commit refactors where custom environment variables are set in the
workspace and decouples agent specific configs from the `agentssh.Server`.
To reproduce all functionality, `agentssh.Config` is introduced.
The custom environment variables are now configured in `agent/agent.go`
and the agent retains control of the final state. This will allow for
easier extension in the future and keep other modules decoupled.
We're failing tests on error logs like this: https://github.com/coder/coder/actions/runs/7706053882/job/21000984583
Unfortunately, the error we hit, when the underlying connection is closed, is unexported, so we can't specifically ignore it.
Part of the issue is that agent.Close() doesn't wait for these goroutines to complete before returning, so the test harness proceeds to close the connection. This looks to our product code like the network connection failing. It would be possible to fix this, but just doesn't seem worth it for the extra insurance of catching other error logs in these tests.
This one is huge, and I'm sorry.
The problem is that once I change `tailnet.Conn` to start doing v2 behavior, I kind of have to change it everywhere, including in CoderSDK (CLI), the agent, wsproxy, and ServerTailnet.
There is still a bit more cleanup to do, and I need to add code so that when we lose connection to the Coordinator, we mark all peers as LOST, but that will be in a separate PR since this is big enough!
We're seeing some flaky tests related to agent connectivity - https://github.com/coder/coder/actions/runs/7286675441/job/19856270998
I'm pretty sure what happened in this one is that the client opened a connection while the wgengine was in the process of reconfiguring the wireguard device, so the fact that the peer became "active" as a result of traffic being sent was not noticed.
The test calls `AwaitReachable()` but this only tests the disco layer, so it doesn't wait for wireguard to come up.
I think we should be using TSMP for pinging and reachability, since this operates at the IP layer, and therefore requires that wireguard comes up before being successful.
This should also help with the problems we have seen where a TCP connection starts before wireguard is up and the initial round trip has to wait for the 5 second wireguard handshake retry.
fixes: #11294
Fixes#10799
The flake happens when we try to remote forward, but the port we've chosen is not free. In the flaked example, it's actually the SSH listener that occupies the port we try to remote forward, leading to confusing reads (c.f. the linked issue).
This fix simplies the tests considerably by using the Go ssh client, rather than shelling out to OpenSSH. This avoids using a pseudoterminal, avoids the need for starting any local OS listeners to communicate the forwarding (go SSH just returns in-process listeners), and avoids an OS listener to wire OpenSSH up to the agentConn.
With the simplied logic, we can immediately tell if a remote forward on a random port fails, so we can do this in a loop until success or timeout.
I've also simplified and fixed up the other forwarding tests. Since we set up forwarding in-process with Go ssh, we can remove a lot of the `require.Eventually` logic.