Fixes https://github.com/coder/coder/issues/16124
If a workspace agent crashes, it is possible for any startup scripts to
be ran again. This PR makes it so that the
`GetWorkspaceAgentScriptTimingsByBuildID` query only returns the first
timing recorded per-script.
* feat: begin impl of agent script timings
* feat: add job_id and display_name to script timings
* fix: increment migration number
* fix: rename migrations from 251 to 254
* test: get tests compiling
* fix: appease the linter
* fix: get tests passing again
* fix: drop column from correct table
* test: add fixture for agent script timings
* fix: typo
* fix: use job id used in provisioner job timings
* fix: increment migration number
* test: behaviour of script runner
* test: rewrite test
* test: does exit 1 script break things?
* test: rewrite test again
* fix: revert change
Not sure how this came to be, I do not recall manually changing
these files.
* fix: let code breathe
* fix: wrap errors
* fix: justify nolint
* fix: swap require.Equal argument order
* fix: add mutex operations
* feat: add 'ran_on_start' and 'blocked_login' fields
* fix: update testdata fixture
* fix: refer to agent_id instead of job_id in timings
* fix: JobID -> AgentID in dbauthz_test
* fix: add 'id' to scripts, make timing refer to script id
* fix: fix broken tests and convert bug
* fix: update testdata fixtures
* fix: update testdata fixtures again
* feat: capture stage and if script timed out
* fix: update migration number
* test: add test for script api
* fix: fake db query
* fix: use UTC time
* fix: ensure r.scriptComplete is not nil
* fix: move err check to right after call
* fix: uppercase sql
* fix: use dbtime.Now()
* fix: debug log on r.scriptCompleted being nil
* fix: ensure correct rbac permissions
* chore: remove DisplayName
* fix: get tests passing
* fix: remove space in sql up
* docs: document ExecuteOption
* fix: drop 'RETURNING' from sql
* chore: remove 'display_name' from timing table
* fix: testdata fixture
* fix: put r.scriptCompleted call in goroutine
* fix: track goroutine for test + use separate context for reporting
* fix: appease linter, handle trackCommandGoroutine error
* fix: resolve race condition
* feat: replace timed_out column with status column
* test: update testdata fixture
* fix: apply suggestions from review
* revert: linter changes
Updates the `DeleteOldWorkspaceAgentLogs` to:
- Retain logs for the most recent build regardless of age,
- Delete logs for agents that never connected and were created before
the cutoff for deleting logs while still retaining the logs most recent build.
Related to #10576
This PR introduces quartz to coderd/database/dbpurge and updates the following unit tests to make use of Quartz's functionality:
- TestPurge
- TestDeleteOldWorkspaceAgentLogs
Additionally, updates DeleteOldWorkspaceAgentLogs to replace the hard-coded interval with a parameter passed into the query. This aids in testing and brings us a step towards allowing operators to configure the cutoff interval for workspace agent logs.
* Refactors the existing httpmw tests to use dbtestutil so that we can test them against a real database if desired,
* Modifies the GetWorkspaceAgentByAuthToken to return the owner and associated roles, removing the need for additional queries
* chore: rename startup logs to agent logs
This also adds a `source` property to every agent log. It
should allow us to group logs and display them nicer in
the UI as they stream in.
* Fix migration order
* Fix naming
* Rename the frontend
* Fix tests
* Fix down migration
* Match enums for workspace agent logs
* Fix inserting log source
* Fix migration order
* Fix logs tests
* Fix psql insert
This commit reverts some of the changes in #8029 and implements an
alternative method of keeping track of when the startup script has ended
and there will be no more logs.
This is achieved by adding new agent fields for tracking when the agent
enters the "starting" and "ready"/"start_error" lifecycle states. The
timestamps simplify logic since we don't need understand if the current
state is before or after the state we're interested in. They can also be
used to show data like how long the startup script took to execute. This
also allowed us to remove the EOF field from the logs as the
implementation was problematic when we returned the EOF log entry in the
response since requesting _after_ that ID would give no logs and the API
would thus lose track of EOF.
* feat(coderd,agent): send startup log eof at the end
* fix(coderd): fix edge case in startup log pubsub
* fix(coderd): ensure startup logs are closed on lifecycle state change (fallback)
* fix(codersdk): fix startup log channel shared memory bug
* fix(site): remove the EOF log line
* feat(api): Add agent shutdown lifecycle states
* feat(agent): Add shutdown_script support
* feat(agent): Add shutdown_script timeout
* feat(site): Support new agent lifecycle states
---
Co-authored-by: Marcin Tojek <marcin@coder.com>
* feat: Add connection_timeout and troubleshooting_url to agent
This commit adds the connection timeout and troubleshooting url fields
to coder agents.
If an initial connection cannot be established within connection timeout
seconds, then the agent status will be marked as `"timeout"`.
The troubleshooting URL will be present, if configured in the Terraform
template, it can be presented to the user when the agent state is either
`"timeout"` or `"disconnected"`.
Fixes#4678
* fix: Add coder user to docker group on installation
This makes for a simpler setup, and reduces the likelihood
a user runs into a strange issue.
* Add wgnet
* Add ping
* Add listening
* Finish refactor to make this work
* Add interface for swapping
* Fix conncache with interface
* chore: update gvisor
* fix tailscale types
* linting
* more linting
* Add coordinator
* Add coordinator tests
* Fix coordination
* It compiles!
* Move all connection negotiation in-memory
* Migrate coordinator to use net.conn
* Add closed func
* Fix close listener func
* Make reconnecting PTY work
* Fix reconnecting PTY
* Update CI to Go 1.19
* Add CLI flags for DERP mapping
* Fix Tailnet test
* Rename ConnCoordinator to TailnetCoordinator
* Remove print statement from workspace agent test
* Refactor wsconncache to use tailnet
* Remove STUN from unit tests
* Add migrate back to dump
* chore: Upgrade to Go 1.19
This is required as part of #3505.
* Fix reconnecting PTY tests
* fix: update wireguard-go to fix devtunnel
* fix migration numbers
* linting
* Return early for status if endpoints are empty
* Update cli/server.go
Co-authored-by: Colin Adler <colin1adler@gmail.com>
* Update cli/server.go
Co-authored-by: Colin Adler <colin1adler@gmail.com>
* Fix frontend entites
* Fix agent bicopy
* Fix race condition for the last node
* Fix down migration
* Fix connection RBAC
* Fix migration numbers
* Fix forwarding TCP to a local port
* Implement ping for tailnet
* Rename to ForceHTTP
* Add external derpmapping
* Expose DERP region names to the API
* Add global option to enable Tailscale networking for web
* Mark DERP flags hidden while testing
* Update DERP map on reconnect
* Add close func to workspace agents
* Fix race condition in upstream dependency
* Fix feature columns race condition
Co-authored-by: Colin Adler <colin1adler@gmail.com>
* feat: Add anonymized telemetry to report product usage
This adds a background service to report telemetry to a Coder
server for usage data. There will be realtime event data sent
in the future, but for now usage will report on a CRON.
* Fix flake and requested changes
* Add reporting options for setup
* Add reporting for workspaces
* Add resources as they are reported
* Track API key usage
* Ensure telemetry is tracked prior to exit
This enables a "kubernetes_pod" to attach multiple agents that
could be for multiple services. Each agent is required to have
a unique name, so SSH syntax is:
`coder ssh <workspace>.<agent>`
A resource can have zero agents too, they aren't required.