Commit Graph

22 Commits

Author SHA1 Message Date
98a5ae7f48 feat: add provisioner job hang detector (#7927) 2023-06-25 13:17:00 +00:00
8bd9f9c351 feat: unified tracing between coderd<->provisionerd (#7370) 2023-05-03 23:02:35 +00:00
6651c1632d fix: avoid terraform state concurrent access, remove global mutex (#5273) 2022-12-06 17:05:14 +00:00
45f81a7cd5 fix: prevent terraform init races (#4985) 2022-11-09 19:40:52 +00:00
795ed3dc97 provisioner: fix multi-dir installs (#4690)
In the previous implementation, tests would occasionally fail since the original install directory was deleted.
2022-10-22 20:44:05 +00:00
9bc0d06aa0 fix: Install Terraform once and only log >=500 (#4339)
Fixes #4302.
2022-10-03 15:19:02 -05:00
112eaf80d1 fix: Add logging to Terraform install (#4191)
Fixes #4129.
2022-09-24 14:55:17 -05:00
1c85799be5 fix: Update Terraform to v1.3.0 (#4180)
Contributes to #3202.
2022-09-23 15:31:26 -05:00
f1423450bd fix: Allow terraform provisions to be gracefully cancelled (#3526)
* fix: Allow terraform provisions to be gracefully cancelled

This change allows terraform commands to be gracefully cancelled on
Unix-like platforms by signaling interrupt on provision cancellation.

One implementation detail to note is that we do not necessarily kill a
running terraform command immediately even if the stream is closed. The
reason for this is to allow for graceful cancellation even in such an
event. Currently the timeout is set to 5 minutes by default.

Related: #2683

The above issue may be partially or fully fixed by this change.

* fix: Remove incorrect minimumTerraformVersion variable

* Allow init to return provision complete response
2022-08-18 17:03:55 +03:00
fb9fca8bc9 fix: Ensure terraform tests have a cache path and logger (#3161)
* fix: Ensure terraform tests have a cache path and logger

* fix: Protect against concurrent `terraform init`
2022-08-04 20:37:07 +03:00
6377f17fda chore: update terraform to 1.2.1 (#3243)
* chore: update terraform to 1.2.1

* allow terraform version equal to max
2022-07-27 17:11:38 +01:00
c6b1daabc5 feat: Download default terraform version when minor version mismatches (#1775) 2022-06-22 23:11:52 +00:00
552dad6919 Remove tfexec, allow TF_ environment vars and log them (#2264)
* Remove tfexec, allow TF_ environment vars and log them

Signed-off-by: Spike Curtis <spike@coder.com>

* fixup: commented code, long lines

Signed-off-by: Spike Curtis <spike@coder.com>

* rename executor methods to remove get

Signed-off-by: Spike Curtis <spike@coder.com>

* don't log terraform environment variables we don't know are safe

Signed-off-by: Spike Curtis <spike@coder.com>

* Disable linting of fake secret

Signed-off-by: Spike Curtis <spike@coder.com>

* drop parse support and move logger into terraform package

Signed-off-by: Spike Curtis <spike@coder.com>

* disable testpackage linter on internal package test

Signed-off-by: Spike Curtis <spike@coder.com>
2022-06-16 17:50:39 +00:00
c78f947e09 feat: Upgrade terraform version to 1.1.9 (#1745)
* upgrade terraform version to 1.1.9

* Fix docs typo

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
2022-05-25 19:35:41 -04:00
dc46ff407b fix: ensure websocket close messages are truncated to 123 bytes (#779)
It's possible for websocket close messages to be too long, which cause
them to silently fail without a proper close message. See error below:

```
2022-03-31 17:08:34.862 [INFO]	(stdlib)	<close_notjs.go:72>	"2022/03/31 17:08:34 websocket: failed to marshal close frame: reason string max is 123 but got \"insert provisioner daemon:Cannot encode []database.ProvisionerType into oid 19098 - []database.ProvisionerType must implement Encoder or be converted to a string\" with length 161"
```
2022-04-01 18:17:45 +00:00
6612e3c9c7 feat: Add config-ssh command (#735)
* feat: Add config-ssh command

Closes #254 and #499.

* Fix Windows support
2022-03-30 17:59:54 -05:00
13cef7d07c feat: Support caching provisioner assets (#574)
* feat: Add AWS instance identity authentication

This allows zero-trust authentication for all AWS instances.

Prior to this, AWS instances could be used by passing `CODER_TOKEN`
as an environment variable to the startup script. AWS explicitly
states that secrets should not be passed in startup scripts because
it's user-readable.

* feat: Support caching provisioner assets

This caches the Terraform binary, and Terraform plugins.
Eventually, it could cache other temporary files.

* chore: fix linter

Co-authored-by: Garrett <garrett@coder.com>
2022-03-28 14:57:19 -05:00
c451f4e685 feat: Add templates to create working release (#422)
* Add templates

* Move API structs to codersdk

* Back to green tests!

* It all works, but now with tea! 🧋

* It works!

* Add cancellation to provisionerd

* Tests pass!

* Add deletion of workspaces and projects

* Fix agent lock

* Add clog

* Fix linting errors

* Remove unused CLI tests

* Rename daemon to start

* Fix leaking command

* Fix promptui test

* Update agent connection frequency

* Skip login tests on Windows

* Increase tunnel connect timeout

* Fix templater

* Lower test requirements

* Fix embed

* Disable promptui tests for Windows

* Fix write newline

* Fix PTY write newline

* Fix CloseReader

* Fix compilation on Windows

* Fix linting error

* Remove bubbletea

* Cleanup readwriter

* Use embedded templates instead of serving over API

* Move templates to examples

* Improve workspace create flow

* Fix Windows build

* Fix tests

* Fix linting errors

* Fix untar with extracting max size

* Fix newline char
2022-03-22 13:17:50 -06:00
9d2803e07a feat: Add graceful exits to provisionerd (#372)
* ci: Update DataDog GitHub branch to fallback to GITHUB_REF

This was detecting branches, but not our "main" branch before.
Hopefully this fixes it!

* Add basic Terraform Provider

* Rename post files to upload

* Add tests for resources

* Skip instance identity test

* Add tests for ensuring agent get's passed through properly

* Fix linting errors

* Add echo path

* Fix agent authentication

* fix: Convert all jobs to use a common resource and agent type

This enables a consistent API for project import and provisioned resources.

* Add "coder_workspace" data source

* feat: Remove magical parameters from being injected

This is a much cleaner abstraction. Explicitly declaring the user
parameters for each provisioner makes for significantly simpler
testing.

* feat: Add graceful exits to provisionerd

Terraform (or other provisioners) may need to cleanup state, or
cancel actions before exit. This adds the ability to gracefully
exit provisionerd.

* Fix cancel error check
2022-02-28 18:40:49 +00:00
e75bde4e31 feat: Add provisionerdaemon to coderd (#141)
* feat: Add history middleware parameters

These will be used for streaming logs, checking status,
and other operations related to workspace and project
history.

* refactor: Move all HTTP routes to top-level struct

Nesting all structs behind their respective structures
is leaky, and promotes naming conflicts between handlers.

Our HTTP routes cannot have conflicts, so neither should
function naming.

* Add provisioner daemon routes

* Add periodic updates

* Skip pubsub if short

* Return jobs with WorkspaceHistory

* Add endpoints for extracting singular history

* The full end-to-end operation works

* fix: Disable compression for websocket dRPC transport (#145)

There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing.

This is just tracking some experimentation to fix that race condition

## Run results: ##
- Run 1: peer test failure
- Run 2: peer test failure
- Run 3: `TestWorkspaceHistory/CreateHistory`  - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45
```
status code 412: The provided project history is running. Wait for it to complete importing!`
```
- Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176
```
    workspacehistory_test.go:122: 
        	Error Trace:	workspacehistory_test.go:122
        	Error:      	Condition never satisfied
        	Test:       	TestWorkspaceHistory/CreateHistory
```
- Run 5: peer failure
- Run 6: Pass  
- Run 7: Peer failure

## Open Questions: ##

### Is `dRPC` or `websocket` at fault for the data race?

It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](f6e369438f/drpcwire/error.go (L15)) - so `dRPC` has created this buffer and owns it.

From `dRPC`'s perspective, the callstack looks like this:
- [`sendPacket`](f6e369438f/drpcstream/stream.go (L253))
  - [`writeFrame`](f6e369438f/drpcwire/writer.go (L65))
    - [`AppendFrame`](f6e369438f/drpcwire/packet.go (L128))
      - with finally the data race happening here:
```go
// AppendFrame appends a marshaled form of the frame to the provided buffer.
func AppendFrame(buf []byte, fr Frame) []byte {
...
	out := buf
	out = append(out, control).   // <---------
```

This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame.

Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: f6e369438f/drpcwire/writer.go (L73)

However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](8dee580a7f/write.go (L180)), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](a1a9cfc821/flate/stateless.go (L94)), which is where get our race.

In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly).

### Why does cloning on `Read` fail?

Get a bunch of errors like:
```
2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0
2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF
2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF
2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0
```

# UPDATE:

We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now:

- Run 1:  
- Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338
- Run 3:  
- Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168
- Run 5: 

* fix: Remove race condition with acquiredJobDone channel (#148)

Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83

__Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs.

__Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up.

* fix: Bump up workspace history timeout (#149)

This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32

Looking at the timing of the test:
```
    t.go:56: 2022-02-02 21:33:21.964 [DEBUG]	(terraform-provisioner)	<provision.go:139>	ran apply
    t.go:56: 2022-02-02 21:33:21.991 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    t.go:56: 2022-02-02 21:33:22.050 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    t.go:56: 2022-02-02 21:33:22.090 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    t.go:56: 2022-02-02 21:33:22.140 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    t.go:56: 2022-02-02 21:33:22.195 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    t.go:56: 2022-02-02 21:33:22.240 [DEBUG]	(provisionerd)	<provisionerd.go:162>	skipping acquire; job is already running
    workspacehistory_test.go:122: 
        	Error Trace:	workspacehistory_test.go:122
        	Error:      	Condition never satisfied
        	Test:       	TestWorkspaceHistory/CreateHistory
```

It  appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout.

Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here.

In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that.

Co-authored-by: Bryan <bryan@coder.com>
2022-02-03 20:34:50 +00:00
6a919aea79 feat: Add authentication and personal user endpoint (#29)
* feat: Add authentication and personal user endpoint

This contribution adds a lot of scaffolding for the database fake
and testability of coderd.

A new endpoint "/user" is added to return the currently authenticated
user to the requester.

* Use TestMain to catch leak instead

* Add userpassword package

* Add WIP

* Add user auth

* Fix test

* Add comments

* Fix login response

* Fix order

* Fix generated code

* Update httpapi/httpapi.go

Co-authored-by: Bryan <bryan@coder.com>

Co-authored-by: Bryan <bryan@coder.com>
2022-01-20 13:46:51 +00:00
7c260f88d1 feat: Create provisioner abstraction (#12)
* feat: Create provisioner abstraction

Creates a provisioner abstraction that takes prior art from the Terraform plugin system. It's safe to assume this code will change a lot when it becomes integrated with provisionerd.

Closes #10.

* Ignore generated files in diff view

* Check for unstaged file changes

* Install protoc-gen-go

* Use proper drpc plugin version

* Fix serve closed pipe

* Install sqlc with curl for speed

* Fix install command

* Format CI action

* Add linguist-generated and closed pipe test

* Cleanup code from comments

* Add dRPC comment

* Add Terraform installer for cross-platform

* Build provisioner tests on Linux only
2022-01-08 11:24:02 -06:00