Commit Graph

72 Commits

Author SHA1 Message Date
d2b035312e chore: fix parse typo for network telemetry (#13971) 2024-07-22 17:14:37 +00:00
e8db21c89e chore: add additional network telemetry stats & events (#13800) 2024-07-10 14:14:35 +10:00
a110d18275 chore: add DRPC tailnet & cli network telemetry (#13687) 2024-07-03 15:23:46 +10:00
ed0ca76b0b chore: do network integration tests in isolated net ns (#13117) 2024-05-03 05:42:13 +00:00
3de737fdc8 fix: start packet capture immediately on speedtest (#13128)
I initially made this change when hacking wgengine to also capture wireguard packets going into the magicsock, so that we could capture the initial wireguard handshake. 

I don't think we should ship that additional capture logic, but... it seems generally useful to capture packets from the get go on speedtest, so that you can see disco and pings before the TCP speedtest session starts.
2024-05-02 19:44:32 +04:00
e801e878ba feat: add agent acks to in-memory coordinator (#12786)
When an agent receives a node, it responds with an ACK which is relayed
to the client. After the client receives the ACK, it's allowed to begin
pinging.
2024-04-10 17:15:33 -05:00
66154f937e fix(coderd): pass block endpoints into servertailnet (#12149) 2024-03-08 05:29:54 +00:00
4e7beee102 feat: show tailnet peer diagnostics after coder ping (#12314)
Beginnings of a solution to #12297 

Doesn't cover disco or definitively display whether we successfully connected to DERP, but shows some checklist diagnostics for connecting to an agent.

For this first PR, I just added it to `coder ping` to see how we like it, but could be incorporated into `coder ssh` _et al._ after a timeout.

```
$ coder ping dogfood2
p2p connection established in 147ms
pong from dogfood2 p2p via  95.217.xxx.yyy:42631  in 147ms
pong from dogfood2 p2p via  95.217.xxx.yyy:42631  in 140ms
pong from dogfood2 p2p via  95.217.xxx.yyy:42631  in 140ms
✔ preferred DERP region 999 (Council Bluffs, Iowa)
✔ sent local data to Coder networking coodinator
✔ received remote agent data from Coder networking coordinator
    preferred DERP 10013 (Europe Fly.io (Paris))
    endpoints: 95.217.xxx.yyy:42631, 95.217.xxx.yyy:37576, 172.17.0.1:37576, 172.20.0.10:37576
✔ Wireguard handshake 11s ago
```
2024-02-27 22:04:46 +04:00
9861830e87 fix: never send local endpoints if disabled (#12138) 2024-02-20 15:51:25 +10:00
2d0b9106c0 fix: change servertailnet to register the DERP dialer before setting DERP map (#12137)
I noticed a possible race where tailnet.Conn can try to dial the embedded region before we've set our custom dialer that send the DERP in-memory.  This closes that race and adds a test case for servertailnet with no STUN and an embedded relay
2024-02-15 10:51:12 +04:00
bc14e926d8 feat: add option to speedtest to dump a pcap of network traffic (#11848) 2024-01-29 09:57:31 -06:00
5cbb76b47a fix: stop spamming DERP map updates for equivalent maps (#11792)
Fixes 2 related issues:

1. wsconncache had incorrect logic to test whether to send DERPMap updates, sending if the maps were equivalent, instead of if they were _not equivalent_.
2. configmaps used a bugged check to test equality between DERPMaps, since it contains a map and the map entries are serialized in random order. Instead, we avoid comparing the protobufs and instead depend on the existing function that compares `tailcfg.DERPMap`. This also has the effect of reducing the number of times we convert to and from protobuf.
2024-01-24 16:27:15 +04:00
5388a1b6d7 fix: use TSMP ping for reachability, not latency (#11749)
Use TSMP ping for reachability, but leave Disco ping for when we call Ping() since we often use that to determine whether we have a direct connection.

Also adds unit tests to make sure Ping() returns direct connection vs DERP correctly.
2024-01-22 17:37:15 +04:00
7ffd99cfe2 fix: use DiscoPing (partially reverts #11306) (#11744) 2024-01-22 12:40:21 +00:00
3d85cdfa11 feat: set peers lost when disconnected from coordinator (#11681)
Adds support to Coordination to call SetAllPeersLost() when it is closed. This ensure that when we disconnect from a Coordinator, we set all peers lost.

This covers CoderSDK (CLI client) and Agent.  Next PR will cover MultiAgent (notably, `wsproxy`).
2024-01-22 15:26:20 +04:00
f01cab9894 feat: use tailnet v2 API for coordination (#11638)
This one is huge, and I'm sorry.

The problem is that once I change `tailnet.Conn` to start doing v2 behavior, I kind of have to change it everywhere, including in CoderSDK (CLI), the agent, wsproxy, and ServerTailnet.

There is still a bit more cleanup to do, and I need to add code so that when we lose connection to the Coordinator, we mark all peers as LOST, but that will be in a separate PR since this is big enough!
2024-01-22 11:07:50 +04:00
58873fa7e2 chore: remove unused context/cancel in tailnet Conn (#11399)
Spotted during code read; unused fields
2024-01-05 08:15:42 +04:00
520c3a8ff7 fix: use TSMP for pings and checking reachability (#11306)
We're seeing some flaky tests related to agent connectivity - https://github.com/coder/coder/actions/runs/7286675441/job/19856270998

I'm pretty sure what happened in this one is that the client opened a connection while the wgengine was in the process of reconfiguring the wireguard device, so the fact that the peer became "active" as a result of traffic being sent was not noticed.

The test calls `AwaitReachable()` but this only tests the disco layer, so it doesn't wait for wireguard to come up.

I think we should be using TSMP for pinging and reachability, since this operates at the IP layer, and therefore requires that wireguard comes up before being successful.

This should also help with the problems we have seen where a TCP connection starts before wireguard is up and the initial round trip has to wait for the 5 second wireguard handshake retry.

fixes: #11294
2024-01-02 15:53:52 +04:00
f400d8a0c5 fix: handle SIGHUP from OpenSSH (#10638)
Fixes an issue where remote forwards are not correctly torn down when using OpenSSH with `coder ssh --stdio`.  OpenSSH sends a disconnect signal, but then also sends SIGHUP to `coder`.  Previously, we just exited when we got SIGHUP, and this raced against properly disconnecting.

Fixes https://github.com/coder/customers/issues/327
2023-11-13 15:14:42 +04:00
94eb9b8db1 fix: disable t.Parallel on TestPortForward (#10449)
I've said it before, I'll say it again: you can't create a timed context before calling `t.Parallel()` and then use it after.

Fixes flakes like https://github.com/coder/coder/actions/runs/6716682414/job/18253279157

I've chosen just to drop `t.Parallel()` entirely rather than create a second context after the parallel call, since the vast majority of the test time happens before where the parallel call was.  It does all the tailnet setup before `t.Parallel()`.
Leaving a call to `t.Parallel()` is a bug risk for future maintainers to come in and use the wrong context in the latter part of the test by accident.
2023-11-01 13:45:13 +04:00
236e84c4d6 feat: add logging for forwarded TCP connections
part of #7963

log TCP connections as they are forwarded by gVisor
2023-10-09 19:41:26 +04:00
03a7d2f70b chore: fix servertailnet test flake (#10110)
https://github.com/coder/coder/actions/runs/6424100765/job/17444018788?pr=10083#step:5:771
2023-10-06 11:31:53 -05:00
19d7da3d24 refactor(coderd/database): split Time and Now into dbtime package (#9482)
Ref: #9380
2023-09-01 16:50:12 +00:00
64ef867b4f fix(tailnet): re-add keepalives (#9410) 2023-08-29 15:21:30 -05:00
64df076328 feat: add server flag to force DERP to use always websockets (#9238) 2023-08-24 17:22:31 +00:00
22e781eced chore: add /v2 to import module path (#9072)
* chore: add /v2 to import module path

go mod requires semantic versioning with versions greater than 1.x

This was a mechanical update by running:
```
go install github.com/marwan-at-work/mod/cmd/mod@latest
mod upgrade
```

Migrate generated files to import /v2

* Fix gen
2023-08-18 18:55:43 +00:00
5b2ea2e94f fix(tailnet): disable wireguard trimming (#9098)
Co-authored-by: Spike Curtis <spike@coder.com>
2023-08-15 14:26:56 -05:00
344d32b2f1 feat(coderd): expire agents from server tailnet (#9092) 2023-08-14 20:38:37 -05:00
bc862fa493 chore: upgrade tailscale to v1.46.1 (#8913) 2023-08-09 19:50:26 +00:00
3c52b01850 chore: add tailscale magicsock debug logging controls (#8982) 2023-08-08 17:56:08 +00:00
25e30c6f41 feat(cli): support fine-grained server log filtering (#8748) 2023-07-26 16:46:22 -05:00
c47b78c44b chore: replace wsconncache with a single tailnet (#8176) 2023-07-12 17:37:31 -05:00
f40865bc2f chore: use mutex around blockEndpoints (#8209)
https://github.com/coder/coder/actions/runs/5378950122/jobs/9759972142
2023-06-26 10:01:50 -05:00
a28d422c35 feat: add flag to disable all direct connections (#7936) 2023-06-21 22:02:05 +00:00
b1d1b63113 chore: ensure logs consistency across Coder (#8083) 2023-06-20 12:30:45 +02:00
247f8a973f feat: replace ssh maxTimeout with keep-alive mechanism (#8062)
* Bump up coder/ssh

* feat: Set default agent timeout to ~72h

* Address PR comments

* Fix
2023-06-16 15:22:18 +02:00
b3689c8f64 Only send tailnet nodes updates with preferred DERP (#7387)
Signed-off-by: Spike Curtis <spike@coder.com>
2023-05-04 14:30:57 +04:00
3eb7f06bf1 feat(agent): add http debug routes for magicsock (#7287) 2023-04-26 13:01:49 -05:00
745868fd8a revert: chore: upgrade tailscale (#7236) 2023-04-20 17:58:22 -05:00
a86830a283 chore: upgrade tailscale (#7207) 2023-04-20 13:29:56 -05:00
fbf329fbb7 fix(tailnet): set TCP keepalive idle to 72 hours for SSH conns (#7196) 2023-04-18 17:53:11 -05:00
bc18f6c113 fix: add CODER_AGENT_TAILNET_LISTEN_PORT for specifying a static tailnet port (#6980)
Fixes #5175.
2023-04-03 16:20:19 +00:00
97f77c4507 feat: allow DERP headers to be set (#6572)
* feat: allow DERP headers to be set

* chore: remove custom flag

* Clone DERP header on client create

* Adjust to use interface to cast headers

---------

Co-authored-by: Kyle Carberry <kyle@carberry.com>
2023-03-21 18:43:20 +00:00
17bc5794d4 fix: direct embedded derp traffic directly to the server (#6595)
Prior to this change, DERP traffic would route from `coderd` to the
`CODER_ACCESS_URL` to reach the internal DERP server, which may have
resulted in slower connections due to proxying, or the failure of
web traffic entirely.

If your Coder deployment has a proxy in front of it, your traffic through
web terminals, apps, and port-forwarding is about to get a lot faster!
2023-03-14 14:46:47 +00:00
7a8ccda40e chore: copy forced derp websockets to fix flake (#6475)
See: https://github.com/coder/coder/actions/runs/4350034299/jobs/7600478389
2023-03-06 21:29:41 -06:00
2ff1c6d613 feat: add agent stats for different connection types (#6412)
This allows us to track when our extensions are used, when the
web terminal is used, and average connection latency to the agent.
2023-03-02 08:06:00 -06:00
1724cbf872 feat: automatically use websockets if DERP upgrade is unavailable (#6381)
* feat: automatically use websockets if DERP upgrade is unavailable

This might be our biggest hangup for deployments at the moment...

Load balancers by default do not support the DERP protocol, so many
of our prospects and customers run into failing workspace connections.
This automatically swaps to use WebSockets, and reports the reason to
coderd.

In a future contribution, a warning will appear by the agent if it was
forced to use WebSockets instead of DERP.

* Fix nil pointer type in Tailscale dep

* Fix requested changes
2023-03-01 22:18:14 +00:00
cae8b88f60 fix(tailnet): Avoid logging netmap (#6342) 2023-02-25 08:06:38 +00:00
677721e4a1 fix(tailnet): Skip nodes without DERP, avoid use of RemoveAllPeers (#6320)
* fix(tailnet): Skip nodes without DERP, avoid use of RemoveAllPeers
2023-02-24 18:16:29 +02:00
a414de9e81 fix(tailnet): Improve tailnet setup and agentconn stability (#6292)
* fix(tailnet): Improve start and close to detect connection races

* fix: Prevent agentConn use before ready via AwaitReachable

* fix(tailnet): Ensure connstats are closed on conn close

* fix(codersdk): Use AwaitReachable in DialWorkspaceAgent

* fix(tailnet): Improve logging via slog.Helper()
2023-02-24 13:11:28 +02:00