mirror of
https://github.com/coder/coder.git
synced 2025-07-06 15:41:45 +00:00
255 lines
12 KiB
Markdown
255 lines
12 KiB
Markdown
# Scale Tests and Utilities
|
||
|
||
We scale-test Coder with [a built-in utility](#scale-testing-utility) that can
|
||
be used in your environment for insights into how Coder scales with your
|
||
infrastructure. For scale-testing Kubernetes clusters we recommend to install
|
||
and use the dedicated Coder template,
|
||
[scaletest-runner](https://github.com/coder/coder/tree/main/scaletest/templates/scaletest-runner).
|
||
|
||
Learn more about [Coder’s architecture](../../architecture/architecture.md) and
|
||
our [scale-testing methodology](scale-testing.md).
|
||
|
||
## Recent scale tests
|
||
|
||
> Note: the below information is for reference purposes only, and are not
|
||
> intended to be used as guidelines for infrastructure sizing. Review the
|
||
> [Reference Architectures](../../architecture/validated-arch.md#node-sizing)
|
||
> for hardware sizing recommendations.
|
||
|
||
| Environment | Coder CPU | Coder RAM | Coder Replicas | Database | Users | Concurrent builds | Concurrent connections (Terminal/SSH) | Coder Version | Last tested |
|
||
| ---------------- | --------- | --------- | -------------- | ----------------- | ----- | ----------------- | ------------------------------------- | ------------- | ------------ |
|
||
| Kubernetes (GKE) | 3 cores | 12 GB | 1 | db-f1-micro | 200 | 3 | 200 simulated | `v0.24.1` | Jun 26, 2023 |
|
||
| Kubernetes (GKE) | 4 cores | 8 GB | 1 | db-custom-1-3840 | 1500 | 20 | 1,500 simulated | `v0.24.1` | Jun 27, 2023 |
|
||
| Kubernetes (GKE) | 2 cores | 4 GB | 1 | db-custom-1-3840 | 500 | 20 | 500 simulated | `v0.27.2` | Jul 27, 2023 |
|
||
| Kubernetes (GKE) | 2 cores | 8 GB | 2 | db-custom-2-7680 | 1000 | 20 | 1000 simulated | `v2.2.1` | Oct 9, 2023 |
|
||
| Kubernetes (GKE) | 4 cores | 16 GB | 2 | db-custom-8-30720 | 2000 | 50 | 2000 simulated | `v2.8.4` | Feb 28, 2024 |
|
||
| Kubernetes (GKE) | 2 cores | 4 GB | 2 | db-custom-2-7680 | 1000 | 50 | 1000 simulated | `v2.10.2` | Apr 26, 2024 |
|
||
|
||
> Note: a simulated connection reads and writes random data at 40KB/s per
|
||
> connection.
|
||
|
||
## Scale testing utility
|
||
|
||
Since Coder's performance is highly dependent on the templates and workflows you
|
||
support, you may wish to use our internal scale testing utility against your own
|
||
environments.
|
||
|
||
> Note: This utility is experimental. It is not subject to any compatibility
|
||
> guarantees, and may cause interruptions for your users. To avoid potential
|
||
> outages and orphaned resources, we recommend running scale tests on a
|
||
> secondary "staging" environment or a dedicated
|
||
> [Kubernetes playground cluster](https://github.com/coder/coder/tree/main/scaletest/terraform).
|
||
> Run it against a production environment at your own risk.
|
||
|
||
### Create workspaces
|
||
|
||
The following command will provision a number of Coder workspaces using the
|
||
specified template and extra parameters.
|
||
|
||
```shell
|
||
coder exp scaletest create-workspaces \
|
||
--retry 5 \
|
||
--count "${SCALETEST_PARAM_NUM_WORKSPACES}" \
|
||
--template "${SCALETEST_PARAM_TEMPLATE}" \
|
||
--concurrency "${SCALETEST_PARAM_CREATE_CONCURRENCY}" \
|
||
--timeout 5h \
|
||
--job-timeout 5h \
|
||
--no-cleanup \
|
||
--output json:"${SCALETEST_RESULTS_DIR}/create-workspaces.json"
|
||
|
||
# Run `coder exp scaletest create-workspaces --help` for all usage
|
||
```
|
||
|
||
The command does the following:
|
||
|
||
1. Create `${SCALETEST_PARAM_NUM_WORKSPACES}` workspaces concurrently
|
||
(concurrency level: `${SCALETEST_PARAM_CREATE_CONCURRENCY}`) using the
|
||
template `${SCALETEST_PARAM_TEMPLATE}`.
|
||
1. Leave workspaces running to use in next steps (`--no-cleanup` option).
|
||
1. Store provisioning results in JSON format.
|
||
1. If you don't want the creation process to be interrupted by any errors, use
|
||
the `--retry 5` flag.
|
||
|
||
### Traffic Generation
|
||
|
||
Given an existing set of workspaces created previously with `create-workspaces`,
|
||
the following command will generate traffic similar to that of Coder's Web
|
||
Terminal against those workspaces.
|
||
|
||
```shell
|
||
# Produce load at about 1000MB/s (25MB/40ms).
|
||
coder exp scaletest workspace-traffic \
|
||
--template "${SCALETEST_PARAM_GREEDY_AGENT_TEMPLATE}" \
|
||
--bytes-per-tick $((1024 * 1024 * 25)) \
|
||
--tick-interval 40ms \
|
||
--timeout "$((delay))s" \
|
||
--job-timeout "$((delay))s" \
|
||
--scaletest-prometheus-address 0.0.0.0:21113 \
|
||
--target-workspaces "0:100" \
|
||
--trace=false \
|
||
--output json:"${SCALETEST_RESULTS_DIR}/traffic-${type}-greedy-agent.json"
|
||
```
|
||
|
||
Traffic generation can be parametrized:
|
||
|
||
1. Send `bytes-per-tick` every `tick-interval`.
|
||
1. Enable tracing for performance debugging.
|
||
1. Target a range of workspaces with `--target-workspaces 0:100`.
|
||
1. For dashboard traffic: Target a range of users with `--target-users 0:100`.
|
||
1. Store provisioning results in JSON format.
|
||
1. Expose a dedicated Prometheus address (`--scaletest-prometheus-address`) for
|
||
scaletest-specific metrics.
|
||
|
||
The `workspace-traffic` supports also other modes - SSH traffic, workspace app:
|
||
|
||
1. For SSH traffic: Use `--ssh` flag to generate SSH traffic instead of Web
|
||
Terminal.
|
||
1. For workspace app traffic: Use `--app [wsdi|wsec|wsra]` flag to select app
|
||
behavior. (modes: _WebSocket discard_, _WebSocket echo_, _WebSocket read_).
|
||
|
||
### Cleanup
|
||
|
||
The scaletest utility will attempt to clean up all workspaces it creates. If you
|
||
wish to clean up all workspaces, you can run the following command:
|
||
|
||
```shell
|
||
coder exp scaletest cleanup \
|
||
--cleanup-job-timeout 2h \
|
||
--cleanup-timeout 15min
|
||
```
|
||
|
||
This will delete all workspaces and users with the prefix `scaletest-`.
|
||
|
||
## Scale testing template
|
||
|
||
Consider using a dedicated
|
||
[scaletest-runner](https://github.com/coder/coder/tree/main/scaletest/templates/scaletest-runner)
|
||
template alongside the CLI utility for testing large-scale Kubernetes clusters.
|
||
|
||
The template deploys a main workspace with scripts used to orchestrate Coder,
|
||
creating workspaces, generating workspace traffic, or load-testing workspace
|
||
apps.
|
||
|
||
### Parameters
|
||
|
||
The _scaletest-runner_ offers the following configuration options:
|
||
|
||
- Workspace size selection: minimal/small/medium/large (_default_: minimal,
|
||
which contains just enough resources for a Coder agent to run without
|
||
additional workloads)
|
||
- Number of workspaces
|
||
- Wait duration between scenarios or staggered approach
|
||
|
||
The template exposes parameters to control the traffic dimensions for SSH
|
||
connections, workspace apps, and dashboard tests:
|
||
|
||
- Traffic duration of the load test scenario
|
||
- Traffic percentage of targeted workspaces
|
||
- Bytes per tick and tick interval
|
||
- _For workspace apps_: modes (echo, read random data, or write and discard)
|
||
|
||
Scale testing concurrency can be controlled with the following parameters:
|
||
|
||
- Enable parallel scenarios - interleave different traffic patterns (SSH,
|
||
workspace apps, dashboard traffic, etc.)
|
||
- Workspace creation concurrency level (_default_: 10)
|
||
- Job concurrency level - generate workspace traffic using multiple jobs
|
||
(_default_: 0)
|
||
- Cleanup concurrency level
|
||
|
||
### Kubernetes cluster
|
||
|
||
It is recommended to learn how to operate the _scaletest-runner_ before running
|
||
it against the staging cluster (or production at your own risk). Coder provides
|
||
different
|
||
[workspace configurations](https://github.com/coder/coder/tree/main/scaletest/templates)
|
||
that operators can deploy depending on the traffic projections.
|
||
|
||
There are a few cluster options available:
|
||
|
||
| Workspace size | vCPU | Memory | Persisted storage | Details |
|
||
| -------------- | ---- | ------ | ----------------- | ----------------------------------------------------- |
|
||
| minimal | 1 | 2 Gi | None | |
|
||
| small | 1 | 1 Gi | None | |
|
||
| medium | 2 | 2 Gi | None | Medium-sized cluster offers the greedy agent variant. |
|
||
| large | 4 | 4 Gi | None | |
|
||
|
||
Note: Review the selected cluster template and edit the node affinity to match
|
||
your setup.
|
||
|
||
#### Greedy agent
|
||
|
||
The greedy agent variant is a template modification that makes the Coder agent
|
||
transmit large metadata (size: 4K) while reporting stats. The transmission of
|
||
large chunks puts extra overhead on coderd instances and agents when handling
|
||
and storing the data.
|
||
|
||
Use this template variant to verify limits of the cluster performance.
|
||
|
||
### Observability
|
||
|
||
During scale tests, operators can monitor progress using a Grafana dashboard.
|
||
Coder offers a comprehensive overview
|
||
[dashboard](https://github.com/coder/coder/blob/main/scaletest/scaletest_dashboard.json)
|
||
that can seamlessly integrate into the internal Grafana deployment.
|
||
|
||
This dashboard provides insights into various aspects, including:
|
||
|
||
- Utilization of resources within the Coder control plane (CPU, memory, pods)
|
||
- Database performance metrics (CPU, memory, I/O, connections, queries)
|
||
- Coderd API performance (requests, latency, error rate)
|
||
- Resource consumption within Coder workspaces (CPU, memory, network usage)
|
||
- Internal metrics related to provisioner jobs
|
||
|
||
Note: Database metrics are disabled by default and can be enabled by setting the
|
||
environment variable `CODER_PROMETHEUS_COLLECT_DB_METRICS` to `true`.
|
||
|
||
It is highly recommended to deploy a solution for centralized log collection and
|
||
aggregation. The presence of error logs may indicate an underscaled deployment
|
||
of Coder, necessitating action from operators.
|
||
|
||
## Autoscaling
|
||
|
||
We generally do not recommend using an autoscaler that modifies the number of
|
||
coderd replicas. In particular, scale down events can cause interruptions for a
|
||
large number of users.
|
||
|
||
Coderd is different from a simple request-response HTTP service in that it
|
||
services long-lived connections whenever it proxies HTTP applications like IDEs
|
||
or terminals that rely on websockets, or when it relays tunneled connections to
|
||
workspaces. Loss of a coderd replica will drop these long-lived connections and
|
||
interrupt users. For example, if you have 4 coderd replicas behind a load
|
||
balancer, and an autoscaler decides to reduce it to 3, roughly 25% of the
|
||
connections will drop. An even larger proportion of users could be affected if
|
||
they use applications that use more than one websocket.
|
||
|
||
The severity of the interruption varies by application. Coder's web terminal,
|
||
for example, will reconnect to the same session and continue. So, this should
|
||
not be interpreted as saying coderd replicas should never be taken down for any
|
||
reason.
|
||
|
||
We recommend you plan to run enough coderd replicas to comfortably meet your
|
||
weekly high-water-mark load, and monitor coderd peak CPU & memory utilization
|
||
over the long term, reevaluating periodically. When scaling down (or performing
|
||
upgrades), schedule these outside normal working hours to minimize user
|
||
interruptions.
|
||
|
||
### A note for Kubernetes users
|
||
|
||
When running on Kubernetes on cloud infrastructure (i.e. not bare metal), many
|
||
operators choose to employ a _cluster_ autoscaler that adds and removes
|
||
Kubernetes _nodes_ according to load. Coder can coexist with such cluster
|
||
autoscalers, but we recommend you take steps to prevent the autoscaler from
|
||
evicting coderd pods, as an eviction will cause the same interruptions as
|
||
described above. For example, if you are using the
|
||
[Kubernetes cluster autoscaler](https://kubernetes.io/docs/reference/labels-annotations-taints/#cluster-autoscaler-kubernetes-io-safe-to-evict),
|
||
you may wish to set `cluster-autoscaler.kubernetes.io/safe-to-evict: "false"` as
|
||
an annotation on the coderd deployment.
|
||
|
||
## Troubleshooting
|
||
|
||
If a load test fails or if you are experiencing performance issues during
|
||
day-to-day use, you can leverage Coder's [Prometheus metrics](../prometheus.md)
|
||
to identify bottlenecks during scale tests. Additionally, you can use your
|
||
existing cloud monitoring stack to measure load, view server logs, etc.
|