mirror of
https://github.com/coder/coder.git
synced 2025-07-13 21:36:50 +00:00
docs: add new scaling doc to best practices section (#15904)
[preview](https://coder.com/docs/@bp-scaling-coder/tutorials/best-practices/scale-coder) --------- Co-authored-by: Spike Curtis <spike@coder.com>
This commit is contained in:
@ -5,7 +5,7 @@ without compromising service. This process encompasses infrastructure setup,
|
|||||||
traffic projections, and aggressive testing to identify and mitigate potential
|
traffic projections, and aggressive testing to identify and mitigate potential
|
||||||
bottlenecks.
|
bottlenecks.
|
||||||
|
|
||||||
A dedicated Kubernetes cluster for Coder is recommended to configure, host and
|
A dedicated Kubernetes cluster for Coder is recommended to configure, host, and
|
||||||
manage Coder workloads. Kubernetes provides container orchestration
|
manage Coder workloads. Kubernetes provides container orchestration
|
||||||
capabilities, allowing Coder to efficiently deploy, scale, and manage workspaces
|
capabilities, allowing Coder to efficiently deploy, scale, and manage workspaces
|
||||||
across a distributed infrastructure. This ensures high availability, fault
|
across a distributed infrastructure. This ensures high availability, fault
|
||||||
@ -13,27 +13,29 @@ tolerance, and scalability for Coder deployments. Coder is deployed on this
|
|||||||
cluster using the
|
cluster using the
|
||||||
[Helm chart](../../install/kubernetes.md#4-install-coder-with-helm).
|
[Helm chart](../../install/kubernetes.md#4-install-coder-with-helm).
|
||||||
|
|
||||||
|
For more information about scaling, see our [Coder scaling best practices](../../tutorials/best-practices/scale-coder.md).
|
||||||
|
|
||||||
## Methodology
|
## Methodology
|
||||||
|
|
||||||
Our scale tests include the following stages:
|
Our scale tests include the following stages:
|
||||||
|
|
||||||
1. Prepare environment: create expected users and provision workspaces.
|
1. Prepare environment: create expected users and provision workspaces.
|
||||||
|
|
||||||
2. SSH connections: establish user connections with agents, verifying their
|
1. SSH connections: establish user connections with agents, verifying their
|
||||||
ability to echo back received content.
|
ability to echo back received content.
|
||||||
|
|
||||||
3. Web Terminal: verify the PTY connection used for communication with Web
|
1. Web Terminal: verify the PTY connection used for communication with Web
|
||||||
Terminal.
|
Terminal.
|
||||||
|
|
||||||
4. Workspace application traffic: assess the handling of user connections with
|
1. Workspace application traffic: assess the handling of user connections with
|
||||||
specific workspace apps, confirming their capability to echo back received
|
specific workspace apps, confirming their capability to echo back received
|
||||||
content effectively.
|
content effectively.
|
||||||
|
|
||||||
5. Dashboard evaluation: verify the responsiveness and stability of Coder
|
1. Dashboard evaluation: verify the responsiveness and stability of Coder
|
||||||
dashboards under varying load conditions. This is achieved by simulating user
|
dashboards under varying load conditions. This is achieved by simulating user
|
||||||
interactions using instances of headless Chromium browsers.
|
interactions using instances of headless Chromium browsers.
|
||||||
|
|
||||||
6. Cleanup: delete workspaces and users created in step 1.
|
1. Cleanup: delete workspaces and users created in step 1.
|
||||||
|
|
||||||
## Infrastructure and setup requirements
|
## Infrastructure and setup requirements
|
||||||
|
|
||||||
@ -54,13 +56,16 @@ channel for IDEs with VS Code and JetBrains plugins.
|
|||||||
The basic setup of scale tests environment involves:
|
The basic setup of scale tests environment involves:
|
||||||
|
|
||||||
1. Scale tests runner (32 vCPU, 128 GB RAM)
|
1. Scale tests runner (32 vCPU, 128 GB RAM)
|
||||||
2. Coder: 2 replicas (4 vCPU, 16 GB RAM)
|
1. Coder: 2 replicas (4 vCPU, 16 GB RAM)
|
||||||
3. Database: 1 instance (2 vCPU, 32 GB RAM)
|
1. Database: 1 instance (2 vCPU, 32 GB RAM)
|
||||||
4. Provisioner: 50 instances (0.5 vCPU, 512 MB RAM)
|
1. Provisioner: 50 instances (0.5 vCPU, 512 MB RAM)
|
||||||
|
|
||||||
The test is deemed successful if users did not experience interruptions in their
|
The test is deemed successful if:
|
||||||
workflows, `coderd` did not crash or require restarts, and no other internal
|
|
||||||
errors were observed.
|
- Users did not experience interruptions in their
|
||||||
|
workflows,
|
||||||
|
- `coderd` did not crash or require restarts, and
|
||||||
|
- No other internal errors were observed.
|
||||||
|
|
||||||
## Traffic Projections
|
## Traffic Projections
|
||||||
|
|
||||||
@ -90,11 +95,11 @@ Database:
|
|||||||
|
|
||||||
## Available reference architectures
|
## Available reference architectures
|
||||||
|
|
||||||
[Up to 1,000 users](./validated-architectures/1k-users.md)
|
- [Up to 1,000 users](./validated-architectures/1k-users.md)
|
||||||
|
|
||||||
[Up to 2,000 users](./validated-architectures/2k-users.md)
|
- [Up to 2,000 users](./validated-architectures/2k-users.md)
|
||||||
|
|
||||||
[Up to 3,000 users](./validated-architectures/3k-users.md)
|
- [Up to 3,000 users](./validated-architectures/3k-users.md)
|
||||||
|
|
||||||
## Hardware recommendation
|
## Hardware recommendation
|
||||||
|
|
||||||
@ -107,7 +112,7 @@ guidance on optimal configurations. A reasonable approach involves using scaling
|
|||||||
formulas based on factors like CPU, memory, and the number of users.
|
formulas based on factors like CPU, memory, and the number of users.
|
||||||
|
|
||||||
While the minimum requirements specify 1 CPU core and 2 GB of memory per
|
While the minimum requirements specify 1 CPU core and 2 GB of memory per
|
||||||
`coderd` replica, it is recommended to allocate additional resources depending
|
`coderd` replica, we recommend that you allocate additional resources depending
|
||||||
on the workload size to ensure deployment stability.
|
on the workload size to ensure deployment stability.
|
||||||
|
|
||||||
#### CPU and memory usage
|
#### CPU and memory usage
|
||||||
|
@ -1,20 +1,23 @@
|
|||||||
# Scale Tests and Utilities
|
# Scale Tests and Utilities
|
||||||
|
|
||||||
We scale-test Coder with [a built-in utility](#scale-testing-utility) that can
|
We scale-test Coder with a built-in utility that can
|
||||||
be used in your environment for insights into how Coder scales with your
|
be used in your environment for insights into how Coder scales with your
|
||||||
infrastructure. For scale-testing Kubernetes clusters we recommend to install
|
infrastructure. For scale-testing Kubernetes clusters we recommend that you install
|
||||||
and use the dedicated Coder template,
|
and use the dedicated Coder template,
|
||||||
[scaletest-runner](https://github.com/coder/coder/tree/main/scaletest/templates/scaletest-runner).
|
[scaletest-runner](https://github.com/coder/coder/tree/main/scaletest/templates/scaletest-runner).
|
||||||
|
|
||||||
Learn more about [Coder’s architecture](./architecture.md) and our
|
Learn more about [Coder’s architecture](./architecture.md) and our
|
||||||
[scale-testing methodology](./scale-testing.md).
|
[scale-testing methodology](./scale-testing.md).
|
||||||
|
|
||||||
|
For more information about scaling, see our [Coder scaling best practices](../../tutorials/best-practices/scale-coder.md).
|
||||||
|
|
||||||
## Recent scale tests
|
## Recent scale tests
|
||||||
|
|
||||||
> Note: the below information is for reference purposes only, and are not
|
The information in this doc is for reference purposes only, and is not intended
|
||||||
> intended to be used as guidelines for infrastructure sizing. Review the
|
to be used as guidelines for infrastructure sizing.
|
||||||
> [Reference Architectures](./validated-architectures/index.md#node-sizing) for
|
|
||||||
> hardware sizing recommendations.
|
Review the [Reference Architectures](./validated-architectures/index.md#node-sizing) for
|
||||||
|
hardware sizing recommendations.
|
||||||
|
|
||||||
| Environment | Coder CPU | Coder RAM | Coder Replicas | Database | Users | Concurrent builds | Concurrent connections (Terminal/SSH) | Coder Version | Last tested |
|
| Environment | Coder CPU | Coder RAM | Coder Replicas | Database | Users | Concurrent builds | Concurrent connections (Terminal/SSH) | Coder Version | Last tested |
|
||||||
|------------------|-----------|-----------|----------------|-------------------|-------|-------------------|---------------------------------------|---------------|--------------|
|
|------------------|-----------|-----------|----------------|-------------------|-------|-------------------|---------------------------------------|---------------|--------------|
|
||||||
@ -25,8 +28,7 @@ Learn more about [Coder’s architecture](./architecture.md) and our
|
|||||||
| Kubernetes (GKE) | 4 cores | 16 GB | 2 | db-custom-8-30720 | 2000 | 50 | 2000 simulated | `v2.8.4` | Feb 28, 2024 |
|
| Kubernetes (GKE) | 4 cores | 16 GB | 2 | db-custom-8-30720 | 2000 | 50 | 2000 simulated | `v2.8.4` | Feb 28, 2024 |
|
||||||
| Kubernetes (GKE) | 2 cores | 4 GB | 2 | db-custom-2-7680 | 1000 | 50 | 1000 simulated | `v2.10.2` | Apr 26, 2024 |
|
| Kubernetes (GKE) | 2 cores | 4 GB | 2 | db-custom-2-7680 | 1000 | 50 | 1000 simulated | `v2.10.2` | Apr 26, 2024 |
|
||||||
|
|
||||||
> Note: a simulated connection reads and writes random data at 40KB/s per
|
> Note: A simulated connection reads and writes random data at 40KB/s per connection.
|
||||||
> connection.
|
|
||||||
|
|
||||||
## Scale testing utility
|
## Scale testing utility
|
||||||
|
|
||||||
@ -34,17 +36,24 @@ Since Coder's performance is highly dependent on the templates and workflows you
|
|||||||
support, you may wish to use our internal scale testing utility against your own
|
support, you may wish to use our internal scale testing utility against your own
|
||||||
environments.
|
environments.
|
||||||
|
|
||||||
> Note: This utility is experimental. It is not subject to any compatibility
|
<blockquote class="admonition important">
|
||||||
> guarantees, and may cause interruptions for your users. To avoid potential
|
|
||||||
> outages and orphaned resources, we recommend running scale tests on a
|
This utility is experimental.
|
||||||
> secondary "staging" environment or a dedicated
|
|
||||||
> [Kubernetes playground cluster](https://github.com/coder/coder/tree/main/scaletest/terraform).
|
It is not subject to any compatibility guarantees and may cause interruptions
|
||||||
> Run it against a production environment at your own risk.
|
for your users.
|
||||||
|
To avoid potential outages and orphaned resources, we recommend that you run
|
||||||
|
scale tests on a secondary "staging" environment or a dedicated
|
||||||
|
[Kubernetes playground cluster](https://github.com/coder/coder/tree/main/scaletest/terraform).
|
||||||
|
|
||||||
|
Run it against a production environment at your own risk.
|
||||||
|
|
||||||
|
</blockquote>
|
||||||
|
|
||||||
### Create workspaces
|
### Create workspaces
|
||||||
|
|
||||||
The following command will provision a number of Coder workspaces using the
|
The following command will provision a number of Coder workspaces using the
|
||||||
specified template and extra parameters.
|
specified template and extra parameters:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
coder exp scaletest create-workspaces \
|
coder exp scaletest create-workspaces \
|
||||||
@ -56,8 +65,6 @@ coder exp scaletest create-workspaces \
|
|||||||
--job-timeout 5h \
|
--job-timeout 5h \
|
||||||
--no-cleanup \
|
--no-cleanup \
|
||||||
--output json:"${SCALETEST_RESULTS_DIR}/create-workspaces.json"
|
--output json:"${SCALETEST_RESULTS_DIR}/create-workspaces.json"
|
||||||
|
|
||||||
# Run `coder exp scaletest create-workspaces --help` for all usage
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The command does the following:
|
The command does the following:
|
||||||
@ -70,6 +77,12 @@ The command does the following:
|
|||||||
1. If you don't want the creation process to be interrupted by any errors, use
|
1. If you don't want the creation process to be interrupted by any errors, use
|
||||||
the `--retry 5` flag.
|
the `--retry 5` flag.
|
||||||
|
|
||||||
|
For more built-in `scaletest` options, use the `--help` flag:
|
||||||
|
|
||||||
|
```shell
|
||||||
|
coder exp scaletest create-workspaces --help
|
||||||
|
```
|
||||||
|
|
||||||
### Traffic Generation
|
### Traffic Generation
|
||||||
|
|
||||||
Given an existing set of workspaces created previously with `create-workspaces`,
|
Given an existing set of workspaces created previously with `create-workspaces`,
|
||||||
@ -105,7 +118,11 @@ The `workspace-traffic` supports also other modes - SSH traffic, workspace app:
|
|||||||
1. For SSH traffic: Use `--ssh` flag to generate SSH traffic instead of Web
|
1. For SSH traffic: Use `--ssh` flag to generate SSH traffic instead of Web
|
||||||
Terminal.
|
Terminal.
|
||||||
1. For workspace app traffic: Use `--app [wsdi|wsec|wsra]` flag to select app
|
1. For workspace app traffic: Use `--app [wsdi|wsec|wsra]` flag to select app
|
||||||
behavior. (modes: _WebSocket discard_, _WebSocket echo_, _WebSocket read_).
|
behavior.
|
||||||
|
|
||||||
|
- `wsdi`: WebSocket discard
|
||||||
|
- `wsec`: WebSocket echo
|
||||||
|
- `wsra`: WebSocket read
|
||||||
|
|
||||||
### Cleanup
|
### Cleanup
|
||||||
|
|
||||||
|
@ -243,6 +243,11 @@
|
|||||||
"title": "Scaling Utilities",
|
"title": "Scaling Utilities",
|
||||||
"description": "Tools to help you scale your deployment",
|
"description": "Tools to help you scale your deployment",
|
||||||
"path": "./admin/infrastructure/scale-utility.md"
|
"path": "./admin/infrastructure/scale-utility.md"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Scaling best practices",
|
||||||
|
"description": "How to prepare a Coder deployment for scale",
|
||||||
|
"path": "./tutorials/best-practices/scale-coder.md"
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@ -761,16 +766,21 @@
|
|||||||
"description": "Guides to help you make the most of your Coder experience",
|
"description": "Guides to help you make the most of your Coder experience",
|
||||||
"path": "./tutorials/best-practices/index.md",
|
"path": "./tutorials/best-practices/index.md",
|
||||||
"children": [
|
"children": [
|
||||||
{
|
|
||||||
"title": "Security - best practices",
|
|
||||||
"description": "Make your Coder deployment more secure",
|
|
||||||
"path": "./tutorials/best-practices/security-best-practices.md"
|
|
||||||
},
|
|
||||||
{
|
{
|
||||||
"title": "Organizations - best practices",
|
"title": "Organizations - best practices",
|
||||||
"description": "How to make the best use of Coder Organizations",
|
"description": "How to make the best use of Coder Organizations",
|
||||||
"path": "./tutorials/best-practices/organizations.md"
|
"path": "./tutorials/best-practices/organizations.md"
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"title": "Scale Coder",
|
||||||
|
"description": "How to prepare a Coder deployment for scale",
|
||||||
|
"path": "./tutorials/best-practices/scale-coder.md"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"title": "Security - best practices",
|
||||||
|
"description": "Make your Coder deployment more secure",
|
||||||
|
"path": "./tutorials/best-practices/security-best-practices.md"
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"title": "Speed up your workspaces",
|
"title": "Speed up your workspaces",
|
||||||
"description": "Speed up your Coder templates and workspaces",
|
"description": "Speed up your Coder templates and workspaces",
|
||||||
|
322
docs/tutorials/best-practices/scale-coder.md
Normal file
322
docs/tutorials/best-practices/scale-coder.md
Normal file
@ -0,0 +1,322 @@
|
|||||||
|
# Scale Coder
|
||||||
|
|
||||||
|
This best practice guide helps you prepare a Coder deployment that you can
|
||||||
|
scale up to a high-scale deployment as use grows, and keep it operating smoothly with a
|
||||||
|
high number of active users and workspaces.
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
|
||||||
|
Observability is one of the most important aspects to a scalable Coder deployment.
|
||||||
|
When you have visibility into performance and usage metrics, you can make informed
|
||||||
|
decisions about what changes you should make.
|
||||||
|
|
||||||
|
[Monitor your Coder deployment](../../admin/monitoring/index.md) with log output
|
||||||
|
and metrics to identify potential bottlenecks before they negatively affect the
|
||||||
|
end-user experience and measure the effects of modifications you make to your
|
||||||
|
deployment.
|
||||||
|
|
||||||
|
- Log output
|
||||||
|
- Capture log output from from Coder Server instances and external provisioner daemons
|
||||||
|
and store them in a searchable log store like Loki, CloudWatch logs, or other tools.
|
||||||
|
- Retain logs for a minimum of thirty days, ideally ninety days.
|
||||||
|
This allows you investigate when anomalous behaviors began.
|
||||||
|
|
||||||
|
- Metrics
|
||||||
|
- Capture infrastructure metrics like CPU, memory, open files, and network I/O for all
|
||||||
|
Coder Server, external provisioner daemon, workspace proxy, and PostgreSQL instances.
|
||||||
|
- Capture Coder Server and External Provisioner daemons metrics
|
||||||
|
[via Prometheus](#how-to-capture-coder-server-metrics-with-prometheus).
|
||||||
|
|
||||||
|
Retain metric time series for at least six months. This allows you to see
|
||||||
|
performance trends relative to user growth.
|
||||||
|
|
||||||
|
For a more comprehensive overview, integrate metrics with an observability
|
||||||
|
dashboard like [Grafana](../../admin/monitoring/index.md).
|
||||||
|
|
||||||
|
### Observability key metrics
|
||||||
|
|
||||||
|
Configure alerting based on these metrics to ensure you surface problems before
|
||||||
|
they affect the end-user experience.
|
||||||
|
|
||||||
|
- CPU and Memory Utilization
|
||||||
|
- Monitor the utilization as a fraction of the available resources on the instance.
|
||||||
|
|
||||||
|
Utilization will vary with use throughout the course of a day, week, and longer timelines.
|
||||||
|
Monitor trends and pay special attention to the daily and weekly peak utilization.
|
||||||
|
Use long-term trends to plan infrastructure upgrades.
|
||||||
|
|
||||||
|
- Tail latency of Coder Server API requests
|
||||||
|
- High tail latency can indicate Coder Server or the PostgreSQL database is underprovisioned
|
||||||
|
for the load.
|
||||||
|
- Use the `coderd_api_request_latencies_seconds` metric.
|
||||||
|
|
||||||
|
- Tail latency of database queries
|
||||||
|
- High tail latency can indicate the PostgreSQL database is low in resources.
|
||||||
|
- Use the `coderd_db_query_latencies_seconds` metric.
|
||||||
|
|
||||||
|
### How to capture Coder server metrics with Prometheus
|
||||||
|
|
||||||
|
Edit your Helm `values.yaml` to capture metrics from Coder Server and external provisioner daemons with
|
||||||
|
[Prometheus](../../admin/integrations/prometheus.md):
|
||||||
|
|
||||||
|
1. Enable Prometheus metrics:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
CODER_PROMETHEUS_ENABLE=true
|
||||||
|
```
|
||||||
|
|
||||||
|
1. Enable database metrics:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
CODER_PROMETHEUS_COLLECT_DB_METRICS=true
|
||||||
|
```
|
||||||
|
|
||||||
|
1. For a high scale deployment, configure agent stats to avoid large cardinality or disable them:
|
||||||
|
|
||||||
|
- Configure agent stats:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
CODER_PROMETHEUS_AGGREGATE_AGENT_STATS_BY=agent_name
|
||||||
|
```
|
||||||
|
|
||||||
|
- Disable agent stats:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
CODER_PROMETHEUS_COLLECT_AGENT_STATS=false
|
||||||
|
```
|
||||||
|
|
||||||
|
## Coder Server
|
||||||
|
|
||||||
|
### Locality
|
||||||
|
|
||||||
|
If increased availability of the Coder API is a concern, deploy at least three
|
||||||
|
instances of Coder Server. Spread the instances across nodes with anti-affinity rules in
|
||||||
|
Kubernetes or in different availability zones of the same geographic region.
|
||||||
|
|
||||||
|
Do not deploy in different geographic regions.
|
||||||
|
|
||||||
|
Coder Servers need to be able to communicate with one another directly with low
|
||||||
|
latency, under 10ms. Note that this is for the availability of the Coder API.
|
||||||
|
Workspaces are not fault tolerant unless they are explicitly built that way at
|
||||||
|
the template level.
|
||||||
|
|
||||||
|
Deploy Coder Server instances as geographically close to PostgreSQL as possible.
|
||||||
|
Low-latency communication (under 10ms) with Postgres is essential for Coder
|
||||||
|
Server's performance.
|
||||||
|
|
||||||
|
### Scaling
|
||||||
|
|
||||||
|
Coder Server can be scaled both vertically for bigger instances and horizontally
|
||||||
|
for more instances.
|
||||||
|
|
||||||
|
Aim to keep the number of Coder Server instances relatively small, preferably
|
||||||
|
under ten instances, and opt for vertical scale over horizontal scale after
|
||||||
|
meeting availability requirements.
|
||||||
|
|
||||||
|
Coder's
|
||||||
|
[validated architectures](../../admin/infrastructure/validated-architectures/index.md)
|
||||||
|
give specific sizing recommendations for various user scales. These are a useful
|
||||||
|
starting point, but very few deployments will remain stable at a predetermined
|
||||||
|
user level over the long term. We recommend monitoring and adjusting resources as needed.
|
||||||
|
|
||||||
|
We don't recommend that you autoscale the Coder Servers. Instead, scale the
|
||||||
|
deployment for peak weekly usage.
|
||||||
|
|
||||||
|
Although Coder Server persists no internal state, it operates as a proxy for end
|
||||||
|
users to their workspaces in two capacities:
|
||||||
|
|
||||||
|
1. As an HTTP proxy when they access workspace applications in their browser via
|
||||||
|
the Coder Dashboard.
|
||||||
|
|
||||||
|
1. As a DERP proxy when establishing tunneled connections with CLI tools like
|
||||||
|
`coder ssh`, `coder port-forward`, and others, and with desktop IDEs.
|
||||||
|
|
||||||
|
Stopping a Coder Server instance will (momentarily) disconnect any users
|
||||||
|
currently connecting through that instance. Adding a new instance is not
|
||||||
|
disruptive, but you should remove instances and perform upgrades during a
|
||||||
|
maintenance window to minimize disruption.
|
||||||
|
|
||||||
|
## Provisioner daemons
|
||||||
|
|
||||||
|
### Locality
|
||||||
|
|
||||||
|
We recommend that you run one or more
|
||||||
|
[provisioner daemon deployments external to Coder Server](../../admin/provisioners.md)
|
||||||
|
and disable provisioner daemons within your Coder Server.
|
||||||
|
This allows you to scale them independently of the Coder Server:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
CODER_PROVISIONER_DAEMONS=0
|
||||||
|
```
|
||||||
|
|
||||||
|
We recommend deploying provisioner daemons within the same cluster as the
|
||||||
|
workspaces they will provision or are hosted in.
|
||||||
|
|
||||||
|
- This gives them a low-latency connection to the APIs they will use to
|
||||||
|
provision workspaces and can speed builds.
|
||||||
|
|
||||||
|
- It allows provisioner daemons to use in-cluster mechanisms (for example
|
||||||
|
Kubernetes service account tokens, AWS IAM Roles, and others) to authenticate with
|
||||||
|
the infrastructure APIs.
|
||||||
|
|
||||||
|
- If you deploy workspaces in multiple clusters, run multiple provisioner daemon
|
||||||
|
deployments and use template tags to select the correct set of provisioner
|
||||||
|
daemons.
|
||||||
|
|
||||||
|
- Provisioner daemons need to be able to connect to Coder Server, but this does not need
|
||||||
|
to be a low-latency connection.
|
||||||
|
|
||||||
|
Provisioner daemons make no direct connections to the PostgreSQL database, so
|
||||||
|
there's no need for locality to the Postgres database.
|
||||||
|
|
||||||
|
### Scaling
|
||||||
|
|
||||||
|
Each provisioner daemon instance can handle a single workspace build job at a
|
||||||
|
time. Therefore, the maximum number of simultaneous builds your Coder deployment
|
||||||
|
can handle is equal to the number of provisioner daemon instances within a tagged
|
||||||
|
deployment.
|
||||||
|
|
||||||
|
If users experience unacceptably long queues for workspace builds to start,
|
||||||
|
consider increasing the number of provisioner daemon instances in the affected
|
||||||
|
cluster.
|
||||||
|
|
||||||
|
You might need to automatically scale the number of provisioner daemon instances
|
||||||
|
throughout the day to meet demand.
|
||||||
|
|
||||||
|
If you stop instances with `SIGHUP`, they will complete their current build job
|
||||||
|
and exit. `SIGINT` will cancel the current job, which will result in a failed build.
|
||||||
|
Ensure your autoscaler waits long enough for your build jobs to complete before
|
||||||
|
it kills the provisioner daemon process.
|
||||||
|
|
||||||
|
If you deploy in Kubernetes, we recommend a single provisioner daemon per pod.
|
||||||
|
On a virtual machine (VM), you can deploy multiple provisioner daemons, ensuring
|
||||||
|
each has a unique `CODER_CACHE_DIRECTORY` value.
|
||||||
|
|
||||||
|
Coder's
|
||||||
|
[validated architectures](../../admin/infrastructure/validated-architectures/index.md)
|
||||||
|
give specific sizing recommendations for various user scales. Since the
|
||||||
|
complexity of builds varies significantly depending on the workspace template,
|
||||||
|
consider this a starting point. Monitor queue times and build times and adjust
|
||||||
|
the number and size of your provisioner daemon instances.
|
||||||
|
|
||||||
|
## PostgreSQL
|
||||||
|
|
||||||
|
PostgreSQL is the primary persistence layer for all of Coder's deployment data.
|
||||||
|
We also use `LISTEN` and `NOTIFY` to coordinate between different instances of
|
||||||
|
Coder Server.
|
||||||
|
|
||||||
|
### Locality
|
||||||
|
|
||||||
|
Coder Server instances must have low-latency connections (under 10ms) to
|
||||||
|
PostgreSQL. If you use multiple PostgreSQL replicas in a clustered config, these
|
||||||
|
must also be low-latency with respect to one another.
|
||||||
|
|
||||||
|
### Scaling
|
||||||
|
|
||||||
|
Prefer scaling PostgreSQL vertically rather than horizontally for best
|
||||||
|
performance. Coder's
|
||||||
|
[validated architectures](../../admin/infrastructure/validated-architectures/index.md)
|
||||||
|
give specific sizing recommendations for various user scales.
|
||||||
|
|
||||||
|
## Workspace proxies
|
||||||
|
|
||||||
|
Workspace proxies proxy HTTP traffic from end users to workspaces for Coder apps
|
||||||
|
defined in the templates, and HTTP ports opened by the workspace. By default
|
||||||
|
they also include a DERP Proxy.
|
||||||
|
|
||||||
|
### Locality
|
||||||
|
|
||||||
|
We recommend each geographic cluster of workspaces have an associated deployment
|
||||||
|
of workspace proxies. This ensures that users always have a near-optimal proxy
|
||||||
|
path.
|
||||||
|
|
||||||
|
### Scaling
|
||||||
|
|
||||||
|
Workspace proxy load is determined by the amount of traffic they proxy.
|
||||||
|
|
||||||
|
Monitor CPU, memory, and network I/O utilization to decide when to resize
|
||||||
|
the number of proxy instances.
|
||||||
|
|
||||||
|
Scale for peak demand and scale down or upgrade during a maintenance window.
|
||||||
|
|
||||||
|
We do not recommend autoscaling the workspace proxies because many applications
|
||||||
|
use long-lived connections such as websockets, which would be disrupted by
|
||||||
|
stopping the proxy.
|
||||||
|
|
||||||
|
## Workspaces
|
||||||
|
|
||||||
|
Workspaces represent the vast majority of resources in most Coder deployments.
|
||||||
|
Because they are defined by templates, there is no one-size-fits-all advice for
|
||||||
|
scaling workspaces.
|
||||||
|
|
||||||
|
### Hard and soft cluster limits
|
||||||
|
|
||||||
|
All Infrastructure as a Service (IaaS) clusters have limits to what can be
|
||||||
|
simultaneously provisioned. These could be hard limits, based on the physical
|
||||||
|
size of the cluster, especially in the case of a private cloud, or soft limits,
|
||||||
|
based on configured limits in your public cloud account.
|
||||||
|
|
||||||
|
It is important to be aware of these limits and monitor Coder workspace resource
|
||||||
|
utilization against the limits, so that a new influx of users don't encounter
|
||||||
|
failed builds. Monitoring these is outside the scope of Coder, but we recommend
|
||||||
|
that you set up dashboards and alerts for each kind of limited resource.
|
||||||
|
|
||||||
|
As you approach soft limits, you can request limit increases to keep growing.
|
||||||
|
|
||||||
|
As you approach hard limits, consider deploying to additional cluster(s).
|
||||||
|
|
||||||
|
### Workspaces per node
|
||||||
|
|
||||||
|
Many development workloads are "spiky" in their CPU and memory requirements, for
|
||||||
|
example, they peak during build/test and then lower while editing code.
|
||||||
|
This leads to an opportunity to efficiently use compute resources by packing multiple
|
||||||
|
workspaces onto a single node. This can lead to better experience (more CPU and
|
||||||
|
memory available during brief bursts) and lower cost.
|
||||||
|
|
||||||
|
There are a number of things you should consider before you decide how many
|
||||||
|
workspaces you should allow per node:
|
||||||
|
|
||||||
|
- "Noisy neighbor" issues: Users share the node's CPU and memory resources and might
|
||||||
|
be susceptible to a user or process consuming shared resources.
|
||||||
|
|
||||||
|
- If the shared nodes are a provisioned resource, for example, Kubernetes nodes
|
||||||
|
running on VMs in a public cloud, then it can sometimes be a challenge to
|
||||||
|
effectively autoscale down.
|
||||||
|
|
||||||
|
- For example, if half the workspaces are stopped overnight, and there are ten
|
||||||
|
workspaces per node, it's unlikely that all ten workspaces on the node are
|
||||||
|
among the stopped ones.
|
||||||
|
|
||||||
|
- You can mitigate this by lowering the number of workspaces per node, or
|
||||||
|
using autostop policies to stop more workspaces during off-peak hours.
|
||||||
|
|
||||||
|
- If you do overprovision workspaces onto nodes, keep them in a separate node
|
||||||
|
pool and schedule Coder control plane (Coder Server, PostgreSQL, workspace
|
||||||
|
proxies) components on a different node pool to avoid resource spikes
|
||||||
|
affecting them.
|
||||||
|
|
||||||
|
Coder customers have had success with both:
|
||||||
|
|
||||||
|
- One workspace per AWS VM
|
||||||
|
- Lots of workspaces on Kubernetes nodes for efficiency
|
||||||
|
|
||||||
|
### Cost control
|
||||||
|
|
||||||
|
- Use quotas to discourage users from creating many workspaces they don't need
|
||||||
|
simultaneously.
|
||||||
|
|
||||||
|
- Label workspace cloud resources by user, team, organization, or your own
|
||||||
|
labelling conventions to track usage at different granularities.
|
||||||
|
|
||||||
|
- Use autostop requirements to bring off-peak utilization down.
|
||||||
|
|
||||||
|
## Networking
|
||||||
|
|
||||||
|
Set up your network so that most users can get direct, peer-to-peer connections
|
||||||
|
to their workspaces. This drastically reduces the load on Coder Server and
|
||||||
|
workspace proxy instances.
|
||||||
|
|
||||||
|
## Next steps
|
||||||
|
|
||||||
|
- [Scale Tests and Utilities](../../admin/infrastructure/scale-utility.md)
|
||||||
|
- [Scale Testing](../../admin/infrastructure/scale-testing.md)
|
Reference in New Issue
Block a user