mirror of
https://github.com/grafana/tempo.git
synced 2025-03-14 03:06:42 +00:00
Remove TempoRequestLatency alert and associated runbook section (#4768)
This commit is contained in:
@ -1,16 +1,6 @@
|
||||
"groups":
|
||||
- "name": "tempo_alerts"
|
||||
"rules":
|
||||
- "alert": "TempoRequestLatency"
|
||||
"annotations":
|
||||
"message": |
|
||||
{{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
|
||||
"runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoRequestLatency"
|
||||
"expr": |
|
||||
cluster_namespace_job_route:tempo_request_duration_seconds:99quantile{route!~"metrics|/frontend.Frontend/Process|debug_pprof"} > 3
|
||||
"for": "15m"
|
||||
"labels":
|
||||
"severity": "critical"
|
||||
- "alert": "TempoCompactorUnhealthy"
|
||||
"annotations":
|
||||
"message": "There are {{ printf \"%f\" $value }} unhealthy compactor(s)."
|
||||
|
@ -4,22 +4,6 @@
|
||||
{
|
||||
name: 'tempo_alerts',
|
||||
rules: [
|
||||
{
|
||||
alert: 'TempoRequestLatency',
|
||||
expr: |||
|
||||
%s_route:tempo_request_duration_seconds:99quantile{route!~"%s"} > %s
|
||||
||| % [$._config.group_prefix_jobs, $._config.alerts.p99_request_exclude_regex, $._config.alerts.p99_request_threshold_seconds],
|
||||
'for': '15m',
|
||||
labels: {
|
||||
severity: 'critical',
|
||||
},
|
||||
annotations: {
|
||||
message: |||
|
||||
{{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
|
||||
|||,
|
||||
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoRequestLatency',
|
||||
},
|
||||
},
|
||||
{
|
||||
alert: 'TempoCompactorUnhealthy',
|
||||
expr: |||
|
||||
|
@ -2,52 +2,6 @@
|
||||
|
||||
This document should help with remediation of operational issues in Tempo.
|
||||
|
||||
## TempoRequestLatency
|
||||
|
||||
Aside from obvious errors in the logs the only real lever you can pull here is scaling. Use the Reads or Writes dashboard
|
||||
to identify the component that is struggling and scale it up.
|
||||
|
||||
The Query path is instrumented with tracing (!) and this can be used to diagnose issues with higher latency. View the logs of
|
||||
the Query Frontend, where you can find an info level message for every request. Filter for requests with high latency and view traces.
|
||||
|
||||
The Query Frontend allows for scaling the query path by sharding queries. There are a few knobs that can be tuned for optimum
|
||||
parallelism -
|
||||
- Number of shards each query is split into, configured via
|
||||
```
|
||||
query_frontend:
|
||||
trace_by_id:
|
||||
query_shards: 10
|
||||
```
|
||||
- Number of Queriers (each of these process the sharded queries in parallel). This can be changed by modifying the size of the
|
||||
Querier deployment. More Queriers -> faster processing of shards in parallel -> lower request latency.
|
||||
|
||||
- Querier parallelism, which is a combination of a few settings:
|
||||
|
||||
```
|
||||
querier:
|
||||
max_concurrent_queries: 10
|
||||
frontend_worker:
|
||||
match_max_concurrent: true // true by default
|
||||
parallelism: 5 // parallelism per query-frontend. ignored if match_max_concurrent is set to true
|
||||
|
||||
storage:
|
||||
trace:
|
||||
pool:
|
||||
max_workers: 100
|
||||
```
|
||||
|
||||
MaxConcurrentQueries defines the total number of shards each Querier processes at a given time. By default, this number will
|
||||
be split between the query frontends, so if there are N query frontends, the Querier will process (Max Concurrent Queries/ N)
|
||||
queries per query frontend.
|
||||
|
||||
Another way to increase parallelism is by increasing the size of the worker pool that queries the cache & backend blocks.
|
||||
|
||||
A theoretically ideal value for this config to avoid _any_ queueing would be (Size of blocklist / Max Concurrent Queries).
|
||||
But also factor in the resources provided to the querier.
|
||||
|
||||
Our [documentation](https://grafana.com/docs/tempo/latest/operations/backend_search/#query-frontend)
|
||||
includes [a solid guide](https://grafana.com/docs/tempo/latest/operations/backend_search/#guidelines-on-key-configuration-parameters) on the various parameters with suggestions.
|
||||
|
||||
### Trace Lookup Failures
|
||||
|
||||
If trace lookups are fail with the error: `error querying store in Querier.FindTraceByID: queue doesn't have room for <xyz> jobs`, this
|
||||
|
Reference in New Issue
Block a user