Remove TempoRequestLatency alert and associated runbook section (#4768)

This commit is contained in:
Xoan Gonzalez
2025-02-28 14:17:06 +01:00
committed by GitHub
parent 5d98dcd245
commit 291eb0a783
3 changed files with 0 additions and 72 deletions

View File

@ -1,16 +1,6 @@
"groups":
- "name": "tempo_alerts"
"rules":
- "alert": "TempoRequestLatency"
"annotations":
"message": |
{{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
"runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoRequestLatency"
"expr": |
cluster_namespace_job_route:tempo_request_duration_seconds:99quantile{route!~"metrics|/frontend.Frontend/Process|debug_pprof"} > 3
"for": "15m"
"labels":
"severity": "critical"
- "alert": "TempoCompactorUnhealthy"
"annotations":
"message": "There are {{ printf \"%f\" $value }} unhealthy compactor(s)."

View File

@ -4,22 +4,6 @@
{
name: 'tempo_alerts',
rules: [
{
alert: 'TempoRequestLatency',
expr: |||
%s_route:tempo_request_duration_seconds:99quantile{route!~"%s"} > %s
||| % [$._config.group_prefix_jobs, $._config.alerts.p99_request_exclude_regex, $._config.alerts.p99_request_threshold_seconds],
'for': '15m',
labels: {
severity: 'critical',
},
annotations: {
message: |||
{{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
|||,
runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoRequestLatency',
},
},
{
alert: 'TempoCompactorUnhealthy',
expr: |||

View File

@ -2,52 +2,6 @@
This document should help with remediation of operational issues in Tempo.
## TempoRequestLatency
Aside from obvious errors in the logs the only real lever you can pull here is scaling. Use the Reads or Writes dashboard
to identify the component that is struggling and scale it up.
The Query path is instrumented with tracing (!) and this can be used to diagnose issues with higher latency. View the logs of
the Query Frontend, where you can find an info level message for every request. Filter for requests with high latency and view traces.
The Query Frontend allows for scaling the query path by sharding queries. There are a few knobs that can be tuned for optimum
parallelism -
- Number of shards each query is split into, configured via
```
query_frontend:
trace_by_id:
query_shards: 10
```
- Number of Queriers (each of these process the sharded queries in parallel). This can be changed by modifying the size of the
Querier deployment. More Queriers -> faster processing of shards in parallel -> lower request latency.
- Querier parallelism, which is a combination of a few settings:
```
querier:
max_concurrent_queries: 10
frontend_worker:
match_max_concurrent: true // true by default
parallelism: 5 // parallelism per query-frontend. ignored if match_max_concurrent is set to true
storage:
trace:
pool:
max_workers: 100
```
MaxConcurrentQueries defines the total number of shards each Querier processes at a given time. By default, this number will
be split between the query frontends, so if there are N query frontends, the Querier will process (Max Concurrent Queries/ N)
queries per query frontend.
Another way to increase parallelism is by increasing the size of the worker pool that queries the cache & backend blocks.
A theoretically ideal value for this config to avoid _any_ queueing would be (Size of blocklist / Max Concurrent Queries).
But also factor in the resources provided to the querier.
Our [documentation](https://grafana.com/docs/tempo/latest/operations/backend_search/#query-frontend)
includes [a solid guide](https://grafana.com/docs/tempo/latest/operations/backend_search/#guidelines-on-key-configuration-parameters) on the various parameters with suggestions.
### Trace Lookup Failures
If trace lookups are fail with the error: `error querying store in Querier.FindTraceByID: queue doesn't have room for <xyz> jobs`, this