Remove TempoRequestLatency alert and associated runbook section (#4768)

2025-03-14 03:06:42 +00:00 · 2025-02-28 14:17:06 +01:00
parent 5d98dcd245
commit 291eb0a783
3 changed files with 0 additions and 72 deletions
--- a/operations/tempo-mixin-compiled/alerts.yaml
+++ b/operations/tempo-mixin-compiled/alerts.yaml
@ -1,16 +1,6 @@
 "groups":
 - "name": "tempo_alerts"
  "rules":
-  - "alert": "TempoRequestLatency"
-    "annotations":
-      "message": |
-        {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
-      "runbook_url": "https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoRequestLatency"
-    "expr": |
-      cluster_namespace_job_route:tempo_request_duration_seconds:99quantile{route!~"metrics|/frontend.Frontend/Process|debug_pprof"} > 3
-    "for": "15m"
-    "labels":
-      "severity": "critical"
  - "alert": "TempoCompactorUnhealthy"
    "annotations":
      "message": "There are {{ printf \"%f\" $value }} unhealthy compactor(s)."
--- a/operations/tempo-mixin/alerts.libsonnet
+++ b/operations/tempo-mixin/alerts.libsonnet
@ -4,22 +4,6 @@
      {
        name: 'tempo_alerts',
        rules: [
-          {
-            alert: 'TempoRequestLatency',
-            expr: |||
-              %s_route:tempo_request_duration_seconds:99quantile{route!~"%s"} > %s
-            ||| % [$._config.group_prefix_jobs, $._config.alerts.p99_request_exclude_regex, $._config.alerts.p99_request_threshold_seconds],
-            'for': '15m',
-            labels: {
-              severity: 'critical',
-            },
-            annotations: {
-              message: |||
-                {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
-              |||,
-              runbook_url: 'https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoRequestLatency',
-            },
-          },
          {
            alert: 'TempoCompactorUnhealthy',
            expr: |||
--- a/operations/tempo-mixin/runbook.md
+++ b/operations/tempo-mixin/runbook.md
@ -2,52 +2,6 @@

 This document should help with remediation of operational issues in Tempo.

-## TempoRequestLatency
-
-Aside from obvious errors in the logs the only real lever you can pull here is scaling.  Use the Reads or Writes dashboard
-to identify the component that is struggling and scale it up.
-
-The Query path is instrumented with tracing (!) and this can be used to diagnose issues with higher latency. View the logs of
-the Query Frontend, where you can find an info level message for every request. Filter for requests with high latency and view traces.
-
-The Query Frontend allows for scaling the query path by sharding queries. There are a few knobs that can be tuned for optimum
-parallelism -
- Number of shards each query is split into, configured via
-    ```
-    query_frontend:
-      trace_by_id:
-        query_shards: 10
-    ```
- Number of Queriers (each of these process the sharded queries in parallel). This can be changed by modifying the size of the
-Querier deployment. More Queriers -> faster processing of shards in parallel -> lower request latency.
-
- Querier parallelism, which is a combination of a few settings:
-
-    ```
-    querier:
-      max_concurrent_queries: 10
-      frontend_worker:
-          match_max_concurrent: true  // true by default
-          parallelism: 5              // parallelism per query-frontend. ignored if match_max_concurrent is set to true
-
-    storage:
-      trace:
-        pool:
-          max_workers: 100
-    ```
-
-MaxConcurrentQueries defines the total number of shards each Querier processes at a given time. By default, this number will
-be split between the query frontends, so if there are N query frontends, the Querier will process (Max Concurrent Queries/ N)
-queries per query frontend.
-
-Another way to increase parallelism is by increasing the size of the worker pool that queries the cache & backend blocks.
-
-A theoretically ideal value for this config to avoid _any_ queueing would be (Size of blocklist / Max Concurrent Queries).
-But also factor in the resources provided to the querier.
-
-Our [documentation](https://grafana.com/docs/tempo/latest/operations/backend_search/#query-frontend)
-includes [a solid guide](https://grafana.com/docs/tempo/latest/operations/backend_search/#guidelines-on-key-configuration-parameters) on the various parameters with suggestions.
-
 ### Trace Lookup Failures

 If trace lookups are fail with the error: `error querying store in Querier.FindTraceByID: queue doesn't have room for <xyz> jobs`, this