Path: blob/main/operations/observability/mixins/workspace/rules/central/nodes.yaml
2506 views
# Copyright (c) 2022 Gitpod GmbH. All rights reserved.1# Licensed under the GNU Affero General Public License (AGPL).2# See License.AGPL.txt in the project root for license information.34apiVersion: monitoring.coreos.com/v15kind: PrometheusRule6metadata:7labels:8prometheus: k8s9role: alert-rules10name: workspace-nodes-monitoring-rules11spec:12groups:13- name: workspace-nodes-rules14rules:15- record: nodepool:node_load1:normalized16expr: |17node_load1 * on(node) group_left(nodepool) kube_node_labels18/19count without (cpu) (20count without (mode) (21node_cpu_seconds_total * on(node) group_left(nodepool) kube_node_labels22)23)24- name: workspace-nodes-alerts25rules:26- alert: GitpodWorkspaceNodeHighNormalizedLoadAverage27labels:28severity: warning29team: engine30for: 60m31annotations:32runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/NodePoolLoad.md33summary: Workspace node's normalized load average is higher than 10 for more than 60 minutes. Check for abuse.34description: Node {{ $labels.node }} in {{ $labels.cluster }} is reporting {{ printf "%.2f" $value }}% normalized load average. Normalized load average is current load average divided by number of CPU cores of the node.35expr: nodepool:node_load1:normalized{nodepool=~".*workspace.*", cluster!~"ephemeral.*"} > 103637- alert: GitpodHeadlessNodeHighNormalizedLoadAverage38labels:39severity: warning40team: engine41for: 60m42annotations:43runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/NodePoolLoad.md44summary: Workspace node's normalized load average is higher than 10 for more than 60 minutes. Check for abuse.45description: Node {{ $labels.node }} in {{ $labels.cluster }} is reporting {{ printf "%.2f" $value }}% normalized load average. Normalized load average is current load average divided by number of CPU cores of the node.46expr: nodepool:node_load1:normalized{nodepool=~".*headless.*", cluster!~"ephemeral.*"} > 104748- alert: AutoscalerAddsNodesTooFast49labels:50severity: critical51annotations:52runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/AutoscalerAddsNodesTooFast.md53summary: Autoscaler is adding new nodes rapidly54description: Autoscaler in cluster {{ $labels.cluster }} is rapidly adding new nodes.55expr: ((sum(kube_node_labels{nodepool=~"workspace-.*", cluster!~"ephemeral.*"}) by (cluster)) - (sum(kube_node_labels{nodepool=~"workspace-.*", cluster!~"ephemeral.*"} offset 10m) by (cluster))) > 155657- alert: AutoscaleFailure58labels:59severity: warning60team: engine61annotations:62runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/AutoscaleFailure.md63summary: Automatic scale-up failed for some reason.64description: Automatic scale-up in cluster {{ $labels.cluster }} failed due to {{ $labels.reason }}.65expr: |66increase(cluster_autoscaler_failed_scale_ups_total{cluster!~"ephemeral.*"}[1m]) != 06768- alert: NodePoolLoad69labels:70severity: critical71team: engine72for: 60m73annotations:74runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/NodePoolLoad.md75summary: Node pool load has been high for too long for 4 or more nodes76description: Node pool {{ $labels.nodepool }} in cluster {{ $labels.cluster }} has high, sustained load77expr: |78sum by(nodepool, cluster) (count by(node, nodepool, cluster) (sum by(node, nodepool, cluster) (nodepool:node_load1:normalized{nodepool=~".*workspace.*", cluster!~"ephemeral.*"}) >= 1)) >= 4798081