Data Science & AI Workbench uses Prometheus to monitor the operation and performance of the application. Prometheus exposes system metrics such as CPU usage, memory consumption, and network traffic to help you understand and maintain the overall health of your infrastructure. Regularly analyzing your system can help you establish a baseline for your system operations, identify potential issues with your system, and troubleshoot active problems by aiding in determining the root cause.Prometheus comes with a built-in alert manager that you can configure to inform you when certain conditions are met or when a specified threshold is exceeded. Both Prometheus and the alert manager are installed with their default settings in the Helm values.yaml file during the initial installation of Workbench, but can be updated at any time.Follow the steps for Setting platform configurations using the Helm chart to modify the default configurations to enable additional alerts.
To configure additional custom alerts, you must provide a few key elements in your alert and place the alert in the correct area of the Helm chart (opsMetrics.prometheus.server.alertingRules).Here is an example alert that you might implement in your system:
Copy
Ask AI
- alert: PodsBlockedInTerminatingState expr: count(kube_pod_deletion_timestamp) by (namespace, pod) * count(kube_pod_status_reason{reason="NodeLost"} == 0) by (namespace, pod) > 0 for: 5m labels: severity: critical annotations: # Use outer single quotes '' in annotations if they contain Prometheus labels # Use {{"{{}}"}} for Prometheus labels to allow Helm to pass the template to Prometheus properly summary: 'Pod {{"{{$labels.namespace}}"}}/{{"{{$labels.pod}}"}} blocked in Terminating state.'