๐ Kubernetes Monitoring and Observability
Kubernetes Monitoring and Observability
Learn how to implement comprehensive monitoring and observability in your Kubernetes clusters to ensure optimal performance and quick troubleshooting.
What Weโll Cover
- Setting up Prometheus and Grafana
- Custom Metrics and Service Monitors
- Alert Management
- Log Aggregation
- Distributed Tracing
Prerequisites
- Working Kubernetes cluster
- Helm installed
- Basic understanding of monitoring concepts
Installing Prometheus Operator
First, letโs set up Prometheus Operator using Helm:
# Add Prometheus community charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus Stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
Custom ServiceMonitor Configuration
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 15s
namespaceSelector:
matchNames:
- default
Creating Custom Metrics
Example of a custom metrics endpoint in Go:
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
}
Grafana Dashboard Configuration
Example dashboard JSON:
{
"dashboard": {
"id": null,
"title": "Application Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": " "
}
]
}
]
}
}
Alert Configuration
PrometheusRule example:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: application-alerts
namespace: monitoring
spec:
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.*"}[5m])
/
rate(http_requests_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
description: Error rate is above 10% for 5 minutes
Log Aggregation with Loki
Installing Loki stack:
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace monitoring \
--set grafana.enabled=false
Distributed Tracing with Jaeger
Jaeger deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 16686
- containerPort: 14268
Best Practices
- Metric Collection:
- Use meaningful labels
- Follow naming conventions
- Keep cardinality under control
- Alerting:
- Define clear severity levels
- Avoid alert fatigue
- Include runbooks
- Dashboard Design:
- Start with overview
- Use consistent layouts
- Include documentation
Video Resources
Monitoring Fundamentals
- Kubernetes Monitoring with Prometheus by TechWorld with Nana
- Grafana Dashboards Tutorial by The Digital Life
Advanced Monitoring
- PromQL Deep Dive by Julius Volz
- Kubernetes Monitoring Architecture by CNCF
Observability Practices
- Distributed Tracing with Jaeger by Juraci Paixรฃo
- Logging Best Practices by Cloud Native Skunkworks
Additional Resources
Written on August 6, 2025