SRE Configuration Generator#
DevOps-OS generates production-grade SRE configuration files for any service with a single command — covering alerting, dashboards, SLOs, and routing.
Quick Start#
# Generate all SRE configs
python -m cli.devopsos scaffold sre --name my-app --team platform
# Output: sre/alert-rules.yaml
# sre/grafana-dashboard.json
# sre/slo.yaml
# sre/alertmanager-config.yaml
# Availability-only SLO with 99.9% target
python -m cli.devopsos scaffold sre --name my-app --slo-type availability --slo-target 99.9
# Latency SLO with 200ms threshold
python -m cli.devopsos scaffold sre --name my-app --slo-type latency --latency-threshold 0.2
# Send critical alerts to PagerDuty
python -m cli.devopsos scaffold sre --name my-app \
--pagerduty-key YOUR_PD_KEY \
--slack-channel "#platform-alerts"Options#
| Option | Default | Description |
|---|---|---|
--name NAME | my-app | Application / service name |
--team TEAM | platform | Owning team (used in labels and alert routing) |
--namespace NS | default | Kubernetes namespace where the app runs |
--slo-type TYPE | all | availability | latency | error_rate | all |
--slo-target PCT | 99.9 | SLO target as a percentage (e.g. 99.5) |
--latency-threshold SEC | 0.5 | Latency SLI threshold in seconds |
--pagerduty-key KEY | (empty) | PagerDuty integration key; omit to skip PD routing |
--slack-channel CHANNEL | #alerts | Slack channel for alert routing |
--output-dir DIR | sre | Directory where all output files are written |
All options can be set via environment variables prefixed DEVOPS_OS_SRE_.
Generated Files#
sre/
├── alert-rules.yaml Prometheus PrometheusRule CR
├── grafana-dashboard.json Grafana dashboard (importable via API or UI)
├── slo.yaml Sloth-compatible SLO manifest
└── alertmanager-config.yaml Alertmanager routing config stubalert-rules.yaml#
A PrometheusRule Custom Resource compatible with kube-prometheus-stack.
kubectl apply -f sre/alert-rules.yamlIncluded alert groups#
| Group | Alerts |
|---|---|
<name>.availability | HighErrorRate, SLOBurnRate |
<name>.latency | HighLatency, LatencyBudgetBurn |
<name>.infrastructure | PodRestartingFrequently, DeploymentReplicasMismatch |
SLO burn-rate alerts fire before you exhaust your error budget.
grafana-dashboard.json#
A ready-to-import Grafana dashboard with six panels:
| Panel | Metric |
|---|---|
| Request Rate (RPS) | http_requests_total |
| Error Rate | 5xx ratio |
| p99 Latency | histogram quantile |
| Pod Restarts | kube_pod_container_status_restarts_total |
| CPU Usage | container_cpu_usage_seconds_total |
| Memory Usage | container_memory_working_set_bytes |
Import the dashboard#
# Via Grafana HTTP API
curl -X POST http://localhost:3000/api/dashboards/import \
-H "Content-Type: application/json" \
-d "{\"dashboard\": $(cat sre/grafana-dashboard.json), \"overwrite\": true}"Or use Grafana → Dashboards → Import → Upload JSON file.
slo.yaml#
A Sloth-compatible SLO manifest that Sloth converts into multi-window, multi-burn-rate Prometheus recording rules and alerts:
sloth generate -i sre/slo.yaml -o sre/slo-rules.yaml
kubectl apply -f sre/slo-rules.yamlalertmanager-config.yaml#
An Alertmanager routing configuration that:
- Routes all alerts to your Slack channel
- Routes critical alerts to PagerDuty (if
--pagerduty-keyis set) - Inhibits duplicate
warningalerts when acriticalalert is firing for the same service
Apply to the cluster (kube-prometheus-stack)#
kubectl create secret generic alertmanager-kube-prometheus-stack-alertmanager \
--from-file=alertmanager.yaml=sre/alertmanager-config.yaml \
--namespace monitoring --dry-run=client -o yaml | kubectl apply -f -Prerequisites#
Your application must expose Prometheus-compatible metrics:
| Metric | Description |
|---|---|
http_requests_total{status} | Request count, labelled by HTTP status |
http_request_duration_seconds_bucket{le} | Latency histogram |
Auto-instrumentation libraries:
- Python:
prometheus-flask-exporter,starlette-prometheus - Java:
micrometer-registry-prometheus - Go:
prometheus/client_golang - Node.js:
prom-client