Keep Production Running at 99.99% — Systematically
Production reliability isn't about heroics — it's about systems. These 50 prompts give DevOps engineers and SREs the frameworks to build reliable, observable, and self-healing production infrastructure. From incident management to chaos engineering, every prompt encodes the practices that keep the world's most critical systems running.
Each prompt uses chain-of-thought for root cause analysis, tree-of-thought for architecture decisions, and CRTSE framework for operational procedures. Variables like {{service_name}}, {{sla_target}}, {{infrastructure}}, and {{team_size}} ensure practical applicability.
What's Inside — 50 Expert Prompts
- SLO/SLI Design Framework — Defines service level objectives for {{service}} with indicator selection, error budget calculation, alerting thresholds, and burn rate monitoring.
- Incident Management Process Builder — Creates incident response process for {{organization}} with severity levels, roles, communication templates, and post-incident review framework.
- Chaos Engineering Experiment Designer — Designs chaos experiments for {{system}} targeting failure modes: network partition, CPU spike, disk full, dependency failure, and cascading failures.
- Observability Strategy Architect — Designs three pillars (logs, metrics, traces) for {{service_count}} services with tool selection, instrumentation plan, and dashboard hierarchy.
- Capacity Planning Model — Projects infrastructure needs for {{service}} from {{current_load}} to {{target_load}} with headroom calculations, scaling triggers, and cost optimization.
- Deployment Strategy Selector — Evaluates deployment strategies (blue-green, canary, rolling, feature flag) for {{application}} with risk assessment and rollback procedures.
- Service Mesh Configuration Designer — Configures Istio/Linkerd for {{service_count}} services with traffic management, mTLS, circuit breaking, and observability integration.
- Runbook Automation Framework — Converts manual runbook for {{procedure}} into automated workflow with decision points, safety checks, and human approval gates.
- On-Call Rotation Designer — Creates sustainable on-call system for {{team_size}} team with rotation schedule, escalation paths, compensation model, and burnout prevention.
- Post-Incident Review Template — Structures blameless post-mortem for {{incident}} with timeline, contributing factors, remediation items, and systemic improvements.
- Infrastructure as Code Reviewer — Reviews {{iac_tool}} (Terraform, Pulumi, CDK) configurations for security, cost optimization, and reliability best practices.
- Container Orchestration Optimizer — Optimizes {{k8s_cluster}} with resource limits, HPA configuration, pod disruption budgets, and node pool strategy.
- Database Reliability Framework — Designs reliability for {{database}} with backup strategy, failover testing, connection management, and performance monitoring.
- Network Reliability Designer — Creates network architecture for {{application}} with redundancy, DDoS protection, DNS failover, and CDN strategy.
- Cost Optimization Analyzer — Analyzes {{cloud_provider}} spending for {{account}} with right-sizing, reserved instance strategy, and waste identification.
Each Prompt Includes
- {{placeholder}} variables for service, infrastructure, team, and reliability targets
- Expected output: operational procedures, configuration files, architecture diagrams, or analysis reports
- Chain-of-thought root cause analysis and tree-of-thought for architecture decisions
- Anti-patterns: alert fatigue, toil accumulation, hero culture, and reliability theater
Who This Is For
- SRE teams building reliability practices from the ground up
- DevOps engineers designing deployment and monitoring infrastructure
- Platform engineers creating internal developer platforms
- Engineering managers establishing production readiness standards
What Makes This Different
- Based on Google SRE book principles — error budgets, SLOs, toil reduction, and blameless culture
- Covers the FULL reliability stack: prevention, detection, response, and continuous improvement
- Includes chaos engineering — proactive reliability testing, not just reactive incident management
Works With
ChatGPT (GPT-4+), Claude (3.5+), Gemini Pro. Best with Claude for detailed technical analysis.