How to Investigate Intermittent Outages
Capture Evidence Before It Disappears
Intermittent outages are painful because by the time an engineer opens dashboards, the symptom is gone. Without disciplined evidence capture, incidents remain unresolved and repeat.
This guide focuses on preserving transient evidence and reducing hypothesis churn.
Related reading: For cross-checks and deeper triage context, also review API Downtime Investigation Playbook and Website Down After Deploy: Recovery Checklist.
Quick Navigation
- Capture Evidence Before It Disappears
- How Intermittent Outages Hide
- First 15 Minutes of a Burst Failure
- Correlate Bursts With System Events
- Stabilize While Preserving Forensics
- Managing Uncertainty in Live Updates
- Turn Bursts Into Detectable Signals
- Case Walkthrough: 3-Minute Error Bursts Every Hour
- Copy/Paste Intermittent Incident Update
- Intermittent Outage FAQ
How Intermittent Outages Hide
Intermittent outages are hard because the system looks healthy between failures. Your process must capture high-fidelity snapshots at the exact failure moment.
- Short error bursts with long quiet periods.
- Customer reports with no obvious continuous metric spike.
- Incidents tied loosely to peak traffic windows.
- Regional or endpoint-specific micro-failures.
- On-call notes mention "could not reproduce" repeatedly.
First 15 Minutes of a Burst Failure
In the first 15 minutes, prioritize instrumentation over speculation. If you miss the next failure window, investigation resets and confidence drops.
- Stamp exact failure window and timezone.
- Increase targeted tracing/logging temporarily.
- Capture failing request IDs and correlated dependencies.
- Segment incidents by route, region, and user cohort.
- Look for periodic triggers (cron, scale events, cache expiry).
- Preserve artifacts before auto-healing replaces state.
Correlate Bursts With System Events
Correlate failure spikes with deploy events, autoscaling transitions, network jitter, and dependency retries. Intermittency often comes from timing interactions, not one broken component.
- Correlate bursts with deployment and config events.
- Inspect resource contention patterns (locks, pools, queue lag).
- Analyze dependency latency tails during failure windows.
- Compare successful vs failed request paths for divergence.
- Check control-plane events (autoscaling, certificate renewals, DNS updates).
- Use rolling windows that preserve short spikes, not just hourly averages.
Stabilize While Preserving Forensics
Use stabilizing controls that reduce oscillation: retry budgets, queue backpressure, and conservative autoscaling thresholds.
- Apply low-risk guardrails: tighter timeouts and bounded retries.
- Throttle expensive non-critical paths under burst load.
- Enable graceful degradation for optional features.
- Route traffic away from suspected unstable regions.
- Avoid broad changes that erase evidence during investigation.
Managing Uncertainty in Live Updates
With intermittent issues, honesty about uncertainty is important. Communicate observed pattern and active hypotheses, then commit to next update times. That approach keeps trust without overpromising.
Intermittent incidents are mentally draining because progress feels invisible. Small wins matter: capture one reproducible pattern, one reliable trigger, one validated mitigation at a time.
Example update: "Intermittent burst confirmed on one endpoint family; tracing window expanded and mitigation in progress."
Turn Bursts Into Detectable Signals
After stabilization, add event-rich logging and synthetic burst tests that reproduce timing pressure. Intermittent issues become manageable when you can reproduce them.
- Promote temporary instrumentation into lightweight permanent telemetry.
- Add alert rules for burst patterns, not only sustained errors.
- Document trigger patterns discovered during incident.
- Create a reproducibility checklist for future responders.
- Review autoscaling and background job timing interactions.
Case Walkthrough: 3-Minute Error Bursts Every Hour
One team chased random login errors for days until they captured synchronized traces during failures. The root cause was token cache eviction bursts during autoscaling, not auth provider instability.
For How to Investigate Intermittent Outages, the highest-leverage habit is disciplined decision logging: what evidence changed, what action followed, and why that action was chosen. That record keeps parallel teams aligned, prevents contradictory fixes, and gives you a cleaner post-incident review with real lessons instead of hindsight noise.
Copy/Paste Intermittent Incident Update
Use this intermittent-incident capture template during active investigation:
[INCIDENT START] How to Investigate Intermittent Outages
Failure window: [start/end UTC + frequency]
Affected operations: [routes/actions]
Correlated system events: [deploy/scale/restart/network]
Telemetry captured at failure: [logs/traces/metrics]
Retry behavior observed: [client/server patterns]
Temporary stabilizers applied: [limits/backoff/buffers]
Reproduction hypothesis: [most likely trigger]
Next capture plan: [who, what, when]
For intermittent incidents, better capture quality is usually the fastest path to resolution.