DNS Outage Troubleshooting Guide for Real Incidents
Start With Authoritative Truth
DNS incidents are frustrating because two users can see opposite behavior at the same moment. One gets NXDOMAIN, another sees the new record, and a third still hits the old origin due to cache.
The key is discipline: authoritative truth first, resolver behavior second, client caches last. Reversing that order creates noise and bad decisions.
Related reading: For cross-checks and deeper triage context, also review HTTP Status Codes Guide: 200, 300, 400, 500 Explained and CDN Outages and Regional Failures: A Practical Diagnostic Framework.
Quick Navigation
- Start With Authoritative Truth
- Resolver and Propagation Warning Signs
- First 15 Minutes for DNS Incidents
- Resolver-by-Resolver Investigation
- Stabilize Without Creating Cache Chaos
- How to Explain Propagation Clearly
- DNS Reliability Upgrades
- Case Walkthrough: NXDOMAIN After Zone Change
- Copy/Paste DNS Incident Template
- DNS Incident FAQ
Resolver and Propagation Warning Signs
DNS incidents often look contradictory because caches disagree by design. Your first objective is to compare authoritative truth with resolver behavior, not to chase individual client screenshots.
- NXDOMAIN or SERVFAIL reports from specific networks.
- Users in one country resolve correctly while others do not.
- New records work in command line but not in browsers.
- Traffic reaches old IP after migration window.
- Intermittent failures after DNSSEC or delegation changes.
First 15 Minutes for DNS Incidents
The first 15 minutes should answer three questions: are authoritative records correct, is delegation intact, and which resolver groups are failing. That triad gives you a reliable incident boundary.
- Confirm authoritative records for A/AAAA/CNAME and expected TTL.
- Check NS delegation and glue records at parent zone.
- Test multiple public resolvers and one ISP resolver.
- Confirm DNSSEC chain and DS/RRSIG validity if enabled.
- Compare failing hostname with known-good subdomain behavior.
- Publish user guidance that acknowledges propagation variance.
Resolver-by-Resolver Investigation
Move from authority to recursion to client cache. Testing in the reverse order wastes time and creates false confidence when one resolver happens to be fresh.
- Investigate negative caching behavior after recent NXDOMAIN states.
- Validate split-horizon or geo DNS policies if used.
- Review recent registrar, nameserver, and zone file changes.
- Inspect recursion timeout patterns for SERVFAIL clusters.
- Check CAA and certificate automation dependencies if TLS renewals fail.
- Capture resolver-by-resolver evidence before applying new changes.
Stabilize Without Creating Cache Chaos
Prefer stable, minimal edits during propagation windows. Repeated record changes under pressure often prolong incidents because different networks cache different versions.
- Revert to last-known-good authoritative records when safe.
- Avoid repeated live edits while propagation is still unfolding.
- Shorten TTL for future flexibility after service is stable.
- Coordinate registrar/provider escalations with concrete resolver evidence.
- Provide temporary fallback hostnames for critical operations if needed.
How to Explain Propagation Clearly
For DNS incidents, clarity matters more than certainty. Say: "Some resolvers still serve old records; impact varies by network." That statement is honest, actionable, and reduces unnecessary panic.
Support teams need scripts during DNS events because user experience differs by network. Equip them with quick checks and non-technical explanations so they can help without over-escalating every ticket.
Example update: "Authoritative records corrected; propagation is in progress. Some resolvers still cached old answers."
DNS Reliability Upgrades
Build a pre-change DNS checklist with ownership and rollback guardrails. Most repeat DNS incidents are process failures, not protocol failures.
- Document DNS ownership and emergency change process.
- Add resolver-diverse monitoring to catch split behavior earlier.
- Create a pre-change DNS checklist for high-risk updates.
- Audit DNSSEC, delegation, and automation dependencies quarterly.
- Train support on propagation expectations and customer messaging.
Case Walkthrough: NXDOMAIN After Zone Change
A common real-world pattern is NXDOMAIN from one ISP while public resolvers are healthy. That usually points to stale negative caching or resolver-specific recursion issues, not universal record loss.
For DNS Outage Troubleshooting Guide for Real Incidents, the highest-leverage habit is disciplined decision logging: what evidence changed, what action followed, and why that action was chosen. That record keeps parallel teams aligned, prevents contradictory fixes, and gives you a cleaner post-incident review with real lessons instead of hindsight noise.
Copy/Paste DNS Incident Template
Use this DNS-focused escalation format when propagation behavior is inconsistent:
[INCIDENT START] DNS Outage Troubleshooting Guide for Real Incidents
Authoritative record state: [expected vs actual A/AAAA/CNAME]
Delegation check: [NS + glue status]
DNSSEC state: [valid / failing chain]
Failing resolver groups: [ISP/public resolver list]
TTL context: [current TTL + recent changes]
Customer impact scope: [regions/networks affected]
Mitigation step: [revert/fix/escalate to provider]
Next verification round: [time + resolver set]
Clear resolver segmentation helps users trust updates when their experience differs from other regions.