How long should I wait before assuming propagation is stuck?

Compare observed behavior against record TTL and resolver diversity. If several major resolvers remain stale well beyond TTL, treat it as a resolver or delegation incident, not normal propagation.

Is lowering TTL during an outage useful?

It helps future changes but does not invalidate existing cached entries. Lower TTL is strategic, not an immediate fix for already-cached bad data.

When should I suspect DNSSEC?

Suspect DNSSEC when SERVFAIL appears across validating resolvers and recent key/DS changes occurred. Validate chain integrity before making additional record edits.

Can a DNS issue affect only one user journey?

Yes. A single subdomain, CNAME chain, or geo policy can fail while the main site appears healthy. Always test every critical hostname, not just the homepage.

DNS & CDN

DNS Outage Troubleshooting Guide for Real Incidents

Published March 6, 2026 · 15 min read · Author: WebsiteDown

Start With Authoritative Truth

DNS incidents are frustrating because two users can see opposite behavior at the same moment. One gets NXDOMAIN, another sees the new record, and a third still hits the old origin due to cache.

The key is discipline: authoritative truth first, resolver behavior second, client caches last. Reversing that order creates noise and bad decisions.

Related reading: For cross-checks and deeper triage context, also review HTTP Status Codes Guide: 200, 300, 400, 500 Explained and CDN Outages and Regional Failures: A Practical Diagnostic Framework.

Quick Navigation

Start With Authoritative Truth
Resolver and Propagation Warning Signs
First 15 Minutes for DNS Incidents
Resolver-by-Resolver Investigation
Stabilize Without Creating Cache Chaos
How to Explain Propagation Clearly
DNS Reliability Upgrades
Case Walkthrough: NXDOMAIN After Zone Change
Copy/Paste DNS Incident Template
DNS Incident FAQ

Resolver and Propagation Warning Signs

DNS incidents often look contradictory because caches disagree by design. Your first objective is to compare authoritative truth with resolver behavior, not to chase individual client screenshots.

NXDOMAIN or SERVFAIL reports from specific networks.
Users in one country resolve correctly while others do not.
New records work in command line but not in browsers.
Traffic reaches old IP after migration window.
Intermittent failures after DNSSEC or delegation changes.

First 15 Minutes for DNS Incidents

The first 15 minutes should answer three questions: are authoritative records correct, is delegation intact, and which resolver groups are failing. That triad gives you a reliable incident boundary.

Confirm authoritative records for A/AAAA/CNAME and expected TTL.
Check NS delegation and glue records at parent zone.
Test multiple public resolvers and one ISP resolver.
Confirm DNSSEC chain and DS/RRSIG validity if enabled.
Compare failing hostname with known-good subdomain behavior.
Publish user guidance that acknowledges propagation variance.

Resolver-by-Resolver Investigation

Move from authority to recursion to client cache. Testing in the reverse order wastes time and creates false confidence when one resolver happens to be fresh.

Investigate negative caching behavior after recent NXDOMAIN states.
Validate split-horizon or geo DNS policies if used.
Review recent registrar, nameserver, and zone file changes.
Inspect recursion timeout patterns for SERVFAIL clusters.
Check CAA and certificate automation dependencies if TLS renewals fail.
Capture resolver-by-resolver evidence before applying new changes.

Stabilize Without Creating Cache Chaos

Prefer stable, minimal edits during propagation windows. Repeated record changes under pressure often prolong incidents because different networks cache different versions.

Revert to last-known-good authoritative records when safe.
Avoid repeated live edits while propagation is still unfolding.
Shorten TTL for future flexibility after service is stable.
Coordinate registrar/provider escalations with concrete resolver evidence.
Provide temporary fallback hostnames for critical operations if needed.

How to Explain Propagation Clearly

For DNS incidents, clarity matters more than certainty. Say: "Some resolvers still serve old records; impact varies by network." That statement is honest, actionable, and reduces unnecessary panic.

Support teams need scripts during DNS events because user experience differs by network. Equip them with quick checks and non-technical explanations so they can help without over-escalating every ticket.

Example update: "Authoritative records corrected; propagation is in progress. Some resolvers still cached old answers."

DNS Reliability Upgrades

Build a pre-change DNS checklist with ownership and rollback guardrails. Most repeat DNS incidents are process failures, not protocol failures.

Document DNS ownership and emergency change process.
Add resolver-diverse monitoring to catch split behavior earlier.
Create a pre-change DNS checklist for high-risk updates.
Audit DNSSEC, delegation, and automation dependencies quarterly.
Train support on propagation expectations and customer messaging.

Case Walkthrough: NXDOMAIN After Zone Change

A common real-world pattern is NXDOMAIN from one ISP while public resolvers are healthy. That usually points to stale negative caching or resolver-specific recursion issues, not universal record loss.

For DNS Outage Troubleshooting Guide for Real Incidents, the highest-leverage habit is disciplined decision logging: what evidence changed, what action followed, and why that action was chosen. That record keeps parallel teams aligned, prevents contradictory fixes, and gives you a cleaner post-incident review with real lessons instead of hindsight noise.

Copy/Paste DNS Incident Template

Use this DNS-focused escalation format when propagation behavior is inconsistent:

[INCIDENT START] DNS Outage Troubleshooting Guide for Real Incidents
Authoritative record state: [expected vs actual A/AAAA/CNAME]
Delegation check: [NS + glue status]
DNSSEC state: [valid / failing chain]
Failing resolver groups: [ISP/public resolver list]
TTL context: [current TTL + recent changes]
Customer impact scope: [regions/networks affected]
Mitigation step: [revert/fix/escalate to provider]
Next verification round: [time + resolver set]

Clear resolver segmentation helps users trust updates when their experience differs from other regions.