How to Check if a Website Is Down: A Practical Incident Checklist
Before You Declare an Outage
Most teams lose the first 10 minutes of an incident because everyone uses a different definition of "down". One person means they got a timeout from home Wi-Fi, another means checkout is returning 503 in one region, and support is already replying to tickets with incomplete information.
This guide gives you a repeatable sequence for the first report: scope, evidence, likely failure domain, and safe next action. The point is not to sound smart in Slack. The point is to help the right owner start in the right layer immediately.
Related reading: For cross-checks and deeper triage context, also review HTTP Status Codes Guide: 200, 300, 400, 500 Explained and Incident Communication Template for Website Outages.
Quick Navigation
- Before You Declare an Outage
- Early Signals to Trust
- The First 15 Minutes of Triage
- Layer-by-Layer Verification
- Low-Risk Mitigations First
- What to Say While You Investigate
- Improvements After the Incident
- Case Walkthrough: Mixed Regional Failures
- Copy/Paste Triage Message
- Checklist FAQ
Early Signals to Trust
Treat the first report as a clue, not a conclusion. What you need immediately is a scope map: which hostname, which path, which region, and whether failures are consistent or random.
- Customers report mixed behavior: some can load, others cannot.
- Error pattern changes by location or ISP.
- One hostname variant (`www` vs root) behaves differently.
- TLS/browser errors appear even when health checks look green.
- Support tickets start before infra dashboards show obvious spikes.
The First 15 Minutes of Triage
In the first 15 minutes, speed matters less than clean evidence. If you log one reliable snapshot across regions and networks, escalation quality improves and rework drops.
- Normalize the URL and test the exact host users are reporting.
- Run checks from multiple regions before escalating as global.
- Capture DNS, TLS, HTTP status, and latency in one snapshot.
- Test from a second local network path (mobile data or VPN off).
- Compare affected path with a known healthy endpoint on the same domain.
- Publish a short triage summary with what is known and unknown.
Layer-by-Layer Verification
After initial confirmation, validate each layer in order: DNS, TLS, edge, origin, and dependency health. This prevents teams from debating theories when the telemetry already points to one layer.
- Split failures by hostname, path, and protocol rather than one global status.
- Check recent deploys, DNS changes, certificate renewals, and CDN rules in the incident window.
- Correlate user reports with region-level failures and edge telemetry.
- Verify redirect chains so you do not miss a broken destination URL.
- Look for policy-based blocks (403/429) that are not true downtime.
- Build an evidence bundle (time, region, status pattern, recent changes).
Low-Risk Mitigations First
Use the smallest reversible fix first. A narrow rollback, route shift, or temporary fallback page usually restores confidence faster than broad infrastructure changes.
- Route traffic away from affected region if healthy alternatives exist.
- Roll back only the change tied to the failure domain you proved.
- Temporarily relax non-critical controls if they are blocking legitimate traffic.
- Add short-term cache or fallback pages for high-traffic read routes.
- Avoid broad restarts unless you have evidence they reduce user impact.
What to Say While You Investigate
Good communication here is simple: scope + impact + next update time. Avoid saying "investigating" with no context. Tell users whether the issue is global, regional, or limited to specific paths. That alone reduces duplicate tickets and keeps account teams aligned.
Inside the team, assign one incident lead and one scribe early. The lead keeps decisions moving; the scribe preserves an accurate timeline. Without that split, teams either over-talk or over-fix. Both slow recovery.
Example update: "6/8 regions failing on checkout with 503. DNS/TLS healthy. Rollback in progress. Next update in 20 minutes."
Improvements After the Incident
Turn this guide into an internal standard. Teams that institutionalize a shared outage checklist reduce mean time to clarity more than teams that only add new tools.
- Create a one-page triage checklist shared by support and engineering.
- Add region-aware synthetic checks for primary user journeys.
- Document hostname/redirect ownership to prevent blind spots.
- Set a standard incident summary format for first escalation messages.
- Review what evidence was missing and instrument it permanently.
Case Walkthrough: Mixed Regional Failures
A common pattern is mixed reports where mobile users can load the site but office users cannot. In that case, cross-region checks plus resolver comparison quickly separate local network bias from real service impact.
For How to Check if a Website Is Down: A Practical Incident Checklist, the highest-leverage habit is disciplined decision logging: what evidence changed, what action followed, and why that action was chosen. That record keeps parallel teams aligned, prevents contradictory fixes, and gives you a cleaner post-incident review with real lessons instead of hindsight noise.
Copy/Paste Triage Message
Use this checklist format when you need a fast, high-quality escalation in Slack or your incident tool:
[INCIDENT START] How to Check if a Website Is Down: A Practical Incident Checklist
Detection source: [user report / monitor / synthetic check]
Affected URL and hostname: [exact target]
Observed pattern: [timeouts / 5xx / TLS / DNS]
Scope estimate: [global / regional / single ISP]
Checks completed: [regions, networks, browser + curl]
Most likely layer: [DNS / edge / origin / dependency]
Mitigation selected: [action + owner]
Next evidence review: [time in UTC]
This forces precision early and keeps support, engineering, and leadership aligned on the same facts.