Designing for Failure When LLM APIs Go Down
Recent outages show that relying on a single AI provider is a single point of failure for production applications.
On June 23, 2026, the status page for Claude lit up with a familiar warning: "Elevated error rate across multiple models." The incident disrupted claude.ai, the Claude Console, the Claude API, Claude Code, and Claude Cowork. For developers building on top of Anthropic APIs, it was another reminder of a hard truth: LLMs are not magic; they are software running on physical, constrained infrastructure.
This outage was not a one-off. It followed a string of disruptions throughout mid-June 2026, where models like Opus 4.8 and Haiku 4.5 repeatedly threw errors. Anthropic acknowledged that demand has grown faster than its infrastructure can support, even as it works to expand compute capacity through partnerships with Amazon and Google.
When your application relies on an external API for core functionality, a platform-wide outage is a critical event. To build resilient AI-native applications, developers must move past naive API calls and implement defensive engineering patterns that assume the model provider will fail.
The Dual Failure Modes of LLM Infrastructure
To build resilient systems, we have to understand why these APIs fail. LLM outages generally fall into two categories: physical capacity exhaustion and orchestration failures.
Anthropic's mid-2026 struggles represent the capacity exhaustion category. When user growth outpaces GPU availability, providers experience localized or global overloads. During the June 16, 2026 incident, Sonnet and Opus models experienced a sustained 10% error rate because the physical hardware simply could not keep up with peak-hour demand.
In contrast, OpenAI experienced a massive orchestration failure on November 25, 2024. In that incident, a global change to Kubernetes namespace labels triggered a metadata recomputation in the networking layer. This change overwhelmed the control plane in three of OpenAI's largest GPU clusters, causing cascading failures, high latency, and elevated error rates across GPT-4 class models and ChatGPT.
xychart-beta
title "Peak Error Rates During Major LLM Outages"
x-axis ["Claude (June 16, 2026)", "ChatGPT Paid (Nov 25, 2024)", "ChatGPT Enterprise (Nov 25, 2024)"]
y-axis "Peak Error Rate (%)" 0 --> 25
bar [10, 13, 23]
Whether the root cause is a lack of physical chips or a misconfigured Kubernetes cluster, the result for your application is the same: failed requests and broken user experiences.
The 60-Second Triage and Error Code Anatomy
When your application starts throwing errors, the first step is identifying the source of the failure. Many developers waste hours debugging their own code when the provider is down, or conversely, blaming the provider for a local configuration error.
The key lies in the HTTP status codes. For instance, Anthropic's API distinguishes between client-side and server-side limits:
- 429 (rate_limit_error): Your account has exceeded its allowed rate limits or credit balance. This is a client-side issue that requires you to slow down requests, optimize your batching, or upgrade your tier.
- 529 (overloaded_error): The provider's API is temporarily overloaded. This is a server-side issue. The hardware is maxed out, and no amount of account upgrading will fix it.
- 500 (Internal Server Error): The backend is experiencing instability, often seen during broader platform outages.
During the June 23, 2026 outage, developers using Claude Code faced a specific configuration trap. If the ANTHROPIC_API_KEY environment variable is set in the shell, Claude Code defaults to using that API key instead of the user's Pro or Max subscription credentials. During high-load events, developers on pay-as-you-go API tiers faced severe throttling, leading some to mistakenly blame their subscription plans when the issue was actually a stray environment variable routing traffic through the wrong billing path.
Implementing Jittered Exponential Backoff
When facing a 529 overloaded error, naive retry loops make the problem worse. If hundreds of clients hit a server error and all retry exactly one second later, they create a synchronized traffic spike. This is known as the thundering herd problem.
To prevent this, retry logic must incorporate exponential backoff combined with random jitter. Jitter spreads the retry requests over time, allowing the provider's load balancers to recover.
Here is a Python implementation of a resilient request handler using the official SDK:
import anthropic
import time
import random
client = anthropic.Anthropic()
def call_with_backoff(prompt, max_attempts=5):
for attempt in range(max_attempts):
try:
return client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
except anthropic.APIStatusError as e:
if e.status_code == 529:
# Server is overloaded. Apply exponential backoff with random jitter.
delay = (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
continue
elif e.status_code == 429:
# Client-side rate limit. Do not retry immediately; escalate or throttle.
raise e
else:
# Other API errors (e.g., 400, 401, 403) should not be retried.
raise e
raise RuntimeError("Request failed after maximum retry attempts.")
This pattern isolates the 529 errors for retries while immediately raising 429 errors, preventing useless loops that waste execution time and resources.
Designing a Multi-Provider Fallback Architecture
When a platform-wide outage occurs, even the best retry logic will eventually fail. To maintain high availability, applications must implement a fallback strategy.
However, many developers design weak fallback chains that switch to another model within the same provider's ecosystem (for example, falling back from Opus 4.8 to Sonnet 4.6). During a major infrastructure event, this strategy fails because both models share the same underlying API gateway, authentication servers, and network routing layers.
A true high-availability architecture must cross provider boundaries.
| Fallback Strategy | Survives Single-Model Outage? | Survives Platform-Wide Outage? | Trade-offs |
|---|---|---|---|
| Intra-Provider (Opus -> Sonnet) | Yes | No | Low integration effort; identical SDK and formatting. |
| Cross-Provider (Claude -> GPT / Gemini) | Yes | Yes | High integration effort; requires prompt translation and output schema validation. |
Implementing a cross-provider fallback requires abstracting the LLM client interface. By using a unified interface, you can catch provider-specific exceptions and route the payload to an alternative provider like OpenAI or Google Gemini. While this introduces complexity in prompt engineering and output parsing, it is the only way to guarantee uptime when a major provider goes dark.
As AI becomes deeply integrated into production software, developers must treat LLM calls with the same skepticism they reserve for any unreliable third-party service. By implementing strict error triage, jittered backoff, and cross-provider fallbacks, you can build applications that remain online even when the underlying models fail.
Sources & further reading
- Elevated error rate across multiple models — status.claude.com
- Claude is down for many — Anthropic says it's 'investigating' the outage | TechRadar — techradar.com
- Claude Errors Across Many Models: What To Do Now | QWE AI Academy — qwe.edu.pl
- Claude down again? Users report errors as Anthropic confirms “elevated error rate” and investigates service disruption. Here's what users can do if Claude is not working - The Economic Times — economictimes.indiatimes.com
- Elevated Error Rate for ChatGPT and API - OpenAI Status — status.openai.com
Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.
Discussion 0
No comments yet
Be the first to weigh in.