Claude API Rate Limit Errors During Streaming: How to Diagnose and Fix Them

The request fired. The stream opened. Then, roughly two seconds into a long response, the connection dropped with a 429 Too Many Requests and no visible explanation in the log beyond “rate limit exceeded.” Standard retry logic didn’t help — it just hit the same wall faster.

The assumption built into most retry implementations is that rate limits apply per-request: one call, one token cost, one entry against the limit. Streaming breaks that model entirely. A single streaming connection holds open while tokens generate, which means it keeps consuming your tokens-per-minute (TPM) budget continuously for the duration of the response — not just at the moment the request fires. By the time the 429 surfaces, the TPM ceiling was already hit several seconds earlier. The request-per-minute (RPM) counter looked fine. The TPM counter was already at zero.

That one wrong assumption cascades into every subsequent fix attempt: longer timeouts don’t help, higher concurrency makes it worse, and a blind retry without respecting the reset window just burns through the remaining budget on the next tier.

What the Rate Limit Headers Are Actually Telling You

Before any retry logic makes sense, you need to read what Anthropic sends back on every response. The relevant headers are x-ratelimit-limit-requests, x-ratelimit-limit-tokens, x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-requests, and x-ratelimit-reset-tokens. On a 429, Anthropic also returns a retry-after header with the number of seconds to wait before the next attempt.

The critical operational detail: during streaming, inspect these headers on every chunk boundary, not just at the response open. By the time the stream is open and tokens are flowing, the remaining-tokens counter is already decrementing. If you only check headers at request time, you will not catch the TPM exhaustion until the 429 fires mid-stream — which means partial output, broken state, and a retry that starts from scratch.

Here is a working pattern for parsing rate-limit headers before and during a streaming request:

import anthropic
import time

client = anthropic.Anthropic(api_key="your_api_key_here")

def get_rate_limit_state(headers: dict) -> dict:
    return {
        "remaining_requests": int(headers.get("x-ratelimit-remaining-requests", 1)),
        "remaining_tokens": int(headers.get("x-ratelimit-remaining-tokens", 1)),
        "reset_requests_in": headers.get("x-ratelimit-reset-requests", "0s"),
        "reset_tokens_in": headers.get("x-ratelimit-reset-tokens", "0s"),
        "retry_after": int(headers.get("retry-after", 0)),
    }

def parse_reset_seconds(reset_value: str) -> float:
    """Convert '1.5s' or '60s' format to float seconds."""
    if reset_value.endswith("s"):
        return float(reset_value[:-1])
    return float(reset_value)

def stream_with_header_monitoring(prompt: str, max_tokens: int = 1024):
    attempt = 0
    base_delay = 1.0
    max_attempts = 5

    while attempt < max_attempts:
        try:
            with client.messages.stream(
                model="claude-opus-4-5",
                max_tokens=max_tokens,
                messages=[{"role": "user", "content": prompt}],
            ) as stream:
                # Capture response headers immediately after stream opens
                response_headers = stream.response.headers
                state = get_rate_limit_state(response_headers)

                # Pre-flight check: if remaining tokens are critically low, back off
                if state["remaining_tokens"] < max_tokens:
                    wait_seconds = parse_reset_seconds(state["reset_tokens_in"])
                    print(f"TPM low ({state['remaining_tokens']} remaining). "
                          f"Waiting {wait_seconds:.1f}s before streaming.")
                    time.sleep(wait_seconds + 0.5)
                    attempt += 1
                    continue

                # Stream the response
                full_text = ""
                for text_chunk in stream.text_stream:
                    full_text += text_chunk
                    print(text_chunk, end="", flush=True)

                return full_text

        except anthropic.RateLimitError as e:
            # Parse retry-after from the error response headers if available
            retry_after = 0
            if hasattr(e, "response") and e.response is not None:
                retry_after = int(e.response.headers.get("retry-after", 0))

            if retry_after > 0:
                wait = retry_after
            else:
                wait = base_delay * (2 ** attempt)  # exponential backoff

            print(f"\n429 on attempt {attempt + 1}. Waiting {wait:.1f}s.")
            time.sleep(wait)
            attempt += 1

        except anthropic.APIStatusError as e:
            # Also retry on 500, 502, 503, 529
            if e.status_code in {500, 502, 503, 529}:
                wait = base_delay * (2 ** attempt)
                print(f"\nServer error {e.status_code}. Waiting {wait:.1f}s.")
                time.sleep(wait)
                attempt += 1
            else:
                raise

    raise RuntimeError(f"Failed after {max_attempts} attempts.")

The pre-flight check at line 31 is the piece most implementations skip. If remaining_tokens is already below the max_tokens value you’re about to request, opening the stream will fail partway through — not immediately. Waiting for the token reset before opening the stream at all eliminates mid-stream 429s entirely in most cases.

The Rate Limit Tiers and Why They Matter for Streaming

Anthropic structures rate limits across usage tiers. The numbers below are directional — always verify current values at anthropic.com/pricing and in the API documentation, as tier thresholds change with plan updates.

As of mid-2025, the tier structure roughly follows this pattern across Claude models: Free / Build tier accounts start with conservative RPM limits (roughly 5 RPM on some models) and tight TPM ceilings, making sustained streaming particularly fragile at any meaningful volume. Tier 1 and Tier 2 accounts (reached by spending thresholds) unlock higher RPM and TPM budgets, with Tier 2 offering roughly 10x the token throughput of the entry level. Tier 3 and Tier 4 accounts support production workloads with substantially higher limits, and enterprise agreements can negotiate custom rate limits above published tiers.

The streaming-specific implication: a 2,000-token streaming response at Tier 1 might consume a significant fraction of the per-minute TPM budget in a single call. At high concurrency — even 3 or 4 simultaneous streams — TPM exhaustion happens within seconds, not minutes. RPM looks healthy the entire time because you fired only a few requests. This is why monitoring RPM alone gives a false sense of available capacity during streaming workloads.

If you are consistently hitting TPM limits before RPM limits, that is a signal that either the tier needs upgrading or the request batching strategy needs adjustment — not that the retry logic is broken.

The Cascade That Actually Breaks Production

Here is the failure chain as it typically runs in a high-volume streaming setup:

Three concurrent streaming requests open simultaneously. Each consumes tokens continuously as the model generates output, drawing down the shared TPM bucket in real time.
The TPM ceiling is hit mid-stream on the longest request. Anthropic returns a 429 with a retry-after of roughly 30 seconds.
The retry logic — written for standard requests — fires immediately on a fixed 1-second delay, ignoring retry-after. It opens a new stream before the token budget resets.
That stream also hits 429 within seconds. Now there are two failed partial responses and a reset timer that has been restarted.
The application layer, not seeing a complete response, marks the job as failed and queues a retry. The queue fires all three jobs again simultaneously.

The original 429 was recoverable in about 30 seconds. The chain reaction turned it into several minutes of failed jobs and a backlog that took the rest of the minute to drain.

The fix is not a smarter retry. It is preventing concurrent streams from sharing a TPM budget blindly. Adding a pre-request token budget check and a concurrency gate — limiting simultaneous open streams to a number that fits within the remaining TPM window — stops the cascade before it starts. For a Tier 2 account processing long-form responses, dropping from 5 concurrent streams to 2 with a 500ms stagger between openings typically eliminates mid-stream 429s without any change to the retry logic itself.

Before and After: What the Retry Implementation Looks Like

Before: Fixed 1-second retry delay, no retry-after parsing, no pre-flight header check, all concurrent streams open simultaneously. Result: mid-stream 429s cascade into queued retries that exhaust the token budget for the full reset window, producing roughly 60–90 seconds of total downtime per incident instead of the 20–30 seconds the reset timer actually required.

After: retry-after header parsed on every 429, pre-flight token check before stream open, exponential backoff starting at 1 second with a ceiling at the reset window, concurrent streams gated to stay within TPM budget. Result: 429s resolve at the minimum reset interval with no cascade, partial responses are discarded cleanly, and queue depth stays flat during rate limit events.

The real ROI here is not faster recovery — it is that the same failure no longer requires manual intervention to drain the retry queue, which was quietly consuming engineering time on every high-volume run.

Troubleshooting Path: From First 429 to Stable Streaming

When a 429 appears during streaming, work through this sequence in order rather than jumping to tier upgrades or architecture changes:

Check whether the 429 is RPM or TPM. Log both x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens at the moment of the error. If remaining requests is above zero but remaining tokens is at or near zero, the problem is TPM exhaustion from streaming duration — not request frequency. The fix is batching and pre-flight checks, not reducing request count.
Parse retry-after before any retry fires. On a 429, Anthropic returns the exact number of seconds to wait. If your retry logic uses a fixed delay or standard exponential backoff without checking this header first, it will retry before the budget resets and produce another 429. The retry-after value is the minimum safe wait — use it as the floor, not a suggestion.
Add a pre-flight token check before opening each stream. Read x-ratelimit-remaining-tokens and x-ratelimit-reset-tokens before calling client.messages.stream(). If remaining tokens are below your expected max_tokens value, wait for the reset window before opening the stream. This prevents mid-stream failures entirely for the most common TPM exhaustion pattern.
Reduce concurrent stream count and add inter-stream stagger. If you are opening multiple streams simultaneously, each one draws from the same TPM bucket. Gate concurrent streams using a semaphore or queue, and add a 300–500ms delay between stream openings. For most Tier 1–2 workloads, 2 concurrent streams with stagger eliminates TPM collisions without requiring a tier upgrade.
Evaluate whether a tier upgrade is actually needed. If the pre-flight checks and batching are in place and you are still hitting TPM limits consistently at low concurrency, the workload has outgrown the tier. Check the current tier thresholds at docs.anthropic.com/en/api/rate-limits and compare against your actual TPM consumption per minute. A tier upgrade is the right answer when batching is already optimized — not before.

What This Does Not Solve

The header-monitoring and retry pattern above handles the most common streaming 429 scenario — TPM exhaustion during sustained concurrent streaming. It does not address every failure mode.

If you are hitting 529 overloaded errors, the problem is not your rate limit budget. Anthropic returns 529 when the API is under system-wide load, and no amount of header inspection changes that. The correct response is the same exponential backoff, but the cause is external and unpredictable. Pre-flight checks will not help here because the error does not relate to your account’s token budget.

The pattern also does not protect against partial response corruption when a stream drops mid-generation due to a network interruption rather than a rate limit. A mid-stream 429 has a clear retry path: wait, then restart. A dropped TCP connection mid-stream may return a partial response with no error code at all, which means the application layer needs its own completeness check on streamed output — something the retry logic here does not provide.

Finally, this approach assumes you control the retry layer. If you are using a third-party wrapper or SDK abstraction that handles retries internally, that layer may be ignoring retry-after or using its own backoff schedule. Check the wrapper’s retry configuration before assuming the code above is the active retry path.

Batching Requests When Streaming Volume Is High

For workloads that need to process many streaming requests in sequence — document processing, bulk generation, pipeline stages — the fastest stable throughput comes from treating the TPM budget as a shared resource to schedule against, not a ceiling to bump into.

A practical batching pattern: before each streaming request, read the current x-ratelimit-remaining-tokens value and compare it against the expected token cost of the next request (prompt tokens + expected completion tokens). If remaining tokens are below 1.2x the expected cost, wait for the reset. This 20% buffer absorbs variance in actual token counts without requiring precise estimation.

import anthropic
import time

client = anthropic.Anthropic(api_key="your_api_key_here")

def estimate_prompt_tokens(text: str) -> int:
    """Rough estimate: ~4 characters per token."""
    return max(1, len(text) // 4)

def batch_stream_with_budget_check(
    prompts: list[str],
    max_tokens_per_request: int = 1024,
    safety_buffer: float = 1.2
):
    results = []
    last_headers = {}

    for i, prompt in enumerate(prompts):
        estimated_prompt_tokens = estimate_prompt_tokens(prompt)
        total_estimated = estimated_prompt_tokens + max_tokens_per_request

        # Check budget from last known headers
        if last_headers:
            remaining_tokens = int(last_headers.get("x-ratelimit-remaining-tokens", 99999))
            reset_tokens_in = last_headers.get("x-ratelimit-reset-tokens", "0s")

            if remaining_tokens < (total_estimated * safety_buffer):
                wait_seconds = float(reset_tokens_in.rstrip("s")) + 1.0
                print(f"[Request {i+1}] TPM budget low. Waiting {wait_seconds:.1f}s.")
                time.sleep(wait_seconds)

        attempt = 0
        while attempt < 5:
            try:
                with client.messages.stream(
                    model="claude-opus-4-5",
                    max_tokens=max_tokens_per_request,
                    messages=[{"role": "user", "content": prompt}],
                ) as stream:
                    last_headers = dict(stream.response.headers)
                    full_text = stream.get_final_text()
                    results.append(full_text)
                    break

            except anthropic.RateLimitError as e:
                retry_after = 0
                if hasattr(e, "response") and e.response is not None:
                    retry_after = int(e.response.headers.get("retry-after", 0))
                    last_headers = dict(e.response.headers)

                wait = max(retry_after, (2 ** attempt))
                print(f"[Request {i+1}] 429. Waiting {wait}s (attempt {attempt+1}).")
                time.sleep(wait)
                attempt += 1

    return results

The key detail in this pattern is that last_headers carries forward between requests, so the budget check uses the most recent known state rather than making a separate API call. For a batch of 20 requests at Tier 1 TPM limits, this approach typically reduces total wall-clock time compared to blind sequential firing with reactive retries — because it avoids the multi-retry cycles that add up when the budget check is missing.

If the 429s are still appearing after the pre-flight check and retry-after parsing are in place, the next diagnostic step is straightforward: log x-ratelimit-remaining-tokens on every successful stream response and plot it over a one-minute window. If it hits zero before the reset interval on any run with reasonable concurrency, the tier ceiling is the constraint — and that is a different conversation than a code fix.

Get the implementation checklist and batching template sent directly: join the list for the workflow breakdown.

The pre-flight token check is the one change that eliminates the mid-stream failure cascade. Everything else — backoff curves, concurrency gates, tier evaluation — is downstream of knowing your remaining budget before the stream opens, not after it fails.