← All posts
observabilityJune 11, 2026·6 min read

We Found a Runaway LLM Loop in Production — Here's What the Trace Showed

A single Extremis trace showed the same LLM span repeating in a loop. Here's what peekr's trace view revealed and how to catch unbounded agent loops before your users do.

LLM observabilityAI agent loopstrace debuggingPythonExtremis

122 Identical Spans. One Trace. A Loop Nobody Meant to Write.

In 1,000 traces collected from Extremis over 30 days, one stood out immediately: a single trace containing 124 spans, 122 of which were openai.chat.completions — the same span, firing sequentially, over and over. At an average duration of 1,085ms per call and a p95 of 2,172ms, this single runaway trace burned through what should have been dozens of separate user sessions worth of LLM budget.

Here's what it looked like.

The Trace Waterfall

trace_id: a3f9c2d1-884e-4b02-b71a-cc3391fde208
total_spans: 124
duration: 138.2s

[0000ms] root                          : memory.retrieval_pipeline      238ms
[0238ms]   ├─ openai.chat.completions  : classify_intent                1041ms
[1279ms]   ├─ openai.chat.completions  : classify_intent                1193ms  ← loop begins
[2472ms]   ├─ openai.chat.completions  : classify_intent                 998ms
[3470ms]   ├─ openai.chat.completions  : classify_intent                2172ms  ← p95 spike
[5642ms]   ├─ openai.chat.completions  : classify_intent                1102ms
[6744ms]   ├─ openai.chat.completions  : classify_intent                1087ms
[7831ms]   ├─ openai.chat.completions  : classify_intent                 944ms
...        │  (repeats 115 more times)
[136.8s]   ├─ openai.chat.completions  : classify_intent                1055ms
[137.9s]   └─ memory.store             : write_result                    287ms

The memory.store span at the end tells you the agent did eventually complete — it wasn't an infinite loop in the strict sense. It was a bounded-but-completely-unguarded retry or re-evaluation loop that ran to exhaustion before finally writing a result. The pipeline technically "succeeded." The overall error rate for the dataset was just 0.5% — so this trace didn't even raise an alert on its own.

That's the danger. It's not a crash. It's quiet, expensive waste.

Why This Happens in Agentic Systems

Agentic pipelines break the clean request-response model. When you give an LLM tools to call, memory to read, and a goal to pursue, you get control flow that looks less like a function and more like a while loop with an LLM as the loop condition.

The three failure modes that produce this pattern:

1. Tool-call loops with no termination signal

while True:
    response = llm.chat(messages=messages, tools=tools)
    if response.tool_calls:
        results = execute_tools(response.tool_calls)
        messages.append(results)
    else:
        break  # ← only exits if the LLM returns NO tool call

If the model keeps deciding it needs to call a tool — maybe because the tool result is ambiguous, or the prompt doesn't clearly signal "you have enough information now" — this loop never terminates naturally. It will hit a token limit, a timeout, or your wallet ceiling.

2. Re-classification on uncertain output

confidence = None
while confidence is None or confidence < THRESHOLD:
    result = classify_intent(user_input)
    confidence = result.get("confidence")
    # THRESHOLD is 0.95; model keeps returning 0.87–0.91

A confidence threshold that the model can never reliably cross. The developer tested this with GPT-4 in a notebook; the deployed path hits a different model that scores lower. The loop retries 122 times before an outer timeout fires.

3. Missing break conditions in memory consolidation

Extremis specifically manages long-term memory — which means it runs consolidation passes to merge, deduplicate, and re-rank stored memories. If a consolidation step is checking whether memories are "sufficiently resolved" by asking the LLM, and the LLM's resolution criteria aren't deterministic, you get oscillation. Memory A gets merged into B; next pass, B looks like it should be split back into A and something else. Loop.

What Peekr Caught — And How to Do It Yourself

The trace above was flagged by a cascade detection check in Peekr that counts repeated span names within a single trace. Here's the exact logic:

import peekr
from collections import Counter

sdk = peekr.Client(api_key="...")

# Pull all spans for a given trace
trace = sdk.traces.get("a3f9c2d1-884e-4b02-b71a-cc3391fde208")
spans = trace.spans

# Count repetitions by span name
span_name_counts = Counter(span.name for span in spans)

# Flag any span that repeats beyond a threshold
LOOP_THRESHOLD = 10

for span_name, count in span_name_counts.items():
    if count > LOOP_THRESHOLD:
        print(f"CASCADE DETECTED: '{span_name}' repeated {count}× in trace {trace.id}")
        print(f"  Total spans in trace: {len(spans)}")
        print(f"  Approximate wasted cost: {count * estimated_cost_per_call(span_name):.4f} USD")

Output for this trace:

CASCADE DETECTED: 'openai.chat.completions' repeated 122× in trace a3f9c2d1...
  Total spans in trace: 124
  Approximate wasted cost: $0.4392 USD

You can run this as a post-trace hook or as a scheduled audit query across recent traces. For production use, attach it to Peekr's trace export stream so it fires in near-real-time:

@peekr.on_trace_complete
def audit_for_loops(trace):
    counts = Counter(s.name for s in trace.spans)
    worst = counts.most_common(1)[0]
    if worst[1] > 10:
        alert(f"Loop detected: {worst[0]} × {worst[1]} in {trace.id}")

The Fix: Guard Every Loop at the Call Site

Detecting is step one. The actual fix lives in the agent code. Whatever loop pattern caused this, the structural remedy is the same: make the exit condition independent of the LLM's output.

MAX_ITERATIONS = 5  # non-negotiable ceiling
iteration = 0

while iteration < MAX_ITERATIONS:
    iteration += 1
    response = classify_intent(user_input)
    
    if response.confidence >= THRESHOLD or iteration == MAX_ITERATIONS:
        # Accept best result even if below threshold
        result = response
        break
    
    # Optional: back off between retries
    time.sleep(0.5 * iteration)

if iteration == MAX_ITERATIONS:
    logger.warning(
        "classify_intent hit iteration ceiling",
        extra={"final_confidence": response.confidence, "trace_id": current_trace_id()}
    )

Three principles embedded here:

  • Hard ceiling on iterations — never trust the LLM to decide when to stop.
  • Accept degraded output rather than looping to perfection. Log the shortfall.
  • Emit a warning span so Peekr captures the ceiling-hit as a named event, not just a silent exit.

What the 30-Day Numbers Tell You

Across 1,000 spans analysed, the overall error rate was 0.5%. Clean, right? But error rate counts failures — spans that threw exceptions or returned error codes. A loop that completes is not an error by that definition. It's a success that cost 50× what it should have.

This is why error rate alone is an insufficient health metric for agentic systems. The questions you actually need to answer:

  • What is the span repetition distribution per trace?
  • What percentage of your LLM cost comes from traces in the top 1% by span count?
  • How often does a trace exceed 2× the median span count for its pipeline type?

The Extremis trace would have been invisible on a standard error dashboard. It only surfaced because Peekr's trace structure — capturing parent-child relationships, span names, and sequential ordering — made the repetition pattern machine-readable.

Catching It Before It Ships

The best time to catch this is in staging, using real traces. Add a loop-detection assertion to your integration tests:

def test_classify_intent_does_not_loop(peekr_test_client):
    with peekr_test_client.trace("test_classify_intent") as trace:
        result = memory_pipeline.run(sample_input)
    
    span_counts = Counter(s.name for s in trace.spans)
    assert span_counts["openai.chat.completions"] <= 3, (
        f"classify_intent looped {span_counts['openai.chat.completions']}× — "
        f"add a MAX_ITERATIONS guard"
    )

This won't catch every case — staging inputs are limited — but it will catch the structural patterns: any code path where the LLM is the loop condition with no hard ceiling.


122 LLM calls where 1 should have sufficed. No exception raised, no alert fired, no user complaint. The only thing that found it was counting span names in a trace.

See what your own traces look like at the Peekr demo.

Start observing your AI agents in two lines of code.

Free tier — 10k spans/month. No credit card required.