← All posts
observabilityJuly 1, 2026·6 min read

We Found a Runaway LLM Loop in Production — Here's What the Trace Showed

A single Extremis trace showed the same LLM span repeating in a loop. Here's what peekr's trace view revealed and how to catch unbounded agent loops before your users do.

LLM observabilityAI agent loopstrace debuggingPythonExtremis

122 Identical Spans. One Trace. A Classic Runaway Loop.

A single trace from Extremis's production environment contained 124 spans. 122 of them were openai.chat.completions. Same span name, sequential execution, no exit — a textbook unbounded LLM loop running undetected until Peekr flagged it.

Here's exactly what the trace waterfall looked like.

The Trace Waterfall

trace_id: ext-7f3a9c2b-1d4e-4f8a-b3c1-9e2d0a5f7b8c
total_spans: 124
duration: 113,420ms (~1m 53s)

SPAN TREE
├── [0000ms] memory.retrieve                          42ms
├── [0044ms] openai.chat.completions  (1/122)        887ms  ← tool_call: search_memory
├── [0933ms] openai.chat.completions  (2/122)        912ms  ← tool_call: search_memory
├── [1847ms] openai.chat.completions  (3/122)        934ms  ← tool_call: search_memory
├── [2783ms] openai.chat.completions  (4/122)        901ms  ← tool_call: search_memory
│   ...
├── [98201ms] openai.chat.completions (121/122)      878ms  ← tool_call: search_memory
├── [99081ms] openai.chat.completions (122/122)      923ms  ← tool_call: search_memory  ✗ ERROR: max_tokens
└── [100004ms] memory.write                          18ms

avg span duration : 905ms
p95 span duration : 1,459ms
total LLM cost    : ~122 API calls billed
exit condition    : hard token error (not a graceful break)

Every single openai.chat.completions call issued the same search_memory tool call. The model never received a signal to stop. The loop exited only because the final call hit a max_tokens error — not because the application logic caught it.

Why This Happens in Agentic Systems

Agentic frameworks — including memory platforms like Extremis — typically run a think → act → observe loop. The model decides what tool to call, the framework executes it, and the result feeds back as context. In the happy path, the model eventually decides it has enough information and returns a final answer.

The unhappy path looks like this:

# Classic vulnerable agent loop — no guard rails
def run_agent(user_query: str):
    messages = [{"role": "user", "content": user_query}]

    while True:                                        # ← no iteration cap
        response = client.chat.completions.create(
            model="claude-3-5-sonnet-20241022",
            messages=messages,
            tools=TOOLS,
        )

        if response.choices[0].finish_reason == "stop":
            return response.choices[0].message.content  # ← only exit

        # append tool call + execute tool
        tool_result = execute_tool(response)
        messages.append(tool_result)
        # loop continues — model keeps calling search_memory

Three conditions combine to produce the pattern we saw in Extremis:

  1. The tool returns an inconclusive result. search_memory was returning partial matches. The model interpreted "no definitive answer" as a reason to search again with a slightly rephrased query.
  2. No iteration ceiling. The while True loop has a single exit: finish_reason == "stop". If the model keeps issuing tool calls, there's no second exit.
  3. No idempotency check. The same tool was being called with semantically equivalent queries. Without deduplication, the loop had no natural terminator.

At 905ms average per call, 122 iterations burned roughly 110 seconds of wall time and 122 API calls — all billed, none useful.

The Numbers That Made This Findable

Across 1,000 spans analysed over 30 days, the overall error rate in Extremis was 0.4%. That's low enough that most observability tools would show a green dashboard. The runaway loop produced no error for the first 121 iterations. It looked healthy.

This is why aggregate error rates are the wrong signal for loop detection. The trace contained one logical failure expressed as 122 sequential successes followed by a single hard error. Standard alerting would have fired exactly once, if at all.

What makes it visible is span repetition within a single trace — a signal that's only accessible if you're looking at trace-level structure, not aggregated metrics.

Catching It with Peekr

Peekr instruments your LLM calls automatically. Once your app is wrapped, detecting loop patterns is a span-counting query against trace structure. Here's a minimal detector you can drop into a CI check or a monitoring script:

import peekr

client = peekr.instrument(your_openai_client)

# --- post-hoc detection against stored traces ---
def find_runaway_loops(project_id: str, threshold: int = 10):
    traces = peekr.traces.list(project_id=project_id, limit=500)

    flagged = []
    for trace in traces:
        span_counts = {}
        for span in trace.spans:
            span_counts[span.name] = span_counts.get(span.name, 0) + 1

        for span_name, count in span_counts.items():
            if count >= threshold:
                flagged.append({
                    "trace_id":   trace.id,
                    "span_name":  span_name,
                    "repetitions": count,
                    "total_spans": len(trace.spans),
                    "duration_ms": trace.duration_ms,
                })

    return flagged

results = find_runaway_loops("extremis-prod", threshold=10)
for r in results:
    print(
        f"CASCADE DETECTED: span '{r['span_name']}' repeated "
        f"{r['repetitions']}× in trace {r['trace_id']} "
        f"({r['total_spans']} total spans, {r['duration_ms']}ms)"
    )
# Output from the Extremis trace corpus:
CASCADE DETECTED: span 'openai.chat.completions' repeated 122× \
  in trace ext-7f3a9c2b (124 total spans, 113420ms)

You can also add a runtime guard so the loop breaks before it costs you 122 API calls:

import peekr

MAX_ITERATIONS = 15

def run_agent_guarded(user_query: str):
    messages = [{"role": "user", "content": user_query}]
    iteration = 0

    with peekr.trace("agent.run") as root_span:
        while iteration < MAX_ITERATIONS:
            iteration += 1

            with peekr.span("openai.chat.completions") as span:
                span.set_attribute("iteration", iteration)

                response = client.chat.completions.create(
                    model="claude-3-5-sonnet-20241022",
                    messages=messages,
                    tools=TOOLS,
                )
                span.set_attribute("finish_reason", response.choices[0].finish_reason)

            if response.choices[0].finish_reason == "stop":
                return response.choices[0].message.content

            tool_result = execute_tool(response)
            messages.append(tool_result)

        # explicit cap hit — surface it
        root_span.set_attribute("loop.capped", True)
        root_span.set_attribute("loop.iterations", iteration)
        raise RuntimeError(f"Agent loop exceeded {MAX_ITERATIONS} iterations")

The loop.capped attribute becomes queryable in Peekr. You can filter traces where that attribute is True and build a dashboard panel showing how often your agents are hitting the ceiling — which is a separate, useful signal from the runaway case.

The Fix in Extremis

The Extremis team applied two changes:

  1. Iteration cap at 15. Anything beyond that raises a structured exception, logs the trace ID, and returns a degraded response to the user rather than timing out.
  2. Tool call deduplication. Before executing search_memory, the agent now checks whether the same query (normalised) was issued in the last three turns. If it was, it injects a system message: "You have already searched for this. Summarise what you found and respond." This nudges the model toward finish_reason: stop without requiring an external guard.

Neither fix is exotic. Both were straightforward to implement once the trace made the problem legible. The cascade was invisible in logs, invisible in aggregate metrics, and completely obvious in the trace waterfall.

What to Watch For

A few patterns that predict this failure mode:

  • Agentic while True loops with a single exit condition tied to model output
  • Memory or retrieval tools that can return partial or empty results (the model treats ambiguity as a prompt to retry)
  • No tool-call idempotency — the same tool callable unlimited times with no deduplication
  • p95 trace durations that are an order of magnitude above p50 (in Extremis, the loop trace was ~113s against a median closer to 2–3s)

The 0.4% overall error rate looked fine. The 905ms average span duration looked fine. It was the 122× repetition of a single span within one trace that told the real story — and that's a query you can only run if you have structured trace data.


See the full Extremis trace replay and run the loop-detection query yourself at peekr.starkspherelabs.com/demo.

Start observing your AI agents in two lines of code.

Free tier — 10k spans/month. No credit card required.