The best first user for an observability tool is your own product. So we instrumented Extremis — our agent memory system — with Peekr, let it run, and went looking for the ugliest trace we could find.
We found one fast. A single rag.answer call had produced a trace with 124 spans — 122 of them the exact same LLM call. One question from one user had fanned out into 122 openai.chat.completions requests. And in the worst trace, 142.
Nothing was on fire. No errors. The answer came back. Cost for the trace was a few cents. Every dashboard we had was green.
What the trace showed
Trace — rag.answer · 124 spans
openai.chat.completions 612ms
openai.chat.completions 588ms
openai.chat.completions 631ms
openai.chat.completions 604ms
openai.chat.completions 599ms
...
(the same call, 122 times, for one answer)
There's no staircase to spot, no red span to click. Just the same call, over and over, until the request finally returned. If you were scrolling the waterfall you'd give up before you reached the bottom.
Why nothing flagged it
This is the part that bothered us. We had detectors. None of them fired:
- Retry-storm? No — retry detection keys on failures. All 122 calls returned
200 OK. - N+1 / duplicate calls? No — that looks for the same prompt sent repeatedly. These prompts were all different (one per retrieved chunk).
- Sequential / "parallelize this"? It nibbled at the edges, but its advice — run your 122 calls concurrently — is the wrong fix. Making 122 calls finish faster is not the same as not making 122 calls.
Each individual call was reasonable. The defect lived in the shape of the call graph, not in any single span — and shape is exactly what error rates, latency percentiles, and cost-per-call aggregates throw away. A fan-out of 122 distinct, successful, differently-prompted calls is a blind spot for every metric we'd normally trust.
To make it worse, these spans arrived flat — no parent links — so the trace wasn't even a tree you could collapse. It was 122 siblings in a heap.
What Peekr shows now
We built a detector for exactly this class of bug — one that reads the topology of a trace, not its timing. It groups a trace's calls by operation and flags when one launcher floods a single call type, flat spans or not:
⚠ CASCADE · critical
"rag.answer" fans out into 122 "openai.chat.completions" calls per trace
Seen in 12 traces in 24h · up to 142 in one trace
Fix: batch the per-item calls, or cap the loop. 122 completions for a
single answer is a retrieval/ranking problem, not a model problem.
The same detector catches the other shape of this bug — an operation that appears inside itself (a re-entrant cascade, e.g. a consolidation step that writes memories which trigger more consolidation). Both are the same failure mode: a call graph eating its own tail.
The actual bug
rag.answer was scoring relevance with one LLM call per retrieved chunk:
# Before — one completion per chunk, unbounded by how much we retrieved
scored = []
for chunk in retrieved: # 122 chunks → 122 calls
score = client.chat.completions.create(
messages=[{"role": "user", "content": rank_prompt(chunk, query)}],
)
scored.append((chunk, parse_score(score)))
Clean in review. Works in tests with a handful of chunks. Then retrieval gets generous, len(retrieved) climbs to 122, and one answer quietly becomes 122 round-trips to OpenAI. At any real traffic, that's the line that rate-limits your whole app.
The fix
Batch what can be batched, cap what can't:
# After — one ranking call over the top-K, with a hard cap
TOP_K = 12
top = retrieved[:TOP_K] # don't rank what you won't use
ranked = client.chat.completions.create(
messages=[{"role": "user", "content": rank_many(top, query)}],
) # one call ranks all K
scored = parse_scores(ranked, top)
If each item genuinely needs its own completion, the real question isn't "how do I make 122 calls faster" — it's "why am I ranking 122 chunks to answer one question?" The cap is usually the fix.
The impact
| Metric | Before | After |
|---|---|---|
| LLM calls per answer | 122 | 1 |
| Model time per answer | ~75s of round-trips | ~0.6s |
| Rate-limit headroom | one user can 429 you | comfortable |
| Cost per answer | 122× | 1× |
The lesson
Our 279-second bug last month was a timing problem — a sequential loop you could see as a staircase. This one is different: every call was fast, cheap, and successful. The bug was that there were 122 of them, and nothing in a normal observability stack counts that.
Agent and memory systems fail in the topology of their call graph — fan-outs, re-entrant loops, traversal explosions — not in any single call. Those failures are invisible to error rates and cost alerts by construction, because each call passes every check. You only catch them by looking at the shape.
That's the detector we shipped, and the reason we pointed it at ourselves first. If you're running an AI agent that retrieves, ranks, or remembers anything, go count the calls in one trace. The number is usually higher than you think.
Peekr traces every LLM call and surfaces the cascades hiding in your call graph — in-process, two lines of Python: peekr.starkspherelabs.com/observability