← All posts
observabilityJune 11, 2026·5 min read

Peekr Was Tracing Its Own Evaluation Calls

A single Extremis trace showed 38 identical LLM calls, 99 seconds, and $0.55 spent producing empty results. Peekr was evaluating its own judge. Root cause and fix in peekr 0.9.3.

LLM observabilityeval debuggingAI agent tracingPythonself-instrumentation

38 openai.chat.completions spans. 99.2 seconds. $0.55. Every single output: {"claims": []}.

This is trace 47c55e733e8e48268c0c8c4df80959f4 from Extremis, an AI memory platform built on Claude. The trace was marked Success. No errors. No alerts. Just 38 sequential LLM calls that produced zero signal — and a bill for the privilege.

Here's what happened, why it was invisible, and the fix that shipped in peekr 0.9.3.

What the Trace Showed

The spans arrived flat — no parent_id, all attributed to the trace's root operation. Every call was identical: the same RAGAS Faithfulness prompt, the same gpt-4o-mini model, roughly the same token count (~441 tokens each), and the same output:

Span: openai.chat.completions
Model: gpt-4o-mini
Tokens: 441 (input) / ~10 (output)
Duration: 600ms–1.4s
Status: ok

Output: {"claims": []}

38 times. In a row. In one trace.

The total: 16,700 tokens, $0.55, 99.2 seconds — all to produce 38 empty arrays.

Why It Was Silent

No exception was raised. The judge returned valid JSON each time. {"claims": []} is a legal response to the RAGAS Faithfulness prompt — it means "I found no factual claims in this output." The status was ok. Every individual span looked healthy.

On a standard error dashboard, this trace would be invisible. Error rate: 0%. Latency: high, but not an outlier if you're not normalizing by output value. Cost: $0.55 on a single trace — painful at scale, invisible on a Tuesday.

The only thing that surfaced it was counting repeated span names within a single trace — the same cascade detection logic that caught the classify_intent loop in our previous post.

The Root Cause

Extremis instruments its LLM calls with peekr like this:

peekr.init(
    api_key="...",
    evaluators=[peekr.eval.Hallucination(detailed=True)],
)

Hallucination(detailed=True) runs RAGAS-style claim decomposition on every LLM span after it completes. It calls gpt-4o-mini to split the output into atomic claims and verify each one against the context.

The problem: those judge calls go through the standard openai.chat.completions.create — which peekr has patched. So peekr was tracing its own evaluation calls.

The patch had a partial guard. In eval/__init__.py, a ContextVar called _in_eval is set to True while evaluators run:

_in_eval: ContextVar[bool] = ContextVar("_in_eval", default=False)

# In EvalExporter.export():
if _in_eval.get():
    return  # ← stops re-evaluation

This correctly prevented infinite recursion in the evaluator. When peekr's judge call triggered a new span export, _in_eval was True, so EvalExporter.export() returned early — the judge spans were never re-evaluated.

But they were still created and stored. The openai_patch.py called _mark_eval_span() which tagged the span with peekr.internal = True, but that tag was only checked by the EvalExporter — not by the storage exporter. The spans flowed straight into peekr-cloud's Supabase backend, appearing in the traces dashboard as legitimate application spans.

To make it weirder: by the time the spans were stored, their output field contained the judge's response — {"claims": []} or a float like 0.0. Those outputs look like JSON blobs, which peekr's looks_like_tool_call() heuristic would normally catch and skip. But the re-evaluation guard fired before the output check, so the skip was happening for the right reason — just not preventing storage.

The result: 38 ghost spans per trace, all correctly skipping re-evaluation, all incorrectly being stored and billed.

The Fix

Seven lines in openai_patch.py and the same in anthropic_patch.py. When _in_eval is True, call the original function directly — no span, no storage, no eval:

def _make_chat_patch(original):
    def patched(self_or_first, *args, **kwargs):
        # Skip tracing peekr's own judge calls entirely.
        try:
            from ..eval import _in_eval as _peekr_eval_guard
            if _peekr_eval_guard.get():
                return original(self_or_first, *args, **kwargs)
        except Exception:
            pass

        # Normal tracing path follows...
        span, token = start_span("openai.chat.completions")

The existing _in_eval ContextVar already propagated correctly through the thread pool (Python's ThreadPoolExecutor copies context via contextvars.copy_context()), so no new primitives were needed. The guard just needed to be at the right layer — before span creation, not after.

52 existing eval tests pass unchanged. The fix ships in peekr 0.9.3.

To update:

pip install --upgrade peekr

The Deeper Signal

Even after the fix, there's a question worth asking: why were all 38 judge calls returning {"claims": []}?

The answer is that Extremis's LLM outputs are memory-system responses — things like "I've stored that for you" or structured JSON results from recall operations. These genuinely have no atomic factual claims in the RAGAS sense. The judge was correct. The problem was that the eval was running on outputs it was never designed for.

An all-empty claims result — especially when it happens consistently across many spans — is itself a diagnostic signal: your eval is running on the wrong output type.

RAGAS Faithfulness was designed for RAG pipelines: a user asks a question, the system retrieves context, the model generates an answer, and you verify the answer against the context. Memory agent outputs break two of those three assumptions. The "output" is a procedural confirmation, not a factual answer. The "context" is the conversation history, not retrieved documents.

If you're running Hallucination on a memory agent, filter it down to the spans that actually produce evaluable output:

def is_substantive_response(span) -> bool:
    output = span.attributes.get("output", "")
    if not isinstance(output, str):
        return False
    # Skip JSON blobs, short confirmations, tool outputs
    if output.strip().startswith(("{", "[")):
        return False
    if len(output.split()) < 15:
        return False
    return True

peekr.init(
    api_key="...",
    evaluators=[peekr.eval.Hallucination(detailed=True)],
    span_filter=is_substantive_response,
)

This stops the eval from running on "Memory stored." and {"recalled": [...]} — the cases where {"claims": []} is the correct output but also a useless one.

What to Watch For

If you see this pattern in your own traces:

  1. High span count, low output diversity — 10+ spans with the same name, all producing the same structured output
  2. All-empty eval resultshallucination_details.total = 0 consistently across many spans
  3. Unexplained token usage — tokens accumulating on spans you don't recognize as your own application calls

The first is the cascade detection pattern. The second and third are the eval-on-wrong-output-type pattern. Both are silent — they won't surface in error rate, they won't trigger a timeout, and they won't show up in p95 latency if the individual spans are fast.

Peekr's trace structure — parent-child relationships, span names, token counts — makes both patterns machine-readable. The cascade in the Extremis classify_intent loop took 138 seconds. This one took 99 seconds. Neither raised an exception.


Both of these traces are now part of the peekr demo dataset. See what your own traces look like at peekr.starkspherelabs.com/demo.

Start observing your AI agents in two lines of code.

Free tier — 10k spans/month. No credit card required.