← All posts
observabilityJune 3, 2026·4 min read

Zero-Latency LLM Evaluation: Run Hallucination Scoring Without Blocking Your Users

Running RAGAS hallucination scoring after every LLM call was adding 5–10s to our response times. Here's the async pattern that eliminates that latency entirely — and how Peekr now detects it automatically.

LLM evaluationhallucination detectionasync PythonLLM latencyAI agent performanceRAGAS

We noticed something in one of our traces last week. A call to /v1/answer was taking 16.5 seconds. The user was waiting 16 seconds for an answer to a simple question.

The LLM actually answered in 4.8 seconds. The other 11 seconds were Peekr's own hallucination evaluator — the judge LLM running RAGAS faithfulness scoring.

We were slowing down user responses by 2× to measure response quality. That's backwards.

What was happening

RAGAS faithfulness scoring works by sending the LLM's answer to a second judge call, which checks each claim against the retrieved sources. It's valuable — but it makes its own LLM call, which takes 4–10 seconds.

The problem is where that call happened: synchronously, before the response returned to the user.

Timeline:
  0ms     LLM call starts
  4,800ms LLM call finishes → answer ready
  4,800ms RAGAS eval starts  ← user waiting
  9,500ms RAGAS eval finishes
  9,500ms Response returns to user ← 5s of pointless waiting

The user was waiting 9.5 seconds when they could have had their answer in 4.8 seconds. The eval results don't benefit the user in real-time — they only matter for our observability dashboard.

The fix: run eval asynchronously

The right architecture: return the response immediately, run the evaluation in the background, then update the stored span with scores when it's done.

# Before — synchronous eval blocks the response
class EvalExporter:
    def export(self, span):
        score = evaluator.evaluate(span)   # ← blocks
        span.attributes["eval_scores"] = score
        storage_exporter.export(span)      # ← user waits for both
# After — async eval, zero latency to user
class EvalExporter:
    def __init__(self, evaluators, async_eval=True):
        self._pool = ThreadPoolExecutor(max_workers=4)
        self.async_eval = async_eval

    def export(self, span):
        if self.async_eval:
            # Fire-and-forget — span already stored without scores
            self._pool.submit(self._eval_and_patch, copy(span))
        else:
            self._eval_sync(span)

    def _eval_and_patch(self, span):
        # Runs on background thread
        score = evaluator.evaluate(span)
        span.attributes["eval_scores"] = score
        # Re-export to Peekr Cloud — upserts on span_id
        storage_exporter.export(span)

The span gets stored immediately (without scores). The background thread runs the judge call, then patches the span in Peekr Cloud when it finishes. The user never waits.

The new timeline

Timeline (after fix):
  0ms     LLM call starts
  4,800ms LLM call finishes → response returns to user immediately
  
  (background thread)
  4,800ms RAGAS eval starts
  9,500ms RAGAS eval finishes → span updated in Peekr Cloud

User gets their answer in 4.8 seconds. Eval results appear in the dashboard ~5 seconds later. Same information, zero user-facing cost.

When to use sync vs async

Async eval (the default as of Peekr 0.9.2) is the right choice for:

  • Web services where response latency matters
  • Long-running agents where the user is waiting
  • Any production deployment

Synchronous eval is still useful for:

  • Scripts that exit immediately after the LLM call (async work would be killed)
  • Tests that need to assert on eval scores in the same call
  • Debug sessions where you want the scores before the next line runs
# Default — async (zero latency)
peekr.instrument(
    evaluators=[peekr.eval.Hallucination(detailed=True)],
)

# Sync for scripts/tests
peekr.instrument(
    evaluators=[peekr.eval.Hallucination(detailed=True)],
    # Pass to EvalExporter directly:
    # evaluate_async=False
)

How Peekr detected this automatically

The 16.5-second trace appeared in our Peekr dashboard. The timeline made the problem obvious: two openai.chat.completions spans that both took ~5 seconds, one after the other. The LLM call and the judge call, sequential.

This is the same pattern as the sequential execution bug we wrote about recently — two calls that could run in parallel but don't. Peekr's sequential execution detector now flags this automatically:

Root cause: sequential execution of "chat.completions"
2 spans ran one after another. Total: 10.5s, slowest single: 5.7s. Running them concurrently would save 4.8s.

That recommendation shows up in the Insights tab alongside a code fix snippet, and on the trace page itself. You don't need to read a 279-second trace waterfall to spot it — Peekr reads it for you and tells you what to change.

The broader pattern

Evaluation is not the only thing that shouldn't block user responses. Any work that's useful for observability but not needed by the caller should follow this pattern:

  • Hallucination scoring — async ✓ (this fix)
  • Span export — already async in Peekr (background HTTP export thread)
  • Entity extraction — should be async if not needed for the response
  • Vector indexing — should be async if the caller doesn't need to query immediately

The rule: if the user doesn't need it to answer the question, it shouldn't be on the critical path.


Async eval is live in Peekr SDK 0.9.2. Update with pip install --upgrade peekr and eval latency drops to zero. The sequential execution detector ships in Peekr Cloud automatically — no SDK update needed.

Start observing your AI agents in two lines of code.

Free tier — 10k spans/month. No credit card required.