We noticed something in one of our traces last week. A call to /v1/answer was taking 16.5 seconds. The user was waiting 16 seconds for an answer to a simple question.
The LLM actually answered in 4.8 seconds. The other 11 seconds were Peekr's own hallucination evaluator — the judge LLM running RAGAS faithfulness scoring.
We were slowing down user responses by 2× to measure response quality. That's backwards.
What was happening
RAGAS faithfulness scoring works by sending the LLM's answer to a second judge call, which checks each claim against the retrieved sources. It's valuable — but it makes its own LLM call, which takes 4–10 seconds.
The problem is where that call happened: synchronously, before the response returned to the user.
Timeline:
0ms LLM call starts
4,800ms LLM call finishes → answer ready
4,800ms RAGAS eval starts ← user waiting
9,500ms RAGAS eval finishes
9,500ms Response returns to user ← 5s of pointless waiting
The user was waiting 9.5 seconds when they could have had their answer in 4.8 seconds. The eval results don't benefit the user in real-time — they only matter for our observability dashboard.
The fix: run eval asynchronously
The right architecture: return the response immediately, run the evaluation in the background, then update the stored span with scores when it's done.
# Before — synchronous eval blocks the response
class EvalExporter:
def export(self, span):
score = evaluator.evaluate(span) # ← blocks
span.attributes["eval_scores"] = score
storage_exporter.export(span) # ← user waits for both
# After — async eval, zero latency to user
class EvalExporter:
def __init__(self, evaluators, async_eval=True):
self._pool = ThreadPoolExecutor(max_workers=4)
self.async_eval = async_eval
def export(self, span):
if self.async_eval:
# Fire-and-forget — span already stored without scores
self._pool.submit(self._eval_and_patch, copy(span))
else:
self._eval_sync(span)
def _eval_and_patch(self, span):
# Runs on background thread
score = evaluator.evaluate(span)
span.attributes["eval_scores"] = score
# Re-export to Peekr Cloud — upserts on span_id
storage_exporter.export(span)
The span gets stored immediately (without scores). The background thread runs the judge call, then patches the span in Peekr Cloud when it finishes. The user never waits.
The new timeline
Timeline (after fix):
0ms LLM call starts
4,800ms LLM call finishes → response returns to user immediately
(background thread)
4,800ms RAGAS eval starts
9,500ms RAGAS eval finishes → span updated in Peekr Cloud
User gets their answer in 4.8 seconds. Eval results appear in the dashboard ~5 seconds later. Same information, zero user-facing cost.
When to use sync vs async
Async eval (the default as of Peekr 0.9.2) is the right choice for:
- Web services where response latency matters
- Long-running agents where the user is waiting
- Any production deployment
Synchronous eval is still useful for:
- Scripts that exit immediately after the LLM call (async work would be killed)
- Tests that need to assert on eval scores in the same call
- Debug sessions where you want the scores before the next line runs
# Default — async (zero latency)
peekr.instrument(
evaluators=[peekr.eval.Hallucination(detailed=True)],
)
# Sync for scripts/tests
peekr.instrument(
evaluators=[peekr.eval.Hallucination(detailed=True)],
# Pass to EvalExporter directly:
# evaluate_async=False
)
How Peekr detected this automatically
The 16.5-second trace appeared in our Peekr dashboard. The timeline made the problem obvious: two openai.chat.completions spans that both took ~5 seconds, one after the other. The LLM call and the judge call, sequential.
This is the same pattern as the sequential execution bug we wrote about recently — two calls that could run in parallel but don't. Peekr's sequential execution detector now flags this automatically:
Root cause: sequential execution of "chat.completions"
2 spans ran one after another. Total: 10.5s, slowest single: 5.7s. Running them concurrently would save 4.8s.
That recommendation shows up in the Insights tab alongside a code fix snippet, and on the trace page itself. You don't need to read a 279-second trace waterfall to spot it — Peekr reads it for you and tells you what to change.
The broader pattern
Evaluation is not the only thing that shouldn't block user responses. Any work that's useful for observability but not needed by the caller should follow this pattern:
- Hallucination scoring — async ✓ (this fix)
- Span export — already async in Peekr (background HTTP export thread)
- Entity extraction — should be async if not needed for the response
- Vector indexing — should be async if the caller doesn't need to query immediately
The rule: if the user doesn't need it to answer the question, it shouldn't be on the critical path.
Async eval is live in Peekr SDK 0.9.2. Update with pip install --upgrade peekr and eval latency drops to zero. The sequential execution detector ships in Peekr Cloud automatically — no SDK update needed.