You ran your RAG pipeline through RAGAS on your eval set. Faithfulness score: 0.87. You shipped it.
Three weeks later a user screenshots your agent confidently stating that a law was passed in 2019 when it was actually 2021. The retrieved document had the right date. The LLM ignored it.
The benchmark told you what your model can do. You need to know what it's actually doing, on real traffic, every day.
The Gap Between Evals and Production
Evaluation datasets are carefully curated. They cover the cases you thought to test. Real user queries are messier, more varied, and often outside the distribution of your eval set.
The failure modes that benchmarks miss:
- Distribution shift — users ask questions your eval set didn't anticipate
- Context window effects — retrieved documents at the end of a long context get ignored
- Proper noun substitution — model substitutes a wrong but plausible name ("Frank Lloyd Wright" instead of "Gustave Eiffel")
- Numeric invention — dates, percentages, and statistics are fabricated with high confidence
- Cascade errors — a wrong answer in turn 3 of a conversation corrupts turns 4, 5, 6
None of these show up cleanly in aggregate faithfulness scores until they've already damaged user trust.
Sentence-Level Claim Verification
The key insight behind RAGAS faithfulness (and Peekr's hallucination evaluator) is that overall response quality is the average of individual claim quality.
A response like "The Eiffel Tower was built in 1923 by Gustave Eiffel for the Paris World's Fair" contains three claims:
- Built in 1923 — contradicted (it was 1889)
- By Gustave Eiffel — supported
- For the Paris World's Fair — supported
The sentence-level breakdown tells you which claim failed. An aggregate score of 0.67 doesn't. When you're debugging a production complaint, the difference between "something was wrong" and "the date was wrong" is the difference between a 2-hour investigation and a 10-minute fix.
Setting Up Production Hallucination Monitoring
import peekr
peekr.instrument(
exporter=peekr.HTTPExporter(
endpoint="https://peekr.starkspherelabs.com",
api_key="pk_live_…",
),
evaluators=[
peekr.eval.Hallucination(
detailed=True, # sentence-level claim verdicts
context_extractor=lambda span: span.attributes.get("retrieved_docs", ""),
)
],
)
The context_extractor is critical for RAG. Without it, the evaluator compares the output against the prompt — which usually contains the right answer, making everything look grounded. You want to compare against the retrieved documents, not the full prompt.
What You See in the Dashboard
After each LLM call, Peekr logs:
{
"eval_scores": { "Hallucination": 0.33 },
"hallucination_details": {
"score": 0.33,
"total": 3,
"supported": 1,
"contradicted": 2,
"unsupported": 0,
"claims": [
{ "text": "built in 1889", "verdict": "supported" },
{ "text": "in Paris", "verdict": "supported" },
{ "text": "by Frank Lloyd Wright", "verdict": "contradicted" }
]
}
}
The Quality dashboard shows:
- Mean faithfulness score over time — is the trend improving or degrading?
- Critical count (score < 0.5) and warning count (score < 0.8)
- Worst offenders — the 10 traces with the lowest scores, with the exact claims that failed
This tells you where to look. The waterfall view for any trace shows the retrieved context alongside the model's answer, so you can see exactly what the model had available and what it chose to say instead.
Automatically Blocking Low-Quality Responses
For applications where hallucinations carry real risk (healthcare, legal, financial), you can block responses that fall below a faithfulness threshold before they reach users:
peekr.instrument(
evaluators=[peekr.eval.Hallucination(detailed=True)],
guardrails=[
peekr.guard.HallucinationBlock(threshold=0.6)
],
)
HallucinationBlock reuses the score from the evaluator — the judge LLM only runs once. When the response scores below 0.6, GuardrailError is raised. The violation span is stored (so you have an audit trail) and the response never reaches the user.
Practical Thresholds
What faithfulness threshold is "good enough" depends on the risk profile:
| Application | Recommended threshold | Rationale |
|---|---|---|
| Internal search / knowledge base | 0.7 | Wrong answers are inconvenient |
| Customer-facing support bot | 0.8 | Wrong answers damage trust |
| Healthcare or legal assistant | 0.9 | Wrong answers carry liability |
| Financial advice | 0.9+ | Wrong answers may violate Reg BI |
Start with warn mode (action="warn") to understand your baseline before switching to raise. A week of production data will show you the score distribution for your specific queries and content.
The Operational Loop
Hallucination monitoring isn't a one-time setup — it's an operational loop:
- Establish baseline — run for a week to understand your score distribution
- Set alert thresholds — alert when daily mean score drops below baseline
- Investigate worst-offenders — these are your highest-risk outputs
- Fix retrieval first — most hallucinations are caused by the wrong document being retrieved, not the model being confused
- Update your eval set — add the production failure cases to your evals
- Repeat
The eval set and the production monitor should inform each other. Cases that fail in production become new eval cases. Eval improvements should show up in production scores.
Getting Started
Peekr's hallucination evaluator is included in all plans. The detailed=True mode (sentence-level RAGAS decomposition) costs one extra LLM judge call per evaluated span — if cost is a concern, use detailed=False for monitoring and detailed=True only when investigating specific traces.
Free tier: 10k spans/month. For a team sending 100 queries/day, that's ~3 months of free monitoring before you need to upgrade.