See exactly what your
AI agent is doing.
Auto-instruments OpenAI, Anthropic, Gemini, and Bedrock. Every span, token count, latency, and hallucination verdict — captured with two lines of code, no proxy, no lock-in.
Three things every AI team needs
Trace. Cost. Score.
The same three primitives that run in the OSS library — now with a shared dashboard, longer retention, and hosted evaluators.
Trace
Full call tree — every LLM call, tool use, and agent step as a span.
- Auto-patch at class level — all OpenAI() instances captured
- Async + streaming — AsyncOpenAI, streamed responses rolled up
- Waterfall view with input/output and human-readable labels
- Cost in cents per span, per trace, per tenant
Cost
Know what each query costs before it shows up on your bill.
- Cost per query trend — 30-day chart, unit economics over time
- By operation — which endpoint or workflow costs the most
- By tenant — per-customer cost breakdown at a glance
- Savings recommendations — model-swap, output-cap, caching
Score
See which sentence was wrong. Not just a 0–1 score.
- Claim decomposition — supported / contradicted / unsupported per sentence
- Context-aware — evaluator reads your retrieved docs, not the full prompt
- Worst-offenders list — lowest faithfulness traces ranked for triage
- Quality trend over 30 days — know if your agent is regressing
Observability scores hallucinations. It tells you which sentence was wrong and why — so you can fix your prompts and retrieval. If you need to blockbad outputs before users see them, that's Guardrails.
See Guardrails →Four walls every AI team hits
The same four problems — made obvious.
The complaint
“My agent gave the wrong answer.”
See exactly what the LLM received — not what you think you sent.
Malformed tool output is the silent killer. The trace shows the full call tree: tool inputs, outputs, and exactly what the LLM got. You find the mismatch in seconds, not hours.
agent.run 2100ms
└─ tool.fetch_user 12ms
out: null ← returned null
└─ openai.chat 2088ms
in: "User profile: null…" ← LLM received garbageThe complaint
“My agent is hallucinating.”
Know which exact claim was wrong — not just a 0–1 score.
Peekr breaks every LLM response into sentences and verdicts each one: supported, contradicted, or unsupported. You see exactly which claim failed and why.
eval_scores: { Hallucination: 0.00 }
✗ contradicted "1923"
✗ contradicted "Frank Lloyd Wright"
~ unsupported "World's Fair"The complaint
“My agent is too slow.”
The LLM is almost never the bottleneck.
Most teams swap models first. The waterfall shows wall-clock time split across every tool and LLM call. 80% of the time the fix is a cache or a slow tool, not a different model.
agent.run 4300ms
└─ tool.search_web 3800ms ← 88% of time. Cache this.
└─ tool.rerank 18ms
└─ openai.chat 490ms ← not your problemThe complaint
“My API bill keeps climbing.”
Cost-per-query growing faster than traffic means prompt bloat.
Peekr shows cost-per-query trend over 30 days broken down by operation. If it's growing faster than traffic, something is wrong — usually unbounded conversation history or a disproportionately expensive feature.
Trace 1: 18,432 tokens · $0.018
Trace 2: 21,104 tokens · $0.021
Trace 3: 24,891 tokens · $0.025 ← unbounded growth
Cost by operation: chat_summary 67% of spendTwo lines. That's it.
Instrument before your first import. Everything else is automatic.
Peekr patches at the class level — every OpenAI(), AsyncOpenAI(), and anthropic.Anthropic() instance is covered. No wrappers. No proxy. No framework-specific configuration.
- Zero latency overhead — spans export on a background thread
- Idempotent upsert — retries never duplicate spans
- Works alongside JSONL / SQLite for local dev
# 1. Call this before any other imports
import peekr
peekr.instrument(
tenant_id="acme",
exporter=peekr.HTTPExporter(
endpoint="https://peekr.starkspherelabs.com",
api_key="pk_live_…",
),
evaluators=[peekr.eval.Hallucination(detailed=True)],
)
# 2. Your code is unchanged — everything below is traced
from openai import OpenAI, AsyncOpenAI
client = OpenAI() # ← patched automaticallyPricing
Free to start. Scales with your spans.
Start observing in two lines.
Free up to 10k spans per month. No credit card required. Remove the import if you change your mind — your agent code stays untouched.
Building a multi-step agent? See AI Agent Observability →
Also need guardrails? See Peekr Guardrails →