Products

Observability

Trace every LLM call. Cost, latency, hallucination.

Guardrails

HIPAA, FDCPA, FINRA, GDPR — enforced in-process.

Prompts

Version-controlled prompts with A/B testing and rollback.

Resources

Live demo Blog Docs GitHub

LLM Observability · Peekr Cloud

See exactly what your
AI agent is doing.

Auto-instruments OpenAI, Anthropic, Gemini, and Bedrock. Every span, token count, latency, and hallucination verdict — captured with two lines of code, no proxy, no lock-in.

Start free — 10k spans/mo See live demo

OpenAI Anthropic Gemini Bedrock LangChain CrewAI LlamaIndex

Three things every AI team needs

Trace. Cost. Score.

The same three primitives that run in the OSS library — now with a shared dashboard, longer retention, and hosted evaluators.

Trace

Full call tree — every LLM call, tool use, and agent step as a span.

Auto-patch at class level — all OpenAI() instances captured
Async + streaming — AsyncOpenAI, streamed responses rolled up
Waterfall view with input/output and human-readable labels
Cost in cents per span, per trace, per tenant

Cost

Know what each query costs before it shows up on your bill.

Cost per query trend — 30-day chart, unit economics over time
By operation — which endpoint or workflow costs the most
By tenant — per-customer cost breakdown at a glance
Savings recommendations — model-swap, output-cap, caching

Score

See which sentence was wrong. Not just a 0–1 score.

Claim decomposition — supported / contradicted / unsupported per sentence
Context-aware — evaluator reads your retrieved docs, not the full prompt
Worst-offenders list — lowest faithfulness traces ranked for triage
Quality trend over 30 days — know if your agent is regressing

Observability scores hallucinations. It tells you which sentence was wrong and why — so you can fix your prompts and retrieval. If you need to blockbad outputs before users see them, that's Guardrails.

See Guardrails →

Four walls every AI team hits

The same four problems — made obvious.

The complaint

“My agent gave the wrong answer.”

See exactly what the LLM received — not what you think you sent.

Malformed tool output is the silent killer. The trace shows the full call tree: tool inputs, outputs, and exactly what the LLM got. You find the mismatch in seconds, not hours.

agent.run  2100ms
  └─ tool.fetch_user  12ms
       out: null          ← returned null
  └─ openai.chat      2088ms
       in: "User profile: null…"  ← LLM received garbage

The complaint

“My agent is hallucinating.”

Know which exact claim was wrong — not just a 0–1 score.

Peekr breaks every LLM response into sentences and verdicts each one: supported, contradicted, or unsupported. You see exactly which claim failed and why.

eval_scores: { Hallucination: 0.00 }

  ✗ contradicted  "1923"
  ✗ contradicted  "Frank Lloyd Wright"
  ~ unsupported   "World's Fair"

The complaint

“My agent is too slow.”

The LLM is almost never the bottleneck.

Most teams swap models first. The waterfall shows wall-clock time split across every tool and LLM call. 80% of the time the fix is a cache or a slow tool, not a different model.

agent.run  4300ms
  └─ tool.search_web  3800ms  ← 88% of time. Cache this.
  └─ tool.rerank         18ms
  └─ openai.chat        490ms  ← not your problem

The complaint

“My API bill keeps climbing.”

Cost-per-query growing faster than traffic means prompt bloat.

Peekr shows cost-per-query trend over 30 days broken down by operation. If it's growing faster than traffic, something is wrong — usually unbounded conversation history or a disproportionately expensive feature.

Trace 1:  18,432 tokens  · $0.018
Trace 2:  21,104 tokens  · $0.021
Trace 3:  24,891 tokens  · $0.025  ← unbounded growth

Cost by operation: chat_summary 67% of spend

Two lines. That's it.

Instrument before your first import. Everything else is automatic.

Peekr patches at the class level — every OpenAI(), AsyncOpenAI(), and anthropic.Anthropic() instance is covered. No wrappers. No proxy. No framework-specific configuration.

Zero latency overhead — spans export on a background thread
Idempotent upsert — retries never duplicate spans
Works alongside JSONL / SQLite for local dev

Start free — 10k spans/mo Docs

agent.py

# 1. Call this before any other imports
import peekr

peekr.instrument(
  tenant_id="acme",
  exporter=peekr.HTTPExporter(
    endpoint="https://peekr.starkspherelabs.com",
    api_key="pk_live_…",
  ),
  evaluators=[peekr.eval.Hallucination(detailed=True)],
)

# 2. Your code is unchanged — everything below is traced
from openai import OpenAI, AsyncOpenAI
client = OpenAI()  # ← patched automatically