LLM Observability · Peekr Cloud

See exactly what your
AI agent is doing.

Auto-instruments OpenAI, Anthropic, Gemini, and Bedrock. Every span, token count, latency, and hallucination verdict — captured with two lines of code, no proxy, no lock-in.

OpenAI Anthropic Gemini Bedrock LangChain CrewAI LlamaIndex

Three things every AI team needs

Trace. Cost. Score.

The same three primitives that run in the OSS library — now with a shared dashboard, longer retention, and hosted evaluators.

Trace

Full call tree — every LLM call, tool use, and agent step as a span.

  • Auto-patch at class level — all OpenAI() instances captured
  • Async + streaming — AsyncOpenAI, streamed responses rolled up
  • Waterfall view with input/output and human-readable labels
  • Cost in cents per span, per trace, per tenant

Cost

Know what each query costs before it shows up on your bill.

  • Cost per query trend — 30-day chart, unit economics over time
  • By operation — which endpoint or workflow costs the most
  • By tenant — per-customer cost breakdown at a glance
  • Savings recommendations — model-swap, output-cap, caching

Score

See which sentence was wrong. Not just a 0–1 score.

  • Claim decomposition — supported / contradicted / unsupported per sentence
  • Context-aware — evaluator reads your retrieved docs, not the full prompt
  • Worst-offenders list — lowest faithfulness traces ranked for triage
  • Quality trend over 30 days — know if your agent is regressing

Observability scores hallucinations. It tells you which sentence was wrong and why — so you can fix your prompts and retrieval. If you need to blockbad outputs before users see them, that's Guardrails.

See Guardrails →

Four walls every AI team hits

The same four problems — made obvious.

The complaint

My agent gave the wrong answer.

See exactly what the LLM received — not what you think you sent.

Malformed tool output is the silent killer. The trace shows the full call tree: tool inputs, outputs, and exactly what the LLM got. You find the mismatch in seconds, not hours.

agent.run  2100ms
  └─ tool.fetch_user  12ms
       out: null          ← returned null
  └─ openai.chat      2088ms
       in: "User profile: null…"  ← LLM received garbage

The complaint

My agent is hallucinating.

Know which exact claim was wrong — not just a 0–1 score.

Peekr breaks every LLM response into sentences and verdicts each one: supported, contradicted, or unsupported. You see exactly which claim failed and why.

eval_scores: { Hallucination: 0.00 }

  ✗ contradicted  "1923"
  ✗ contradicted  "Frank Lloyd Wright"
  ~ unsupported   "World's Fair"

The complaint

My agent is too slow.

The LLM is almost never the bottleneck.

Most teams swap models first. The waterfall shows wall-clock time split across every tool and LLM call. 80% of the time the fix is a cache or a slow tool, not a different model.

agent.run  4300ms
  └─ tool.search_web  3800ms  ← 88% of time. Cache this.
  └─ tool.rerank         18ms
  └─ openai.chat        490ms  ← not your problem

The complaint

My API bill keeps climbing.

Cost-per-query growing faster than traffic means prompt bloat.

Peekr shows cost-per-query trend over 30 days broken down by operation. If it's growing faster than traffic, something is wrong — usually unbounded conversation history or a disproportionately expensive feature.

Trace 1:  18,432 tokens  · $0.018
Trace 2:  21,104 tokens  · $0.021
Trace 3:  24,891 tokens  · $0.025  ← unbounded growth

Cost by operation: chat_summary 67% of spend

Two lines. That's it.

Instrument before your first import. Everything else is automatic.

Peekr patches at the class level — every OpenAI(), AsyncOpenAI(), and anthropic.Anthropic() instance is covered. No wrappers. No proxy. No framework-specific configuration.

  • Zero latency overhead — spans export on a background thread
  • Idempotent upsert — retries never duplicate spans
  • Works alongside JSONL / SQLite for local dev
agent.py
# 1. Call this before any other imports
import peekr

peekr.instrument(
  tenant_id="acme",
  exporter=peekr.HTTPExporter(
    endpoint="https://peekr.starkspherelabs.com",
    api_key="pk_live_…",
  ),
  evaluators=[peekr.eval.Hallucination(detailed=True)],
)

# 2. Your code is unchanged — everything below is traced
from openai import OpenAI, AsyncOpenAI
client = OpenAI()  # ← patched automatically

Pricing

Free to start. Scales with your spans.

Free
$0/mo

10k spans/mo

  • All traces
  • 7-day retention
  • All evaluators
Get started
Starter
$29/mo

500k spans/mo

  • 30-day retention
  • Alerts
  • Email support
Get started
ProPopular
$99/mo

5M spans/mo

  • 90-day retention
  • Fine-tune export
  • Priority support
Get started
Scale
$399/mo

50M spans/mo

  • Custom retention
  • SLA
  • Dedicated support
Get started

Start observing in two lines.

Free up to 10k spans per month. No credit card required. Remove the import if you change your mind — your agent code stays untouched.

Building a multi-step agent? See AI Agent Observability →

Also need guardrails? See Peekr Guardrails →