Observability

Call peekr.instrument() once and Peekr records a span for every LLM call — model, tokens, latency, status, inputs, and outputs. No wrappers and no proxy.

What gets captured

Instrument before you import any LLM SDK. Peekr patches the client classes for OpenAI, Anthropic, AWS Bedrock, and Google Gemini, so every call after that is traced automatically.

agent.py
import peekr
peekr.instrument()        # patches OpenAI, Anthropic, Bedrock, Gemini at the class level

from openai import OpenAI  # import AFTER instrument()
client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarise this contract..."}],
)
# Span written: model, tokens_input, tokens_output, tokens_total, input, output, duration_ms, status

Warning

Call order matters. peekr.instrument() must run before you import or instantiate any LLM SDK. The instrumentation patches at the class level, so a client created before the call is never traced.

Every span carries these fields. Top-level: trace_id, span_id, parent_id, name, start_time, end_time, duration_ms, status (ok | error), tenant_id, retention_class. Under attributes: model, tokens_input, tokens_output, tokens_total, input, output, error, session_id, user_id, eval_scores, experiment_variant, feature, endpoint.

Cost & latency

Token counts (tokens_input / tokens_output / tokens_total) let you attribute spend per call, and duration_ms on each span shows where wall-clock time actually goes — usually a tool or retrieval step, rarely the model itself.

sql
-- Average tokens and duration per model, from local SQLite (storage="sqlite")
SELECT
    json_extract(attributes, '$.model')         AS model,
    AVG(json_extract(attributes, '$.tokens_total')) AS avg_tokens,
    AVG(duration_ms)                             AS avg_ms,
    COUNT(*)                                     AS calls
FROM spans
WHERE name LIKE 'openai.chat%'
GROUP BY model
ORDER BY avg_tokens DESC;

Sessions & multi-tenant

Wrap a multi-turn conversation in peekr.session(...) to group its spans and tag them with a user, tenant, and retention class. Everything inside the block shares those values.

python
with peekr.session(
    user_id="user-abc",
    tenant_id="acme",
    session_id="user-abc-turn-3",
    retention_class="short",      # default | short | long | pii
):
    resp1 = client.chat.completions.create(...)
    resp2 = client.chat.completions.create(...)
    # Both spans share session_id, user_id, tenant_id, retention_class

Note

Resolution order: values from session() win over those set in instrument(), which in turn win over the environment variables PEEKR_TENANT_ID and PEEKR_RETENTION_CLASS.

Claim-level hallucination detection

peekr.eval.Hallucination() scores how grounded each response is — 1.0 means every claim is supported, 0.0 means fully hallucinated. The score lands in span.attributes["eval_scores"].

python
import peekr

peekr.instrument(
    evaluators=[
        peekr.eval.Hallucination(
            # Pull the grounding context out of each span (e.g. your RAG documents):
            context_extractor=lambda span: span.attributes.get("input"),
            model="gpt-4o-mini",   # judge model
            detailed=True,         # store a verdict for every claim
        ),
    ],
)

With detailed=True, Peekr decomposes the response into atomic claims and writes per-claim verdicts to span.attributes.hallucination_details. Each verdict is one of supported, contradicted, or unsupported — so you can see exactly which sentence went off the rails, not just an aggregate number.

Viewing traces

Choose where spans land with storage in instrument(), then inspect them from the CLI.

1

Open a trace file or database

Add --io to include the captured inputs and outputs.

terminal
peekr view traces.jsonl --io
peekr view traces.db --io
2

Generate a self-contained HTML report

Builds a shareable dashboard from a SQLite database.

terminal
peekr dashboard traces.db -o report.html
3

Replay a single trace

Step through one trace by id to see its full waterfall.

terminal
peekr replay <trace_id> --db traces.db

Peekr Cloud

Point an HTTPExporter at Peekr Cloud to get the trace waterfall, cost and token trends, hallucination distributions, and a worst-offender list with no SQL needed.

python
import peekr

peekr.instrument(
    tenant_id="acme",
    exporter=peekr.HTTPExporter(
        endpoint="https://peekr.starkspherelabs.com",
        api_key="pk_live_…",
    ),
)
typescript
import { instrument } from "@peekr/sdk";

instrument({
  exporter: {
    type: "http",
    endpoint: "https://peekr.starkspherelabs.com",
    apiKey: "pk_live_…",
  },
});

Feedback & fine-tuning data

Attach a human rating to any trace with peekr.feedback(...), then export the good ones as a ready-to-train dataset with peekr.export_feedback(...).

python
# Rate a trace
peekr.feedback(trace_id="...", rating="good", note="grounded, correct tone")

# Export the good traces as an OpenAI fine-tuning file
peekr.export_feedback(
    db_path="traces.db",
    filter="good",
    output="training.jsonl",
    format="openai-ft",   # or "raw"
)