Peekr CloudDemoAcme Agents

Insights

Save money. Fix regressions.

Computed live from your spans.

Available savings

$333 / mo

9 actionable recommendations

Current monthly spend

$565

Projected from the last 24h × 30 days

Active anomalies

Cost / latency / quality drift

What to fix this week

Cascaderag.answermedium effort

"rag.answer" fans out into 45 "openai.chat.completions.create" calls per trace

1 trace in 24h · up to 45 "openai.chat.completions.create" calls in one trace (avg 45) · ~$33/mo on these calls

Fan-out

45×

1 trace

NowAfter fix

45× openai.chat.completions.create

→

~1 batched call

Seen in 1 trace in the last 24h

One "rag.answer" produces ~45 "openai.chat.completions.create" calls — almost always a per-retrieved-item LLM call. Batch them into one request, or cap the loop. 45 completions for a single answer is a retrieval/ranking bug, not a model problem.

fix.py

# One operation made N LLM calls in a single trace.
# Batch the per-item calls into one request:
inputs = [c.text for c in retrieved[:TOP_K]]   # e.g. TOP_K = 8
client.embeddings.create(model="text-embedding-3-small", input=inputs)

# Or collapse a per-chunk loop into a single completion:
context = "\n\n".join(c.text for c in retrieved[:TOP_K])
client.chat.completions.create(messages=[{"role": "user", "content": context}])

See the trace that triggered this View all affected traces →Learn more ↗

Model swapsupport_botlow effort

Route short support_bot queries to gpt-4o-mini

110 of 124 support_bot calls in 24h were under 600 input tokens on claude-opus-4-7.

Save / mo

$74

99% on feature

Now:claude-opus-4-7110 calls/24h→gpt-4o-mini

Short prompts don't need a frontier model. Add a length check at the dispatcher: if tokens_input < 600, use gpt-4o-mini; otherwise fall back. Quality drop is typically negligible at this length.

fix.py

# Route short prompts to cheaper model
model = "gpt-4o-mini" if tokens < 600 else "gpt-4o"
client.chat.completions.create(model=model, ...)

View all affected traces →Learn more ↗

Fine-tunesupport_bothigh effort

Fine-tune for support_bot (high-volume on premium model)

124 support_bot calls in 24h, 100% on premium models.

Save / mo

$74

75% on feature

At this volume a fine-tuned smaller model typically reaches ≥95% of frontier quality on a constrained task. Sample 5k spans, fine-tune gpt-4o-mini, A/B against current. Training cost recovers in ~5 days at current spend.

fix.py

# Export training data from Peekr spans
# then fine-tune gpt-4o-mini on your task

View all affected traces →Learn more ↗

Prompt cachingchat_summarylow effort

Enable prompt caching for chat_summary

84 calls share a 2.4k-token system prompt on claude-opus-4-7.

Save / mo

$73

81% on feature

Now:claude-opus-4-784 calls/24h

Anthropic prompt caching cuts repeated system-prompt cost by ~90% after the first hit. Set cache_control: {"type":"ephemeral"} on the system block — no code path change required on Peekr's side.

fix.py

# Anthropic — cache the system prompt
client.messages.create(
    system=[{"type": "text", "text": prompt,
             "cache_control": {"type": "ephemeral"}}],
    ...
)

View all affected traces →Learn more ↗

Hallucination $chat_summarymedium effort

chat_summary: $27/mo on answers below 0.5 faithfulness

9 of 84 chat_summary calls scored below 0.5 faithfulness (mean 0.38) — $27/mo paid for output your eval flagged as unsupported.

Save / mo

$27

11% on feature

A low-faithfulness answer costs the same to generate as a correct one — then costs you again in retries, support, and lost trust. Tighten grounding (better retrieval, require citations), or block-and-retry: Peekr's HallucinationBlock guard can withhold an answer scoring below your floor before it reaches the user.

fix.py

# Stop serving (and paying for) unsupported answers
peekr.instrument(
    evaluators=[peekr.eval.Hallucination()],
    guards=[peekr.guard.HallucinationBlock(min_score=0.6)],  # withhold + retry
)

See the trace that triggered this View all affected traces →Learn more ↗

Fine-tunesearch_qahigh effort

Fine-tune for search_qa (high-volume on premium model)

135 search_qa calls in 24h, 100% on premium models.

Save / mo

$24

75% on feature

fix.py

# Export training data from Peekr spans
# then fine-tune gpt-4o-mini on your task

View all affected traces →Learn more ↗

Wasted spendtriagelow effort

~$16/mo spent on inputs triaged as low-attention

5 traces were scored low-attention (e.g. "minimal") but still ran LLM calls — model spend on inputs your own router decided weren't worth the work.

Save / mo

$16

3% on feature

Gate the expensive call on the triage score: if the router scores an input "minimal" or "ignore", short-circuit before the model call (a templated reply, a cached answer, or nothing). You're paying frontier-model rates to think about messages you already decided to skim.

fix.py

# Gate the expensive call on the triage score
level = router.score_attention(message)        # "full" | "minimal" | "ignore"
if level in ("minimal", "ignore"):
    return cheap_reply(message)                # template / cache — no model call
answer = client.chat.completions.create(...)   # only when it's worth it

See the trace that triggered this View all affected traces →Learn more ↗

Hallucination $code_assistmedium effort

code_assist: $13/mo on answers below 0.5 faithfulness

4 of 57 code_assist calls scored below 0.5 faithfulness (mean 0.28) — $13/mo paid for output your eval flagged as unsupported.

Save / mo

$13

10% on feature

fix.py

# Stop serving (and paying for) unsupported answers
peekr.instrument(
    evaluators=[peekr.eval.Hallucination()],
    guards=[peekr.guard.HallucinationBlock(min_score=0.6)],  # withhold + retry
)

See the trace that triggered this View all affected traces →Learn more ↗

Parallelismopenai.chat.completions.createlow effort

Parallelise "chat.completions.create" — 35.8× speedup available

45 "openai.chat.completions.create" calls ran sequentially in 1 trace. Total: 42.3s, slowest single call: 1.2s.

Save / batch

41.1s

97% faster

Sequential (now)Parallel (after fix)

42.3s

→

~1.2s

45 calls × avg 940mssaves 41.1s per batch

Each call waited for the previous one to finish. Running them concurrently (ThreadPoolExecutor in Python, Promise.all in JS) reduces wall-clock time from 42.3s to ~1.2s — a 35.8× speedup with no change to cost or output quality.

fix.py

# Python — run LLM calls in parallel
from concurrent.futures import ThreadPoolExecutor

def process_all(items):
    with ThreadPoolExecutor(max_workers=8) as pool:
        return list(pool.map(process_one, items))

See the trace that triggered this View all affected traces →Learn more ↗

When things changed without you noticing

costchat_summary2026-05-18 14:00

chat_summary cost +38% vs 7-day baseline

Triggered when chat_summary defaulted back to claude-opus-4-7 on 2026-05-18. Volume held flat — the spike is purely model-mix.

Inspect a representative trace →

↑ 38%

latency2026-05-19 13:18

tool.web_fetch p95 latency doubled

p95 jumped from 480ms to 980ms after the 13:00 deploy. Hit rate on the downstream proxy dropped — likely cache invalidation.

↑ 104%

qualitydata_extraction2026-05-19 09:42

data_extraction hallucination rate up 11pp

Switched from claude-opus-4-7 to claude-sonnet-4-6 on the structured extraction prompt. Quality regressed; estimated $/correct-answer is actually higher.

↑ 11%

Top spenders

User	Share	Calls	Top feature	Models used	24h	Projected /mo
de u_demo	4.3%	57	search_qa	2	$0.808	$24.25
he u_heavy_19	3.5%	40	data_extraction	3	$0.652	$19.57
he u_heavy_27	3.1%	15	chat_summary	1	$0.592	$17.77
he u_heavy_37	2.6%	25	moderation	3	$0.493	$14.78
he u_heavy_30	2.3%	14	support_bot	1	$0.430	$12.90
he u_heavy_46	2.2%	9	code_assist	1	$0.421	$12.64
he u_heavy_18	2.1%	16	search_qa	2	$0.399	$11.98
he u_heavy_39	2.0%	19	code_assist	3	$0.369	$11.07
he u_heavy_23	1.8%	14	search_qa	2	$0.342	$10.25
53 u_532	1.8%	7	chat_summary	1	$0.334	$10.02

See these recommendations on your own traffic.

Get started free →