Last week our document processing agent started taking 279 seconds to ingest a batch of 15 files. No errors, no crashes — it just ran slowly and silently.
We opened Peekr, looked at one trace, and had the root cause in 30 seconds. Here's what we saw, why it was wrong, and how we fixed it.
What the trace showed
The waterfall view told the whole story immediately:
Trace — 279.4s total · 48.1k tokens · $0.012
openai.embeddings 2.87s ←── chunk 1 embedded
openai.chat.completions 17.92s ←── chunk 1 entities extracted
openai.embeddings 1.24s ←── chunk 2 embedded
openai.chat.completions 37.57s ←── chunk 2 entities extracted
openai.embeddings 790ms ←── chunk 3 embedded
openai.chat.completions 20.83s ←── chunk 3 entities extracted
...
(repeats 15 times)
The pattern was obvious once you saw it: embed → extract → embed → extract, over and over. Each pair waited for the previous one to finish before starting.
This is what sequential execution looks like in a trace — a staircase instead of a fan.
What it should have looked like
The embedding calls were fast (200ms–3s). The entity extraction LLM calls were slow (2–37s). With 15 documents, the pipeline was paying the full cost of every slow call back-to-back.
In a correctly parallelized pipeline, the trace should look like a fan — all the entity extraction calls starting at roughly the same time, finishing whenever they finish, with total wall-clock time equal to the slowest single call:
Theoretical parallel trace — ~37s total
openai.embeddings (all at once)
├── openai.chat.completions (doc 1) 17.92s
├── openai.chat.completions (doc 2) 37.57s ←── slowest
├── openai.chat.completions (doc 3) 20.83s
... (all overlapping)
279 seconds → 37 seconds. The 242 seconds of idle waiting time was wasted.
The actual bug
The extraction function had a for loop with no parallelism:
# Before — fully sequential
def extract_for_memories(tenant_id, memories):
for mem in memories:
extracted = extract_entities(mem["content"]) # LLM call — blocks
upsert_entities_and_links(tenant_id, mem["id"], extracted)
Every document waited for the previous one to finish. With doc 2 taking 37 seconds, docs 3–15 all sat idle.
This is one of the most common performance bugs in AI pipelines. The pattern looks fine in a code review — it's clean, it's readable, it works. You just can't see the cost until you have a trace in front of you.
The fix
Swap the sequential loop for a thread pool:
# After — parallel with bounded concurrency
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
def extract_for_memories(tenant_id, memories, max_workers=8):
results = {"entities": 0, "links": 0, "failed": 0}
lock = threading.Lock()
def _process_one(mem):
try:
extracted = extract_entities(mem["content"]) # LLM call
n_ent, n_link, _ = upsert_entities_and_links(
tenant_id, mem["id"], extracted
)
with lock:
results["entities"] += n_ent
results["links"] += n_link
except Exception:
with lock:
results["failed"] += 1
workers = min(max_workers, len(memories))
with ThreadPoolExecutor(max_workers=workers) as pool:
futures = [pool.submit(_process_one, mem) for mem in memories]
for f in as_completed(futures):
f.result()
return results
max_workers=8 means up to 8 LLM calls in flight at once. For a batch of 15 docs, all 15 start nearly simultaneously. The lock is there because Python's GIL doesn't protect dict mutations across threads.
Why not asyncio.gather?
The underlying extract_entities function makes a synchronous OpenAI client call, not an async one. You can't await a blocking call — you need to either rewrite it to use AsyncOpenAI or run it in a thread pool. The thread pool is the pragmatic choice when you don't want to touch the inner implementation.
If you're using AsyncOpenAI already, asyncio.gather with a semaphore is the cleaner option:
import asyncio
async def extract_for_memories_async(tenant_id, memories, concurrency=8):
sem = asyncio.Semaphore(concurrency)
async def _process_one(mem):
async with sem:
extracted = await extract_entities_async(mem["content"])
await upsert_entities_and_links_async(tenant_id, mem["id"], extracted)
await asyncio.gather(*[_process_one(mem) for mem in memories])
Same result — bounded concurrency, all calls overlapping.
The impact
| Metric | Before | After |
|---|---|---|
| 15-doc batch | 279.4s | ~37s (est.) |
| Speedup | 1× | 7.5× |
| Cost | $0.012 | unchanged |
| Output quality | same | same |
The cost didn't change — you're still making the same LLM calls, just at the same time instead of one after another. The output is identical. The only thing that changed is the user doesn't wait 4+ minutes for their documents to be processed.
How we spotted it
We didn't have an alert set up for "ingest takes >60 seconds." We didn't have a metric for "entity extraction latency per batch." We just looked at a trace.
Peekr's waterfall view showed us the staircase pattern in seconds — every extraction call starting only after the previous one ended. The diagnosis was visual, not numerical. Once you see a staircase in a trace that should be a fan, you know exactly what to do.
This is why trace-level visibility matters for AI agents. Cost metrics and error rates won't catch this — the pipeline was succeeding, on budget, with 1.00 faithfulness scores. The only signal was latency, and you can't diagnose latency from aggregates. You need to see the actual execution timeline.
If you're building an AI agent that processes batches of anything — documents, queries, rows — check your traces. The sequential loop hiding inside your parallel-looking architecture is probably there.
Peekr captures every LLM call with full trace context. Two lines of Python: peekr.starkspherelabs.com/observability