Last week our document processing agent started taking 279 seconds to ingest a batch of 15 files. No errors, no crashes — it just ran slowly and silently.

We opened Peekr, looked at one trace, and had the root cause in 30 seconds. Here's what we saw, why it was wrong, and how we fixed it.

What the trace showed

The waterfall view told the whole story immediately:

Trace — 279.4s total · 48.1k tokens · $0.012

  openai.embeddings      2.87s   ←── chunk 1 embedded
  openai.chat.completions  17.92s  ←── chunk 1 entities extracted
  openai.embeddings      1.24s   ←── chunk 2 embedded
  openai.chat.completions  37.57s  ←── chunk 2 entities extracted
  openai.embeddings      790ms   ←── chunk 3 embedded
  openai.chat.completions  20.83s  ←── chunk 3 entities extracted
  ...
  (repeats 15 times)

The pattern was obvious once you saw it: embed → extract → embed → extract, over and over. Each pair waited for the previous one to finish before starting.

This is what sequential execution looks like in a trace — a staircase instead of a fan.

What it should have looked like

The embedding calls were fast (200ms–3s). The entity extraction LLM calls were slow (2–37s). With 15 documents, the pipeline was paying the full cost of every slow call back-to-back.

In a correctly parallelized pipeline, the trace should look like a fan — all the entity extraction calls starting at roughly the same time, finishing whenever they finish, with total wall-clock time equal to the slowest single call:

Theoretical parallel trace — ~37s total

  openai.embeddings   (all at once)
  ├── openai.chat.completions  (doc 1)   17.92s
  ├── openai.chat.completions  (doc 2)   37.57s  ←── slowest
  ├── openai.chat.completions  (doc 3)   20.83s
  ...  (all overlapping)

279 seconds → 37 seconds. The 242 seconds of idle waiting time was wasted.

The actual bug

The extraction function had a for loop with no parallelism:

# Before — fully sequential
def extract_for_memories(tenant_id, memories):
    for mem in memories:
        extracted = extract_entities(mem["content"])  # LLM call — blocks
        upsert_entities_and_links(tenant_id, mem["id"], extracted)

Every document waited for the previous one to finish. With doc 2 taking 37 seconds, docs 3–15 all sat idle.

This is one of the most common performance bugs in AI pipelines. The pattern looks fine in a code review — it's clean, it's readable, it works. You just can't see the cost until you have a trace in front of you.

The fix

Swap the sequential loop for a thread pool:

# After — parallel with bounded concurrency
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading

def extract_for_memories(tenant_id, memories, max_workers=8):
    results = {"entities": 0, "links": 0, "failed": 0}
    lock = threading.Lock()

    def _process_one(mem):
        try:
            extracted = extract_entities(mem["content"])  # LLM call
            n_ent, n_link, _ = upsert_entities_and_links(
                tenant_id, mem["id"], extracted
            )
            with lock:
                results["entities"] += n_ent
                results["links"] += n_link
        except Exception:
            with lock:
                results["failed"] += 1

    workers = min(max_workers, len(memories))
    with ThreadPoolExecutor(max_workers=workers) as pool:
        futures = [pool.submit(_process_one, mem) for mem in memories]
        for f in as_completed(futures):
            f.result()

    return results

max_workers=8 means up to 8 LLM calls in flight at once. For a batch of 15 docs, all 15 start nearly simultaneously. The lock is there because Python's GIL doesn't protect dict mutations across threads.

Why not `asyncio.gather`?

The underlying extract_entities function makes a synchronous OpenAI client call, not an async one. You can't await a blocking call — you need to either rewrite it to use AsyncOpenAI or run it in a thread pool. The thread pool is the pragmatic choice when you don't want to touch the inner implementation.

If you're using AsyncOpenAI already, asyncio.gather with a semaphore is the cleaner option:

import asyncio

async def extract_for_memories_async(tenant_id, memories, concurrency=8):
    sem = asyncio.Semaphore(concurrency)

    async def _process_one(mem):
        async with sem:
            extracted = await extract_entities_async(mem["content"])
            await upsert_entities_and_links_async(tenant_id, mem["id"], extracted)

    await asyncio.gather(*[_process_one(mem) for mem in memories])

Same result — bounded concurrency, all calls overlapping.

The impact

Metric	Before	After
15-doc batch	279.4s	~37s (est.)
Speedup	1×	7.5×
Cost	$0.012	unchanged
Output quality	same	same

The cost didn't change — you're still making the same LLM calls, just at the same time instead of one after another. The output is identical. The only thing that changed is the user doesn't wait 4+ minutes for their documents to be processed.

How we spotted it

We didn't have an alert set up for "ingest takes >60 seconds." We didn't have a metric for "entity extraction latency per batch." We just looked at a trace.

Peekr's waterfall view showed us the staircase pattern in seconds — every extraction call starting only after the previous one ended. The diagnosis was visual, not numerical. Once you see a staircase in a trace that should be a fan, you know exactly what to do.

This is why trace-level visibility matters for AI agents. Cost metrics and error rates won't catch this — the pipeline was succeeding, on budget, with 1.00 faithfulness scores. The only signal was latency, and you can't diagnose latency from aggregates. You need to see the actual execution timeline.

If you're building an AI agent that processes batches of anything — documents, queries, rows — check your traces. The sequential loop hiding inside your parallel-looking architecture is probably there.

Peekr captures every LLM call with full trace context. Two lines of Python: peekr.starkspherelabs.com/observability

We Found a 279-Second Bug by Reading a Trace

What the trace showed

What it should have looked like

The actual bug

The fix

Why not `asyncio.gather`?

The impact

How we spotted it

Start observing your AI agents in two lines of code.

One Answer Made 122 LLM Calls. Here's How We Caught It.

Zero-Latency LLM Evaluation: Run Hallucination Scoring Without Blocking Your Users

We Found a 279-Second Bug by Reading a Trace

What the trace showed

What it should have looked like

The actual bug

The fix

Why not asyncio.gather?

The impact

How we spotted it

Start observing your AI agents in two lines of code.

One Answer Made 122 LLM Calls. Here's How We Caught It.

Zero-Latency LLM Evaluation: Run Hallucination Scoring Without Blocking Your Users

Why not `asyncio.gather`?