When AI agents feel slow, developers instinctively blame the LLM call — but in production systems, the model itself accounts for less than 30% of end-to-end latency in the majority of cases. The real culprits are tool execution, retrieval pipelines, sequential chains that could run in parallel, and the overhead of passing enormous context windows around between steps.
This post walks through how to actually measure where your agent's time goes, what patterns reliably cause slowdowns, and how to fix them.
Why Developers Blame the LLM (and Why That's Wrong)
The LLM call is the most visible part of an agent. It's the part you consciously wait for, the part with spinner animations, and the part that costs money per token. So it feels like the bottleneck.
But a typical ReAct-style agent doing something like "research a company and write a summary" might break down like this:
| Step | Time |
|---|---|
| Initial LLM call (reasoning) | 1.2s |
| Web search tool (3 searches) | 4.1s |
| Embedding + vector retrieval | 1.8s |
| LLM call (synthesis) | 1.4s |
| Total | 8.5s |
The two LLM calls together: 2.6 seconds. Tools and retrieval: 5.9 seconds. If you optimize your prompt to cut LLM time by 40%, you save about one second. If you parallelize the web searches, you might save three.
Step 1: Actually Measure What's Happening
You cannot debug latency you haven't measured. The first thing to do is instrument your agent so every step is timed independently.
Here's a minimal manual approach using a context manager:
import time
from contextlib import contextmanager
@contextmanager
def timed(label: str):
start = time.perf_counter()
yield
elapsed = time.perf_counter() - start
print(f"[TIMING] {label}: {elapsed:.3f}s")
# Usage in a tool function
def search_web(query: str) -> str:
with timed(f"web_search:{query[:30]}"):
return _do_search(query)
This is fine for local debugging but falls apart in production — you lose cross-step context, there's no aggregation, and you can't see how timings vary across different runs or users.
The better approach is to use Peekr, which auto-instruments every LLM call, tool invocation, and chain step without you having to add timing code everywhere:
import peekr
import openai
peekr.init(api_key="your-peekr-key") # That's it
client = openai.OpenAI()
# Every call is now traced with latency breakdowns
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Research Stripe and summarize their products"}]
)
After adding those two lines, Peekr captures wall-clock time for each LLM call, token counts, and — when you're using LangChain, LlamaIndex, or CrewAI — the full span tree of every tool call and retrieval step. You can immediately see which spans are eating your budget.
Step 2: Identify the Common Culprits
Once you have real data, you'll almost always find one of these four patterns.
Sequential Tool Calls That Could Be Parallel
Agents built with simple loops will call tools one at a time, even when there's no dependency between them. A ReAct agent searching three different databases will do it like this:
search_db_a → wait → search_db_b → wait → search_db_c → wait
When the searches are independent, this is pure waste. Fix it with asyncio.gather:
import asyncio
import openai
async def search_async(client, query: str) -> str:
# Your async search implementation
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Search for: {query}"}]
)
return response.choices[0].message.content
async def parallel_research(queries: list[str]) -> list[str]:
client = openai.AsyncOpenAI()
# All three searches fire simultaneously
results = await asyncio.gather(
*[search_async(client, q) for q in queries]
)
return list(results)
# Before: ~6s sequential. After: ~2s parallel.
results = asyncio.run(parallel_research([
"Stripe payment products",
"Stripe competitors",
"Stripe recent news"
]))
This pattern alone — parallelizing independent tool calls — is the highest-leverage optimization in most agent architectures.
Retrieval Pipelines Doing Too Much Work
Vector search is fast. What's slow is everything around it: chunking documents at query time, re-embedding content that hasn't changed, fetching entire documents when you only need a paragraph, or running retrieval with top_k=20 when the LLM can only usefully consume five chunks anyway.
Check your retrieval span times in your traces. If they're consistently above 500ms, look at:
- Whether you're re-embedding at query time (you should embed and index offline)
- Whether your
top_kis higher than necessary - Whether you're fetching full documents instead of chunk-level results
- Whether your vector store is colocated with your compute (network latency compounds fast)
Bloated Context Windows Slowing Everything Down
Larger context = slower prefill time, more expensive inference, and slower output generation. Agents that accumulate full conversation history or stuff entire retrieved documents into every prompt suffer from this.
A quick diagnostic: log prompt_tokens for every call and graph it over a multi-step agent run. If you see a linear increase in tokens per step, your context management is probably naive. Solutions include summarizing earlier turns, using a sliding window of recent messages, and only passing relevant retrieved chunks rather than full documents.
Cold Starts on Tool Infrastructure
If a tool calls a Lambda function, a containerized microservice, or a database with connection pooling limits, the first call in a session can be dramatically slower than subsequent calls. This is easy to misdiagnose as "the LLM is slow" because it happens early in the agent run. Look at your first-tool-call timing vs. subsequent calls in your traces — a 3-5x difference is a cold start problem, not a model problem.
Step 3: Use Model Selection Strategically
Once you've addressed tool and retrieval latency, then model selection becomes worth optimizing. This doesn't mean "use a smaller model everywhere" — it means using the right model for each step.
Most agents have a mix of:
- Reasoning steps: require a capable model (GPT-4o, Claude Sonnet)
- Parsing/extraction steps: work fine with a fast small model (GPT-4o-mini, Haiku)
- Classification/routing steps: can often use a tiny model or even regex
An agent that uses GPT-4o for every step including "extract the JSON from this text" is leaving significant latency on the table. Map your steps, figure out which ones are computationally simple, and route those to faster/cheaper models.
Quick Wins: What to Do Right Now
If you want to cut your agent's latency today without a major architectural overhaul:
1. Add instrumentation first. You cannot prioritize fixes without data. Add Peekr or at minimum wrap your key functions with timing logs. Do this before changing anything else.
2. Find your longest sequential tool chain and parallelize it. Use asyncio.gather for independent calls. Even in synchronous code, you can use ThreadPoolExecutor for I/O-bound tools. This is typically your biggest single win.
3. Audit your top_k in retrieval. Drop it from 20 to 5-8 and measure. In most cases, precision matters more than recall for LLM context, and you'll see a meaningful latency drop.
4. Cap your context window growth. If you're running a multi-step agent, implement a simple message summarization step every N turns or limit history to the last 4-6 exchanges. Log prompt_tokens per step to confirm you've fixed the growth curve.
5. Profile one real production trace end-to-end. Not a synthetic test — a real user request. Look at wall-clock time per span. The bottleneck will be obvious once you see it laid out in a flame graph.
6. Move embedding to offline pipelines. If any of your retrieval involves embedding at query time, shift that work to ingestion time. Query-time embedding is almost always unnecessary and adds 50-200ms per chunk.
The Mental Model Shift
The LLM is a fast, stateless function. It takes tokens in, produces tokens out, and the providers have optimized inference heavily. What's slow is everything you've built around it: the tools that touch external systems, the retrieval pipelines moving data around, the sequential loops that don't take advantage of parallelism, and the context management that grows unbounded.
When your agent feels slow, the debugging order should be: measure spans → identify the longest non-LLM span → fix that → repeat. Most teams that go through this process end up cutting latency by 50-70% before they ever need to think about switching LLM providers or paying for faster inference tiers.
The LLM is rarely your problem. The code you wrote around it usually is.
If you want the span visualization and latency breakdown without wiring up your own tracing infrastructure, the Peekr docs show how to get a full agent trace running in under five minutes.