When your AI agent won't shut up
Two days ago I wrote about making a small LLM sell — getting a 30B model to 87–90% on 101 sales eval scenarios through progressive disclosure, signal-based skill selection, and embedded selling intelligence. The architecture works. The correctness is there.
Then I watched someone use it.
The customer walked up to the kiosk. They’d already completed a guided discovery flow — told us they’re switching from a competitor, watch local dramas and kids’ shows, want to save money. They tapped the chat and typed “Hi.”
The agent responded with five scrollable screens of text.
What happened
Here’s what the agent actually produced for a single “Hi”:
Hi there! 👋 Welcome — I’m your AI concierge. I see you’re looking to switch and want to save on your TV bill while still watching your favourite local dramas and kids’ shows. Let me show you the best fit for your family.
I see you’re looking to switch and want to save on your TV bill while still watching your favourite local dramas and kids’ shows. Based on what you told me, I’d recommend our mid-tier pack — it’s perfect for families who love local content…
It repeated the same sentence twice. Then it called show_product and rendered a beautiful product card — and then narrated every feature on the card as bullet-point text below it. The card already shows the pricing. The card already lists the channels. The agent regurgitated all of it in prose.
Then, without the customer saying anything, it triggered a live demo, pulled up a second product, ran a comparison between the two products, and finally asked if the customer had any questions.
Six tool calls. One customer message. Zero pauses.
Here’s what should have happened:
Hey! Switching over — let me show you the best fit for your family 😊
Two tool calls (catalog lookup, product card), quick reply chips (“Tell me more” / “What about kids shows?” / “Compare plans”), done. Fifteen words of text. Let the card do the selling. Let the customer drive.
Correct but unusable
I’d built 101 eval scenarios that test whether the agent calls the right tools, knows the products, handles objections, reads customer archetypes, and stays in scope. It passes 91.5% of those. The architecture is sound.
But I’d never tested whether the agent shuts up.
So I built 15 new scenarios that test conversational quality: response length, tool-call pacing, question-ending rate, and self-restraint. Things like “when a customer says ‘Hi’ with discovery context, respond in under 80 words” and “don’t trigger a demo unless the customer asks for one.”
The agent passes 10% of them.
91.5% correctness. 10% quality. The agent knows the right answer every time — it just buries it in a wall of text, five unrequested actions, and a second mention of everything it already said.
The root cause: instruction saturation
The system prompt is 54,000 characters across 15 skill files — about 9,700 tokens. The model has 18,432 tokens of context window. Half the context is just instructions.
And this is a Mixture of Experts model with only 3B active parameters per token. It has the knowledge of a 30B model but the attention capacity of a 3B one. When you load it with 30 simultaneous constraints — be warm, be brief, sell outcomes, handle objections, read archetypes, use quick replies, never trigger demos unprompted, always end with a question — it tries to demonstrate all of them at once. The instruction that says “be brief” competes for attention with 29 other instructions. Brevity loses.
I have prior evidence for this. In an earlier tuning pass, I tried making the prompt more specific — adding explicit trigger word lists for scope boundaries. Pass rate dropped from 87.1% to 80.1%. Over-specifying makes things worse. The model tries to process the specificity and loses the forest for the trees.
The hypothesis: fewer, stronger instructions produce better responses than comprehensive detailed rules.
Measuring the problem
Before fixing anything, I need to understand the shape of the problem. What I’m running is essentially a strip study — remove prompt layers one at a time and measure what each layer contributes.
I designed 10 prompt levels, from bare minimum to full prompt:
| Level | What’s loaded | Prompt size |
|---|---|---|
| 0 | Just the preamble and tool definitions | 10.7K chars |
| 1 | + Identity and personality | 11.6K chars |
| 2 | + Safety guardrails | 14.7K chars |
| 3 | + Conversation structure | 21.8K chars |
| 4 | + Customer reading | 27.6K chars |
| 5 | + Product knowledge | 41.4K chars |
| 6 | + Selling skills | 45.7K chars |
| 7 | + Interaction widgets | 50.5K chars |
| 8 | + Advanced selling | 54.2K chars |
| 9 | Full prompt (all skills) | 54.2K chars |
Each level runs through all 144 eval scenarios twice. The questions I need answered:
- At what level does quality start degrading?
- At what level does correctness reach “good enough”?
- Which specific skill additions cause the biggest verbosity increase?
- Is there a level where quality is better than the full prompt?
- Does the model produce short, paced responses naturally when not overloaded?
If the model is naturally brief at Level 3 but verbose at Level 5, then the product knowledge skills are the problem — not the model. If it’s verbose even at Level 0, the model itself tends toward verbosity and the fix is different (physical token limits, not prompt changes).
Beyond the prompt: other levers
The prompt is the obvious variable but not the only one. Response quality is multi-dimensional:
Physical token cap. The agent’s maxTokens is 1024. If I drop it to 256 (~200 words), the model physically cannot produce a wall of text. The question is whether it degrades gracefully (same quality, just shorter) or breaks (truncated mid-sentence, incomplete tool calls).
Tool response content. When the agent retrieves a product, the response includes not just specs but selling intelligence — value story, comparison math, cross-sell suggestions, common questions. The model reads all of this and tries to convey everything at once. If I strip the selling intelligence from the initial response and put it behind a follow-up tool, the model has less to regurgitate.
UI-driven first turn. The guided discovery flow already maps to a specific product recommendation. What if the kiosk shows the product card before the chat opens? Then the agent’s first message is just “What do you think? 😊” instead of doing the full catalog lookup → product card → narrate everything dance.
Each of these is a different kind of fix. The token cap is a config change. The tool response redesign is code. The UI change is a product decision. But they all address the same underlying problem: the agent has too much to say and no reason to stop.
The measurement gap
Word counts and tool-call counts catch structural problems. But they miss what actually matters: does this response feel right?
A 40-word response that’s cold and robotic passes the brevity check but fails as a sales interaction. A 60-word response that’s warm and perfectly pitched might “fail” a strict word limit but be exactly what the customer needs.
I’m adding an LLM-as-judge layer — a frontier model scoring each response on five dimensions: brevity, pacing, warmth, sales effectiveness, and tool usage. It’s too slow to run on every iteration (~24 minutes for 144 scenarios), but as a periodic checkpoint it catches the things no regex can measure. Did the agent sound like a friend who works at the store, or a chatbot reading from a script? Did it create desire without pressure? Did it let the product card do the heavy lifting instead of narrating it?
The judge establishes a quality bar that heuristic metrics can’t.
The research plan
This isn’t a prompt tweak. It’s a structured research project with adaptive phases:
Phase 1 (done): Build measurement infrastructure. Quality eval scenarios, aggregate metrics, strip study tooling, baseline measurement.
Phase 2 (in progress): LLM-as-judge infrastructure. JSON export from eval runner, judge rubric design, scoring pipeline, baseline judge scores.
Phase 3: Run the strip study across all prompt levels and token cap experiments. Measure everything. Find the inflection points. Which variables have the biggest effect?
Phase 4: Reconstruct. The data from Phase 3 tells me what the new prompt should look like. Maybe it’s 5 skills instead of 15. Maybe it’s 5 skills plus maxTokens=256. Maybe it’s a completely different prompt structure — examples instead of rules, personality-driven instead of constraint-driven.
Phase 5: Polish via autonomous optimization loop, using the reconstructed prompt as the new baseline.
Each phase’s findings rewrite the next phase. This is research, not a build — the plan adapts based on what the data says.
The uncomfortable truth
I spent weeks building a sophisticated knowledge architecture. Progressive disclosure. Signal-based skill selection. Embedded selling intelligence. It all works — the agent knows the right things at the right time.
But knowledge isn’t enough if the delivery is wrong.
A great salesperson in a store doesn’t walk up to a customer and recite everything they know about the product the customer might want. They say something warm, show the product, and wait. The waiting is the hard part. The restraint is the skill.
My agent has all the knowledge and none of the restraint. Teaching it to be brief might mean removing most of what I built. The strip study will tell me how much.
The bet is the same one I started with: a small model, with the right architecture, can be good at this. But “right architecture” might mean less than I thought — not progressive disclosure of more knowledge, but aggressive curation of less.
Sometimes the best prompt is a shorter one.