Making a small LLM sell
A month ago I wrote about the hard part of AI sales agents — that the plumbing works but the agent can’t sell. I’d outlined an approach: structured research, playbooks, skills, evals.
I built it. Here’s what I learned.
The result
87–90% pass rate across 101 sales evaluation scenarios, running on a 30B parameter model on local hardware. Not a frontier model. Not a cloud API. A small model with a tight prompt budget, and it handles objections, reads customer types, knows when to shut up and escalate, and actually sells outcomes instead of reciting specs.
The remaining 10–13% are rotating LLM non-determinism — different scenarios fail each run, no systematic failures left. The architecture is done. The ceiling is the model.
What didn’t work: dumping knowledge into the prompt
My first instinct was to write a massive system prompt. Everything the agent needs to know about products, pricing, competitive positioning, objection handling, customer types — just put it all in there.
This fails for two reasons. First, a 30B model can’t absorb a 50-page brief and improvise. It loses the thread. Instructions at the top of the prompt get diluted by instructions at the bottom. Second, most knowledge is irrelevant most of the time. A customer asking about home security doesn’t need the agent loaded with TV channel comparisons.
The system prompt is precious real estate. Every line competes with every other line for the model’s attention. The question isn’t “what does the agent need to know?” — it’s “what does the agent need to know right now?”
What worked: three layers of progressive disclosure
The architecture that worked has three layers, each loaded at a different time:
Layer 1 — Instincts (always loaded, ~400 lines). How to behave, not what to know. Selling principles like “sell outcomes, not specs.” Conversation flow patterns. The four-beat objection handling rhythm: acknowledge, bridge, evidence, empower. Rules about when to escalate. This is the agent’s personality and judgment — it shapes every response.
Layer 2 — Knowledge (loaded on demand via tools). When the agent calls a tool to show a product, the response includes not just specs and pricing, but selling intelligence: a value story, comparison math against competitors, the ideal customer for this product, common questions, and cross-sell suggestions. The agent retrieves this only when it needs it — for the specific product the conversation has reached.
Layer 3 — Context (injected per session). Who is this customer? Returning visitor? What did they look at last time? What archetype are they? This layer tells the agent which angle to use. The same product gets pitched differently to a price-sensitive family versus a time-pressed professional.
The insight is that knowledge has a shelf life measured in conversational turns. Product specs are relevant for one turn — the turn where you’re discussing that product. Selling instincts are relevant for every turn. Customer context is relevant for the whole session but different per session. The architecture matches knowledge to its natural lifespan.
Signal-based skill selection
The system prompt isn’t static. It’s assembled per turn from 13 modular skill files, and a signal detector picks 5–10 of them based on what’s happening in the conversation.
Seven signal types drive selection: product interest, objection detected, comparison shopping, closing opportunity, small talk, complaint, and scope boundary. Each signal maps to specific skills. If the customer is comparing competitors, load the competitive response skill. If they’re raising an objection, load the objection handling skill. If they’re angry about a service issue, load only the escalation skill — don’t load any selling skills.
This means the model reads a different prompt depending on the conversational moment. Early discovery turns get a lightweight prompt (253 lines, 37% smaller). Deep product conversations get a focused one (305 lines, 23% smaller). The model never sees the full 396-line prompt — it sees a curated subset tuned to what’s happening right now.
The effect on a small model is dramatic. Fewer instructions means each instruction gets more attention. The model stops confusing “handle an objection about price” with “mention the promotional deadline” because it only sees one of those instructions when it’s relevant.
Archetypes as compression
After 2–3 exchanges, the agent classifies the customer into one of 9 archetypes across 4 priority tiers: service recovery (never sell), high-intent (they know what they want), medium-intent (needs guidance), and low-intent (just browsing).
The archetype classification replaces a 34-line behavioral specification with a 4-line directive. Instead of listing every possible customer behavior and how to respond, the agent gets: “This is a price-sensitive family decision-maker. Be consultative. Do the comparison math. Expect a two-visit sale.”
This is compression in the information-theoretic sense. The archetype is the compressed representation of dozens of behavioral rules. And compression matters enormously when your model has a limited attention budget.
Deterministic scope guards
Here’s a lesson I learned the hard way: don’t trust the LLM to know its own boundaries.
The agent is a product specialist. It shouldn’t handle account changes, service complaints, contract negotiations, or billing disputes. Early on, I put this in the system prompt: “escalate these topics to staff.” The model followed this instruction… about 40–60% of the time.
The fix was a deterministic scope guard — a regex-based filter that runs before the LLM sees the message. If the customer says “cancel my account” or “my internet is down” or “I want to speak to a manager,” the system auto-escalates without consulting the model at all. No prompt engineering. No hoping the model reads the instruction correctly. Pattern match → escalate.
This took the handoff eval suite from 4–6/10 to 10/10. Instantly. The lesson: use the LLM for things that require judgment. Use deterministic code for things that require reliability.
The eval-driven loop
101 evaluation scenarios across 13 suites: product knowledge, objection handling, competitor comparisons, archetype adaptation, scope boundaries, cross-selling, closing behavior, and more.
The scenarios don’t just test mechanical correctness (“did it call the right tool?”). They test sales quality: Did it reframe the objection? Did it do the comparison math instead of making a vague value claim? Did it detect the customer’s real concern behind the stated question? Did it know when to stop selling and just listen?
Building these evals was harder than building the agent. Each scenario encodes a judgment call about what “good selling” looks like. But once they exist, they’re a ratchet — every change to the system can be measured against 101 definitions of “good.”
I used an automated optimization loop for final tuning: mutate a skill file, run evals, keep improvements, revert regressions. But the loop can only optimize within the design space the evals define. If the evals don’t test for empathy in objection handling, the optimizer won’t discover empathy. The human work — research, playbooks, eval design — defines the design space. Automation explores it.
Embedding selling intelligence in data
The least obvious architectural decision: the product data files contain selling intelligence, not just specs.
When the agent retrieves a product, it doesn’t get “Plan X: 80 channels, $Y/month.” It gets:
- Value story: “This is the only way to get [premium content] in the market right now. At $Y, it’s cheaper than subscribing to the standalone streaming service.”
- Best for: “Customers switching from a competitor who watch drama and movies.”
- Comparison math: “Standalone streaming = $A. This plan = $B, with live channels on top. Saves $C/month.”
- Cross-sell: “Pairs well with the home security bundle for a complete household package.”
The agent doesn’t need to figure out the selling angle. The selling angle is in the data. The agent’s job is to weave it into natural conversation based on what the customer cares about. This is a much easier task for a small model than deriving the angle from raw specs.
What the ceiling tells you
At 87–90%, every remaining failure is the model making a suboptimal choice on a specific run — calling a tool from memory instead of looking it up, skipping an interactive widget in favor of a text list, not mentioning a time-limited promotion when it would be natural to. These failures rotate. Nothing is systematically broken.
This tells me the architecture is right and the model is the bottleneck. A larger model (70B+) would likely push this to 93–97% without changing a single line of prompt or code. The progressive disclosure, signal-based selection, scope guards, and embedded selling intelligence would all transfer directly.
And that’s the point. The architecture is model-agnostic. It’s a knowledge delivery system that matches information to conversational context. The model is a replaceable component within it.
The meta-lesson
Making a small LLM sell well is not a prompting problem. It’s a knowledge architecture problem.
The research took longer than the code. The playbooks took longer than the prompts. The evals took longer than the tuning. The unglamorous work — talking to domain experts, mapping customer archetypes, writing 101 scenario definitions of “good” — is the actual work. The model just executes within the space that work defines.
If your agent is mechanically correct but can’t sell, the fix isn’t a better model or a longer prompt. It’s a better architecture for getting the right knowledge to the model at the right moment. Progressive disclosure. Signal-based selection. Deterministic guards for reliability. Embedded intelligence in data.
The small model doesn’t need to be smart. It needs to be well-informed, at the right time, about the right things.