The hard part of AI sales agents isn't the plumbing
I’m experimenting with an AI-powered customer journey for a telco retail store. The idea: proximity notification triggers facial recognition, which launches an agentic AI sales conversation on a kiosk tablet, with a real-time staff co-pilot dashboard. The prototype has the full stack — TypeScript backend, React frontends, SSE streaming, SQLite sessions, face matching, behavioural skills, tools, tests.
And then I talked to the agent. It was… fine. Mechanically correct. It called the right tools. It remembered your name. It stayed in guardrails.
But it couldn’t sell.
Specs don’t sell
The agent knows that the mid-tier TV pack has 80+ channels and costs a certain amount per month. It can recite this accurately. But a customer who walked in because their kids fight over the TV every evening doesn’t care about channel counts. They care that the plan lets everyone stream on 3 devices simultaneously — no more fights.
A customer already paying separately for a standalone streaming subscription doesn’t need to hear about “premium streaming apps included.” They need to hear: “You’re already spending more on streaming alone. This bundle includes all those apps and 100 live channels for less than you’d pay separately. You’d save money and get more.”
The difference isn’t information. It’s framing. And framing requires understanding what the customer actually wants — not what they said.
The four forces of every purchase
Jobs to Be Done theory has this concept called the Four Forces. Every time someone switches from one solution to another, four forces are at play:
PUSH PULL
(pain with current) (attraction of new)
↓ ↓
┌─────────────────────────────────┐
│ DECISION TO SWITCH │
└─────────────────────────────────┘
↑ ↑
ANXIETY HABIT
(fear of new) (comfort with current)
Push and Pull drive the customer forward. Anxiety and Habit hold them back.
A great salesperson instinctively navigates these forces. They surface the push (“So your current setup isn’t really working for the family?”), amplify the pull (“Imagine everyone watching what they want, no arguments”), reduce the anxiety (“No lock-in contract, and we set it up for you”), and dissolve the habit (“You won’t even notice the switch — same apps, just more stuff around it”).
My agent does none of this. It waits for a question, recites a spec, and asks if you want a demo. That’s a brochure, not a salesperson.
The constraint: a small local model
Here’s the thing — I’m running this on a local LLM. A 30B parameter model on edge hardware, not GPT-4 in the cloud. This is deliberate (edge-first, data sovereignty, zero latency), but it means I can’t brute-force intelligence with a massive prompt and a frontier model.
Every token in the system prompt must earn its place. The model can’t absorb a 50-page sales playbook and improvise. It needs tight, pattern-based instructions: “when X, do Y.” And those patterns need to be the right patterns — derived from real customer behaviour, not my imagination.
Approaching this systematically
Instead of tweaking prompts ad-hoc and vibes-checking the output, I’m building a structured pipeline:
1. Research (human work)
- Real product catalog with selling angles, not just specs
- Customer archetypes from staff interviews — who actually walks in and why
- Competitive intelligence — what customers compare us to and how staff respond
- Real objections — the actual reasons people don’t buy today
- JTBD Four Forces maps for every major product switch
2. Playbooks (human + AI)
- Value stories: per-product outcome-based reframes
- Objection handlers: pattern → response pairs from what actually works
- Archetype strategies: how to adapt tone, pace, and product focus per customer type
3. Skills & Evals (AI + human review)
- New agent skill files encoding the playbooks into tight, pattern-based instructions
- Sales quality evaluation scenarios — not “did it call the right tool?” but “did it find the upsell opportunity?” and “did it handle the competitor objection?”
4. Automated tuning
- Using pi-autoresearch to run an autonomous optimization loop: tweak a skill file → run evals → keep what improves scores → revert what doesn’t → repeat
The key insight: autoresearch can only optimise within a design space. It can’t invent the design space. If the evals don’t test for sales quality, optimising for eval pass rate won’t make the agent sell better. The research and playbooks define the design space. The evals measure it. Only then does automated tuning make sense.
What makes this hard
Three things:
Small model, big ask. A 30B model has to juggle product knowledge, conversation flow, customer reading, objection handling, and business objectives — all from a system prompt under 400 lines. Every new skill I add risks crowding out an existing one.
Knowledge is layered. Product specs go in tool responses (retrieved on demand). Selling strategies go in the system prompt (must shape every response). Customer context gets injected dynamically per session. Getting the architecture of knowledge right matters more than getting any single piece of knowledge right.
The measurement problem. How do you eval “good selling”? A mechanically correct response and a persuasive response can both call the right tool and contain the right facts. The difference is subtle — it’s in the framing, the empathy, the timing of the offer. I need evals that capture this, and that’s genuinely hard to automate.
Where this stands
The plumbing works. The research briefs are written — each one a clear task spec for what to gather and how to deliver it. The next step is the unglamorous one: go talk to store staff, observe real customers, gather the raw material, and build the intelligence layer that turns a chatbot into a salesperson.
The interesting bet is whether a 30B parameter model, with the right knowledge architecture and enough eval-driven tuning, can actually be good at this. Not frontier-model good. Good enough that a customer walking into a retail store has a better experience than browsing a brochure rack.
I think it can. But I’ll find out.