Eval Framework in Practice: An AI Sales Assistant Example

Aug 06, 2025

In the last post, we went over the new operating system for product management of AI-native products: The Eval-Driven Development framework. To get more concrete, let’s go through the practical steps for building an evaluation framework for a hypothetical product: an "AI Sales Assistant."

The assistant's primary function is to draft personalized follow-up emails for sales representatives based on data from the company's CRM and transcripts of recent sales calls. The overarching business goal is to increase sales rep efficiency while improving the quality and consistency of customer communications.

Applying the five-step EDD playbook to this product would look like this:

Objective: The product objective is not merely to "build an email drafter." A measurable objective would be: "Generate a factually accurate, relevant, and on-brand follow-up email draft in under 30 seconds that a sales rep can send with minimal editing."
Dataset: To create the "golden set," the team would gather 200 real (and fully anonymized) CRM opportunity records, each with corresponding call notes or transcripts. Then, the company's top three performing sales reps would be tasked with hand-writing the "perfect" follow-up email for each of these 200 scenarios. This collection of expert-written emails becomes the "ground truth" against which the AI's outputs will be measured.
Metrics & Framework: The core of the EDD process is the multi-dimensional evaluation framework. This is not a simple checklist but a detailed scorecard that balances different aspects of performance. The table below illustrates what such a framework could look like.

This framework translates abstract goals into a tangible, measurable system. The dimensions chosen—Functional Correctness, Response Quality, User Trust & Safety, and Business Impact—directly map to the key challenges of building a successful AI product.

Each metric is concrete and tied to a specific measurement method, demonstrating the practical application of the hybrid evaluation approach. For instance, "Tool Call Accuracy" is a deterministic check perfect for a fast, code-based eval. "Relevance" and "Coherence" are subjective and well-suited for a scalable LLM-as-judge. "Tone Alignment" is highly nuanced and best calibrated with expert human evaluation before being automated. Finally, "User Adoption" and "Edit Distance" are business-level metrics tracked via behavioral analytics that measure whether the product is actually delivering value to users. This structured table is more than an example; it is a template for translating the abstract theory of EDD into a practical tool that any product team can adapt and implement.

The Real Moat: Your Evals Are Your Most Defensible Asset

In the rapidly commoditizing landscape of AI, traditional competitive moats are eroding. If the underlying technology and the surface-level features are no longer defensible, where does a sustainable competitive advantage come from?

It is surely not the model or the features. Instead, it is an AI company's proprietary, ever-improving eval framework and the unique "golden dataset" that powers it. This is a fundamental shift in strategic thinking.

Evaluation frameworks are highly defensible for several key reasons:

They Encode Unique Domain Expertise: A well-constructed eval is the codified expression of a team's deep, nuanced understanding of its specific customers and their definition of "quality."

An evaluation framework built to assess the outputs of a legal tech product is useless for a company building an AI for medical diagnostics. This domain-specific judgment, embedded in the evals, cannot be easily copied or reverse-engineered.

They Create a Powerful Data Flywheel: A product designed with a probabilistic mindset actively captures user feedback and corrections.

This feedback is used to continuously refine and expand the golden dataset. A better, more diverse dataset leads to more robust evals. More robust evals enable the team to build a better, more reliable model. A better model delivers a superior product experience, which in turn attracts more engaged users, who generate more high-quality feedback data.
This creates a virtuous, self-reinforcing cycle—a data network effect that is powered and accelerated by the evaluation engine.

They Drive Superior Iteration Speed: A team with a robust, automated evaluation system can test hypotheses, measure improvements, and ship better products far faster than a competitor relying on manual checks, subjective opinions, and gut feel. In the fast-moving world of AI, the speed and quality of iteration is a decisive competitive advantage.

The fundamental mindset shift for product leaders, therefore, goes beyond mere process change. It is a strategic reorientation. The job is no longer to manage a backlog of features to be shipped. The new mandate is to act as the chief architect of an evaluation engine. That engine—the system that embodies the company's unique judgment and powers its learning loop—is what will separate the winners from the losers.

The game is no longer about having the best algorithm today; it is about building the best system for improvement tomorrow. This is the core of the probabilistic mindset, and it is the key to building not just a successful AI product, but a lasting, defensible business.

Roger Jin's Newsletter

Discussion about this post

Ready for more?