From Feature-Driven to Eval-Driven: The New OS for AI Product Development
The Mindset That Got You Here Won't Get You There
When I was at Google Deepmind, I led a team of product managers building the first generation of Gemini experience. I was shocked to find that our decades of experience that made us top-tier product leaders seemed to become our biggest vulnerability.
For years, a seasoned product manager would be rewarded for building solid, thoughtful and predictable product experiences. This deterministic mindset, honed over countless successful launches, is now a liability.
This isn't an exaggeration. The failure rate for enterprise AI projects is shockingly high—some estimates say over 80% of them fail. This isn't a technology problem indeed. It’s a mindset problem, a direct result of applying an old playbook to a new game.
Modern AI is inherently probabilistic. With a given input, you wouldn’t know the exact output. This isn't a bug; it's a core feature. But it breaks every rule in the traditional PM playbook. Leaders demand certainty from systems designed to operate on probability, and products collapse under the weight of this flawed expectation.
The solution is not a new tool. It’s a fundamental upgrade to your own mental operating system. It requires a "Probabilistic Mindset Shift"—the move from thinking in binary right/wrong answers to thinking on a different set of quality measures, likelihoods, confidence scores, and acceptable ranges.
This shift changes the questions you ask. Instead of "Does the feature work?", you must now ask:
What are the different dimensions and measurements of quality?
How do you evaluate the non-deterministic outcomes?
What’re the non-acceptable outputs and how can you prevent them?"
How do we improve without a single "right" answer?"
Since you can't guarantee perfect correctness, you must design for trust. This isn't a vague goal; it's a concrete design principle with clear strategies like providing explainability, showing confidence scores, and giving users control to override the AI. User trust becomes a core metric you can and should measure.
This new mindset leads to a new operating system called Eval-Driven Development.
This is the playbook for that OS shift, drawn from my in-the-trenches experience leading products at Google DeepMind and coaching the next generation of AI-native product leaders and entrepreneurs.
From Feature-Driven to Eval-Driven
The traditional operating system for product development is Feature-Driven Development (FDD). It operates on a straightforward contract: the Product Requirements Document (PRD) specifies, in detail, the features to be built. The team then executes against those specifications. Success is measured by shipping the defined features with high accuracy, on time and within budget. This model is perfectly suited for the predictable, deterministic world of traditional software.
However, when applied to AI development, the FDD model breaks down completely:
Performance Cannot Be "Specced": A PM cannot write a PRD that states, "The AI will generate 'good' marketing copy." The definition of "good" is subjective, contextual, and cannot be captured in a static requirements document. The quality of an AI's output exists on a spectrum, not as a binary pass/fail state.
Outcomes Are Inherently Unpredictable: A team does not know how well a model will perform a novel task until they actually build and test it. And if you test it with the traditional QA methods, you don’t get what you actually look for. The AI development lifecycle is not a linear march toward a known endpoint; it is an uncertain and highly iterative process of research and discovery.
The Focus Is Wrong: FDD concentrates on what to build—the feature. Effective AI development must focus on how well the system performs a task—the outcome.
The failure of the old model necessitates a new operating system: Eval-Driven Development (EDD). In this paradigm, the central governing artifact of the product process is no longer the PRD. It is the evaluation framework, or "eval" for short. The core idea of EDD is that the product requirement is not a list of features to be built. Instead, the requirement is a set of clearly defined, measurable criteria that constitute a "good" or successful output for a given task.
The development cycle is transformed from a linear Spec -> Build -> Test -> Ship process into a continuous, data-driven loop:
Define Eval Criteria -> Test Against Criteria -> Analyze Failures & Gaps -> Improve -> Repeat
The product manager's role undergoes a transformation accordingly. The primary job is no longer writing feature specs. It is architecting the evaluation framework and system itself. The PM's product judgment, deep understanding of user needs, and definition of quality are encoded directly into the evaluation framework, which then guides the engineering and data science teams' iterative work.
The adoption of EDD represents more than just a change in process; it triggers a fundamental power shift within a product organization. It moves the source of truth and authority from static, narrative-based documents (PRDs) to dynamic, data-driven systems (eval frameworks). In the FDD world, power is concentrated in the "what we should build" decision, and the product manager, as the author of the PRD, is the primary gatekeeper of that decision.
In an EDD world, the eval framework becomes the source of truth. Its creation is an inherently collaborative act, PMs need to build deep partnership with data scientists to define statistically sound metrics (like precision, recall, or F1-score), with engineers to build the automated testing infrastructure, and with domain experts to provide the nuanced, qualitative judgment of what "good" actually looks like in a specific context.
Consequently, the product manager's role evolves to defining the problem space through the architecture of the evals and then managing a portfolio of experimental bets to improve performance within that space. This requires a different set of skills: less feature specification and more statistical literacy, experimental design, and deep technical collaboration.
The EDD Playbook: A Step-by-Step Guide
Moving from theory to practice requires a concrete, actionable playbook. The following five-step process outlines how to implement Eval-Driven Development, transforming it from an abstract concept into a day-to-day operational reality for product teams.
Step 1: Define Your Objective (Not Your Feature)
The process begins by anchoring on the core user problem and the desired business outcome, not on a proposed solution. In traditional development, a team might receive a request to "Build a feature to summarize call transcripts." In EDD, this is reframed as an objective: "Reduce the time our sales reps spend on post-call administrative work by 50% by providing them with accurate and relevant summaries." This objective-first orientation is a cornerstone of modern product management, but it becomes non-negotiable in the world of AI, where the path to the solution is not dictated in advance.
Step 2: Collect Your Dataset (The "Golden Set")
An evaluation framework is useless without high-quality data to test the system against. The next critical step is to assemble a "golden set" of representative examples that will serve as the benchmark for performance. This dataset is thoughtfully curated to reflect the full spectrum of real-world usage. A robust golden set should include:
Happy Paths: Typical, common scenarios that represent the core use case.
Edge Cases: Uncommon but plausible inputs that test the boundaries of the system's capabilities.
Adversarial Cases: Inputs deliberately designed to trick, confuse, or "jailbreak" the system, testing its safety and robustness.
The data for this set can be sourced from various places, including anonymized production logs, examples manually curated by domain experts, or even synthetically generated data designed to cover specific scenarios. The guiding principles for this dataset are diversity and realism.
Step 3: Define Your Eval Metrics (The Multi-Dimensional Scorecard)
This step is the heart of the EDD process. A single, high-level metric like "accuracy" is almost always insufficient and can be misleading. Instead, a multi-dimensional scorecard is needed to capture a holistic view of the AI's performance. This scorecard is typically built using a hybrid of three distinct evaluation methods:
Code-based Eval: This method uses automated code to perform objective, deterministic checks. It is fast, cheap, and ideal for verifying things like output formatting ("Does the output contain valid JSON?"), length constraints ("Is the summary under 200 words?"), or the presence of specific keywords.
Auto Eval (LLM-as-Judge): This innovative technique uses another powerful LLM (like GPT-4o or Claude 3.5 Sonnet) as an automated "judge" to score more subjective qualities. By giving the judge model a clear rubric, it can assess dimensions like "relevance," "clarity," or "coherence" on a numeric scale. This approach offers a scalable way to evaluate subjective quality but requires careful prompt engineering for the judge model to ensure consistency.
Human Eval: This remains the gold standard for evaluating the most nuanced and subjective qualities, such as adherence to a specific brand voice, creativity, factuality, or overall user trust. While it is the most time-consuming and expensive method, it is essential for establishing a "ground truth" and for calibrating the automated LLM-as-judge evals. Given its heavy lifting, plan ahead and manage this process well to drive efficiency and quality.
A comprehensive scorecard will include metrics across several dimensions, such as Factual Accuracy, Relevance, Coherence, Tone, Safety, and Lack of Bias.
Step 4: Run and Compare Evals (The Iteration Engine)
With the objective, dataset, and metrics in place, the team can begin the core iteration loop. This is a scientific process of experimentation, not a gut-feel-driven one. The loop proceeds as follows:
Establish a Baseline: Run the initial version of the AI system (the "V0") against the full evaluation suite to establish a baseline score for every metric on the scorecard.
Formulate a Hypothesis: Propose a specific, testable change. For example: "I hypothesize that adding a chain-of-thought reasoning step to the prompt will improve the 'Factual Accuracy' score by 10% without significantly degrading the 'Latency' score."
Implement and Re-run: Make the single change to the system and re-run the entire evaluation suite.
Analyze the Results: Compare the new scorecard to the baseline. Did the 'Factual Accuracy' score improve as predicted? Did the change cause an unexpected regression in another area, like the 'Tone Alignment' score?
This data-driven feedback loop replaces subjective debates with objective evidence, allowing the team to iterate rapidly and make demonstrable progress.
Step 5: Continuously Evaluate (Closing the Loop)
EDD does not end when the product is launched. It is a continuous process.
The production environment must be instrumented to constantly gather feedback and surface new failure modes. User feedback—both explicit (ratings, correction submissions) and implicit (tracking when a user ignores a suggestion)—should be collected and analyzed. The most interesting examples, especially failures, should be fed back into the "golden set," making it a living, growing asset.
This continuous loop of evaluation and refinement is the only way to prevent model performance from drifting or degrading over time and to ensure the product becomes more valuable, not less, after it ships.
In the next post, we are going to dive deeper into a case study to see how this Eval-Driven Development framework manifests itself in practice. We are also going over how you would turn your eval framework to be the most defensible moat for your AI product. Stay tuned.


