Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more
In my first stint as a Machine Learning (ML) Product Manager, simple questions influenced passionate discussions through features and leaders. How can I know if this product is actually working? The product in question I managed is for both internal and external customers. This model allowed internal teams to identify the biggest issues they faced, and allowed them to prioritize the right experience set to fix customer issues. In such a complex web of interdependencies between internal and external customers, selecting the right metrics to capture the impact of a product was important to guide it towards success.
Not tracking whether a product is working well is like landing on a plane without instructions from air traffic control. There is absolutely no way to make informed decisions to customers without knowing what is right or wrong. Additionally, if you do not actively define metrics, your team will identify your own backup metrics. The risk of having multiple flavors of “precision” or “quality” metrics is that it leads to scenarios where not everyone develops their own versions and not all work on the same outcome.
For example, when reviewing the underlying metrics between the annual goals and the engineering team, the immediate feedback was: “This is a business metric. It already tracks accuracy and recall.”
First, identify what you want to know about AI products
Once you’ve started working on the task of defining metrics for your product – where should you start? In my experience, the complexity of working with multiple customers on ML products is also translated into definitions of metrics in the model. What do you use to measure whether the model is working well? Measuring internal team results to prioritize launches based on models is not fast enough. When you measure whether a customer adopts a solution that employs a model-recommended solution, there is the risk of drawing conclusions from a very broad range of recruitment metrics (what if the customer just wants to reach a support agent but doesn’t adopt the solution?).
Fast forward to the era of large-scale language models (LLMS). Not only does it have a single output from the ML model, it also has text answers, images, and music as outputs. Product dimensions that require metrics are increasing rapidly, including formats, customers, and types. The list continues.
As I try to come up with metrics across all my products, my first step is to distill into some important questions what I want to know about the impact on my customers. Identifying the right set of questions makes it easier to identify the right metric set. Here are some examples:
- Did the customer receive the output? → Coverage Metrics
- How long did it take for the product to provide output? →Latency metrics
- Did users like the output? → Metrics for customer feedback, customer recruitment and retention
Once you have identified the important questions, the next step is to identify the set of subquestions for the “input” and “output” signals. The output metric is to delay indicators that can measure events that are already occurring. Input metrics and key indicators can be used to identify trends and predict results. See below for information on how to add appropriate sub-questions to the above questions and key metrics. Not all questions need to have a reading/delay indicator.
- Did the customer receive the output? → Coverage
- How long did it take for the product to provide output? →Latency
- Did users like the output? →Customer feedback, customer recruitment, retention
- Did the user show that the output is correct/incorrect? (output)
- Was the output good/fair? (input)
The third and final step is to identify how to collect metrics. Most metrics are collected at scale by new equipment via data engineering. However, especially for ML-based products, there is a manual or automatic evaluation option to evaluate the model output, in some cases (such as Question 3 above). It is always best to develop automated assessments, but starting with a manual assessment of “Output Good/Fair” and creating rubrics for definitions of good, fair and bad things can help lay the foundation for a rigorously tested automated assessment process.
Example use case: AI search, list description
You can apply the above framework to any ML-based product to identify a list of key metrics for your product. Let’s search for it as an example.
question | metric | The nature of the metric |
---|---|---|
Did the customer receive the output? → Coverage | Search session percentage using search results displayed by customers | output |
How long did it take for the product to provide output? →Latency | How long it takes to view user search results | output |
Did users like the output? →Customer feedback, customer recruitment, retention Did the user show that the output is correct/incorrect? (Output) Was the output good/fair? (input) | Search session % with “thumb” feedback on search results with clicks from customers or search session % Per % of search results marked as “good/fair” for each search term, per rubric | output input |
What about the products that generate a list description (whether it’s a Doordash menu item or an Amazon product list)?
question | metric | The nature of the metric |
---|---|---|
Did the customer receive the output? → Coverage | % list with generated explanation | output |
How long did it take for the product to provide output? →Latency | It takes time to generate an explanation for the user | output |
Did users like the output? →Customer feedback, customer recruitment, retention Did the user show that the output is correct/incorrect? (Output) Was the output good/fair? (input) | % of list with generated descriptions that require editing from technical content team/seller/customer Percentage of description of list marked “good/fair” for each quality rubric | output input |
The above approach is scalable to multiple ML-based products. I hope this framework will help you define the appropriate metric set for your ML model.
Sharanya Rao is Group Product Manager Intuition.