Is your AI product actually working? How to develop the right metric system

Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more

In my first stint as a Machine Learning (ML) Product Manager, simple questions influenced passionate discussions through features and leaders. How can I know if this product is actually working? The product in question I managed is for both internal and external customers. This model allowed internal teams to identify the biggest issues they faced, and allowed them to prioritize the right experience set to fix customer issues. In such a complex web of interdependencies between internal and external customers, selecting the right metrics to capture the impact of a product was important to guide it towards success.

Not tracking whether a product is working well is like landing on a plane without instructions from air traffic control. There is absolutely no way to make informed decisions to customers without knowing what is right or wrong. Additionally, if you do not actively define metrics, your team will identify your own backup metrics. The risk of having multiple flavors of “precision” or “quality” metrics is that it leads to scenarios where not everyone develops their own versions and not all work on the same outcome.

For example, when reviewing the underlying metrics between the annual goals and the engineering team, the immediate feedback was: “This is a business metric. It already tracks accuracy and recall.”

First, identify what you want to know about AI products

Once you’ve started working on the task of defining metrics for your product – where should you start? In my experience, the complexity of working with multiple customers on ML products is also translated into definitions of metrics in the model. What do you use to measure whether the model is working well? Measuring internal team results to prioritize launches based on models is not fast enough. When you measure whether a customer adopts a solution that employs a model-recommended solution, there is the risk of drawing conclusions from a very broad range of recruitment metrics (what if the customer just wants to reach a support agent but doesn’t adopt the solution?).

Fast forward to the era of large-scale language models (LLMS). Not only does it have a single output from the ML model, it also has text answers, images, and music as outputs. Product dimensions that require metrics are increasing rapidly, including formats, customers, and types. The list continues.

As I try to come up with metrics across all my products, my first step is to distill into some important questions what I want to know about the impact on my customers. Identifying the right set of questions makes it easier to identify the right metric set. Here are some examples:

Did the customer receive the output? → Coverage Metrics
How long did it take for the product to provide output? →Latency metrics
Did users like the output? → Metrics for customer feedback, customer recruitment and retention

Once you have identified the important questions, the next step is to identify the set of subquestions for the “input” and “output” signals. The output metric is to delay indicators that can measure events that are already occurring. Input metrics and key indicators can be used to identify trends and predict results. See below for information on how to add appropriate sub-questions to the above questions and key metrics. Not all questions need to have a reading/delay indicator.

Did the customer receive the output? → Coverage
How long did it take for the product to provide output? →Latency
Did users like the output? →Customer feedback, customer recruitment, retention
1. Did the user show that the output is correct/incorrect? (output)
2. Was the output good/fair? (input)

The third and final step is to identify how to collect metrics. Most metrics are collected at scale by new equipment via data engineering. However, especially for ML-based products, there is a manual or automatic evaluation option to evaluate the model output, in some cases (such as Question 3 above). It is always best to develop automated assessments, but starting with a manual assessment of “Output Good/Fair” and creating rubrics for definitions of good, fair and bad things can help lay the foundation for a rigorously tested automated assessment process.

Example use case: AI search, list description

You can apply the above framework to any ML-based product to identify a list of key metrics for your product. Let’s search for it as an example.

question	metric	The nature of the metric
Did the customer receive the output? → Coverage	Search session percentage using search results displayed by customers	output
How long did it take for the product to provide output? →Latency	How long it takes to view user search results	output
Did users like the output? →Customer feedback, customer recruitment, retention Did the user show that the output is correct/incorrect? (Output) Was the output good/fair? (input)	Search session % with “thumb” feedback on search results with clicks from customers or search session % Per % of search results marked as “good/fair” for each search term, per rubric	output input

What about the products that generate a list description (whether it’s a Doordash menu item or an Amazon product list)?

question	metric	The nature of the metric
Did the customer receive the output? → Coverage	% list with generated explanation	output
How long did it take for the product to provide output? →Latency	It takes time to generate an explanation for the user	output
Did users like the output? →Customer feedback, customer recruitment, retention Did the user show that the output is correct/incorrect? (Output) Was the output good/fair? (input)	% of list with generated descriptions that require editing from technical content team/seller/customer Percentage of description of list marked “good/fair” for each quality rubric	output input

The above approach is scalable to multiple ML-based products. I hope this framework will help you define the appropriate metric set for your ML model.

Sharanya Rao is Group Product Manager Intuition.

Daily insights into business use cases in VB every day

If you want to impress your boss, VB Daily has it covered. From regulatory shifts to actual deployments, it provides an internal scoop on what companies are doing with generated AI, allowing you to share the biggest ROI insights.

Please read our privacy policy

Thank you for subscribing. Check out this VB newsletter.

An error has occurred.

Is your AI product actually working? How to develop the right metric system

First, identify what you want to know about AI products

Example use case: AI search, list description

Leave a Reply Cancel reply

Follow US

Popular News

Championship Promotion 2024/25: Leeds, Chef UTD, Burnley, Front Runners to reach Sunderland Premier League | Football News

Global Coronavirus Cases

Importent Links

About US

Quick Links

Categories & Tags

Subscribe US

First, identify what you want to know about AI products

Example use case: AI search, list description

You Might Also Like

NYT Connections Sports Edition June 16th Tips and Answers: Tips for Resolving Connections #266

The US Navy is more proactive in telling startups that “we want you”

Which one is more reliable

Just add humans: Oxford Medical Research highlights missing links in chatbot tests

The iPad I recommend to most users is now only $299

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Popular News

Championship Promotion 2024/25: Leeds, Chef UTD, Burnley, Front Runners to reach Sunderland Premier League | Football News

Global Coronavirus Cases

Importent Links

About US

Quick Links

Categories & Tags

Subscribe US