The Power of Defining Success

What does success mean to you? Relax, this is not a motivational message. By the end of this post, however, you will possess the knowledge to help you join the most successful 10%* of AI business transformations! (*made-up statistic).

Memory Lane: The AMI Corpus

I have been reflecting on the topic of measuring success since reading this article about Mistral’s release of Voxtral 2 last week.  Hidden in their report is the fact that they measure speaker turn (diarization) error rate on the AMI Corpus, amongst others. 

Twenty years ago, as a researcher at IDIAP, I worked on the AMI project to apply speech-to-text to business meetings. We quickly hit a wall: the technology at the time was great for a single person dictating a document, but it was woeful for natural conversation between a group of people.

In real meetings, people don't use perfect grammar; they grunt, interrupt, and talk over each other. We realised that to improve meeting transcription, we needed data from actual meetings, not people reading books. 

That led us to create the AMI Corpus, which provided the ground truth needed to measure and improve performance.  Since that time, the AMI benchmark has been consistently used in conversational speech recognition research with error rates dropping from 30% to under 10%, approaching human accuracy.

The Point: Models Aren't Optimised for You

Here’s the key point: the whizzbang AI models you are adopting today have not been optimized with your specific definition of success in mind.

Model producers like OpenAI or Anthropic optimize for general benchmark tasks. They publish impressive bar charts that keep getting better and better, but you won’t find data on whether their new model improves your outbound sales conversions or your e-commerce customer satisfaction. 

Why not? Because that isn’t their job - it’s yours.

The release of GPT 3 in 2021 had such an impact because it responded like a human.  This didn’t happen by chance: one of the key factors was that researchers invented a success metric for human preference, Reinforcement Learning from Human Feedback (RLHF), and then spent years optimising for it.

In AI adoption, there are no shortcuts. You must define your success measure, collect your data, and start benchmarking.

Why Internal Benchmarking is a Commercial Necessity

Even if you use off-the-shelf models, building internal benchmarks provides three massive strategic advantages:

  • Vendor Leverage: Vendors do not publish metrics specific to your use case. In building Dubber Insights, we benchmarked various transcription providers against our own proprietary dataset of business calls. This gave us an information advantage in negotiations and allowed us to mix vendors to deliver the best accuracy for every language, migrating vendors when necessary to seamlessly track the ever-evolving state of the art.

  • Model Agility: The AI world moves at a breathtaking pace. Since 2021, the three largest commercial LLM providers (OpenAI, Google, Anthropic) have released over 100 LLM variants, and half of those have already been retired. How do you keep up with all that change? Your own benchmark puts you back in control. You can objectively decide when to migrate to improve cost or accuracy, and you’ll know exactly which model to pick when the current one you use is deprecated.

  • Engineering Integrity (CI/CD for AI): Testing and QA processes are established pillars of modern software development. However, once you include an LLM in your stack, traditional code tests aren't enough to ensure quality. You need AI accuracy benchmarking integrated into your CI/CD pipelines to guard against model drift or the unintended side effects that code changes can have on AI prompts and outputs.

As a final thought, keep in mind that benchmarking tells you how you expect a model to perform, but production metrics can tell you how it is actually working. By adding statistics of model performance, such as classification rates, to your production observability stack, you can safely A/B test new models or prompts against live traffic. A dual approach of benchmarking together with production monitoring of model performance gives you the confidence to move fast.

The Bottom Line

At the heart of every great AI advancement is a clearly defined measure of success. Yet, this principle is often neglected by businesses rushing to incorporate LLMs.

As AI models become increasingly commoditised, having a clear definition of your task and the tools to measure it on real data put you in the driver's seat.

To cut through the noise and ensure your AI transformation actually works: define what success means to you, and then relentlessly measure it.

Previous
Previous

Not 100% Accurate, but 100% Useful