Rate Your AI

Today I’ll be talking about large language models (LLMs) in AI. This piece isn’t about artificial general intelligence (AGI) and robots replacing humans — I don’t believe this will happen anytime soon and neither does Help Scout — but about what will be created and achieved in the future.

Remember how the AI and machine learning (ML) fields looked just a short time ago?

Swag from ChatbotConf 2016

It’s staggering to think of how far we’ve come and how fast we got here. LLMs took the world by storm just over two years ago when OpenAI released ChatGPT, and I, for one, have definitely been impressed by how far the technology Google helped bring to life has evolved. Fast forward to 2025 and now ChatGPT isn’t just a toy or a fad but a useful tool that not only helps provide customer support but actually helps people in all walks of life.

Bringing AI to Help Scout

Our Help Scout account has a ton of text (things like customer conversations, saved replies, Docs articles, etc.), and, as you might know, a ton of text and LLMs are a match made in heaven. With all of that data handy, we were excited to create new ways to help our lovely Cteam — what we call our support team — do their jobs faster and more efficiently.

We thought a great place to start was figuring out how to automatically answer some of our most common questions. That’s how two of our newest features, AI Drafts and AI Answers, were conceived.

They work a bit differently from each other, but the gist is the same: We use the retrieval augmented generation (RAG) pattern to build context for the LLM that then answers the user’s question. (It’s not the goal of this post to explain what RAG is and how to tune retrieval but please let me know if you’d like to read another post on how Help Scout does RAG!)

Playing Whac-An-LLM

We started pretty much where everyone else did when building out these features. We made a prompt with instructions, threw it at OpenAI, got a response, and then returned it back to the user.

It worked surprisingly well for such a simple flow, but users were not always happy; there were some bugs and edge case scenarios. We started making prompt changes, attempting to fix those issues, but we found out soon enough that while some of the fixes worked, others made it worse! With the growth of our prompts in length and complexity, it started to more and more resemble a Whac-A-Mole game, or, in our case, a Whac-An-LLM game.

It was to be expected, of course. LLMs are not obedient robots that follow every instruction you give them, nor are they capable of reasoning. Instead, they’re probabilistic machines that hallucinate all the time, and it just so happens that we find most of their creations useful!

We weren’t the only ones who were working hard, either; OpenAI, Anthropic, Google, Meta, Cohere, and other companies were also hard at work training bigger and better (faster, cheaper, smarter — you name the adjective, they were working on it) LLMs. We started building for the GPT-4 OpenAI model, but soon GPT-4-turbo and GPT-4o models were released. We then found out that just swapping one model for another doesn’t always work the way we’d expect it to. Even if the newest model shines in the benchmarks, it doesn’t mean it generalizes to our task and that users will be happy with the output (yes, even if it’s from the same provider!)

It became clear we needed a system to rate and evaluate our AI features.

Rating our AI

The first thing that comes to mind when we talk about rating AI is those mildly annoying feedback modals that pop up once you’ve had your question answered and are ready to move on. If you’ve been using ChatGPT, chances are you’ve seen them:

It’s interesting that the ChatGPT design team is experimenting a lot with that feedback flow. I can recall several different designs. For example, they’ve experimented with the number of feedback icons and their placement, they offer a comparison side-by-side view when preparing a new model release, etc.

So we tried that as well. It was definitely an interesting experiment, which didn’t quite go the way we expected. First of all, we have to accept the fact that users are not very keen on surveys, let alone ones served up by bots! Also, positive feedback is provided less frequently, as people tend to leave negative feedback more often than the positive. But most of the time there was no feedback at all. 🤷

The second issue we discovered was that lots of the negative feedback is a “false positive.” That means that technically the answer the AI provided was absolutely correct, but the customer just wasn’t happy with it. For example, requests to get a discount or about features not available in Help Scout were often rated negatively just because people didn’t like the answer! The positive feedback we received was helpful, but there just wasn’t that much of it.

So, you may be asking yourself if you should be doing a customer survey pop up. As is often the case, it depends. It didn’t work very well for us, but I suppose as one of many signals, it could still be useful.

Introducing Freeplay

The next option we thought about trying was internal human rating and evaluation, that is, using fellow colleagues at Help Scout to help us build a better AI solution. To get humans to rate Drafts and Answers, the first step is to store LLM interactions so they are ready for review, labeling, and rating. I’d say that at this point, there’s no sense in building that system on your own, as there are quite a few LLM observability/LLMOps tools (LangSmith, Weave, Datadog, etc.) available. We’re using Freeplay, but I’m pretty sure the flow translates to other solutions as well.

What does Freeplay give us?

This is one of our Freeplay dashboards that shows different metrics (cost, latency) as charts.

And this is a Freeplay prompt editor window that shows a part of the “generate-draft” prompt with variables, prompt details, and LLM output.

The Freeplay “test runs” view showing an evaluation of the AI Answers feature on 10 test examples.

The first step when evaluating a ML system is to understand what the target metrics are. Target metrics could be both common and well known in the industry, like precision and recall, and they also could be business specific, like context awareness or tone of language. As you’ll be optimizing the system around those metrics, it’s very important to come up with a good list! Of course, it’s an iterative process and it’s absolutely necessary to learn and polish as you go.

Here are some of our evaluations:

Context awareness – LLM understands the context and the question well and can provide a relevant response.
Language – the tone resembles Help Scout, and the answer is polite and concise without controversial topics and biases.

Once you have an initial list of metrics set up and ready, you can start applying them and rating the AI. A naive approach would be to just observe the production traffic and evaluate it retroactively. This would allow you to understand how you’re doing, but it doesn’t help you win the Whac-An-LLM game.

The better approach would be to create pre-defined datasets that capture a subset of production traffic so your evaluations become comparable to each other. In Freeplay terminology, each dataset evaluation is done via a test run that stores the results and allows you to compare different test runs to each other.

Of course, the reality is more complicated. You can see in the screenshot above that those datasets are really small and therefore are unlikely to be representative of production traffic. So you have to balance the dataset size (a bigger dataset better approximates the distribution of questions on production) and the effort (runtime) required to evaluate it. We achieve that by random sampling at a larger dataset size.

Another complication is that the system design and behavior might change over time (especially during early development phases), so the dataset might require re-sampling from time to time. Last but not least, you have to decide how often to run those evaluations: on each change of the prompt, at regular intervals, etc? Those answers really depend on your team size, budget, risk-tolerance, etc.

We’re a small and lean team, so we’re also relying heavily on the Freeplay feature called model-graded evaluations. This just means using another (often more capable) LLM to automatically rate outputs from the LLM powering our application. While not new, the idea drastically reduces the amount of human time required to rate and review sessions.

For example, here’s a model-graded evaluation for “article completeness”:

Evaluation prompt template:

Determine if the provided articles contain all the information needed to fully answer the question.
Multiple supporting articles will be provided and separated by -----

Question: {{inputs.question}}
Articles: {{inputs.articles}}

Evaluation rubric:

No: The articles do not provide enough information to fully answer the question
Yes: The articles provide sufficient information to fully answer the question

The Freeplay evaluation metric editor that shows the “article completeness” eval.

Freeplay has more information on how to create and align those evals to human judgement in this excellent blog post. From Help Scout’s experience, I can add that those model-graded evaluations are indispensable to iterate quickly and run ratings often, even on smaller changes.

There are a few cons, of course: They cost money, are probabilistic in nature (like any LLM output), take time to run on bigger datasets, and don’t replace some of the more nuanced human evals. For example, ratings like “language” or “tone of voice” are pretty subjective and not captured very well with LLMs.

The results

Freeplay and the techniques listed above allow us to iterate quickly, fix bugs, introduce improvements, and switch models with a high level of confidence. For example, we’ve been able to reduce the hallucination rate to under 5%, switch to cheaper OpenAI models, and introduce a more nuanced Answer flow while maintaining quality.

The next steps for us are to ramp up our dataset curation and maintenance efforts as well as scale up our human labeling. Automatic evaluations alone don’t cover 100% of our target metrics — at least not yet!

I think this picture from Freeplay does a great job at summarizing what our day to day looks like:

We’ve instrumented the application to record LLM calls and send them to Freeplay.
We have humans reviewing, rating, and labeling “sessions” (recorded LLM interactions), curating datasets, and crafting evaluations in Freeplay.
We launch experiments on the curated datasets and gather the results.
Once we’re happy with the results, we deploy changes to production and monitor them. (Loop from #2 above)

Do you rate your AI systems? Why or why not? I’d love to hear from you!