Large language models (LLMs) have become a regular part of both our daily lives and professional work. They also play a central role in the AI products we deliver to our customers. While these models are typically evaluated against established benchmarks before release, much less is known about how they perform in real-world operational settings - for example, when applied to military intelligence data.
That’s why it’s essential not to rely on them blindly, but to carefully examine their behaviour when exposed to unseen or domain-specific information. And that’s exactly what we’re doing at Systematic. Since integrating LLMs into our products, we’ve been focused on developing tools and frameworks to assess model responses - both for general tasks like document summarization and translation, and for defence-specific tasks such as interpreting military documentation.
Our approach centers on evaluating models using data that reflects real-world scenarios as closely as possible, while also pushing their boundaries with challenging edge cases. This helps us clearly understand not only what the models do well, but also where they fall short - allowing us to communicate those limitations transparently.
By Márcia Vagos, Senior Data Scientist
The case for AI evaluation
Large language models (LLMs) have surged over the past three years, becoming widely adopted in many data science projects involving natural language processing.
At Systematic, we are using LLMs to deliver AI-powered solutions to our defence and healthcare customers and users. And as a software company with excellence in mind, quality assurance is at the center of everything we do!
That is why in our Defence AI projects, several initiatives have emerged to establish guidelines, best practices, and frameworks for testing and evaluating the outputs of services using LLMs “out-of-the-box.” In the data science team in the Insight program specifically, we have been integrating LLM evaluation as a standard part of our AI services development cycle. This enables us to explore early on sun-shine use cases, but also “rainy” cases to keep project stakeholders informed. The significance of this lies in ensuring that we develop products to high standards while building user trust - especially critical in defence, where AI may support mission-critical decisions on which lives on the battlefield depend.
By embedding evaluation steps early in the development process, AI application developers gain insights into how models respond to different prompts and can more easily identify risks and issues before they arise. Evaluation is therefore key to ensuring that AI products function according to requirements, produce safe and useful outputs, and ultimately meet user expectations.
Making AI do the work for you
Lately, we’ve been experimenting with using other LLMs to assess the responses of the LLM under evaluation. This technique, known as “LLM-as-a-judge” (LLMaaJ), involves using a larger and more capable model to evaluate the outputs of our AI services. This allows us to fully automate the evaluation process without human intervention. We’re currently focusing on aspects such as factuality, fluency, coherence, and accuracy.
But of course not all is fun and games – most of the effort is still going into collecting test data and refining the prompts for the LLMaaJs, so there’s still a fair amount of manual work involved as we explore what works and what doesn’t. But once the prompts are polished, the entire evaluation pipeline can be run with a single command - how cool is that!
What’s next?
While we’re still in the early stages, our experience so far has been positive. We’ve found that in at least some cases, LLMaaJs are quite effective in automating the evaluation process - something that would otherwise be tedious and time-consuming for developers. We hope that as open-source datasets and higher-level frameworks continue to emerge, they’ll help us streamline these evaluation processes even further, eventually making them a natural part of how we work with LLMs at Systematic.