LLMs in the Real World: How We Test AI for Defence Applications

Large language models (LLMs) have become a regular part of both our daily lives and professional work. They also play a central role in the AI products we deliver to our customers. While these models are typically evaluated against established benchmarks before release, much less is known about how they perform in real-world operational settings - for example, when applied to military intelligence data.

That’s why it’s essential not to rely on them blindly, but to carefully examine their behaviour when exposed to unseen or domain-specific information. And that’s exactly what we’re doing at Systematic. Since integrating LLMs into our products, we’ve been focused on developing tools and frameworks to assess model responses - both for general tasks like document summarization and translation, and for defence-specific tasks such as interpreting military documentation.

Our approach centers on evaluating models using data that reflects real-world scenarios as closely as possible, while also pushing their boundaries with challenging edge cases. This helps us clearly understand not only what the models do well, but also where they fall short - allowing us to communicate those limitations transparently.

By Márcia Vagos, Senior Data Scientist

The case for AI evaluation

Large language models (LLMs) have surged over the past three years, becoming widely adopted in many data science projects involving natural language processing.

At Systematic, we are using LLMs to deliver AI-powered solutions to our defence and healthcare customers and users. And as a software company with excellence in mind, quality assurance is at the center of everything we do!

That is why in our Defence AI projects, several initiatives have emerged to establish guidelines, best practices, and frameworks for testing and evaluating the outputs of services using LLMs “out-of-the-box.” In the data science team in the Insight program specifically, we have been integrating LLM evaluation as a standard part of our AI services development cycle. This enables us to explore early on sun-shine use cases, but also “rainy” cases to keep project stakeholders informed. The significance of this lies in ensuring that we develop products to high standards while building user trust - especially critical in defence, where AI may support mission-critical decisions on which lives on the battlefield depend.

By embedding evaluation steps early in the development process, AI application developers gain insights into how models respond to different prompts and can more easily identify risks and issues before they arise. Evaluation is therefore key to ensuring that AI products function according to requirements, produce safe and useful outputs, and ultimately meet user expectations.

Making AI do the work for you

Lately, we’ve been experimenting with using other LLMs to assess the responses of the LLM under evaluation. This technique, known as “LLM-as-a-judge” (LLMaaJ), involves using a larger and more capable model to evaluate the outputs of our AI services. This allows us to fully automate the evaluation process without human intervention. We’re currently focusing on aspects such as factuality, fluency, coherence, and accuracy.

But of course not all is fun and games – most of the effort is still going into collecting test data and refining the prompts for the LLMaaJs, so there’s still a fair amount of manual work involved as we explore what works and what doesn’t. But once the prompts are polished, the entire evaluation pipeline can be run with a single command - how cool is that!

What’s next?

While we’re still in the early stages, our experience so far has been positive. We’ve found that in at least some cases, LLMaaJs are quite effective in automating the evaluation process - something that would otherwise be tedious and time-consuming for developers. We hope that as open-source datasets and higher-level frameworks continue to emerge, they’ll help us streamline these evaluation processes even further, eventually making them a natural part of how we work with LLMs at Systematic.

"The importance of doing this lies in being able to develop products to high standards and building trust in the products among our users.

But our experience so far is quite positive and we have seen that at least in some cases, LLMaaJ demonstrate to be quite useful in automating the evaluation process, which would be quite tedious and time-consuming if done by the developers."

- Marcia Raquel Vagos, Senior Data Scientist at Systematic

Marcia Raquel Vagos is a senior data scientist specializing in generative AI, machine learning, and data analytics. With a background in consultancy, Marcia has worked on deep learning for medical imaging, time series analysis for energy management, and computational modeling for cancer treatment.

Holding a PhD in computational modeling of biosystems, she excels in data assimilation, numerical simulations, and solving complex computational challenges.

Learn more about us

See vacant positions

We are always looking for people who strive to be the best in their fields - see the current vacant positions.

Explore jobs

Working at Systematic

A common feature of these core business areas is a need to integrate, compare and analyse large volumes of complex data, and to generate an overview that allows decision-making based on a solid foundation, often in critical situations.