A Promising Methodology for Testing GenAI Applications in Java | Docker
In the vast universe of programming, the era of generative artificial intelligence (GenAI) has marked a turning point, opening up a plethora of possibilities for developers.
Tools such as LangChain4j and Spring AI have democratized access to the creation of GenAI applications in Java, allowing Java developers to dive into this fascinating world. With Langchain4j, for instance, setting up and interacting with large language models (LLMs) has become exceptionally straightforward. Consider the following Java code snippet:
public static void main(String[] args) {
var llm = OpenAiChatModel.builder()
.apiKey("demo")
.modelName("gpt-3.5-turbo")
.build();
System.out.println(llm.generate("Hello, how are you?"));
}
This example illustrates how a developer can quickly instantiate an LLM within a Java application. By simply configuring the model with an API key and specifying the model name, developers can begin generating text responses immediately. This accessibility is pivotal for fostering innovation and exploration within the Java community. More than that, we have a wide range of models that can be run locally, and various vector databases for storing embeddings and performing semantic searches, among other technological marvels.
Despite this progress, however, we are faced with a persistent challenge: the difficulty of testing applications that incorporate artificial intelligence. This aspect seems to be a field where there is still much to explore and develop.
In this article, I will share a methodology that I find promising for testing GenAI applications.
Project overview
The example project focuses on an application that provides an API for interacting with two AI agents capable of answering questions.
An AI agent is a software entity designed to perform tasks autonomously, using artificial intelligence to simulate human-like interactions and responses.
In this project, one agent uses direct knowledge already contained within the LLM, while the other leverages internal documentation to enrich the LLM through retrieval-augmented generation (RAG). This approach allows the agents to provide precise and contextually relevant answers based on the input they receive.
I prefer to omit the technical details about RAG, as ample information is available elsewhere. I’ll simply note that this example employs a particular variant of RAG, which simplifies the traditional process of generating and storing embeddings for information retrieval.
Instead of dividing documents into chunks and making embeddings of those chunks, in this project, we will use an LLM to generate a summary of the documents. The embedding is generated based on that summary.
When the user writes a question, an embedding of the question will be generated and a semantic search will be performed against the embeddings of the summaries. If a match is found, the user’s message will be augmented with the original document.
This way, there’s no need to deal with the configuration of document chunks, worry about setting the number of chunks to retrieve, or worry about whether the way of augmenting the user’s message makes sense. If there is a document that talks about what the user is asking, it will be included in the message sent to the LLM.
Technical stack
The project is developed in Java and utilizes a Spring Boot application with Testcontainers and LangChain4j.
For setting up the project, I followed the steps outlined in Local Development Environment with Testcontainers and Spring Boot Application Testing and Development with Testcontainers.
I also use Tescontainers Desktop to facilitate database access and to verify the generated embeddings as well as to review the container logs.
The challenge of testing
The real challenge arises when trying to test the responses generated by language models. Traditionally, we could settle for verifying that the response includes certain keywords, which is insufficient and prone to errors.
static String question = "How I can install Testcontainers Desktop?";
@Test
void verifyRaggedAgentSucceedToAnswerHowToInstallTCD() {
String answer = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
assertThat(answer).contains("https://testcontainers.com/desktop/");
}
This approach is not only fragile but also lacks the ability to assess the relevance or coherence of the response.
An alternative is to employ cosine similarity to compare the embeddings of a “reference” response and the actual response, providing a more semantic form of evaluation.
This method measures the similarity between two vectors/embeddings by calculating the cosine of the angle between them. If both vectors point in the same direction, it means the “reference” response is semantically the same as the actual response.
static String question = "How I can install Testcontainers Desktop?";
static String reference = """
- Answer must indicate to download Testcontainers Desktop from https://testcontainers.com/desktop/
- Answer must indicate to use brew to install Testcontainers Desktop in MacOS
- Answer must be less than 5 sentences
""";
@Test
void verifyRaggedAgentSucceedToAnswerHowToInstallTCD() {
String answer = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
double cosineSimilarity = getCosineSimilarity(reference, answer);
assertThat(cosineSimilarity).isGreaterThan(0.8);
}
However, this method introduces the problem of selecting an appropriate threshold to determine the acceptability of the response, in addition to the opacity of the evaluation process.
Toward a more effective method
The real problem here arises from the fact that answers provided by the LLM are in natural language and non-deterministic. Because of this, using current testing methods to verify them is difficult, as these methods are better suited to testing predictable values.
However, we already have a great tool for understanding non-deterministic answers in natural language: LLMs themselves. Thus, the key may lie in using one LLM to evaluate the adequacy of responses generated by another LLM.
This proposal involves defining detailed validation criteria and using an LLM as a “Validator Agent” to determine if the responses meet the specified requirements. This approach can be applied to validate answers to specific questions, drawing on both general knowledge and specialized information
By incorporating detailed instructions and examples, the Validator Agent can provide accurate and justified evaluations, offering clarity on why a response is considered correct or incorrect.
static String question = "How I can install Testcontainers Desktop?";
static String reference = """
- Answer must indicate to download Testcontainers Desktop from https://testcontainers.com/desktop/
- Answer must indicate to use brew to install Testcontainers Desktop in MacOS
- Answer must be less than 5 sentences
""";
@Test
void verifyStraightAgentFailsToAnswerHowToInstallTCD() {
String answer = restTemplate.getForObject("/chat/straight?question={question}", ChatController.ChatResponse.class, question).message();
ValidatorAgent.ValidatorResponse validate = validatorAgent.validate(question, answer, reference);
assertThat(validate.response()).isEqualTo("no");
}
@Test
void verifyRaggedAgentSucceedToAnswerHowToInstallTCD() {
String answer = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
ValidatorAgent.ValidatorResponse validate = validatorAgent.validate(question, answer, reference);
assertThat(validate.response()).isEqualTo("yes");
}
We can even test more complex responses where the LLM should suggest a better alternative to the user’s question.
static String question = "How I can find the random port of a Testcontainer to connect to it?";
static String reference = """
- Answer must not mention using getMappedPort() method to find the random port of a Testcontainer
- Answer must mention that you don't need to find the random port of a Testcontainer to connect to it
- Answer must indicate that you can use the Testcontainers Desktop app to configure fixed port
- Answer must be less than 5 sentences
""";
@Test
void verifyRaggedAgentSucceedToAnswerHowToDebugWithTCD() {
String answer = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
ValidatorAgent.ValidatorResponse validate = validatorAgent.validate(question, answer, reference);
assertThat(validate.response()).isEqualTo("yes");
}
Validator Agent
The configuration for the Validator Agent doesn’t differ from that of other agents. It is built using the LangChain4j AI Service and a list of specific instructions:
public interface ValidatorAgent {
@SystemMessage("""
### Instructions
You are a strict validator.
You will be provided with a question, an answer, and a reference.
Your task is to validate whether the answer is correct for the given question, based on the reference.
Follow these instructions:
- Respond only 'yes', 'no' or 'unsure' and always include the reason for your response
- Respond with 'yes' if the answer is correct
- Respond with 'no' if the answer is incorrect
- If you are unsure, simply respond with 'unsure'
- Respond with 'no' if the answer is not clear or concise
- Respond with 'no' if the answer is not based on the reference
Your response must be a json object with the following structure:
{
"response": "yes",
"reason": "The answer is correct because it is based on the reference provided."
}
### Example
Question: Is Madrid the capital of Spain?
Answer: No, it's Barcelona.
Reference: The capital of Spain is Madrid
###
Response: {
"response": "no",
"reason": "The answer is incorrect because the reference states that the capital of Spain is Madrid."
}
""")
@UserMessage("""
###
Question: {{question}}
###
Answer: {{answer}}
###
Reference: {{reference}}
###
""")
ValidatorResponse validate(@V("question") String question, @V("answer") String answer, @V("reference") String reference);
record ValidatorResponse(String response, String reason) {}
}
As you can see, I’m using Few-Shot Prompting to guide the LLM on the expected responses. I also request a JSON format for responses to facilitate parsing them into objects, and I specify that the reason for the answer must be included, to better understand the basis of its verdict.
Conclusion
The evolution of GenAI applications brings with it the challenge of developing testing methods that can effectively evaluate the complexity and subtlety of responses generated by advanced artificial intelligences.
The proposal to use an LLM as a Validator Agent represents a promising approach, paving the way towards a new era of software development and evaluation in the field of artificial intelligence. Over time, we hope to see more innovations that allow us to overcome the current challenges and maximize the potential of these transformative technologies.