Sereyreach HIM, NiyAI Data Co. Ltd,
[[email protected]](mailto:[email protected])
In this article, I will briefly talk about how we implemented a testbed
to evaluate the readability of English output. We also talk about how we
perform AI evaluation using a framework called Inspect AI.
What do we want to do?
In today’s world, learning a language is crucial for everyone. Large
language models (LLMs) like ChatGPT, Claude, and Gemini are useful tools
that students can use to learn English or other languages, but it can be
challenge getting them to communicate at the correct ability level for a
learner. Because of this, we built a testbed where we can evaluate
provider, prompts, and other settings to decide which are most useful
for learners.
We started with simple open-ended questions like “Explain how a computer
works” or “What is a dog?”. Then, we adjusted LLM settings such as
System Message and Temperature to observe how they perform with
different configurations. Finally, we used a library called Inspect
AI to compare how each model performed in our testbed.
This article provides a guide to creating your own testbed. We’ll start
by explaining the AI configurations and finish with a brief overview of
the Inspect AI library.
What kind of inputs do we need?
To effectively evaluate the LLM’s performance, we need to consider
several things:
Questions
To assess whether LLMs work as expected, we need question data. The
questions would be open questions such as “Explain how a computer works”
or “What is a dog?”. We avoid yes/no questions, as they do not allow us
to properly evaluate the model’s responses.
Models
Having multiple models is essential for our testbed since we want to
compare their performance. However, if you’re building a testbed for a
specific purpose, it’s important to carefully consider factors such as
model architecture, parameters etc, to ensure your evaluation is
accurate and meaningful.
System Message
The System Message is a type of instruction that we inject so that LLMs
will reply in a way that we need.
For example, if we set our system message to You are an helpful assistant.
, the LLM will respond differently than if we set it to You are a CEFR B1 English teacher. You should speak using simple language such that a student of CEFR level B1 can understand.
In our real testbed, we provide various system messages that are crafted
towards the same outcome, that of an English tutor in a lesson. The
challenge here is to vary the wording of the messages such that the
overall meaning of the responses remains the same, and only the
difficulty level changes.
Temperature
The LLM temperature serves as a critical parameter influencing the
balance between predictability and creativity in generated text.
In simpler terms, Temperature is used to adjust how the LLM picks its
next word.
A lower temperature make the LLM choose the high probability next word
while the higher temperature make the LLM more creative and select from
the wide range of probabilities.
For example:
If we say The Sky is
the next word that would appear could be:
- Blue: 0.5 (50% chance)
- The: 0.2 (20% chance)
- An: 0.1 (10% chance)
With a low temperature, the LLM will choose the word with the
highest probability, so it will likely choose “Blue”.
With a high temperature, the model gives more weight to less
probable options, increasing the chances that it might choose “The”
or even “An”, making the output more unexpected.
Our hypothesis to be tested is that lower temperatures are likely to
correlate to lower (easier) reading level, since they will statistically
prefer more commonly used words, as well as ones that are most expected
within the context of the sentence.
In our case, we set the settings we need for our test inside a .yaml
file like below:
test_configs:
- models:
- openai/gpt-4o-mini
- anthropic/claude-3-haiku-20240307
- ...
- system_messages:
- You are an assistant speaking to a second-language learner. You
will use only simple language. You are a teacher of English as a second
language, having a conversation with a student. You will use easy to
understand language, with short sentences and simple words.
- ...
- temperatures:
- 0.01
- 0.1
- 0.2
- 0.4
- 0.6
- 0.8
- 1.0
- 1.4
- 1.8
......
LLM evaluation tool
To run the testbed, we can either write our own code or use existing
libraries. In our case, I used the Inspect AI library, which includes
all the tools I needed. Below, I’ll briefly explain the tools it offers
for our testbed, but I recommend checking out the full documentation if
you want to learn more.
Scorers
Scorers evaluate the output of LLMs. For example, if we ask, “What is a
dog?” and the LLM responds, “A dog is a domesticated mammal from the
species Canis familiaris…”, scorers will assign a score to this
answer based on specific metrics. In our project, we use many types of
scorers.
- So here is some example of what kind of scorers we used:
- CEFR Scorers : Categorizes words into levels such as A1, A2, B1, B2,
etc, based on a data set we assembled in-house. - Flesch Reading Ease: Scores text readability between 1 and 100, with
100 being the easiest to read. - Gunning Fog Index: Calculates readability by analyzing sentence
length and word complexity.
- CEFR Scorers : Categorizes words into levels such as A1, A2, B1, B2,
Here’s an example of how we create our custom scorer method:
[@scorer](mailto:@scorer)(metrics=[])
def readabilityScore(scorerMethod) -> Score:
async def score(state: TaskState, target: Target):
# Compare state / model output with target
# pass the state text to our scorerMethod function
# get the value from our custom function and return it score
scoreValue = scorerMethod(state.output.completion)
return Score(value=scoreValue)
def scorerMethod(text) -> float:
......
Because we want the scorers to be more flexibility, we put it in .YAML
file:
scorers:
- scorerMethod
- ....
The
metrics
parameter in the@scorer
tag also include built-in
metrics such asaccuracy()
ormean()
. In our case, we analyze the
scores in Excel instead.
Models Interface:
Inspect Library has built in support for many Model include local models
such as (Ollama, Huggingface) and the remote models (ChatGPT, Claude
….). This makes it easier than writing custom code to interface with
these models. Adjusting LLM configurations, such as temperature, is also
simple:
config = GenerateConfig()
config.temperature = 0.5
config.max_connections = 20
Parallelism
The ability to evaluate multiple models in parallel or run several tasks
on one model simultaneously, reduces test time.
eval("test.py", model=[
"openai/gpt-4-turbo",
"anthropic/claude-3-opus-20240229",
"google/gemini-1.5-pro"
])
Now let’s dive deeper on how we do the implementation.
[@task](mailto:@task)
def inspect_cefr(
input_messages: list[str],
target_messages: list[str],
system_message_str: str,
temperature: float,
scorers: list[Scorer]
):
dataset = create_inspect_ai_dataset(input_messages, target_messages)
config = GenerateConfig()
config.temperature = temperature
config.max_connections = 20
return Task(
dataset=dataset,
plan=[
system_message(system_message_str),
generate(),
],
scorer=scorers,
config=config,
metrics=[]
)
def main():
...
eval(
tasks,
model=models,
log_dir=f"logs/{current_time}",
)
In this implementation, we start by reading our test configuration from
a YAML file. Then, we pass the configurations into the eval()
function
provided by the Inspect AI library. The tasks
variable, which we
pass into eval()
, is an array containing Task objects generated by the
inspect_cefr()
function.
The inspect_cefr()
function creates a task for each configuration,
model, and input message set we want to evaluate. It handles configuring
the dataset, system messages, temperature settings, and scorers for each
task.
Once everything is set, the results are generated and saved in the
/logs
directory. To view the results in a browser, you can simply run
the command $ inspect view
, which launches a web interface for
inspecting the output.
Conclusion
In this article, we walked through the process of implementing a testbed
using the Inspect AI library. While we focused on the setup and
methodology, we didn’t dive into the data collection or analysis
processes, as this article serves more as an abstract guide to explain
how our testbed is structured. The aim is to provide a beginner-friendly
roadmap that you can follow to create a testbed for your own use cases.