One challenge we face at NiyAI is persuading our models to keep their language straight when giving users feedback. Often times when outputting Khmer text, models have an unfortunate tendency to start emitting English or even Thai midsentence. Discussing text from one language in the medium of another sets up even more ways for an LLM to confuse itself – many of our first attempts at this resulted in the model attempting to translate the text instead of explaining it.

Marchisio et al from Cohere researched problems with multilingual models, and the most useful lessons we learned from their results were:

Lower Temperature

“Temperature” is a setting in LLMs which controls randomness when deciding on the next token.

Intuitively it makes sense that increasing temperature will increase the chance of a model going off-script by selecting a token in the wrong language. Also, due to the “predict the next token” nature of LLMs, selecting the wrong language once has a cumulative knock-on effect on subsequent tokens. Therefore, it is wise to keep temperature setting to a minimum. One might also conclude that judicious lowering of repetition penalty may also improve coherence.

Few-shot Prompting

The Cohere paper notes, “when prompted with an instruction in a non-English language, for instance, Command R Base often translates it instead of answering.” Thus, providing the prompt in the target language is not enough.

This can be improved by giving multiple examples (“shots”) of inputs and expected responses in the target language. 5-shot appears to be better than 1-shot, though of course this needs balancing against the need to keep prompts as concise as possible.

Location of Language Description

The instruction for “Reply in <language>” should appear in isolation at the end of the prompt. Integrating it with the rest of the prompt by mentioning it as part of a longer sentence (e.g. “Give instructions in <language> on how to build a house”) has a lower success rate in maintaining the correct language.

This is also likely true of any kind of important directive (e.g. requests for CSV format).

Choice of Model

Models that perform better on classic benchmarks don’t always perform better at the task of maintaining other languages in their output. For every one of Llama, Mixtral, Cursor and GPT, there were languages where the older or smaller models outperformed their more advanced counterparts. This pattern was not universal however; being an older or smaller model does not imply better coherence. It is however worth keeping this in mind when choosing which models to benchmark for a less-spoken language use case.

For all the advice given above, there may be models out there for which the opposite is true. The only way to know is to construct your own evaluation framework for your use case, and experiment with different settings. Let us know how you get on!