Earlier this year, Silo AI in collaboration with TurkuNLP and HPLT released Poro 34B, a best-in-class open foundation model for Finnish with strong capabilities in English and code. Poro 34B is evidence of a unique approach to building performant large language models (LLMs) for low-resource languages, and an important milestone on the journey towards providing models for all official EU languages. While base models are versatile in nature, they require fine-tuning or special prompting when deployed into production. In line with this, another common challenge for models designed for low-resource languages is that most of the instruction datasets that are publicly available are written in English. To chat tune a model to follow instructions in a low-resource language, we put forward an approach that uses the base model’s translation capability to generate instruction-tuning data. This results in a model that produces high-quality responses in the same language as in which the prompts are written.
Foundation models are a cornerstone of the AI infrastructure needed to build AI-powered products, services and businesses. In line with this, Poro 34B is a best-in-class open foundation model for Finnish and is also on par with other English LLMs like Llama 33B and outperforms Falcon 40B and MBT 30B. In English-Finnish translation, it outperforms not only dedicated open-source translation models but also Google Translate and scores on par with GPT-4. Poro 34B and other similar base models are versatile, but require fine-tuning or special prompting when deployed into production.
Poro 34B chat can answer questions, follow instructions written in plain language, and write code. As such, the model is an optimal choice for chat-based use cases, such as customer support, assistants, co-pilots and data processing tools. Poro 34B chat is bilingual and has exceptional translation capabilities, even outperforming state-of-the-art models dedicated to translation.
Instruction tuning models for low-resource languages
In contrast to other available open source models, Poro 34B chat has been instruction tuned in a low-resource language, allowing it to follow instructions in Finnish.
Models are only as good as their training data. A common challenge for models designed for low-resource languages is that most of the instruction datasets that are publicly available are written in English. With datasets in English only, a model learns to provide responses in English, even if prompts are written in a different language.
For a performant chat tuned model, datasets that show instruction following are needed. These instructions can be commands such as “summarize this document”, mathematical problems or anything else that a model is required to do. A model is then trained with these datasets in order to learn how to generalize about a variety of tasks, including many which are not explicitly represented in the training dataset.
To chat tune a model to follow instructions in a low-resource language, we put forward an approach that uses the base model’s translation capability to generate instruction-tuning data. Hence, the low-resource base model functions as a translator to turn English instructions into instructions in the low-resource language. For chat capability in both the high- and low-resource language, instruction tuning with a combined set of instructions in both languages results in a model that produces high-quality responses in the same language as in which the prompts are written.
In line with this approach, we use the Poro 34B base model to generate instructions in Finnish for Poro 34B chat to follow instructions in Finnish. For capability in both English and Finnish, a combined set of instructions in both languages are then used to finetune the Poro 34B chat model. This approach ensures users get responses from the model in the same language as in which the prompts are written.
Download on HuggingFace and share your results
Now that the Poro 34B chat model is publicly available for download on HuggingFace, we would love to hear your feedback and see how you put the model into use. We invite you to download the model and tag us on social media to share your results, or contact us at [email protected].
For more on how to build specialized sovereign LLMs, we encourage you to have a look at our SiloGen Platform.
Help develop chat models for a low-resource language
To create the best chat models, fine-tuning on examples that reflect the target language's culture and customs is essential. Translations alone often retain the original language's nuances. If you know Finnish, you can help improve Finnish chat models by joining the Avoin Avustaja crowdsourcing effort, a Finnish version of Open Assistant managed by TurkuNLP, https://avoin-avustaja.fi/
About
Silo AI
Want to discuss how Silo AI could help your organization?
Join the 5000+ subscribers who read the Silo AI monthly newsletter to be among the first to hear about the latest insights, articles, podcast episodes, webinars, and more.