Blog

Europe's open language model Poro: A milestone for European AI and low-resource languages

Promotional image for large language model Poro by Silo AI.

Together with the University of Turku and HPLT, Silo AI, the largest private AI lab in Europe, has reached a significant milestone with the successful completion of training the Poro model. This marks an important step for SiloGen, the company's generative AI arm, and its efforts to strengthen European digital sovereignty and democratize access to large language models (LLMs) for all European languages. The model is evidence of the successful application of a novel method to train LLMs for low-resource languages.

Silo AI and TurkuNLP are building a family of multilingual open source LLMs, with the aim of strengthening European digital sovereignty and democratizing access to LLMs. The development of base models aligned with European values is crucial to this effort, ensuring they are built on data and information accurately representing the diverse languages, citizens, organizations and cultural landscape of the European Union. This approach not only aligns with European values, but also allows for sovereignty in how downstream applications and value creation happens.

Proven approach to build performant LLMs for low-resource languages

The completion of training Poro functions as a proof point for an innovative approach in developing AI models for languages with scarce data resources. Poro outperforms all existing open language models on the Finnish language, including FinGPT, Mistral, Llama, and the BLUUMI 176 billion parameter model, among others.

This success is attributed to pairing the low-resource Finnish language with high-resource languages. The team has worked on determining optimal data reuse frequencies for low-resource languages during training and incorporated translated paired texts between English and Finnish. This strategy relies on a cross-lingual signal to enhance the model's understanding of the connections between languages, proving crucial in achieving superior performance for low-resource languages, without compromising performance in English.

The completion of Poro exemplifies Silo AI's commitment to advancing AI models for low-resource languages. Releasing Poro as an open-source model facilitates widespread access and collaborative improvement, particularly for underrepresented European languages. This approach enriches the AI community, offering a valuable resource for research and development, reflecting a deliberate effort to enhance linguistic diversity in AI applications.

The completion of Poro is the first step in SiloGen’s efforts to train state-of-the-art LLMs for all official EU languages.

Features of Poro 34B

Below is a summary of key features of Poro 34B. For transparency with respect to model architecture, data and other technical information, please refer to the official model card.

  • Poro Research Checkpoints: Checkpoints for the model are released throughout the training process, providing external researchers with unprecedented access to investigate the model training process.
  • Model architecture: Poro 34B is 34.2 billion parameters and uses a BLOOM architecture with ALiBi embeddings to allow for context window extrapolation. While model architecture for the initial model has been kept simple, future models under progress will support additional capabilities, such as flash attention, rotary embeddings and grouped query attention.
  • Multilingual capabilities: Poro is designed to process English and Finnish and has proficiency with a variety of programming languages. Additionally, it can perform translation between English and Finnish.
  • Open source: Poro is freely available under the Apache 2.0 License, implying applicability for both commercial and research use.
  • Dataset: The model is trained with a dataset of 1 trillion tokens, with English, Finnish and a variety of programming languages represented.
  • Training details: Poro is trained using 512 AMD MI250X GPUs on the LUMI supercomputer in Finland.
Poro 34B has state-of-the-art performance on the Finnish benchmark FIN-bench, top performance in its class on code benchmarks (MBPP and HumanEval, pass@10), and is competitive with other models in its class on common English language benchmarks (Huggingface-6; arc_challenge, hellaswag, mmlu, truthfulqa, winogrande, gsm8k). When combined, Poro's overall performance in Finnish, programming languages and English exceeds other comparable open source models.

More information

A family of European open multilingual LLMs

  • Together with the University of Turku and HPLT, SiloGen launched an initiative to build a family of open multilingual LLMs with a world-class team, access to a record amount of compute and data, and a distinctive software layer to train LLMs.
  • In November, we published the first three checkpoints of Poro 34B, a multilingual, open European language model showing performance evidence on low-resource languages like Finnish, without compromising performance in English.
  • Later, we released the next two checkpoint milestones, covering a total of 50% of training for Poro 34B. Model evaluations prove performance for low-resource languages, with best-in-class performance for the Finnish language.
  • Now, the model family is adding support to the Nordic languages, including Swedish, Norwegian, Danish and Icelandic, and announced a partnership with LAION, adding vision capability and commencing the training of multimodal models.
  • Next, the expansion continues with the inclusion of all other official EU languages, broadening the linguistic scope and reinforcing its mission to democratize access to LLMs across the entire European Union.

Considerations for Use

The final model has been trained as a robust base model that can be finetuned for specific purposes. The intended audience for Poro Research Checkpoints is academic and industry research. The checkpoints are not suitable for deployment in a production use case without further training, fine-tuning and testing.

Acknowledgments

We wish to thank the operators of the LUMI/EuroHPC supercomputer for computational resources and technical support, including AMD, HPE and CSC – the IT Center for Science, Finland. TurkuNLP researchers have received funding from the European Union’s Horizon Europe research and innovation programme High Performance Language Technologies (HPLT) under grant agreement No 101070350.

About

Silo AI

Silo AI is Europe’s largest private AI lab on a mission to ensure Europe has a flagship AI company. We’re a trusted AI partner that brings competitive advantage to product R&D. We build AI-driven solutions and products to enable smart devices, autonomous vehicles, industry 4.0, and smart cities. Silo AI provides its customers unique access to world-class AI models and expertise, as well as the Silo OS infrastructure to speed up AI development and deployment. With SiloGen, Silo AI is currently building market leading open source LLMs, with the intent to ensure European digital sovereignty and democratize access to LLMs.
www.silo.ai

TurkuNLP

The TurkuNLP Group is a group of researchers at the University of Turku, with a research focus on various aspects of natural language processing, language technology and digital linguistics. TurkuNLP has contributed to a large number of open source NLP resources, such as FinBERT, WikiBERT, FinGPT, Turku Dependency Treebank, Universal Dependencies, Turku Neural Parsing Pipeline, Large internet corpora, Turku Paraphrase Corpus, Turku Sentiment Corpus, Wikidata normalization, TurkuONE etc. The University of Turku is an international academic community of 25,000 students and staff and was ranked among the 301–400 best universities in the 2023 Shanghai Ranking.

Want to discuss how Silo AI could help your organization?

Get in touch with our AI experts.
Peter Sarlin, PhD
CEO & Co-Founder
peter.sarlin@silo.ai
+358 40 572 7670
Author
Authors

Share on Social
Subscribe to our newsletter

Join the 5000+ subscribers who read the Silo AI monthly newsletter to be among the first to hear about the latest insights, articles, podcast episodes, webinars, and more.

What to read next

Ready to level up your AI capabilities?

Succeeding in AI requires a commitment to long-term product development. Let’s start today.