SiloGen is today announcing a release of the first model checkpoints of a family of multilingual open source large language models (LLMs), covering all official European languages and code.
- Together with the University of Turku and HPLT, SiloGen launched at the end of August an initiative to build open multilingual LLMs, with the aim of ensuring European digital sovereignty and democratizing access to LLMs.
- The unique open source initiative combines a world-class team, access to a record amount of compute on Europe’s most powerful supercomputer LUMI, a record amount of data, and a distinctive software layer to train LLMs.
- Two months after initiating the training efforts for a family of models, we are excited to release the first checkpoint milestones for Poro 34B.
Named ‘Poro’ after the Finnish word for reindeer, this new 34 billion parameter LLM for English, Finnish and code is an early look at what is in store from our multilingual model family. Future Poro releases will expand support to other European languages and add capabilities, such as updated model architecture, expanded context window, modalities etc. As one of the first projects in the field of LLMs, we will also be providing external researchers unprecedented access to the training of models. In a program called Poro Research Checkpoints, we will release a series of checkpoints for the model during the training process. Sharing these checkpoints will enable visibility into language model training among researchers and practitioners who do not have the resources to train their own large models from scratch.
Poro’s advanced capabilities with European languages like Finnish descend from how it addresses the core challenge for low-resource languages: training LLMs requires enormous amounts of data, but for low-resource languages like Finnish, sufficient data is simply not available. In general, Poro addresses this by cross-training low-resource languages with high-resource languages. This takes advantage of a cross-lingual signal that allows the model to achieve higher performance for the low-resource language than training a monolingual model, and has the further advantage of teaching the model basic translation capability.
After 30% of training, Poro already extends state-of-the-art base model performance on the Finnish language benchmark FIN-bench (e.g. FinGPT, Llama, Mistral), and in light of current experiments expect similar results as we expand to other languages. This is achieved without compromising performance in English, for which Poro is on course to achieve performance on par with, and beyond, comparable open English-oriented models (e.g., Llama and Mistral).
Poro is the result of a collaboration between Silo AI’s generative AI arm SiloGen and the University of Turku’s TurkuNLP Group and HPLT project, bringing together cutting-edge research and industry expertise.
Features of Poro 34B
Below is a summary of key features of Poro 34B. When training completes we will additionally release instruction and chat tuned varieties of the Poro 34B base model. For transparency with respect to model architecture, data and other technical information, please refer to the official model card.
- Poro Research Checkpoints: Checkpoints for the model are released throughout the training process, providing external researchers with unprecedented access to investigate the model training process.
- Model architecture: Poro 34B is 34.2 billion parameters and uses a BLOOM architecture with ALiBi embeddings to allow for context window extrapolation. While model architecture for the initial model has been kept simple, future models under progress will support additional capabilities, such as flash attention, rotary embeddings and grouped query attention.
- Multilingual capabilities: Poro is designed to process English and Finnish, and has proficiency with a variety of programming languages. Additionally, it can perform basic translation between English and Finnish.
- Open source: Poro is freely available under the Apache 2.0 License, implying applicability for both commercial and research use.
- Dataset: The model is trained with a dataset of 1 trillion tokens, with English, Finnish and a variety of programming languages represented.
- Training details: Poro is trained using 512 AMD MI250X GPUs on the LUMI supercomputer in Finland.
Considerations for Use
The intended audience for Poro Research Checkpoints is academic and industry research. These checkpoints are not suitable for deployment in a production use case without further training, fine-tuning and testing.
We wish to thank the operators of the LUMI/EuroHPC supercomputer for computational resources and technical support, including AMD, HPE and CSC – the IT Center for Science, Finland. TurkuNLP researchers have received funding from the European Union’s Horizon Europe research and innovation programme High Performance Language Technologies (HPLT) under grant agreement No 101070350.
Want to discuss how Silo AI could help your organization?
Join the 5000+ subscribers who read the Silo AI monthly newsletter to be among the first to hear about the latest insights, articles, podcast episodes, webinars, and more.