Poro is a family of multilingual open source large language models (LLMs), with the aim of strengthening European digital sovereignty and democratizing access to LLMs. To ensure transparency and openness, and as part of the Poro Research Checkpoint program, we are today announcing new model checkpoints, as well as the next-generation models with additional languages and modalities.
- Together with the University of Turku and HPLT, SiloGen launched an initiative to build a family of
open multilingual LLMs with a world-class team, access to a record amount of compute and data, and a distinctive software layer to train LLMs.
- Two months later, we are now releasing the next two checkpoint milestones, covering a total of 50% of training for Poro 34B. Model evaluations prove performance for low-resource languages, with best-in-class performance for the Finnish language.
- As a next step, the model family adds support to the Nordic languages, including Swedish, Norwegian, Danish and Icelandic, and announces a partnership with LAION, adding vision capability and commencing the training of multimodal models
In mid-November, we published the first three checkpoints of Poro 34B, a multilingual, open European language model showing performance evidence on low-resource languages like Finnish, without compromising performance in English. We’re now publishing the next two checkpoints for Poro 34B, with in total 50% of the model trained. After five model checkpoints, the results for Poro 34B show that Poro is already outperforming all existing open language models on the Finnish language, including FinGPT, Mistral, Llama and the BLUUMI 176 billion parameter model among others (FinGPT is the first large generative Finnish language model (Luukkonen et al., forthcoming, EMNLP)).
“I’m proud of the results we have already been able to achieve with the Poro models. Already at this stage, I believe it’s safe to say that Poro 34B is, to date, the best open Finnish language model available. It’s inspiring to see how we have been able to use some of the learnings from FinGPT and the BLUUMI 176 billion parameter model, improve on those, and now have an even better model. We expect to reach 100% of training Poro 34B in the coming weeks.” says Research Fellow Sampo Pyysalo from TurkuNLP.
Added languages and modalities
With the proficient initial results of the Poro model family, we are now excited to announce a set of new models with additional capabilities. We have commenced training a model family covering English, Finnish, Swedish, Norwegian, Danish, Icelandic and code. These models have an updated and more modern architecture, and comes in a variety of model sizes. This is an important step towards the aim of covering all European languages, and our vision of European digital sovereignty with AI infrastructure for European companies to benefit from.
Language models with vision
While extending support to additional European languages, we are now also announcing that the upcoming model generations will add vision to their capabilities. This is enabled through a partnership with LAION (Large-scale Artificial Intelligence Open Network) for building a set of multimodal models. LAION is a global non-profit organization, with an aim to make large-scale data sets, machine learning models and related code publicly available. They provide assets, such as the LAION-5B dataset and the open toolbox for NSFW and toxicity detection LAION-SAFETY, for developing safe, trustworthy and reliable multimodal models. Their assets are among others behind the image generation tool Stable Diffusion. LAION and their collaborators already made pivotal contributions to training, studying and open-sourcing multi-modal foundation models and corresponding datasets with works like openCLIP, openFlamingo, CLAP and DataComp. This partnership will introduce vision capabilities to the Poro model family through a modular architecture by providing vision to existing models, as well as opening up opportunities to additional multimodal architectures in the future.
“In line with the plan to cover all European languages, it’s a natural step to start with an extension to the Nordic languages. And it’s likewise natural to extend Poro with vision. Through a partnership with LAION, multimodal models help in expanding the potential use cases and possibilities for value creation. Models with vision capabilities will be able to interpret, summarize, and describe documents containing both text and images. Like textual data, we see an even larger potential for generative AI to consolidate large amounts of data of different modalities.” Peter Sarlin, Silo AI CEO and co-founder, notes.
The collaboration with LAION brings together industry expertise and experience, strong and rigorous academic research, and an open source philosophy. This is a strong foundation for ensuring trustworthy, reliable and robust models. We hope the level of transparency enabled by our open source approach, in combination with the Poro Research Checkpoint program, will add to the trust we have been able to build with partners and clients alike.
Considerations for Use
The intended audience for Poro Research Checkpoints is academic and industry research. These checkpoints are not suitable for deployment in a production use case without further training, fine-tuning and testing.
We wish to thank the operators of the LUMI/EuroHPC supercomputer for computational resources and technical support, including AMD, HPE and CSC – the IT Center for Science, Finland. TurkuNLP researchers have received funding from the European Union’s Horizon Europe research and innovation programme High Performance Language Technologies (HPLT) under grant agreement No 101070350.
Want to discuss how Silo AI could help your organization?
Join the 5000+ subscribers who read the Silo AI monthly newsletter to be among the first to hear about the latest insights, articles, podcast episodes, webinars, and more.