More than double the number of languages covered by previous open-source AI models to increase coverage for underrepresented communities

Today, the research team at Cohere For AI (C4AI), Cohere’s non-profit research lab, are excited to announce a new state-of-the-art, open-source, massively multilingual, generative large language research model (LLM) covering 101 different languages — more than double the number of languages covered by existing open-source models. Aya helps researchers unlock the powerful potential of LLMs for dozens of languages and cultures largely ignored by most advanced models on the market today.

We are open-sourcing both the Aya model, as well as the largest multilingual instruction fine-tuned dataset to-date with a size of 513 million covering 114 languages. This data collection includes rare annotations from native and fluent speakers all around the world, ensuring that AI technology can effectively serve a broad global audience that have had limited access to-date.

Closes the Gap in Languages and Cultural Relevance

Aya is part of a paradigm shift in how the ML community approaches massively multilingual AI research, representing not just technical progress, but also a change in how, where, and by whom research is done.

As LLMs, and AI generally, have changed the global technological landscape, many communities across the world have been left unsupported due to the language limitations of existing models. This gap hinders the applicability and usefulness of generative AI for a global audience, and it has the potential to further widen existing disparities that already exist from previous waves of technological development. By focusing primarily on English and one or two dozen other languages as training resources, most models tend to reflect inherent cultural bias.

We started the Aya project to address this gap, bringing together over 3,000 independent researchers from 119 countries.

Figure: Geographical distribution of Aya collaborators

Significantly Outperforms Existing Open-Source Multilingual Models

The research team behind Aya was able to substantially improve performance for underserved languages, demonstrating superior capabilities in complex tasks, such as natural language understanding, summarization, and translation, across a wide linguistic spectrum.

We benchmark Aya model performance against available, open-source, massively multilingual models. It surpasses the best open-source models, such as mT0 and Bloomz, on benchmark tests by a wide margin. Aya consistently scored 75% in human evaluations against other leading open-source models, and 80-90% across the board in simulated win rates.

Aya also expands coverage to more than 50 previously unserved languages, including Somali, Uzbek, and more. While proprietary models do an excellent job serving a range of the most commonly spoken languages in the world, Aya helps to provide researchers with an unprecedented open-source model for dozens of underrepresented languages.

Figure: Head-to-head comparison of preferred model responses

Trained on the Most Extensive Multilingual Dataset to Date

We are releasing the Aya Collection consisting of 513 million prompts and completions covering 114 languages. This massive collection was created by fluent speakers around the world creating templates for selected datasets and augmenting a carefully curated list of datasets. It also includes the Aya Dataset which is the most extensive human-annotated, multilingual, instruction fine-tuning dataset to date. It contains approximately 204,000 rare human curated annotations by fluent speakers in 67 languages, ensuring robust and diverse linguistic coverage. This offers a large-scale repository of high-quality language data for developers and researchers.

Many languages in this collection had no representation in instruction-style datasets before. The fully permissive and open-sourced dataset includes a wide spectrum of language examples, encompassing a variety of dialects and original contributions that authentically reflect organic, natural, and informal language use. This makes it an invaluable resource for multifaceted language research and linguistic preservation efforts.

How to Get Involved

We are releasing both the Aya model and Aya datasets with a fully permissive Apache 2.0 license, with the goal of broadening access to multilingual progress. With this license, academics, civil institutions, and small companies can leverage the Aya model and data for a broader impact.

Aya will be a foundation for additional open science projects, and we expect to continue to improve Aya’s capabilities. To join this open science initiative and make sure your language is represented, go to the Aya Project website to sign up and get started. You can also try the Aya model in the Cohere Playground or download the model and dataset.

To learn more about the research and the people behind it, check out our documentary. We’ll also be hosting a virtual event on Friday, February 16 to share more about the new Aya model.