Context by Cohere
Introducing Aya: An Open Science Initiative to Accelerate Multilingual AI Progress

Introducing Aya: An Open Science Initiative to Accelerate Multilingual AI Progress

Share:

TL;DR:

Aya is an open science project that aims to build a state of art multilingual generative language model; that harnesses the collective wisdom and contributions of people from all over the world.


Cohere For AI is a research lab that seeks to solve complex machine learning problems. We are honored to introduce Aya—an ongoing collaborative open science endeavor aimed at building a multilingual language model via instruction tuning that harnesses the collective wisdom and contributions of people from all over the world. This yearlong open science initiative brings together AI experts from academia, industry, non-profits and independent research to create a state of the art multilingual model and foster open collaboration.

As natural language processing technologies advance, not all languages have been treated equally by developers and researchers. Much of the data used to train large language models comes from the internet, which continues to reflect the composition of early users of this technology - 5% of the world speaks English at home, yet 63.7% of internet communication is in English. There are around 7,000 languages spoken in the world, and around 400 languages have more than 1M speakers.1 However, there is scarce coverage of multilingual datasets.2 3 On top of this, the under-indexing of certain languages is also driven by access to compute resources. Mobile data, compute, and other computational resources may often be expensive or unavailable in regions that are home to under-represented languages. Unless we address this disproportionate representation head-on, we risk perpetuating this divide and further widening the gap in language access of new technologies.

In the Aya Multilingual project, we want to improve available multilingual generative models and accelerate progress for languages across the world. The word Aya is derived from the Twi language and is translated to “fern”. Aya is a symbol of endurance and resourcefulness which captures the spirit of our own commitment to accelerate multilingual AI progress. Contributing to Aya is open to anyone who is passionate about advancing the field of natural language processing and is committed to promoting open science. You don’t have to be an AI expert to be involved, we are looking for everyday citizens, teachers, linguists and lifelong learners. By joining Aya, you become part of a global movement dedicated to democratizing access to language technology. We will be open sourcing all our models, training data, and the data collection tool as part of this project.

In our commitment to fostering collaboration, we are supporting a dedicated Discord server to connect with Aya contributors worldwide. Here we gather to coordinate our efforts and connect as a community of independent researchers, passionate about ensuring our languages are included in the future of generative AI. By joining our community you’ll have the opportunity to connect with like-minded individuals from your region and collectively make a significant impact on language representation.

The project is led and supported with compute and resources by Cohere For AI. However, it is a truly multi-institutional initiative with the help of a community of researchers, engineers, linguists, social scientists, and lifelong learners from over 100 countries around the world.

Join us on this remarkable journey as we collectively shape the future of multilingual language models. Let's unite, collaborate, and unleash the true potential of open science for the betterment of global communication. Get started today by contributing for your language.

Not sure where to start? Join our dedicated Discord Server for the Aya multilingual project, and you can meet people contributing in your language.


1. How many languages are there in the world?. (2023). Retrieved 30 May 2023, from https://www.ethnologue.com/insights/how-many-languages/

2. From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers. (2023). Retrieved 30 May 2023, from https://aclanthology.org/2020.emnlp-main.363.pdf

3. Team, N., Costa-jussà, M., Cross, J., Çelebi, O., Elbayad, M., & Heafield, K. et al. (2022). No Language Left Behind: Scaling Human-Centered Machine Translation. Retrieved 30 May 2023, from https://arxiv.org/abs/2207.04672


Keep reading