Aya is an open science project that aims to build a state of art multilingual generative language model; that harnesses the collective wisdom and contributions of people from all over the world.
Cohere For AI is a research lab that seeks to solve complex machine learning problems. We are honored to introduce Aya—an ongoing collaborative open science endeavor aimed at building a multilingual language model via instruction tuning that harnesses the collective wisdom and contributions of people from all over the world. This yearlong open science initiative brings together AI experts from academia, industry, non-profits and independent research to create a state of the art multilingual model and foster open collaboration.
As natural language processing technologies advance, not all languages have been treated equally by developers and researchers. Much of the data used to train large language models comes from the internet, which continues to reflect the composition of early users of this technology - 5% of the world speaks English at home, yet 63.7% of internet communication is in English. There are around 7,000 languages spoken in the world, and around 400 languages have more than 1M speakers.1 However, there is scarce coverage of multilingual datasets.2 3 On top of this, the under-indexing of certain languages is also driven by access to compute resources. Mobile data, compute, and other computational resources may often be expensive or unavailable in regions that are home to under-represented languages. Unless we address this disproportionate representation head-on, we risk perpetuating this divide and further widening the gap in language access of new technologies.
In the Aya Multilingual project, we want to improve available multilingual generative models and accelerate progress for languages across the world. The word Aya is derived from the Twi language and is translated to “fern”. Aya is a symbol of endurance and resourcefulness which captures the spirit of our own commitment to accelerate multilingual AI progress. Contributing to Aya is open to anyone who is passionate about advancing the field of natural language processing and is committed to promoting open science. You don’t have to be an AI expert to be involved, we are looking for everyday citizens, teachers, linguists and lifelong learners. By joining Aya, you become part of a global movement dedicated to democratizing access to language technology. We will be open sourcing all our models, training data, and the data collection tool as part of this project.
In our commitment to fostering collaboration, we are thrilled to announce two international sprints in a couple of months. These sprints will bring together individuals from diverse parts of the world. By accommodating different time zones and regional contexts, we hope to ensure that everyone has an equal opportunity to actively participate and contribute their expertise.
By participating in these dedicated sprints, you’ll have the opportunity to connect with like-minded individuals from your region and collectively make a significant impact on language representation. The project is led and supported with compute and resources by Cohere For AI. However, it is a truly multi-institutional initiative with the help of a community of researchers, engineers, linguists, social scientists, and lifelong learners from over 100 countries around the world.
Join us on this remarkable journey as we collectively shape the future of multilingual language models. Let's unite, collaborate, and unleash the true potential of open science for the betterment of global communication. Get started today by contributing for your language. Sign up to be to be part of one of the international sprints, scheduled as follows:
- Aya International Sprint 1 - August 12, sign up here
- Aya International Sprint 2 - August 26, sign up here
Not sure where to start? Join our dedicated Discord Server for the AYA multilingual project, and you can meet people contributing in your language.
1. How many languages are there in the world?. (2023). Retrieved 30 May 2023, from https://www.ethnologue.com/insights/how-many-languages/
2. From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers. (2023). Retrieved 30 May 2023, from https://aclanthology.org/2020.emnlp-main.363.pdf
3. Team, N., Costa-jussà, M., Cross, J., Çelebi, O., Elbayad, M., & Heafield, K. et al. (2022). No Language Left Behind: Scaling Human-Centered Machine Translation. Retrieved 30 May 2023, from https://arxiv.org/abs/2207.04672