There’s no denying that we’re in the midst of a revolutionary time for Language AI. Developers are waking up to the vast emerging capabilities of language understanding and generation models. One of the key building blocks for this new generation of applications are the embeddings that power search systems.
To aid developers in rapidly getting started with commonly used datasets, we are releasing a massive archive of embedding vectors that can be freely downloaded and used to power your applications.
Using Cohere’s Multilingual embedding model, we have embedded millions of Wikipedia articles in many languages. The articles are broken down into passages, and an embedding vector is calculated for each passage.
The archives are available for download on Hugging Face Datasets, and contain both the text, embedding vector, and additional metadata values.
from datasets import load_dataset docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train")
This downloads the entire dataset (Simple English Wikipedia in this instance). The schema looks like this:
emb column contains the embedding of that passage of text (with the title of the article appended to its beginning). This is an array of 768 floats (the embedding dimension of Cohere’s multilingual-22-12 embedding model).
Read more about how this data was prepared and processed in the dataset card.
What Can You Build with This?
The sky's the limit to what you can build with this. A few common use cases include:
Neural Search Systems
Wikipedia is one of the world’s most valuable knowledge stores. This embedding archive can be used to build search systems that retrieve relevant knowledge based on a user query.
In this example, to conduct a search, the query is first embedded using
co.Embed() , and then the similarity is calculated using dot product multiplication.
# Get the query, then embed it query = 'Who founded youtube' query_embedding = co.embed(texts=[query], model='multilingual-22-12').embeddings # Compute dot score between query embedding and document embeddings # 'doc_embeddings' is the list of vectors in the archive dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1)) top_k = torch.topk(dot_scores, k=3)
Weaviate: Neural Search with a Vector Database
Beyond a certain scale, it becomes useful to employ a vector database for more scalable and advanced retrieval functionality.
A subset of this embedding archive is hosted publicly by Weaviate. You can query it directly without having to download the dataset or process it in any way. It contains 10 million of these vectors comprised of 1 million each from the languages:
query_result = semantic_serch("time travel plot twist")
And get the results:
You can also filter the results for a specific language, say Japanese:
query_result = semantic_serch("time travel plot twist", results_lang='ja')
And get results only in that one language.
Use More Than One Language
Because these archives were embedded with a model with cross-lingual properties, you can use multiple languages in your application and rely on the property that sentences that are similar in meaning will have similar embeddings, even if they are in different languages.
Search specific sections of Wikipedia
Beyond global Wikipedia exploration, a dataset like this opens the door to searching specific topics if you curate several pages on a relevant topic. Examples include :
- All the episode pages of Breaking Bad (Get the page titles from List of Breaking Bad Episodes) or other TV series.
- Utilize Wikipedia information boxes to collect the titles of a specific topic, say Electronics (from the bottom of the Computers page)
Due to the size of the dataset, an interim step can be to import the text into a database like Postgres and use that to extract interesting subsets for each project you want to build.
We can’t wait to see what you build! Sign up for a free Cohere account to start building.