Context by Cohere
The Embedding Archives: Millions of Wikipedia Article Embeddings in Many Languages

The Embedding Archives: Millions of Wikipedia Article Embeddings in Many Languages

Share:

There’s no denying that we’re in the midst of a revolutionary time for Language AI. Developers are waking up to the vast emerging capabilities of language understanding and generation models. One of the key building blocks for this new generation of applications are the embeddings that power search systems.

To aid developers in rapidly getting started with commonly used datasets, we are releasing a massive archive of embedding vectors that can be freely downloaded and used to power your applications.

Using Cohere’s Multilingual embedding model, we have embedded millions of Wikipedia articles in many languages. The articles are broken down into passages, and an embedding vector is calculated for each passage.

The archives are available for download on Hugging Face Datasets, and contain both the text, embedding vector, and additional metadata values.

from datasets import load_dataset
docs = load_dataset(f"Cohere/wikipedia-22-12-simple-embeddings", split="train")

This downloads the entire dataset (Simple English Wikipedia in this instance). The schema looks like this:

The emb column contains the embedding of that passage of text (with the title of the article appended to its beginning). This is an array of 768 floats (the embedding dimension of Cohere’s multilingual-22-12 embedding model).

Wikipedia

Number of vectors / embedded passages

English 

35 million

German

15 million

French

13 million

Spanish

10 million

Italian

8 million

Japanese

5 million

Arabic

3 million

Chinese (Simplified)

2 million

Korean

1 million

Simple English

486 Thousand

Hindi

432 Thousand

Total

94 Million

Read more about how this data was prepared and processed in the dataset card.

What Can You Build with This?

The sky's the limit to what you can build with this. A few common use cases include:

Neural Search Systems

Wikipedia is one of the world’s most valuable knowledge stores. This embedding archive can be used to build search systems that retrieve relevant knowledge based on a user query.

In this example, to conduct a search, the query is first embedded using co.Embed() , and then the similarity is calculated using dot product multiplication.

# Get the query, then embed it
query = 'Who founded youtube'
query_embedding = co.embed(texts=[query], model='multilingual-22-12').embeddings 


# Compute dot score between query embedding and document embeddings
# 'doc_embeddings' is the list of vectors in the archive
dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
top_k = torch.topk(dot_scores, k=3)

Now, topk contains the indices of the most relevant results. Look at the actual code example here [Colab/notebook].

Weaviate: Neural Search with a Vector Database

Beyond a certain scale, it becomes useful to employ a vector database for more scalable and advanced retrieval functionality.

A subset of this embedding archive is hosted publicly by Weaviate. You can query it directly without having to download the dataset or process it in any way. It contains 10 million of these vectors comprised of 1 million each from the languages: en, de, fr, es, it, ja, ar, zh, ko, hi.

You can find this code in this colab/notebook. You can query the dataset with

query_result = semantic_serch("time travel plot twist")

And get the results:

You can also filter the results for a specific language, say Japanese:

query_result = semantic_serch("time travel plot twist", results_lang='ja')

And get results only in that one language.

Use More Than One Language

Because these archives were embedded with a model with cross-lingual properties, you can use multiple languages in your application and rely on the property that sentences that are similar in meaning will have similar embeddings, even if they are in different languages.

Search specific sections of Wikipedia

Beyond global Wikipedia exploration, a dataset like this opens the door to searching specific topics if you curate several pages on a relevant topic. Examples include :

  • All the episode pages of Breaking Bad (Get the page titles from List of Breaking Bad Episodes) or other TV series.
  • Utilize Wikipedia information boxes to collect the titles of a specific topic, say Electronics (from the bottom of the Computers page)

Due to the size of the dataset, an interim step can be to import the text into a database like Postgres and use that to extract interesting subsets for each project you want to build.

Let's build!

Drop by the Embedding Archives: Wikipedia thread on the Cohere Discord (join here) if you have any questions, ideas, or if you want to share something cool you build with this.


We can’t wait to see what you build! Sign up for a free Cohere account to start building.

Keep reading