Context by Cohere
How to Boost Wikipedia Search with Rerank

How to Boost Wikipedia Search with Rerank

Improving the Wikipedia API's search results using the Rerank endpoint as a reranker.

Share:

Traditional search algorithms often struggle to rank results effectively — a major hindrance when the search yields large results. Implementing reranking enables us to ensure that these search algorithms deliver the most relevant responses. With reranking techniques, search engines can go beyond basic relevance signals and instead find results that have a semantic similarity to a user’s query. This method improves the quality of the search, increases user satisfaction, and reveals valuable information that might otherwise be overlooked.

In this article, we’ll see how Cohere’s reranking capabilities can dramatically improve search results. We’ll start by looking at the results from a traditional Wikipedia search using the Wikipedia API. Then, we’ll reorder those results using Cohere’s Rerank endpoint. Finally, we’ll look at a user’s reranked search results for the query “Where are Monet’s water lilies?” to illustrate how Rerank improves the user experience.

We’ll go through the following steps:

  • Step 1: Get Search Results from the Wikipedia API
  • Step 2: Rerank Results
  • Step 3: Display Results
  • Step 4: Put It Together

You can find the source code used in this example here.

Step 1: Get Search Results from the Wikipedia API

First, we’ll use the search_wikipedia function to perform the search on Wikipedia through its API. It constructs the search URL using the provided query, makes a GET request to the API, and retrieves the initial search results.

Using the list operator, the Wikipedia API returns only the proper search results. (Note: Do not use the generator operator because the results will not be ranked correctly). Then, a second call gets all the information we need (leading passage, URL, and image) from the page.

def search_wikipedia(query):
   
    wiki_search_string = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=search&srlimit=20&srsearch="
    wiki_search_string += urllib.parse.quote_plus(query)

    initial_req = requests.get(wiki_search_string).json()

The function retrieves additional information for the returned page IDs, including titles, URLs, text excerpts, and images.

# this is the correct wikipedia ranking (same as the wikipedia.org website)
    page_ids = []
    for page in initial_req["query"]["search"]:
        page_ids.append(str(page["pageid"]))
    if len(page_ids) == 0:
        return [],[]
    
    # Get extra info for the returned page ids
    wiki_data_string = "https://en.wikipedia.org/w/api.php?action=query&format=json&prop=info%7Cextracts%7Cpageimages&formatversion=2&inprop=url&exchars=1200&exlimit=20&exintro=1&explaintext=1&exsectionformat=plain&piprop=thumbnail%7Cname&pithumbsize=100&pilimit=50&pilicense=any"
    wiki_data_string += "&pageids="+"|".join(page_ids) 

    res = requests.get(wiki_data_string).json()

After some formatting, the information obtained is stored in the initial_structured and initial_passages variables. We will use this formatted information both to get the original results and to show the results of a search that has not been reranked.

    initial_passages = [None] * len(page_ids)
    initial_structured = [None] * len(page_ids)

    # Retrieve and format additional information including titles, URLs, text excerpts and images
    for page in res["query"]["pages"]:
        # Use the initial list operator ranking, because the second request gives a different result ordering, 
        # which is not relevance based (!)
        actual_ranking_idx = page_ids.index(str(page['pageid']))
        initial_structured[actual_ranking_idx] = {
            "title": page["title"],
            "url": page["fullurl"],
            "text": page["extract"],
            "img": "" if "thumbnail" not in page else page["thumbnail"]["source"],
        }
        initial_passages[actual_ranking_idx] = page["title"] + " " + page["extract"]

    return initial_structured, initial_passages

Step 2: Rerank Results

Next, we’ll rerank the search results using Cohere’s Rerank endpoint.

The re_rank() function reranks the initial search results based on the user query and a specified reranking model (rerank-english-v2.0). Cohere’s reranking algorithm receives the user query (data.query) and the initial search results (data.passages) as parameters, then compares the semantic information in the query and the initial search results.

The reranking model then assigns a relevance score to each document in the initial search results—the higher the score, the more relevant the document is to the query.

After sorting the initial search results in descending order of their relevance scores, the reranked results, including the document indices and relevance scores, are returned as the output of the re_rank() function.

async def re_rank(data: ReRankInput):
    if len(data.passages) == 0:
        return {"results": []}
    rerank_time = default_timer()
    re_ranked_result = co.rerank(
        model="rerank-english-v2.0",
        query=data.query, 
        documents=data.passages)
    serializable = []
    for res in re_ranked_result:
        serializable.append(
            {
                "index": int(res.index),
                "relevance_score": round(float(res.relevance_score), 3)
                }
            )
    re_ranked_result = {"results": serializable}
    
    rerank_time = round((default_timer() - rerank_time) * 1000, 1)
    
    _id = str(uuid.uuid4())
    re_ranked_result["id"] = _id
    return re_ranked_result

Step 3: Display Results

Let’s now compare the original Wikipedia search and the reranked results for the query “Where are Monet’s water lilies?”. Below are the results from the initial Wikipedia search.

The original Wikipedia search results for “Where are Monet’s water lilies?”
The original Wikipedia search results for “Where are Monet’s water lilies?”

These results are acceptable but not perfect. When we click on the first example, we’ll get some of the information we’re looking for, but it’s not at the top of the page and doesn’t provide detailed enough information.

The first listed result from the original Wikipedia search results
The first listed result from the original Wikipedia search results

Let’s compare Wikipedia’s results to the reranked results using Rerank.

The reranked search results by the Rerank endpoint
The reranked search results by the Rerank endpoint

Now, instead of the first result being the Wikipedia entry for Claude Monet, we get the Fondation Monet in Giverny. The name of the town is right there in the title.

The first listed result from the reranked search results
The first listed result from the reranked search results

The article mentions the famous water lilies in the first paragraph. We get information about the site and even a map showing where the gardens are located in France. For example, a user would know to fly to Paris, rather than Marseille, from this map.

Let’s try one more search example: “David Bowie hits.” Here are the results of the traditional search:

The original Wikipedia search results for “David Bowie hits”
The original Wikipedia search results for “David Bowie hits”

While an article on David Bowie’s discography would contain all the hits, there’s a lot of information for a user to sift through. And while the David Bowie article probably mentions several of his notable songs, it may not list all of his hits — and they’d be listed among extraneous information about his life.

Let’s contrast these results with the reranked results using Rerank.

The reranked search results by the Rerank endpoint
The reranked search results by the Rerank endpoint

The first result is a greatest hits album, which will contain much more relevant information to this user’s search. The Cohere search also returned some of his biggest hits as individual results. Notice that the most relevant result, listed first, would only be ranked 10 by Wikipedia.

Step 4: Put It Together

Now that we’ve established that the reranked results are better, how do we get it into our JavaScript-backed UI?

To do so, we create an API using the Fast API package, an easy-to-use web framework that Python developers can use to create RESTful APIs quickly. The following code connects the Python code below to the JavaScript code that creates the demo’s front end.

api_app = FastAPI(title="api app")
app = FastAPI(title="main app")
app.mount("/api", api_app)
app.mount("/", StaticFiles(directory="ui", html=True), name="ui")

Conclusion

Cohere’s Rerank endpoint can significantly improve the relevance of responses to search queries. Cohere’s reranking algorithm leverages semantic similarity to deliver more accurate results beyond basic relevance signals. As shown in the queries reviewed here, the reranked results had more relevant information, highlighted key details, and improved the overall search experience.

By integrating Cohere’s reranking functionality into user interfaces, developers can enhance search capabilities and deliver more meaningful results to their users.
To get started building your own version, create a free Cohere account.

Keep reading