Embeddings can capture the meaning of a piece of text beyond keyword-matching. In this article, we will build a simple news article recommender system that computes the embeddings of all available articles and recommend the most relevant articles based on embeddings similarity.

We will also make the recommendation tighter by using text classification to recommend only articles within the same category. We will then extract a list of tags from each recommended article, which can further help readers discover new articles.

All this will be done via three Cohere API endpoints stacked together: Embed, Classify, and Chat.

We will implement the following steps:

Find the most similar articles to the one currently reading using embeddings.
Keep only articles of the same category using text classification.
Extract tags from these articles.
Show the top 5 recommended articles.

View the complete notebook here.

1. Find the most similar articles to the one currently reading using embeddings

Find the most similar articles using embeddings

Throughout this article, we'll use the BBC news article dataset as an example [Source]. This dataset consists of articles from a few categories: business, politics, tech, entertainment, and sport. Here are some example articles:

1.2 Turn articles into embeddings

The first thing we need to do is to turn each article's text into embeddings. An embedding is a list of numbers that our models use to represent a piece of text, capturing its context and meaning. We do this by calling Cohere’s Embed endpoint, which takes in texts as input and returns embeddings as output.

articles = df_inputs['Text'].tolist()

output = co.embed(
            model ='embed-english-v3.0',
            input_type='search_document',
            texts = articles)
embeds = output.embeddings

1.3 Pick one article and find the most similar articles

Next, we pick any one article to be the one the reader is currently reading (let's call this the target) and find other articles with the most similar embeddings (let's call these candidates) using cosine similarity.

Cosine similarity is a metric that measures how similar two sequences of numbers are (embeddings in our case), and we compute it for each target-candidate pair.

from sklearn.metrics.pairwise import cosine_similarity
 
def get_similarity(target,candidates):
  # Calculate cosine similarity
  similarity_scores = cosine_similarity(target,candidates)
 
  # Sort by descending order in similarity
  similarity_scores = list(enumerate(similarity_scores))
  similarity_scores = sorted(similarity_scores, key=lambda x:x[1], reverse=True)
 
  # Return similarity scores
  return similarity_scores

Using Article ID 70 as an example target article, here’s what we get:

Target:

[ID 70] aragones angered by racism fine spain coach luis aragones is furious after being fined by the spanis ...

Candidates:

[ID 23] ferguson urges henry punishment sir alex ferguson has called on the football association to punish a ...
[ID 51] mourinho defiant on chelsea form chelsea boss jose mourinho has insisted that sir alex ferguson and  ...
[ID 73] balco case trial date pushed back the trial date for the bay area laboratory cooperative (balco) ste ...
[ID 41] mcleish ready for criticism rangers manager alex mcleish accepts he is going to be criticised after  ...
[ID 42] premier league planning cole date the premier league is attempting to find a mutually convenient dat ...

2. Keep only articles of the same category using text classification

Keep only articles of the same category using text classification

Two articles may be similar, but they may not necessarily belong to the same category. For example, an article about a sports team manager facing a fine may be similar to another about a business entity facing a fine, but they are not of the same category.

Perhaps we can make the system better by only recommending articles of the same category. For this, let's build a news category classifier.

2.1 Build a classifier

We use Cohere’s Classify endpoint to build a news category classifier, classifying articles into five classes: Business, Politics, Tech, Entertainment, and Sport.

A typical text classification model requires hundreds/thousands of data points to train, but with this endpoint, we can build a classifier with as few as five examples per class.

To build the classifier, we need a set of examples consisting of text (news text) and labels (news category). The BBC News dataset happens to have both (columns 'Text' and 'Category'), so this time we’ll use the categories for building our examples.

To build the classifier, we will use the Text and Category columns

The Classify endpoint needs a minimum of 5 examples for each category, which we will sample randomly from the dataset. Since we have 5 categories, we will have a total of 25 examples.

def classify_text(texts, examples):
    classifications = co.classify(
        inputs=texts,
        examples=examples
    )

    return [c.prediction for c in classifications.classifications]

2.2 Measure its performance

Before actually using the classifier, let's first test its performance. Here we take another 100 data points as the test dataset and the classifier will predict the classes i.e. news category.

# Predicted classes
predictions = []
BATCH_SIZE = 90 # The API accepts a maximum of 96 inputs
for i in range(0, len(df_test['Text']), BATCH_SIZE):
    batch_texts = df_test['Text'][i:i+BATCH_SIZE].tolist()
    predictions.extend(classify_text(batch_texts, examples))
    
    
# Actual classes
actual = df_test['Category'].tolist()
 
# Compute metrics on the test dataset
accuracy = accuracy_score(actual, predictions)

We get a good accuracy score of 89% (more details in the notebook), so the classifier is ready to be implemented in our recommender system.

3. Extract tags from these articles

We now proceed to the tags extraction step. Compared to the previous two steps, this step is not about sorting or filtering articles, but rather enriching them with more information.

We do this by prompting Cohere’s Chat endpoint with a few examples of text and its tags. We then feed the articles from the classifier step and the endpoint will generate the corresponding tags.

There is more than one way to construct the prompt, depending on what you'd like to extract. In my case, the tags I'd like to extract are primarily the names of a person, company, or organization, and perhaps also some generic keywords. That was the idea behind the example tags I put in the prompt, which you can see below.

def extract_tags(article):
  prompt = f"""Given an article, extract a list of tags containing keywords of that article.

# Examples
Article: japanese banking battle at an end japan s sumitomo mitsui \
financial has withdrawn its takeover offer for rival bank ufj holdings enabling the \
latter to merge with mitsubishi tokyo.  sumitomo bosses told counterparts at ufj of its \
decision on friday  clearing the way for it to conclude a 3 trillion

Tags: sumitomo mitsui financial, ufj holdings, mitsubishi tokyo, japanese banking

Article:france starts digital terrestrial france has become the last big european country to \
launch a digital terrestrial tv (dtt) service.  initially  more than a third of the \
population will be able to receive 14 free-to-air channels. despite the long wait for a \
french dtt roll-out  the new platform s bac

Tags: france, digital terrestrial

Article: apple laptop is  greatest gadget  the apple powerbook 100 has been chosen as the greatest \
gadget of all time  by us magazine mobile pc.  the 1991 laptop was chosen because it was \
one of the first  lightweight  portable computers and helped define the layout of all future \
notebook pcs. the magazine h

Tags: apple, apple powerbook 100, laptop


Article:{article}

Tags:"""
  
  
  response = co.chat(
    model='command-r',
    message=prompt,
    preamble="")

  return response.text

4. Show the Top 5 recommended articles

Putting everything together to recommend the Top 5 articles

Let's now put everything together for our article recommender system.

First, we select the target article and compute the similarity scores against the candidate articles. Next, we filter the articles via classification. Finally, we extract the keywords from each article and show the recommendations.

Keeping to Article ID 70 as an example target article, here’s what we get:

You are reading...

[ID 70] Article: aragones angered by racism fine spain coach luis aragones is furious after being fined by the spanish football federation for his comments about thierry henry.  the 66-year-old criticised his 3000 euros (£2 060) punishment even though it was far below the maximum penalty.  i am not guilty  nor do i ...

You might also like...

[ID 23] Article: ferguson urges henry punishment sir alex ferguson has called on the football association to punish arsenal s thierry henry for an incident involving gabriel heinze.  ferguson believes henry deliberately caught heinze on the head with his knee during united s controversial win. the united boss said i...
Tags: football, sir alex ferguson, thierry henry, arsenal, manchester united

[ID 51] Article: mourinho defiant on chelsea form chelsea boss jose mourinho has insisted that sir alex ferguson and arsene wenger would swap places with him.  mourinho s side were knocked out of the fa cup by newcastle last sunday before seeing barcelona secure a 2-1 champions league first-leg lead in the nou camp....
Tags: chelsea, jose mourinho, sir alex ferguson, arsene wenger, fa cup, newcastle, barcelona, champions league

[ID 41] Article: mcleish ready for criticism rangers manager alex mcleish accepts he is going to be criticised after their disastrous uefa cup exit at the hands of auxerre at ibrox on wednesday.  mcleish told bbc radio five live:  we were in pole position to get through to the next stage but we blew it  we absolutel...
Tags: rangers, alex mcleish, auxerre, uefa cup, ibrox

[ID 42] Article: premier league planning cole date the premier league is attempting to find a mutually convenient date to investigate allegations chelsea made an illegal approach for ashley cole.  both chelsea and arsenal will be asked to give evidence to a premier league commission  but no deadline has been put on ...
Tags: premier league, chelsea, arsenal, ashley cole

[ID 14] Article: ireland 21-19 argentina an injury-time dropped goal by ronan o gara stole victory for ireland from underneath the noses of argentina at lansdowne road on saturday.  o gara kicked all of ireland s points  with two dropped goals and five penalties  to give the home side a 100% record in their autumn i...
Tags: rugby, ireland, argentina, ronan o gara

Let’s try a couple of other articles in business and tech and see the output.

Business (returning recommendations around German economy and global economic growth/slump):

You are reading...

[ID 1] Article: german business confidence slides german business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy.  munich-based research institute ifo said that its confidence index fell to 95.5 in february from 97.5 in january  its first decline in three months. the stu...

You might also like...

[ID 56] Article: borussia dortmund near bust german football club and former european champion borussia dortmund has warned it will go bankrupt if rescue talks with creditors fail.  the company s shares tumbled after it said it has  entered a life-threatening profitability and financial situation . borussia dortmund...
Tags: borussia dortmund, german football, bankruptcy

[ID 2] Article: bbc poll indicates economic gloom citizens in a majority of nations surveyed in a bbc world service poll believe the world economy is worsening.  most respondents also said their national economy was getting worse. but when asked about their own family s financial outlook  a majority in 14 countries...
Tags: bbc, economy, financial outlook

[ID 8] Article: car giant hit by mercedes slump a slump in profitability at luxury car maker mercedes has prompted a big drop in profits at parent daimlerchrysler.  the german-us carmaker saw fourth quarter operating profits fall to 785m euros ($1bn) from 2.4bn euros in 2003. mercedes-benz s woes - its profits slid...
Tags: daimlerchrysler, mercedes, luxury car, profitability

[ID 32] Article: china continues rapid growth china s economy has expanded by a breakneck 9.5% during 2004  faster than predicted and well above 2003 s 9.1%.  the news may mean more limits on investment and lending as beijing tries to take the economy off the boil. china has sucked in raw materials and energy to fee...
Tags: china, economy, beijing

[ID 96] Article: bmw to recall faulty diesel cars bmw is to recall all cars equipped with a faulty diesel fuel-injection pump supplied by parts maker robert bosch.  the faulty part does not represent a safety risk and the recall only affects pumps made in december and january. bmw said that it was too early to say h...
Tags: bmw, diesel cars, robert bosch, fuel injection pump

Tech (returning recommendations around consumer devices):

You are reading...

[ID 71] Article: camera phones are  must-haves  four times more mobiles with cameras in them will be sold in europe by the end of 2004 than last year  says a report from analysts gartner.  globally  the number sold will reach 159 million  an increase of 104%. the report predicts that nearly 70% of all mobile phones ...

You might also like...


[ID 3] Article: lifestyle  governs mobile choice  faster  better or funkier hardware alone is not going to help phone firms sell more handsets  research suggests.  instead  phone firms keen to get more out of their customers should not just be pushing the technology for its own sake. consumers are far more interest...
Tags: mobile, lifestyle, phone firms, handsets

[ID 69] Article: gates opens biggest gadget fair bill gates has opened the consumer electronics show (ces) in las vegas  saying that gadgets are working together more to help people manage multimedia content around the home and on the move.  mr gates made no announcement about the next generation xbox games console ...
Tags: bill gates, consumer electronics show, gadgets, xbox

[ID 46] Article: china  ripe  for media explosion asia is set to drive global media growth to 2008 and beyond  with china and india filling the two top spots  analysts have predicted.  japan  south korea and singapore will also be strong players  but china s demographics give it the edge  a media conference in londo...
Tags: china, india, japan, south korea, singapore, global media growth

[ID 19] Article: moving mobile improves golf swing a mobile phone that recognises and responds to movements has been launched in japan.  the motion-sensitive phone - officially titled the v603sh - was developed by sharp and launched by vodafone s japanese division. devised mainly for mobile gaming  users can also ac...
Tags: mobile phone, japan, sharp, vodafone, golf swing

[ID 63] Article: what high-definition will do to dvds first it was the humble home video  then it was the dvd  and now hollywood is preparing for the next revolution in home entertainment - high-definition.  high-definition gives incredible  3d-like pictures and surround sound. the dvd disks and the gear to play the...
Tags: high-definition, dvd, hollywood, home entertainment

In conclusion, this demonstrates an example of how we can stack multiple NLP endpoints together to get an output much closer to our desired outcome.

In practice, hosting and maintaining multiple models can turn quickly into a complex activity. But by leveraging Cohere endpoints, this task is reduced to a simple API call.