Text classification plays a pivotal role in digitizing a wide variety of modern industries. Also sometimes referred to as text tagging or text categorization, text classification describes the process of arranging text into specific, organized groups by assigning text a label or class. Using text classifications helps automate many business processes, such as customer support, survey analysis, sentiment analysis, and document summarization, and more.

Text classification has drastically evolved over time, shifting away from traditional machine learning (ML) models that need large amounts of data to Large Language Models (LLM) that require only a handful of examples for model training.

The most significant advantage of text classification via natural language processing (NLP) is its ability to scale and accurately extract specific information from large volumes of textual data. Users can save hundreds of hours by using quality classifiers and encoders, making the process fast and cost-effective. Once deployed, a well-trained text classification model can perform with unsurpassed accuracy. Companies can automate multiple business processes and discover actionable insights that lead to better decision-making.

From the boosting and bagging approach, decision trees, regression models, neural networks, vectorization, and now deep learning-based models — text classification has grown spectacularly in recent years. Close to 250 well-known models are already in production with nearly 400 available datasets, each bringing a different style, architecture, and set of model characteristics. Adapting to these datasets has led to continuous developments in NLP and text classification models.

If you work in this field, keeping up to date with all the novel innovations is essential. So, let’s look at 10 must-read articles and research papers on text classification.

1. “The impact of preprocessing on text classification”

Authors: Alper Kursay Uysal and Serkan Gunal

Published: January 2014

The authors of this 2014 paper posit that choosing a suitable combination of preprocessing tasks, instead of enabling or disabling them all, significantly improves classification accuracy. Their study used widely known preprocessing tasks, such as stemming, eliminating stop words, tokenization, and lower case conversion on two domains (emails and news), in two different languages (English and Turkish).

While running classification using SVM, the authors used various feature sizes like 10, 20, 50, 100, 200, 500, 1,000, and 2,000 over all the combinations. Their study achieved a spectacular 98.8 percent accuracy in the English email domain, with a feature size of 500. They subsequently tested all the other varieties and noted that on the smallest feature size of 10, the Turkish news domain had an accuracy of 97.3 percent.

The authors conclude the paper by emphasizing the importance of checking all the possible combinations of preprocessing tasks — regardless of the domain or language involved — to improve the results.

Why it’s a must-read: Conducting experiments with two different languages and on two different domains, the authors of this article demonstrate that preprocessing influences classification accuracy. Though the article has aged, it’s still a must-read for understanding how big of an impact your preprocessing tasks and decisions have on the precision of your classifications, and where that impact will show.

2. “Universal language model fine-tuning for text classification”

Authors: Jeremy Howard and Sebastian Ruder

Published: January 2018

Published back in 2018, this study introduces readers to transfer learning for NLP tasks. Using six different data sets, the authors propose a universal language model fine-tuning method, which outperformed six existing transfer learning methods. The authors emphasize the impact of fine-tuning and directionality on the behavior of a classifier, which can boost the model's performance by 0.5-0.7 times.

Readers can also grasp the novel fine-tuning techniques implemented by the authors in this study. This piece truly opens the grounds for more work in text classification using transfer learning.

Why it’s a must-read: This article suggests that using Universal Language Model Fine-tuning (ULMFiT) as a transfer learning method for your NLP tasks increases their effectiveness. More importantly, it provides different techniques for fine-tuning your language model. By reading this paper, you can easily learn foundational strategies for making your classifications tighter, making this paper a must-read.

3. “Feature selection for text classification: A review”

Authors: Xuelian Deng, Yuqing Li, Jian Weng, and Jilian Zhang

Published: May 2018

As the name suggests, this 2018 study emphasizes the feature selection process to classify text, in which the selected features are directly proportional to the heterogeneity in data. This study also highlights the representation schemes of documents, such as bag-of-words and local, global dictionaries, and similarity measures in text classification, such as Euclidean distance, Jaccard coefficient, Pearson correlation, and cosine similarity.

This study paved the way for ongoing research in feature selection, such as multi-label feature selection, streaming feature selection, online feature selection, filter-based locality preserving feature selection, and similarity preserving feature selection. It offers a detailed review of different techniques and considerations for text classification.

Why it’s a must-read: This article provides you with a foundational understanding of how to perform text classification with big multimedia data — something that’s become even more important in the four years since this paper’s publication. It’s a must-read because it highlights state-of-the-art feature selection techniques like the filter, wrapper, embedded, and hybrid models that you should use to facilitate multimedia text classification and data processing.

4. “A recent overview of the state-of-the-art elements of text classification”

Authors: Marcin Michał Mirończuk and Jarosław Protasiewicz

Published: September 2018

This 2018 paper describes six baseline elements for text classification, helping readers understand their importance and associated techniques. The authors showcase these elements in order of adoption and their category of text classification.

The authors conducted a qualitative analysis to systematically identify both the older and new techniques in all stages of text classification. They also explored the research trends with the help of quantitative field analysis. The paper concludes by opening up future directions, such as multilingual classifications and how complex embedding of features can create better semantic libraries for language learning.

Why it’s a must-read: This article gives an excellent overview of the essential phases involved in text classification and various concepts related to modern text classification. It also identifies trends that are emerging in contemporary text classification practices. This article is a must-read if you’re new to text classification or looking to sharpen your understanding of how text classification works.

5. “Easy data augmentation techniques for boosting performance on text classification tasks”

Authors: Jason Wei and Kai Zou

Published: January 2019

A well-encoded classifier is not enough to boost a model's performance and get better results. This study, published in 2019, shows that data augmentation and preprocessing techniques — like synonym replacement, random insertion, random swap, and random deletion — can improve performance on smaller data sets.

The authors used CNN and recurrent neural network (RNN) over five domains of classification tasks: Stanford sentiment treebank, customer reviews, subjectivity and objectivity, question type, and pro-con dataset. They recorded an evident boost in performance when textual data gets augmented in different quantities.

A key takeaway is that exploratory data augmentation (EDA) conserves data labels even though the sentences change during augmentation. The authors also recommend how much augmentation is optimal for performance gain.

Why it’s a must-read: This article introduces EDA as a way to boost accuracy and performance on your text classification tasks. It provides you with four specific and easy-to-perform EDA tasks that you can implement to improve your text classification. This article is a must-read because, in addition to providing you with these techniques, it also gives you parameters to implement, making it easier to get started with text classification.

6. “Benchmarking zero-shot text classification: Datasets, evaluation, and entailment approach”

Authors: Wenpeng Yin, Jamaal Hay, and Dan Roth

Published: August 2019

Zero-shot text classification is a defined problem in natural language understanding (NLU), where an appropriate label is given to the text irrespective of the domain recognized by the previously assigned label. This 2019 study explains that zero-shot text classification is a series of problems starting with maintaining topic categorization, pre-availability of recognized data labels, and close to no consideration of aspects while labeling.

The authors work through an entailment model, where the algorithm thinks about how a human would categorize a particular text. This is then learned and used to label data. The labeled data is compared to the ground truth benchmark labels for the same data, and the accuracy is determined.

Why it’s a must-read: This article discusses zero-shot text classification and how well it works. Since the research is scarce on this topic, there’s limited comparison between methods adopted to solve the zero-shot problem, making this piece a must-read for understanding zero-shot text classification.

7. “Deep learning based text classification: A comprehensive review”

Authors: Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao

Published: January 2022

This research paper, published in 2020, summarizes how deep learning (DL) techniques have outscored traditional machine learning approaches for day-to-day tasks, such as sentiment analysis and news categorization. The authors thoroughly reviewed 150 modern DL models and looked at how their contributions have significantly influenced the applications mentioned above.

They also categorized and briefly described these models based on the neural network architectures and transformers involved. Their description of a unique hybrid model that combines long short-term memory (LSTM) and convolutional neural network (CNN) architecture is particularly noteworthy.

Why it’s a must-read: This research paper provides a solid overall look at the state of classification algorithms. It explores how you can leverage the power of deep learning to improve the machine learning-driven classification strategies that you already have in place.

8. “A comparative analysis of logistic regression, random forest and KNN models for the text classification

Authors: Kanish Shah, Henil Patel, Devanshi Sanghvi, and Manan Shah

Published: March 2020

This research 2020 paper will help you understand the practical workings of prominent text classification algorithms. It’s a comparative analysis between the logistic regression, random forest, and the KNN model for text classification.

The authors also present an enormous literature review about works done in all three algorithms by different researchers, their pros and cons, and how their approaches differ from the authors’ own approach. They achieved close to 100 percent accuracy using their logistic regression classifier in one of the cases and practically outclassed the other two methods by a considerable margin.

Why it’s a must-read: This article is an in-depth comparison of some of the top algorithms used for text classification. It reflects on their efficacy by engaging with previous scholarship on the different text classification algorithms discussed. It’s a must-read article if you’re looking to deepen your understanding of text classification and can also be useful when determining which text classification algorithms you should implement.

9. “Text classification using machine learning and deep learning models”

Authors: Johnson Kolluri, Shaik Razia, and Soumya Ranjan Nayak

Published: June 2020

Published in 2020, this paper explores how maintaining irregular data is a big challenge for organizations, which has increased the demand for text classification tools. The three text classification methods mentioned are:

Supervised
Unsupervised
Semi-supervised

The authors also elaborate on unique approaches like graph-based methods, transductive SVM, self-cleaning, and co-cleaning. The paper concludes by explaining why it’s essential to categorize the text using mining for a semi-supervised learning approach to boost accuracy.

Why it’s a must-read: This paper highlights a radical new approach to text classification using BERT. It’s a must-read if you want to understand the workings of individual text classification methods using algorithms, such as hierarchical and K-means clustering, logistic regression, Naive Bayes, SVM, decision trees, K-nearest neighbors (KNN), neural networks, and more.

10. “Efficient English text classification using selected machine learning techniques”

Author: Xiaoyu Luo

Published: June 2021

This recent paper, published in 2021, details the implementation of the support vector machine (SVM) method and other ML techniques for classifying English text and documents. The authors employed the following methods for classifying texts using three different data sets:

Naive Bayes algorithm
SVM method
Logistic regression
Logistic regression cross-validation (LRCV)

The results were pretty solid, with SVM scoring a precision rating of around 90 percent in one of the data sets, and the highest in all three data sets when simulated on the Weka platform. Interestingly, the Naive Bayes algorithm worked with the least precision of 12 percent for one of the data sets.

The author also presents a straightforward approach for categorizing the data using text mining, attribute abstraction, stop words removal, stemming, and vector space documents.

Why it’s a must-read: This paper highlights how SVM and other ML techniques can be used to classify English text and documents, with a relatively strong precision rating. This is a must-read article if your work involves English text classification, as it provides strategies you can implement — and platforms you can use — to perform this classification more efficiently. It provides a good starting point for your own experimentations with and refining of your classification strategy.

Conclusion

With the radical advancements in machine learning and NLP, text classification techniques are evolving so rapidly that it's hard to keep track. Text alone is so information-rich that the greater the scale and variety, the brighter the future of text classification and analytics.

It’s important to stay up-to-date with recent advancements to ensure that you derive accurate insights from your textual data. However, keeping up with new advancements is a time-consuming, constant task. Instead of taking on this work yourself, you can offload this work to the pros at Cohere.

With Cohere, you’ll always stay on top of the market's state-of-the-art techniques for accurate and reliable results. Cohere’s Classify endpoint removes the need for expert MLEs, doesn’t require large amounts of training data, and is pre-trained with a massive corpus, making it an ideal, easy-to-use text classification tool.

Learn more about text classification on the Cohere platform and check out our Classify endpoint.