A pipeline and comparative study of 12 machine learning models for text classification

04/04/2022
by   Annalisa Occhipinti, et al.
0

Text-based communication is highly favoured as a communication method, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem. We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models' performance by applying specific methods (based on natural language processing) in the preprocessing stage. Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. By analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public email corpus. Statistical tests and explainability techniques are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94

READ FULL TEXT
research
08/10/2023

Exploring Machine Learning and Transformer-based Approaches for Deceptive Text Classification: A Comparative Analysis

Deceptive text classification is a critical task in natural language pro...
research
04/08/2021

Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification

Text classification is a significant branch of natural language processi...
research
09/09/2016

Harassment detection: a benchmark on the #HackHarassment dataset

Online harassment has been a problem to a greater or lesser extent since...
research
12/10/2021

Computer-Assisted Creation of Boolean Search Rules for Text Classification in the Legal Domain

In this paper, we present a method of building strong, explainable class...
research
09/27/2021

Small data problems in political research: a critical replication study

In an often-cited 2019 paper on the use of machine learning in political...
research
02/14/2018

Authorship Attribution Using the Chaos Game Representation

The Chaos Game Representation, a method for creating images from nucleot...
research
02/13/2023

Towards Agile Text Classifiers for Everyone

Text-based safety classifiers are widely used for content moderation and...

Please sign up or login with your details

Forgot password? Click here to reset