The estimation of the likelihood of rare events poses a significant problem for classification problems in machine learning. This occurs when one of the target classes has a majority number of instances concentrated to it compared to the minority class. This is called the class imbalanced problem. This affects the accuracy and the performance of machine learning methods as the results tend to be skewed towards the majority class. This results in giving a false sense of accuracy, thereby under-estimating the most important class of interest.
There are many events that are rare, with too few instances occurring compared to the majority class. For example, there are more survivors than deaths, more existing customers than those who lapse, fewer defects than normal instances in manufacturing, fewer fraud cases etc. This study reviews the problem of class imbalance in machine learning. Specifically, the following outcomes are covered:
o compare the performance of various machine learning techniques;
on data before and after employing techniques used to handle the class imbalanced problem; and
o compare various approaches to handle this problem, using a number of evaluation metrics
Logistic regression (LR), structural vector machine (SVM), multilayer perceptron (MLP), k-nearest neighbour (KNN) and random forest (RF) are applied on the accidental death data to predict the likelihood of dying from accidents. The methods are applied on 75% training data and tested on the remaining 25% testing data.
Techniques employed to handle the class imbalanced problem include:
We consider a number of evaluation metrics using accuracy, precision, recall, F1 score, receiver operating characteristic (ROC) and the area under the ROC (AUC). Accuracy is the number of the times a ML classifier has correctly classified all instances in the training set. Recall is the ability of the given classifier to correctly classify positive classes well. Precision is the ability of the given classifier to incorrectly classify positive instances when they are negative. F1 score is the harmonic mean between precision and recall. The ROC curve shows the probability that a classifier randomly ranks a chosen positive case higher than chosen negative cases. The AUC measures the area under the ROC and the values closer to 1 are good for best models.