top of page

Deep Learning for Outlier Detection

Updated: May 13

The purpose of this blogpost is to illustrate the efficiency of deep neural networks, more specifically deep auto-encoders, to identify outliers and in this scenario frauds. The dataset of this usecase can be found here.





What is an Auto-encoder and why do we need it?


An autoencoder is a type of artificial neural network used in unsupervised learning, which means it doesn't need labeled data to train. At its core, an autoencoder is designed to learn a compressed, efficient representation of input data. It consists of two main parts: an encoder and a decoder.


The encoder takes the input data and converts it into a compressed representation, often referred to as a "code" or "latent space". This compressed representation captures the most important features of the input data. The decoder then takes this compressed representation and reconstructs the original input data as closely as possible.


The key idea behind autoencoders is to force the network to learn a compressed representation of the input data in such a way that it can accurately reconstruct the original data from this compressed representation. By doing so, the autoencoder learns to capture the essential features of the input data while discarding unnecessary details. This compressed representation can then be used for various tasks such as data compression, denoising, dimensionality reduction, or even generation of new data samples similar to the training data.


Lets have a look how an auto-encoder neural network looks like.


Autoencoder architecture
Autoencoder architecture

On the lefthand side, there is the input layer and the encoder part while on the righthand side the decoder part and output layer. This specific structure forces the model to learn good representations in the latent space by reducing the number of parameters in the encoder part. The decoder tries to reconstruct the original input given the compressed representation it gets from the latent space.


Autoencoders have numerous applications across various domains. In image processing, they can be used for tasks like image denoising (denoising autoencoders), image compression, or even generating new images (variational autoencoders). In natural language processing, autoencoders can learn efficient representations of text data for tasks like text summarization or language translation (transformer encoders).


Additionally, autoencoders have found applications in anomaly detection, where they can reconstruct normal data accurately but fail to reconstruct anomalies, thus flagging them as outliers. Such a scenario will be investigated using the fraud detection dataset below.


Overall, autoencoders are powerful tools for learning compact representations of complex data, enabling a wide range of applications in machine learning and artificial intelligence.


Usecase: Develop an AI Fraud Detection System with Limited Fraud-Labeled Data


Fraud detection in AI entails the utilization of algorithms to identify deceptive or malicious activities within datasets, especially prevalent in financial transactions or cybersecurity. This task is inherently challenging due to the evolving nature of fraudulent tactics, ranging from sophisticated schemes to subtle anomalies that evade traditional detection methods.


Moreover, distinguishing fraudulent behavior from legitimate activities often requires navigating complex patterns and correlations within vast amounts of data. Outlier detection serves as a crucial component in this domain, as it targets instances deviating significantly from the norm, potentially indicating fraudulent behavior. By leveraging outlier detection techniques alongside advanced AI models, such as machine learning and deep learning algorithms, fraud detection systems can effectively adapt to emerging threats and enhance their accuracy in identifying suspicious activities.


For this usecase, we will use the Bank Account Fraud Dataset Suite (NeurIPS 2022). The Bank Account Fraud (BAF) suite of datasets, unveiled at NeurIPS 2022, consists of six synthetic tabular datasets representing various types of bank account fraud. BAF stands out as an authentic, comprehensive, and resilient platform for testing both new and existing methods in machine learning (ML) and fair ML, marking a pioneering initiative in the field. These datasets are designed to be realistic, drawn from present-day real-world data, and intentionally biased, each featuring controlled types of bias. They are also characterized by imbalanced settings, with a notably low prevalence of positive class instances.


BAF dataset characteristics

1,000,000

32

#Categorical features

4

Class imbalance ratio

1:90 (fraud:non-fraud)

0

#Empty values

0

Viewing the characteristics of the dataset, one can understand that fraud detection as well as the whole domain of outlier detection is quite difficult since very few labeled data from the interesting class (11K this scenario) are available.


Lets try to tackle this problem as a supervised learning problem and try to fit a state-of-the-art classifier such as XGBoost (version 2.0, currently the latest) and see what we get back in terms of Precision-Recall (PRC) curves (NOT Receiver Operating Characteristic (ROC)).


Note that: ROC curves plot the true positive rate (TPR) against the false positive rate (FPR), showing the trade-off between sensitivity and specificity at various threshold settings. PRC curves, on the other hand, plot precision against recall, focusing on the trade-off between positive predictive value and sensitivity. While ROC curves are suitable for balanced class distributions and emphasize the true negative rate, PRC curves are more informative for imbalanced datasets and prioritize the positive class's performance.


Before using the dataset for evaluation, the categorical columns are converted to one-hot encoded representations. Since the dataset is imbalanced, lets report on the stratified 5-fold cross validation. To avoid data leakage (major mistake in many performance evaluations), the normalization is taking place during each fold and scalers are applied to the testing set before the prediction phase. For XGBoost, the number of estimators is set to 100.


Lets see why by examining the PRC curves.

Model's performance in termofs of AUPRC
Model's performance in termofs of AUPRC

The AUC of PRC curve is 16%. That's not good, in a matter of fact that's catastrophic for any business that cannot detect fraudulent behavior. ROC metric suffers in the presence of class imbalance!


Instead of applying a model directly to tackle the issue, lets try another more sophisticated approach. This approach can operate with limited labeled data as we will see while it relies in unsupervised methods to distinguish outliers (frauds). We will begin with a simple architecture and add complexity to observe the behavior in terms of AUC-PRC curves.


Since one image says a thousand words, lets have a look at the designed architecture.


System's Architecture
System's Architecture

The system consists of multiple autoencoder whose job is to learn very good representations of non-fraudulent entries; therefore, they are trained only on non-fraudulent data. Each autoencoder is trained on a seperate feature space similar to a bagging technique. This way we can ensure that the ensemble will be robust. After the training of all autoencoders, we get a final ensemble of N models which we are going to use to get the reconstruction error of incoming entries.

The idea behind this is that by learning to reconstruct non-fraudulent data with minimum recostruction error, the autoencoder will produce higher reconstruction error when applied on a fraudulent entry. For each entry, the ensemble will generate a vector of N different reconstruction errors, where each one corresponds to an autoencode that was trained of a seperate set of features. This vector can be used to train a classifier to learn the exact same classification problem as before but with a completely different set of features (or additionally to the previous feature space in this case).


The question that pops up is: how many autoencoders do we actually need? To answer that we will have to experiment a bit to see the performance with respect to N (number of autoencoders). Below we report on the performance in terms of AUC-PRC in a sample set using 5-fold stratified sampling. The employed model for training on the updated feature space is again an XGBoost classifier of 100 trees.


Ensemble's (10 AEs) performance in terms of AUPRC on test set
Ensemble's (10 AEs) performance in terms of AUPRC on test set

For N equals to 10, we already see a tremendoun improvement (around 50%) over the previous single XGBoost application; however, we can do better.


Ensemble's (100 AEs) performance in terms of AUPRC on test set
Ensemble's (100 AEs) performance in terms of AUPRC on test set

With N equals to 100, it starts looking way better, around 21% better than before! Now we may get excited and want to scale it up to 1,000 but at this point it's a matter of resources. I would recommend it only if you are willing to spend a couple of hundred dollars to discover the cap point.



Conclusion


In conclusion, the implementation of an autoencoder mechanism has proven to be a formidable tool in predicting fraudulent transactions with remarkable accuracy. By utilizing the power of unsupervised learning, the autoencoder ensemble effectively captures intricate patterns and anomalies within the transaction data, enabling proactive identification of fraudulent activities. Another direction in this area would be class-imbalance learning. Class-imbalance aims to boost the minority class to reduce the balanced error rate. Such cases will be examined in future posts.



** If you are interested in other ML use-cases, please contact me using the form (and also include a publicly available dataset for this case, I'm always curious to explore new problems).

Comments


bottom of page