Group 6: Bank Fraud Detection

COMP 542 Machine Learning Fall 2021

12/21/2021

Members:

Fraudster

Table of Contents

Objective Goal

Fraud is a serious problem for telecommunications and banking such as identity theft, telemarketing and bank fraud. Such fraud has lost millions in revenue each year and data mining can be utilized to improve fraud detection. Applying data mining techniques we can reduce such fraudulent transactions by building a profile of customer’s calling behavior or account activity. By comparing the customer’s calling profile or account activity, we can then look for outliers in the data and eliminate them to prevent fraud. Our business will be successful once we are able to detect outliers in our consumers activities and label them as fraudulent. If a consumer makes a purchase that is an outlier, our system will flag that activity. If a consumer's identity is used outside of their normal scope of activity, our system will flag it as fraudulent. Our criteria for a successful outcome will be met when we can successfully detect outliers in consumers purchases more than 80%.

Background Information

A dataset containing synthetic bank payment data was acquired from Kaggle.com. The dataset was in the form of a .csv file of size 48MB, bs140513_032310.csv. The data contained were generated by BankSim, an agent-based simulator developed by Edgar Rojas and Stefan Axelsson. BankSim was developed using a sample subset of real transactional data aggregated from a larger population provided by a bank in Spain. The dataset was readily accessed from Kaggle.com by signing up for an account.

The BankSim synthetic data is structured, tabular data composed of both categorical and quantitative data elements. Nominal and ordinal data types are both present in the categorical fields. The data is also sequential, the transactions made by an account at different points in time heavily influence whether a transaction is flagged as fraudulent or not.

The generated datasets contain no PI (personal information) or disclosure of legal and private customer transactions, making this data suitable for research purposes. The timeframe of this data covers 6 months, from November 2012 to April 2013 and is restricted to zipcodes of Madrid and Barcelona. There are 15 merchant categories that differentiate between payments made and all prices are given in euro. Zip Code One (ZC1) refers to one of the biggest zip codes by payment volume and is the only zip code available in our dataset.

Novelty detection algorithms have previously been used on synthetic data to prove the performance of outlier detection. Fraudsters and threat agents adapt their behavior to avoid financial account controls set by financial institutions and legislation, e.g. making smaller transactions that fall just below an alert threshold. There is a lack of data available for research in fields such as money laundering, financial fraud, and illegal payments leading to in-house solutions not shared with the public. Real data also has the shortcoming of not having enough (and lack of diverse) fraudulent activity to build a machine learning model from.

During a normal step of the simulation, a customer that enters the simulation can decide to purchase an item or service from one of the offered categories. Once the category has been selected, it senses nearby merchants that offer that category and listen to the offers from the merchant. If accepted, the transaction takes place and the merchant registers the payment. Each step in the simulation represents a day of commercial activity. Currently, the data set does not differentiate between the different days of the week to feed the consumption pattern; all days of the week are treated the same.

BankSim ran for 180 steps several times and parameters were calibrated to obtain a distribution reliable for testing. Thieves were injected with the aim to steal an average of 3 cards per step and perform 2 fraudulent transactions per day. As a result, 594643 records were produced, of which 7200 were fraudulent transactions. That represents a very high crime rate (17%), although the simulation was intentionally programmed to produce an aggressive fraud behavior rate while honoring the distributions found in the original data.

Design Principle

Our project follows the design principle of the CRISP-DM design model. This model can be outlined as follows: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.

Determine Business Objectives

Background

The goal of this project is for us to learn about the concepts, principles, techniques, and applications of machine learning. Secondary goals include learning python.

Business Objectives

Fraud is a serious problem for telecommunications and banking such as identity theft, telemarketing and bank fraud. Such fraud has lost millions in revenue each year and data mining can be utilized to improve fraud detection. Applying data mining techniques we can reduce such fraudulent transactions by building a profile of customer’s calling behavior or account activity. By comparing the customer’s calling profile or account activity, we can then look for outliers in the data and eliminate them to prevent fraud.

Business Success Criteria

Our business will be successful once we are able to detect outliers in our consumers activities and label them as fraudulent. If a consumer makes a purchase that is an outlier, our system will flag that activity. If a consumer's identity is used outside of their normal scope of activity, our system will flag it as fraudulent.

Terminology

Costs and Benefits

There would be some cost involved for the business but the reward outweighs the cost. If the project is successful the business would benefit by being able to protect its customers from fraudulent activity on users accounts. More users would want to use their business due to the added security.

ML Goals

Goal is to increase the user count of certain banks by adding security to their users by protecting the users from fraud. Goal is to gather the user's information and determine if there are any fraudulent transactions based on the location, amount, and frequency of purchases of the user. We will be using outlier detection to find the fraudulent transactions.

ML Success Criteria

Our criteria for a successful outcome will be met when we can successfully detect outliers in consumers purchases more than 80%.

Project Plan

Phase Time Resources Risks
Business understanding 1 week All analysts Data problems, technology problems
Data understanding 3 weeks All analysts Data problems, technology problems
Data preparation 5 weeks All analysts Data problems, technology problems
Modeling 2 weeks All analysts Coding problems, technology problems
Evaluation 1 week All analysts Poor model results , inability to produce findings
Deployment 1 week All analysts Inability to implement models, no deployment launch

Initial Assessment of Tools and Techniques

We rely on Python's open source libraries and associated packages:

We also rely on open source datasets available on the internet to acquire bank transaction data. In our particular case, the data was synthetic. We then prepared our data to create predictive classification models that determine whether new data is valid or represents a potential fraudulent transaction based on its outlier profile.

Data Exploration / Processing

The data for our Bank Fraud Checker system consists of synthetic, i.e. generated, data based on real but limited bank transaction data from an unidentified financial institution in Barcelona, Spain.

The data for our model is downloaded in the form of a csv file and imported into a pandas dataframe to facilitate data analysis.

As you can see below, the data consists of over 500,000 records with fully populated cells across 10 fields. Due to the lack of diversity in zip code data, those columns were eliminated from our analysis.

Data Visualizations

Class Imbalance

As seen below in the histogram breakdown of our label feature, fraud, the stark imbalance between the number of fraudulent transactions to the number of valid transactions poses serious issues.

An imbalanced dataset is a dataset where classes are distributed unequally, as with our fraudfield.

The fraudulent transactions only constitute an approximate 1% of our entire dataset. Fraud to valid transactions have a 1:80 ratio.

This imbalanced data can potentially create problems with our classification task. The model we create will be biased toward valid transactions.

Distribution of Numeric Data

The only continuous numeric field is amount, which contains the value (in Euros) of the purchases made by customers from merchants. When plotted, the distribution of this data shows it is highly skewed to the right (positive skew).

Bank Data Transaction Frequency

Looking at the frequency of transactions by day, or “step”, broken down by “fraud”, other than a spike in valid transactions every 30 days or so, there is no telling trend as to how many fraudulent transactions occur on any given day. We are not provided with the time of day, day of the week, or month of the year to help us derive any patterns in the occurrence of fraudulent transactions.

Typically, this time/date data would be very helpful because seasonality does play a significant role as to when a fraudster would engage in fraudulent activity, choosing to act when both the likelihood of getting caught and the likelihood of financial transactions being closely scrutinized is low.

What we can tell from this chart is that valid and fraudulent transactions are occurring daily with the same proportionality (roughly 1:83) kept consistent throughout the data generation period (180 steps).

Bank Data Transaction Average Over Time

Taking another view of the daily transactions, we see in the graphs below the “amount” of each transaction plotted against the step. The fraudulent transactions are by and large much higher in “amount” than the valid transactions. While we do see some valid transactions reaching upto the €2000 mark, most valid transactions remain below the €200 average line we see in the bottom graph. The average “amount” of the fraudulent transactions fluctuate wildly around the €600 mark as the steps progress.

Bank Data Transaction Scatterplot Over Time

From a visual inspection then, it is clear that the majority of fraudulent transactions represent the higher “amount” purchases. The heavy skewness observed earlier in the “amount” histogram was an artifact of fraudulent transactions and therefore represent true outliers. If a rudimentary fraud alert system was based on a simple “amount” threshold, it would work fairly well for our given dataset.

Apart from the “amount” field, our dataset primarily contained categorical data. A look at “age” showed the age group where the most fraudulent transactions occurred was group 2, or people aged between 26 and 35 years old, followed by group 3 (36-45 years) then 4 (46-55 years). All age groups contained fraudulent transactions.

Transaction Frequency and Average By Age

The average fraudulent “amount” spent across all age groups floats around the €500 mark. However, given the very low counts for both valid and fraudulent transactions for age groups 0 (18 years and under) and U (unknown), the average fraud “amount” should immediately raise questions. Notwithstanding the stereotype of youth being impetuous high spenders, their lack of access to large amounts of money makes the high average “amount” of transactions under their category stand out especially given the small number of transactions within that category.

Transaction Frequency and Average by Gender

The same anomalous behavior can also be seen with the “Enterprise” category in the “gender” field. There are only 7 counts of fraud in that category but the average fraud “amount” is greater than for “male” and “female” categories which have thousands of records. While it may be true that company expenditures are naturally much larger than that of individuals, the types of purchases that we will examine below don’t account for it. As with some purchase types, there are no fraudulent transactions for the “Unknown” gender type.

Transaction Frequency and Average by Purchase Catgory

There are 15 purchase types identified from the “category” field. Of those, “es_transportation”, “es_contents”, and “es_food” do not contain any fraudulent transactions. The fact that the “es_transportation” category has over 500,000 records is testament to the case imbalance problem mentioned earlier in this paper. Of the remaining categories, the “travel” category takes the lionshare of fraudulent transactions. The average “amount” for this purchase category is over €2500.

Binned Scatter Plot

Even though the number of fraudulent transactions in the travel category was small, it had the fraudulent transactions with the highest expenditures across age and gender.

Alluvial Chart

We created an alluvial chart which shows the dataset color coded by fraud (orange for fraud and blue for valid). Each transaction is represented by a stream flowing from the gender type to the associated age of the customer. From there the transaction stream flows to the type of purchase that was made. Because the data streams are very fine and intersect multiple times, we do not see any trend in this way. This speaks to the high entropy we have in our dataset and why a Decision Tree would be better suited to derive insights and patterns from the bank transaction data versus traditional data mining would.

Analyses

Average Frequency of Transactions per Customer and per Merchant

The average number of customer transactions over the 180 day period was 143 compared to the 11,749 transactions per merchant. The average occurrence of fraud per customer over the 180 day period was 2 while the average occurrence of fraud per merchant was 144. The average customer expenditure during this period was €35 while the average merchant revenue was €130. The average amount stolen per customer during this period was €198 while the average amount defrauded from merchants was €357.

Customer and Merchant Hash Tables Containing Individual Transaction Data

A hash table was created to store how many fraudulent and valid transactions occurred per customer along with the purchase and stolen amounts; the same was done for merchants. From these tables, we could service the average purchase and the average theft per customer and per merchant. Of the 4112 individual customers, 1483 of them (36%) experienced theft from a fraudster. Of the 50 individual merchants, 30 of them (60%) experienced fraud from a fraudster.

Source of Transactions

In our entire set of half a million data points, the data is made up of transactions between 4112 customers and 50 merchants.

Looking through the merchant hash table, there were merchants that experienced more fraudulent transactions than valid transactions. Cross-referencing the merchant ID with the purchase category, we were able to determine fraud occurs more prevalently to merchants associated with leisure, travel, hotel services, and sports and toys.

Average Transactions and Expenditures by Customer

The average number of valid transactions per customer was 143 transactions while the average number of frauds within the same time period was 2 fradulent transactions. The average amount of valid transactions was only 35 euro while the average fradulent purchase amount was nearly 200 euro.

Merchant Sale and Fradulent Transactions

We created a table containing each merchant ID with features including the list of each valid transaction, the list of fraudulent transactions, the number of those transactions and the average amount of those transactions. We notice a pattern wherein the number of valid transactions are much higher in volume but represent much lower purchase amounts on average, whereas the opposite trend is observed for fraudulent transactions per merchant.

Average Transactions and Expenditures by Merchant

The average number of valid transactions per merchant was 11750 transactions while the average number of frauds within the same time period was 144 fradulent transactions. The average amount of valid transactions was only 130 euro while the average fradulent purchase amount was nearly 360 euro.

Most Exploited Merchant

The most exploited merchants defined as having more fradulent transactions than valid ones in our data set represented by the following categories:

Statistics

Amount Distribution Shape

When analyzing the skewness and kurtosis of the full “amount” data, we find we have an excess kurtosis of 1425.31 and a skewness of 32.37 (for perspective, both values are 0 for a normal distribution).

Amount Descriptive Statistics

Looking at the statistical descriptors of the amount field below, we see that of the 594643 records, the mean transaction “amount” is roughly €38. The minimum purchase is €0 and the maximum is €8330. The skewness is also apparent when noting the 50th percentile of all those records is only €27. Given the maximum value and the number of transactions, the presence of a positive skew is obvious. The kurtosis is also apparent when noting the 75th percentile of all those records is only €43. Again, considering the maximum value and the number of transactions, the presence of an extreme kurtosis is evident.

Density Plot

A density plot visualizes the distribution of data over a continuous interval or time period. This is similar to a histogram but with a variation that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. Density plots have a few advantages over histograms such that they're better at determining the distribution shape because they're not affected by the number of bins used. The vast majority of the transactions fall within the 10 to 100 euro bin, as evidenced by our extreme skewness and kurtosis.

Outlier Detection

Using IQR Method

We wanted to see if we could accurately determine fraudulent transactions by looking at outliers. We first tried calculated outliers using the IQR method. The IQR range on average is about 30 euro. With this method any transaction above 85 euro is considered an outlier which gives us too many false positives.

Using Z-Score Method

Since the IQR method did not work, we tried using the Z-score method with a 1.5 standard deviation threshold. This gave us better results with only 141 false positives.

Outlier Method Comparison

As far as determining outliers the z-score method out performed the IQR method by orders of magnitude.

Label Encoded Legends

We had to encode our categorical features for correlation analysis and data preperation for our modeling. Below are the code descriptions for our categorical feature values.

Pairplots showing Correlations Between Numeric Features

Correlation Heat-Map

A correlation matrix represents the correlations between pairs of variables in the given data. The correlation coefficient is the number that denotes the strength of the relationship between the two variables. The plot shown above is a cool warm heat map that shows the strength of the relationship with colors, blue for weak correlation and red for strong correlation. We can see that when the plot compares the same two variables the correlation coefficient is 1. The only variable pair that has a significant correlation is fraud and the amount, in which the correlation coefficient is 0.49. This tells us that whether a transaction is fraudulent or not is related to the amount of the transaction.

Chi-Squared Test

A chi-square test is used in statistics to test the independence of two variables. Chi-Square measures how the expected count and observed count deviates between each other. We use this test to determine the relationship between the independent category feature (our feature set) and dependent category feature (our label). In feature selection, we aim to select the features which are highly dependent on the label. We determined each feature was dependent upon fraud.

Modelling

Calculating outliers wasn’t the best approach to determining whether a data point was fraudulent or not, we decided to use machine learning algorithms to help us with our binary classification problem. It made it very easy for us to choose any classification algorithm we wanted to try to implement and test. We chose logistic regression, linear support vector machines, k-nearest neighbors, decision tree, random forest, and multi-layer perceptron classifiers. These classifiers are known to work really well with binary classification problems such as our own, and coupled with the fact that we only have between 4-6 features depending on how many features we want to include in our feature set, and only about half a million rows in our data, these classifiers have no problem handling our data. The longest it took to fit any of these classifiers with our training data was 2 minutes. Apart from parameters of our models that we had to hard code, such as k, the number of clusters in our KNN, or which type of solver to use for logistic regression, we stuck with the default settings for the parameters provided by the library. This decision was made by reading the documentation provided for each classifier, what parameters we could use for each classifier and how to determine the parameter values. We used only the parameters that were relevant to our models given the type of solvers chosen, for example. And We found that the default settings worked really well for our simple dataset. For Decision tree, random forest, and the multi-layer perceptron classifiers, we chose to input a value for the random_state parameter. By inputting any integer, we ensure the same results across different calls by controlling the random number generator seed.

Feature Set and Label Preparation

We are preparing our feature set and label. We determined the only features revelevant to our class prediction are a customers age, gender, the purchase category and purchase amount.

Training and Testing Split

We are splitting our data with a traditional 70/30 training:testing split.

Selected Models

Classification Models

Logistic Regression

Logistic Regression is a type of Generalized Linear Model (GLM) that uses a logistic function to model a binary variable based on any kind of independent variables. For sklearns Logistic Regression, the ‘liblinear’ solver is a good choice for small datasets. For the multi_class parameter, if the option chosen is ‘ovr’, then a binary problem is fit for each label (‘multinomial’ is unavailable when solver=’liblinear’).

Support Vector Machines

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. LinearSVC is a class of SVMs capable of performing binary and multi-class classification on a dataset. LinearSVC is another (faster) implementation of Support Vector Classification for the case of a linear kernel.

K-Nearest Neighbor

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Some advantages of decision trees are they are simple to understand and to interpret, they can be visualised, and they require little data preparation.

Random Forest

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. A diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

Neural Networks

Neural Networks are a machine learning algorithm that involves fitting many hidden layers used to represent neurons that are connected with synaptic activation functions. Multi-layer Perceptron classifier optimizes the log-loss function using stochastic gradient descent. Activation function for the hidder layer by default is 'relu', the rectified linear unit function, returns f(x) = max(0,x).

Parameter Tuning

For our Logistic Regression model, we set the solver = liblinear. This is because the liblinear algorithm is good for solving smaller datasets. Next, we set multi_class = ovr to fit a binary problem for each label. For LinearSVC, we set dual = false because our number of samples are higher than our number of features. We set the number of max iterations = 2000 to limit the workload of our LinearSVC model. For our KNeighborsClassifier model, we have default parameters. In our DecisionTreeclassifier, RandomForestClassifier, and MLPClassifier we set the random_state = 1 to have reproducible results across multiple function calls.

Cross Validation

To compare the accuracy between models, we decided to go with 10-fold cross validation so produce 10 accuracy scores per model. These accuracy scores were then compared to each other, and it was determined that the MLP classifier outperformed the other classifiers, as seen in the boxplot below.

Model Performance

For our project, we are using six different models:

Using these different models we achieved an accurate f_score of 99.2% to 99.3%.

We were trying to achieve accuracy scores above 80% which we successfully did. The reason why our models achieved such a high F1 score is because we limited the amount of data. We were able to remove features that would overfit our model. When we cleaned up our data and only included certain features such as age, gender, category, and amount. Hence the reason we were able to get a high F1 score.

We hypothesized that the MLP classifier, which is a neural network classifier, would perform the best given that neural networks perform really well given almost any data set and almost any data science problem. Neural networks, as we all know, consist of connected neurons or nodes in a layered structure having an input layer, hidden layers, and an output layer. With our data, the nodes of our input layer correspond to our data features: age, gender, category, and amount. Because we used the MLP classifiers default parameter settings, we only have 1 hidden layer with 100 nodes or units. The MLP classifier optimizes the log-loss function using stochastic gradient descent and the neurons are connected with synaptic activation functions, in this particular case, the recitifed linear unit function, or relu. Given we fit 6 machine learning classifiers to our training data and tested it with the 30% data we reserved for validation, we needed a way to compare the accuracy results of the classifier’s predictions to the actual testing data. We did so by implementing the ten fold cross validation.

Assessment of Results

  1. Our MLP Classifier ranked number 1 with F1 score of 0.994.
  2. Our LogisticRegression with F1 score of 0.993 ranked second.
  3. Our KNeighborsClassifier with F1 score of 0.993 ranked third.
  4. Our LinearSVC with F1 score of 0.992 ranked fourth.
  5. Our DecisionTreeClassifier with F1 score of 0.992 ranked fifth.
  6. Our RandomForestClassifier with F1 score of 0.992 ranked last.

Demo

Using our MLP classifier, in the first block of code we are looking at a female between the age of 19 and 25 years old purchasing something that falls under the sports and toys category. She is spending 5000 euro which our machine learning alogrithm detects as fraud. In the second block of code we are looking at a female in the same age range buying something that is under the sports and toys category. She is spending only 50 euro which our machine learning algorithm doesn't detect the transaction as fraud.

ML Success Criteria Evaluation

Our projects ML success criteria are met when we can successfully detect outliers in consumers purchases more than 80% of the time. All of our models have exceeded our ML success criteria. Our project's business success criteria includes being able to detect and flag fraudulent transactions. All models presented meet the business objectives and none were deficient. The MLP classifier was the most optimal f1_score, while other models are relatively close. We believe the MLP classifier worked best with our data because it works well with all different data types and non uniform distribution data.

Most if not all banks will never have a uniform distributed data set. We are confident that if we wanted to deploy our project we would show banks how we were able to accurately detect outliers with our limited data set. Banks would only be able to provide a limited data set to make sure they do not violate client privacy. We are confident that we could take their banking data farm and plug our machine learning algorithm to be able to detect fraud for their bank. If they were to supply more data variables we are confident that MLP would still return accurate results because MLP, being a neural network, will be more than capable of handling more data types.

Future Work

Future work entails performing statistical feature selection, statistical test of significance on model performance metrics, ANOVA analysis between model results, plotting ROC curves, and applying new datasets to our fitted model to further evaluate model prediction. The more data we feed into our classifier model, the more we are able to saturate the classifier's learning and improve upon it's predictive power.