Fraud is a serious problem for telecommunications and banking such as identity theft, telemarketing and bank fraud. Such fraud has lost millions in revenue each year and data mining can be utilized to improve fraud detection. Applying data mining techniques we can reduce such fraudulent transactions by building a profile of customer’s calling behavior or account activity. By comparing the customer’s calling profile or account activity, we can then look for outliers in the data and eliminate them to prevent fraud. Our business will be successful once we are able to detect outliers in our consumers activities and label them as fraudulent. If a consumer makes a purchase that is an outlier, our system will flag that activity. If a consumer's identity is used outside of their normal scope of activity, our system will flag it as fraudulent. Our criteria for a successful outcome will be met when we can successfully detect outliers in consumers purchases more than 80%.
A dataset containing synthetic bank payment data was acquired from Kaggle.com. The dataset was in the form of a .csv file of size 48MB, bs140513_032310.csv. The data contained were generated by BankSim, an agent-based simulator developed by Edgar Rojas and Stefan Axelsson. BankSim was developed using a sample subset of real transactional data aggregated from a larger population provided by a bank in Spain. The dataset was readily accessed from Kaggle.com by signing up for an account.
The BankSim synthetic data is structured, tabular data composed of both categorical and quantitative data elements. Nominal and ordinal data types are both present in the categorical fields. The data is also sequential, the transactions made by an account at different points in time heavily influence whether a transaction is flagged as fraudulent or not.
The generated datasets contain no PI (personal information) or disclosure of legal and private customer transactions, making this data suitable for research purposes. The timeframe of this data covers 6 months, from November 2012 to April 2013 and is restricted to zipcodes of Madrid and Barcelona. There are 15 merchant categories that differentiate between payments made and all prices are given in euro. Zip Code One (ZC1) refers to one of the biggest zip codes by payment volume and is the only zip code available in our dataset.
Novelty detection algorithms have previously been used on synthetic data to prove the performance of outlier detection. Fraudsters and threat agents adapt their behavior to avoid financial account controls set by financial institutions and legislation, e.g. making smaller transactions that fall just below an alert threshold. There is a lack of data available for research in fields such as money laundering, financial fraud, and illegal payments leading to in-house solutions not shared with the public. Real data also has the shortcoming of not having enough (and lack of diverse) fraudulent activity to build a machine learning model from.
Fraud scenarios implemented in BankSim were based on selected cases from the Grant Thorton report Member and Council (2009). The type of fraud focused on was card related frauds. This type of fraud refers to when the important data on the card is compromised: account name, credit card number, expiration date, and verification code.
Theft scenarios include cases where the customer loses physical possession of their card and flagging this scenario could entail seeing a high number of unusual transactions with high value in a short period of time.
Cloned card/skimming scenarios include cases where a clone of the card is created, without knowledge of the card owner. Flagging this behavior could entail seeing a high number of unusual transactions with high value in a short period of time. Other flags include seeing simultaneous payments in different physical locations, or seeing the card used far from previously known locations.
Internet purchases scenarios include cases where fraudsters use “carding” to purchase immaterial goods on the internet with websites that check the validity of the card instantly. This tells the fraudster that the card is still valid before they use it while physically present. Flagging this behavior could entail blacklisting carding websites and cross-reference them with current user activity to detect any unusual purchases after the carding was executed.
During a normal step of the simulation, a customer that enters the simulation can decide to purchase an item or service from one of the offered categories. Once the category has been selected, it senses nearby merchants that offer that category and listen to the offers from the merchant. If accepted, the transaction takes place and the merchant registers the payment. Each step in the simulation represents a day of commercial activity. Currently, the data set does not differentiate between the different days of the week to feed the consumption pattern; all days of the week are treated the same.
BankSim ran for 180 steps several times and parameters were calibrated to obtain a distribution reliable for testing. Thieves were injected with the aim to steal an average of 3 cards per step and perform 2 fraudulent transactions per day. As a result, 594643 records were produced, of which 7200 were fraudulent transactions. That represents a very high crime rate (17%), although the simulation was intentionally programmed to produce an aggressive fraud behavior rate while honoring the distributions found in the original data.
Our project follows the design principle of the CRISP-DM design model. This model can be outlined as follows: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.
The goal of this project is for us to learn about the concepts, principles, techniques, and applications of machine learning. Secondary goals include learning python.
Fraud is a serious problem for telecommunications and banking such as identity theft, telemarketing and bank fraud. Such fraud has lost millions in revenue each year and data mining can be utilized to improve fraud detection. Applying data mining techniques we can reduce such fraudulent transactions by building a profile of customer’s calling behavior or account activity. By comparing the customer’s calling profile or account activity, we can then look for outliers in the data and eliminate them to prevent fraud.
Our business will be successful once we are able to detect outliers in our consumers activities and label them as fraudulent. If a consumer makes a purchase that is an outlier, our system will flag that activity. If a consumer's identity is used outside of their normal scope of activity, our system will flag it as fraudulent.
There would be some cost involved for the business but the reward outweighs the cost. If the project is successful the business would benefit by being able to protect its customers from fraudulent activity on users accounts. More users would want to use their business due to the added security.
Goal is to increase the user count of certain banks by adding security to their users by protecting the users from fraud. Goal is to gather the user's information and determine if there are any fraudulent transactions based on the location, amount, and frequency of purchases of the user. We will be using outlier detection to find the fraudulent transactions.
Our criteria for a successful outcome will be met when we can successfully detect outliers in consumers purchases more than 80%.
Phase | Time | Resources | Risks |
---|---|---|---|
Business understanding | 1 week | All analysts | Data problems, technology problems |
Data understanding | 3 weeks | All analysts | Data problems, technology problems |
Data preparation | 5 weeks | All analysts | Data problems, technology problems |
Modeling | 2 weeks | All analysts | Coding problems, technology problems |
Evaluation | 1 week | All analysts | Poor model results , inability to produce findings |
Deployment | 1 week | All analysts | Inability to implement models, no deployment launch |
We rely on Python's open source libraries and associated packages:
We also rely on open source datasets available on the internet to acquire bank transaction data. In our particular case, the data was synthetic. We then prepared our data to create predictive classification models that determine whether new data is valid or represents a potential fraudulent transaction based on its outlier profile.
The data for our Bank Fraud Checker system consists of synthetic, i.e. generated, data based on real but limited bank transaction data from an unidentified financial institution in Barcelona, Spain.
The data for our model is downloaded in the form of a csv file and imported into a pandas dataframe to facilitate data analysis.
As you can see below, the data consists of over 500,000 records with fully populated cells across 10 fields. Due to the lack of diversity in zip code data, those columns were eliminated from our analysis.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import sklearn as sk
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("data/bs_clean.csv")
df.head()
step | customer | age | gender | merchant | category | amount | fraud | |
---|---|---|---|---|---|---|---|---|
0 | 0 | C1093826151 | 4 | M | M348934600 | transportation | 4.55 | 0 |
1 | 0 | C352968107 | 2 | M | M348934600 | transportation | 39.68 | 0 |
2 | 0 | C2054744914 | 4 | F | M1823072687 | transportation | 26.89 | 0 |
3 | 0 | C1760612790 | 3 | M | M348934600 | transportation | 17.25 | 0 |
4 | 0 | C757503768 | 5 | M | M348934600 | transportation | 35.72 | 0 |
import missingno as msno
msno.bar(df)
<AxesSubplot:>
As seen below in the histogram breakdown of our label feature, fraud, the stark imbalance between the number of fraudulent transactions to the number of valid transactions poses serious issues.
An imbalanced dataset is a dataset where classes are distributed unequally, as with our fraudfield.
The fraudulent transactions only constitute an approximate 1% of our entire dataset. Fraud to valid transactions have a 1:80 ratio.
This imbalanced data can potentially create problems with our classification task. The model we create will be biased toward valid transactions.
sns.histplot(data=df, x="fraud")
plt.xticks(np.arange(0, 2, 1))
<AxesSubplot:xlabel='fraud', ylabel='Count'>
([<matplotlib.axis.XTick at 0x22b73b788e0>, <matplotlib.axis.XTick at 0x22b73b788b0>], [Text(0, 0, ''), Text(0, 0, '')])
The only continuous numeric field is amount, which contains the value (in Euros) of the purchases made by customers from merchants. When plotted, the distribution of this data shows it is highly skewed to the right (positive skew).
sns.set(rc = {'figure.figsize':(15,8)})
sns.histplot(data=df, x="amount", hue="fraud", multiple="stack")
<AxesSubplot:xlabel='amount', ylabel='Count'>
Looking at the frequency of transactions by day, or “step”, broken down by “fraud”, other than a spike in valid transactions every 30 days or so, there is no telling trend as to how many fraudulent transactions occur on any given day. We are not provided with the time of day, day of the week, or month of the year to help us derive any patterns in the occurrence of fraudulent transactions.
Typically, this time/date data would be very helpful because seasonality does play a significant role as to when a fraudster would engage in fraudulent activity, choosing to act when both the likelihood of getting caught and the likelihood of financial transactions being closely scrutinized is low.
What we can tell from this chart is that valid and fraudulent transactions are occurring daily with the same proportionality (roughly 1:83) kept consistent throughout the data generation period (180 steps).
# frequency count of field by fraud
sns.histplot(data=df, x="step", hue="fraud", multiple="stack")
<AxesSubplot:xlabel='step', ylabel='Count'>
Taking another view of the daily transactions, we see in the graphs below the “amount” of each transaction plotted against the step. The fraudulent transactions are by and large much higher in “amount” than the valid transactions. While we do see some valid transactions reaching upto the €2000 mark, most valid transactions remain below the €200 average line we see in the bottom graph. The average “amount” of the fraudulent transactions fluctuate wildly around the €600 mark as the steps progress.
# mean transaction amount by field by fraud
# the lines are error bars representing the uncertainty around the mean estimate
sns.lineplot(data=df, x="step", y="amount", hue="fraud")
<AxesSubplot:xlabel='step', ylabel='amount'>
From a visual inspection then, it is clear that the majority of fraudulent transactions represent the higher “amount” purchases. The heavy skewness observed earlier in the “amount” histogram was an artifact of fraudulent transactions and therefore represent true outliers. If a rudimentary fraud alert system was based on a simple “amount” threshold, it would work fairly well for our given dataset.
Apart from the “amount” field, our dataset primarily contained categorical data. A look at “age” showed the age group where the most fraudulent transactions occurred was group 2, or people aged between 26 and 35 years old, followed by group 3 (36-45 years) then 4 (46-55 years). All age groups contained fraudulent transactions.
sns.stripplot(data=df, x="step", y="amount", hue="fraud")
plt.xticks(np.arange(0, 180, 18))
<AxesSubplot:xlabel='step', ylabel='amount'>
([<matplotlib.axis.XTick at 0x22b08aaa550>, <matplotlib.axis.XTick at 0x22b08aaa520>, <matplotlib.axis.XTick at 0x22b08aa30a0>, <matplotlib.axis.XTick at 0x22b090f74f0>, <matplotlib.axis.XTick at 0x22b09096610>, <matplotlib.axis.XTick at 0x22b09042af0>, <matplotlib.axis.XTick at 0x22b09096910>, <matplotlib.axis.XTick at 0x22b090f75b0>, <matplotlib.axis.XTick at 0x22b0900fac0>, <matplotlib.axis.XTick at 0x22b09105910>], [Text(0, 0, '0'), Text(1, 0, '1'), Text(2, 0, '2'), Text(3, 0, '3'), Text(4, 0, '4'), Text(5, 0, '5'), Text(6, 0, '6'), Text(7, 0, '7'), Text(8, 0, '8'), Text(9, 0, '9')])
The average fraudulent “amount” spent across all age groups floats around the €500 mark. However, given the very low counts for both valid and fraudulent transactions for age groups 0 (18 years and under) and U (unknown), the average fraud “amount” should immediately raise questions. Notwithstanding the stereotype of youth being impetuous high spenders, their lack of access to large amounts of money makes the high average “amount” of transactions under their category stand out especially given the small number of transactions within that category.
sns.histplot(data=df, x="age", hue="fraud", multiple="stack")
<AxesSubplot:xlabel='age', ylabel='Count'>
sns.barplot(data=df, x="age", y="amount", hue="fraud")
<AxesSubplot:xlabel='age', ylabel='amount'>
The same anomalous behavior can also be seen with the “Enterprise” category in the “gender” field. There are only 7 counts of fraud in that category but the average fraud “amount” is greater than for “male” and “female” categories which have thousands of records. While it may be true that company expenditures are naturally much larger than that of individuals, the types of purchases that we will examine below don’t account for it. As with some purchase types, there are no fraudulent transactions for the “Unknown” gender type.
sns.histplot(data=df, x="gender", hue="fraud", multiple="stack")
<AxesSubplot:xlabel='gender', ylabel='Count'>
sns.barplot(data=df, x="gender", y="amount", hue="fraud")
<AxesSubplot:xlabel='gender', ylabel='amount'>
There are 15 purchase types identified from the “category” field. Of those, “es_transportation”, “es_contents”, and “es_food” do not contain any fraudulent transactions. The fact that the “es_transportation” category has over 500,000 records is testament to the case imbalance problem mentioned earlier in this paper. Of the remaining categories, the “travel” category takes the lionshare of fraudulent transactions. The average “amount” for this purchase category is over €2500.
sns.histplot(data=df, y="category", hue="fraud", multiple="stack")
<AxesSubplot:xlabel='Count', ylabel='category'>
sns.barplot(data=df, y="category", x="amount", hue="fraud")
<AxesSubplot:xlabel='amount', ylabel='category'>
Even though the number of fraudulent transactions in the travel category was small, it had the fraudulent transactions with the highest expenditures across age and gender.
import matplotlib.pyplot as plt
import seaborn as sns
data = df
fig_dims = (6, 6)
fig, ax = plt.subplots(figsize = fig_dims)
sns.scatterplot(data=data, x = 'age', y = 'category', size='amount', hue='gender',alpha= 0.5, sizes=(10,1000), ax = ax)
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', fontsize=9)
<AxesSubplot:xlabel='age', ylabel='category'>
<matplotlib.legend.Legend at 0x22b0f83d8e0>
We created an alluvial chart which shows the dataset color coded by fraud (orange for fraud and blue for valid). Each transaction is represented by a stream flowing from the gender type to the associated age of the customer. From there the transaction stream flows to the type of purchase that was made. Because the data streams are very fine and intersect multiple times, we do not see any trend in this way. This speaks to the high entropy we have in our dataset and why a Decision Tree would be better suited to derive insights and patterns from the bank transaction data versus traditional data mining would.
import pyalluvial.alluvial as alluvial
freq_df = pd.read_csv("alluvial.csv")
fig = alluvial.plot(df=freq_df, xaxis_names=['gender','age', 'category'], y_name='count', alluvium='fraud', ignore_continuity=False, figsize=(20, 100))
plt.ylabel('Transactions')
Text(0, 0.5, 'Transactions')
The average number of customer transactions over the 180 day period was 143 compared to the 11,749 transactions per merchant. The average occurrence of fraud per customer over the 180 day period was 2 while the average occurrence of fraud per merchant was 144. The average customer expenditure during this period was €35 while the average merchant revenue was €130. The average amount stolen per customer during this period was €198 while the average amount defrauded from merchants was €357.
# avg customer purchases per day
(df.groupby(["step"])["customer"].count()/df.groupby(["step"])["customer"].nunique()).mean()
# avg merchant sales per day
(df.groupby(["step"])["merchant"].count()/df.groupby(["step"])["merchant"].nunique()).mean()
1.0384574855236957
85.38324425832103
A hash table was created to store how many fraudulent and valid transactions occurred per customer along with the purchase and stolen amounts; the same was done for merchants. From these tables, we could service the average purchase and the average theft per customer and per merchant. Of the 4112 individual customers, 1483 of them (36%) experienced theft from a fraudster. Of the 50 individual merchants, 30 of them (60%) experienced fraud from a fraudster.
customer = {}
merchant = {}
for index, row in df.iterrows():
#populating customer table
if row.customer not in customer:
customer[row.customer] = {}
customer[row.customer]["purchases"] = []
customer[row.customer]["stolen"] = []
if row.fraud == 0:
customer[row.customer]["transactions"] = 1
customer[row.customer]["frauds"] = 0
customer[row.customer]["purchases"].append(row.amount)
elif row.fraud == 1:
customer[row.customer]["transactions"] = 0
customer[row.customer]["frauds"] = 1
customer[row.customer]["stolen"].append(row.amount)
else:
if row.fraud == 0:
customer[row.customer]["transactions"] += 1
customer[row.customer]["purchases"].append(row.amount)
elif row.fraud == 1:
customer[row.customer]["frauds"] += 1
customer[row.customer]["stolen"].append(row.amount)
#populating the merchant table
if row.merchant not in merchant:
merchant[row.merchant] = {}
merchant[row.merchant]["sales"] = []
merchant[row.merchant]["stolen"] = []
if row.fraud == 0:
merchant[row.merchant]["transactions"] = 1
merchant[row.merchant]["frauds"] = 0
merchant[row.merchant]["sales"].append(row.amount)
elif row.fraud == 1:
merchant[row.merchant]["transactions"] = 0
merchant[row.merchant]["frauds"] = 1
merchant[row.merchant]["stolen"].append(row.amount)
else:
if row.fraud == 0:
merchant[row.merchant]["transactions"] += 1
merchant[row.merchant]["sales"].append(row.amount)
elif row.fraud == 1:
merchant[row.merchant]["frauds"] += 1
merchant[row.merchant]["stolen"].append(row.amount)
In our entire set of half a million data points, the data is made up of transactions between 4112 customers and 50 merchants.
# how many individual customers and merchants
len(customer)
len(merchant)
4112
50
Looking through the merchant hash table, there were merchants that experienced more fraudulent transactions than valid transactions. Cross-referencing the merchant ID with the purchase category, we were able to determine fraud occurs more prevalently to merchants associated with leisure, travel, hotel services, and sports and toys.
Pavg = []
for index, row in c_df.iterrows():
Pavg.append(sum(row["purchases"]) / row["transactions"])
c_df["avg_purchase"] = Pavg
Tavg = []
for index, row in c_df.iterrows():
if row.frauds > 0:
Tavg.append(sum(row["stolen"]) / row["frauds"])
else:
Tavg.append(0)
c_df["avg_theft"] = Tavg
c_df[c_df.avg_theft > 0].head()
purchases | stolen | transactions | frauds | avg_purchase | avg_theft | |
---|---|---|---|---|---|---|
C765155274 | [9.1, 14.39, 18.96, 36.39, 23.22, 8.41, 10.7, ... | [752.23] | 175 | 1 | 39.201143 | 752.230 |
C623601481 | [68.79, 58.38, 78.92, 2.78, 14.92, 31.77, 45.1... | [431.88, 2372.22, 521.63, 1888.43, 541.61] | 89 | 5 | 28.985955 | 1151.154 |
C194016923 | [30.19, 31.45, 8.54, 5.15, 24.48, 32.2, 10.8, ... | [164.04, 1142.23] | 158 | 2 | 29.775127 | 653.135 |
C834963773 | [40.69, 4.93, 20.86, 37.22, 20.19, 11.84, 51.6... | [747.24, 667.76, 437.47, 96.59, 244.63] | 178 | 5 | 33.194944 | 438.738 |
C124539163 | [10.09, 21.87, 20.25, 14.75, 29.59, 17.67, 55.... | [4574.72, 85.87] | 75 | 2 | 32.948533 | 2330.295 |
The average number of valid transactions per customer was 143 transactions while the average number of frauds within the same time period was 2 fradulent transactions. The average amount of valid transactions was only 35 euro while the average fradulent purchase amount was nearly 200 euro.
# avg customer transactions over 180 days
c_df["transactions"].mean()
# avg customer fraud over 180 days
c_df["frauds"].mean()
# avg customer expenditure during this period
c_df["avg_purchase"].mean()
# avg customer fraud during this period
c_df["avg_theft"].mean()
142.86065175097275
1.7509727626459144
34.34324428801419
197.37380636624428
We created a table containing each merchant ID with features including the list of each valid transaction, the list of fraudulent transactions, the number of those transactions and the average amount of those transactions. We notice a pattern wherein the number of valid transactions are much higher in volume but represent much lower purchase amounts on average, whereas the opposite trend is observed for fraudulent transactions per merchant.
Savg = []
for index, row in m_df.iterrows():
Savg.append(sum(row["sales"]) / row["transactions"])
m_df["avg_sale"] = Savg
Favg = []
for index, row in m_df.iterrows():
if row.frauds > 0:
Favg.append(sum(row["stolen"]) / row["frauds"])
else:
Favg.append(0)
m_df["avg_theft"] = Favg
m_df[m_df.avg_theft > 0].head()
sales | stolen | transactions | frauds | avg_sale | avg_theft | |
---|---|---|---|---|---|---|
M50039827 | [68.79, 59.51, 98.24, 163.03, 115.87, 20.7, 10... | [1025.56, 295.57, 493.79, 520.11, 130.56, 590.... | 870 | 46 | 105.229092 | 409.394130 |
M1888755466 | [87.67, 25.0, 84.39, 24.29, 19.25, 116.01, 96.... | [66.6, 189.22, 41.48, 572.01, 386.21, 226.78, ... | 684 | 228 | 75.685497 | 316.469605 |
M480139044 | [266.59, 44.14, 248.42, 55.82, 50.88, 83.93, 2... | [44.26, 324.5, 667.09, 520.5, 289.21, 560.9, 9... | 1874 | 1634 | 103.299803 | 406.857032 |
M692898500 | [171.07, 109.26, 187.62, 237.48, 195.44, 27.84... | [112.55, 830.57, 143.09, 607.85, 904.51, 411.0... | 884 | 16 | 105.148835 | 418.039375 |
M348875670 | [114.54, 127.84, 199.95, 35.57, 134.89, 154.92... | [112.44, 321.46, 145.84, 0.8, 141.22, 420.81, ... | 97 | 10 | 111.385361 | 211.485000 |
The average number of valid transactions per merchant was 11750 transactions while the average number of frauds within the same time period was 144 fradulent transactions. The average amount of valid transactions was only 130 euro while the average fradulent purchase amount was nearly 360 euro.
# avg merchant transactions over 180 days
m_df["transactions"].mean()
# avg merchant fraud over 180 days
m_df["frauds"].mean()
# avg merchant revenue during this period
m_df["avg_sale"].mean()
# avg merchant fraud during this period
m_df["avg_theft"].mean()
11748.86
144.0
129.176631312623
356.63137368984593
The most exploited merchants defined as having more fradulent transactions than valid ones in our data set represented by the following categories:
exploited = df[["merchant", "category"]][df.merchant.isin(m_df[m_df.frauds > m_df.transactions].index)]
exploited.groupby(["merchant"])["category"].unique()
merchant M1294758098 [leisure] M1353266412 [hotelservices] M17379832 [sportsandtoy] M1873032707 [hotelservices] M2011752106 [hotelservices] M2080407379 [travel] M2122776122 [home] M3697346 [leisure] M732195782 [travel] M857378720 [hotelservices] M980657600 [sportsandtoy] Name: category, dtype: object
When analyzing the skewness and kurtosis of the full “amount” data, we find we have an excess kurtosis of 1425.31 and a skewness of 32.37 (for perspective, both values are 0 for a normal distribution).
from scipy.stats import kurtosis, skew
print( 'excess kurtosis of normal distribution (should be 0): {}'.format( kurtosis(df.amount) ))
print( 'skewness of normal distribution (should be 0): {}'.format( skew(df.amount) ))
excess kurtosis of normal distribution (should be 0): 1425.3116885527731 skewness of normal distribution (should be 0): 32.36575650728976
Looking at the statistical descriptors of the amount field below, we see that of the 594643 records, the mean transaction “amount” is roughly €38. The minimum purchase is €0 and the maximum is €8330. The skewness is also apparent when noting the 50th percentile of all those records is only €27. Given the maximum value and the number of transactions, the presence of a positive skew is obvious. The kurtosis is also apparent when noting the 75th percentile of all those records is only €43. Again, considering the maximum value and the number of transactions, the presence of an extreme kurtosis is evident.
df.amount.describe()
count 594643.000000 mean 37.890135 std 111.402831 min 0.000000 25% 13.740000 50% 26.900000 75% 42.540000 max 8329.960000 Name: amount, dtype: float64
A density plot visualizes the distribution of data over a continuous interval or time period. This is similar to a histogram but with a variation that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. Density plots have a few advantages over histograms such that they're better at determining the distribution shape because they're not affected by the number of bins used. The vast majority of the transactions fall within the 10 to 100 euro bin, as evidenced by our extreme skewness and kurtosis.
df['amount'].plot.density(logx=True)
<AxesSubplot:ylabel='Density'>
We wanted to see if we could accurately determine fraudulent transactions by looking at outliers. We first tried calculated outliers using the IQR method. The IQR range on average is about 30 euro. With this method any transaction above 85 euro is considered an outlier which gives us too many false positives.
# IQR
Q1 = df.amount.quantile(0.25)
Q3 = df.amount.quantile(0.75)
IQR = Q3 - Q1
print("Interquartile Range for amount: " )
IQR
# Outliers
Maximum = Q3 + (1.5 * IQR)
print("Maximum outliers for amount: ")
Maximum
Interquartile Range for amount:
28.799999999999997
Maximum outliers for amount:
85.74
Q1 = df.amount.quantile(0.25)
Q3 = df.amount.quantile(0.75)
IQR = Q3-Q1
#conditions = [(df.amount > (Q3+1.5*IQR)), (df.amount < (Q1-1.5*IQR)), (df.amount < (Q3+1.5*IQR)) | (df.amount > (Q1+1.5*IQR))]
conditions = [(df.amount > (Q3+1.5*IQR)), (df.amount < (Q3+1.5*IQR))]
#values = [1, 1, 0]
values = [1, 0]
df["IQR_outlier"] = np.select(conditions, values)
df.head()
step | customer | age | gender | merchant | category | amount | fraud | age_code | gender_code | category_code | IQR_outlier | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | C1093826151 | 4 | M | M348934600 | transportation | 4.55 | 0 | 4 | 2 | 12 | 0 |
1 | 0 | C352968107 | 2 | M | M348934600 | transportation | 39.68 | 0 | 2 | 2 | 12 | 0 |
2 | 0 | C2054744914 | 4 | F | M1823072687 | transportation | 26.89 | 0 | 4 | 1 | 12 | 0 |
3 | 0 | C1760612790 | 3 | M | M348934600 | transportation | 17.25 | 0 | 3 | 2 | 12 | 0 |
4 | 0 | C757503768 | 5 | M | M348934600 | transportation | 35.72 | 0 | 5 | 2 | 12 | 0 |
lowerBound = Q1 - 1.5*IQR
upperBound = Q3 + 1.5*IQR
print("The lower outlier bound for amount is: ", lowerBound) # non-existant
print("The upper outlier bound for amount is: ", upperBound)
The lower outlier bound for amount is: -29.459999999999994 The upper outlier bound for amount is: 85.74
Since the IQR method did not work, we tried using the Z-score method with a 1.5 standard deviation threshold. This gave us better results with only 141 false positives.
mean = df["amount"].mean()
std = np.std(df["amount"])
print('mean of the dataset is', mean)
print('std. deviation is', std)
threshold = 1.5
outlier = []
for i in df["amount"]:
z = (i-mean)/std
if abs(z) > threshold:
outlier.append(i)
print('The number of outliers in the dataset is', len(outlier))
mean of the dataset is 37.89013530807561 std. deviation is 111.40273725877348 The number of outliers in the dataset is 7341
df["z_score"] = (df.amount-mean)/std
conditions = [(df.z_score > 1.5), (df.z_score < (-1.5)), (df.z_score < (1.5)) | (df.z_score > (-1.5))]
values = [1, 1, 0]
df["Z_outlier"] = np.select(conditions, values)
df.head()
step | customer | age | gender | merchant | category | amount | fraud | age_code | gender_code | category_code | IQR_outlier | z_score | Z_outlier | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | C1093826151 | 4 | M | M348934600 | transportation | 4.55 | 0 | 4 | 2 | 12 | 0 | -0.299276 | 0 |
1 | 0 | C352968107 | 2 | M | M348934600 | transportation | 39.68 | 0 | 2 | 2 | 12 | 0 | 0.016067 | 0 |
2 | 0 | C2054744914 | 4 | F | M1823072687 | transportation | 26.89 | 0 | 4 | 1 | 12 | 0 | -0.098742 | 0 |
3 | 0 | C1760612790 | 3 | M | M348934600 | transportation | 17.25 | 0 | 3 | 2 | 12 | 0 | -0.185275 | 0 |
4 | 0 | C757503768 | 5 | M | M348934600 | transportation | 35.72 | 0 | 5 | 2 | 12 | 0 | -0.019480 | 0 |
As far as determining outliers the z-score method out performed the IQR method by orders of magnitude.
df.fraud.value_counts()
df.IQR_outlier.value_counts()
df.Z_outlier.value_counts()
0 587443 1 7200 Name: fraud, dtype: int64
0 568845 1 25798 Name: IQR_outlier, dtype: int64
0 587302 1 7341 Name: Z_outlier, dtype: int64
We had to encode our categorical features for correlation analysis and data preperation for our modeling. Below are the code descriptions for our categorical feature values.
df["category_code"].groupby(df["category"]).unique()
category barsandrestaurants [0] content [1] fashion [2] food [3] health [4] home [5] hotelservices [6] hyper [7] leisure [8] otherservices [9] sportsandtoy [10] tech [11] transportation [12] travel [13] wellnessandbeauty [14] Name: category_code, dtype: object
df["gender_code"].groupby(df["gender"]).unique()
gender E [0] F [1] M [2] U [3] Name: gender_code, dtype: object
df["age_code"].groupby(df["age"]).unique()
age 0 [0] 1 [1] 2 [2] 3 [3] 4 [4] 5 [5] 6 [6] U [7] Name: age_code, dtype: object
import seaborn as sns
sns.pairplot(df.sample(10000), hue="fraud", diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x22b14a3d220>
A correlation matrix represents the correlations between pairs of variables in the given data. The correlation coefficient is the number that denotes the strength of the relationship between the two variables. The plot shown above is a cool warm heat map that shows the strength of the relationship with colors, blue for weak correlation and red for strong correlation. We can see that when the plot compares the same two variables the correlation coefficient is 1. The only variable pair that has a significant correlation is fraud and the amount, in which the correlation coefficient is 0.49. This tells us that whether a transaction is fraudulent or not is related to the amount of the transaction.
# correlation heatmap
df.corr().style.background_gradient(cmap='coolwarm')
step | amount | fraud | age_code | gender_code | category_code | IQR_outlier | z_score | Z_outlier | |
---|---|---|---|---|---|---|---|---|---|
step | 1.000000 | -0.007961 | -0.011898 | 0.001169 | -0.001107 | -0.017269 | -0.004574 | -0.007961 | -0.007755 |
amount | -0.007961 | 1.000000 | 0.489967 | -0.003930 | -0.012888 | -0.098738 | 0.416670 | 1.000000 | 0.546959 |
fraud | -0.011898 | 0.489967 | 1.000000 | -0.004315 | -0.025047 | -0.114272 | 0.444686 | 0.489967 | 0.669257 |
age_code | 0.001169 | -0.003930 | -0.004315 | 1.000000 | 0.005020 | 0.004816 | -0.002577 | -0.003930 | -0.002632 |
gender_code | -0.001107 | -0.012888 | -0.025047 | 0.005020 | 1.000000 | 0.007700 | -0.016979 | -0.012888 | -0.018109 |
category_code | -0.017269 | -0.098738 | -0.114272 | 0.004816 | 0.007700 | 1.000000 | -0.304427 | -0.098738 | -0.147449 |
IQR_outlier | -0.004574 | 0.416670 | 0.444686 | -0.002577 | -0.016979 | -0.304427 | 1.000000 | 0.416670 | 0.524990 |
z_score | -0.007961 | 1.000000 | 0.489967 | -0.003930 | -0.012888 | -0.098738 | 0.416670 | 1.000000 | 0.546959 |
Z_outlier | -0.007755 | 0.546959 | 0.669257 | -0.002632 | -0.018109 | -0.147449 | 0.524990 | 0.546959 | 1.000000 |
A chi-square test is used in statistics to test the independence of two variables. Chi-Square measures how the expected count and observed count deviates between each other. We use this test to determine the relationship between the independent category feature (our feature set) and dependent category feature (our label). In feature selection, we aim to select the features which are highly dependent on the label. We determined each feature was dependent upon fraud.
#Pearson chi square test age and fraud
#finding the association between age and fraud
from scipy.stats import chi2_contingency
from scipy.stats import chi2
columns = ['age', 'gender', 'category']
stats = []
for col in columns:
result = {}
myCrosstable = pd.crosstab(df[col], df['fraud'])
chiVal, pVal, dof, exp = chi2_contingency(myCrosstable)
#interpret test-statistic
#Test Statistic >= Critical Value: reject null hypothesis, dependent (Ha)
#Test Statistic < Critical Value: fail to reject null hypothesis, independent (Ho)
#chi.ppf(q, df, loc=0, scale=1) inverset CDF
prob = 0.95 #significant value = 1 - 0.95 = 0.05
critical = chi2.ppf(prob, dof)
result['column'] = col
result['critical'] = round(critical, 2)
result['chiVal'] = round(chiVal, 2)
if chiVal >= critical:
result['H0'] = 'reject/dependent'
else:
result['H0'] = 'accepted/independent'
# interpret
# p-value <= alpha: reject null hypothesis, dependent (Ha)
# p-value > alpha: fail to reject null hypothesis, independedt (Ho)
alpha = 0.05
result['significance'] = round(alpha, 2)
result['p'] = round(pVal, 2)
if pVal <= alpha:
result['dependent'] = 'Dependent (reject H0)'
else:
result['independent'] = 'Independent (fail to reject H0)'
stats.append(result)
stats
[{'column': 'age', 'critical': 14.07, 'chiVal': 44.15, 'H0': 'reject/dependent', 'significance': 0.05, 'p': 0.0, 'dependent': 'Dependent (reject H0)'}, {'column': 'gender', 'critical': 7.81, 'chiVal': 393.43, 'H0': 'reject/dependent', 'significance': 0.05, 'p': 0.0, 'dependent': 'Dependent (reject H0)'}, {'column': 'category', 'critical': 23.68, 'chiVal': 193862.64, 'H0': 'reject/dependent', 'significance': 0.05, 'p': 0.0, 'dependent': 'Dependent (reject H0)'}]
Calculating outliers wasn’t the best approach to determining whether a data point was fraudulent or not, we decided to use machine learning algorithms to help us with our binary classification problem. It made it very easy for us to choose any classification algorithm we wanted to try to implement and test. We chose logistic regression, linear support vector machines, k-nearest neighbors, decision tree, random forest, and multi-layer perceptron classifiers. These classifiers are known to work really well with binary classification problems such as our own, and coupled with the fact that we only have between 4-6 features depending on how many features we want to include in our feature set, and only about half a million rows in our data, these classifiers have no problem handling our data. The longest it took to fit any of these classifiers with our training data was 2 minutes. Apart from parameters of our models that we had to hard code, such as k, the number of clusters in our KNN, or which type of solver to use for logistic regression, we stuck with the default settings for the parameters provided by the library. This decision was made by reading the documentation provided for each classifier, what parameters we could use for each classifier and how to determine the parameter values. We used only the parameters that were relevant to our models given the type of solvers chosen, for example. And We found that the default settings worked really well for our simple dataset. For Decision tree, random forest, and the multi-layer perceptron classifiers, we chose to input a value for the random_state parameter. By inputting any integer, we ensure the same results across different calls by controlling the random number generator seed.
We are preparing our feature set and label. We determined the only features revelevant to our class prediction are a customers age, gender, the purchase category and purchase amount.
# creating feature set
X = df[["age_code", "gender_code", "category_code", "amount"]]
X.head()
age_code | gender_code | category_code | amount | |
---|---|---|---|---|
0 | 4 | 2 | 12 | 4.55 |
1 | 2 | 2 | 12 | 39.68 |
2 | 4 | 1 | 12 | 26.89 |
3 | 3 | 2 | 12 | 17.25 |
4 | 5 | 2 | 12 | 35.72 |
y = df["fraud"]
y.head()
0 0 1 0 2 0 3 0 4 0 Name: fraud, dtype: int64
We are splitting our data with a traditional 70/30 training:testing split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(416250, 4) (178393, 4) (416250,) (178393,)
Logistic Regression is a type of Generalized Linear Model (GLM) that uses a logistic function to model a binary variable based on any kind of independent variables. For sklearns Logistic Regression, the ‘liblinear’ solver is a good choice for small datasets. For the multi_class parameter, if the option chosen is ‘ovr’, then a binary problem is fit for each label (‘multinomial’ is unavailable when solver=’liblinear’).
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. LinearSVC is a class of SVMs capable of performing binary and multi-class classification on a dataset. LinearSVC is another (faster) implementation of Support Vector Classification for the case of a linear kernel.
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Some advantages of decision trees are they are simple to understand and to interpret, they can be visualised, and they require little data preparation.
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. A diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.
Neural Networks are a machine learning algorithm that involves fitting many hidden layers used to represent neurons that are connected with synaptic activation functions. Multi-layer Perceptron classifier optimizes the log-loss function using stochastic gradient descent. Activation function for the hidder layer by default is 'relu', the rectified linear unit function, returns f(x) = max(0,x).
For our Logistic Regression model, we set the solver = liblinear. This is because the liblinear algorithm is good for solving smaller datasets. Next, we set multi_class = ovr to fit a binary problem for each label. For LinearSVC, we set dual = false because our number of samples are higher than our number of features. We set the number of max iterations = 2000 to limit the workload of our LinearSVC model. For our KNeighborsClassifier model, we have default parameters. In our DecisionTreeclassifier, RandomForestClassifier, and MLPClassifier we set the random_state = 1 to have reproducible results across multiple function calls.
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
classifiers = [
LogisticRegression(solver='liblinear', multi_class='ovr'),
LinearSVC(dual=False, max_iter=2000),
KNeighborsClassifier(3),
DecisionTreeClassifier(random_state=1),
RandomForestClassifier(random_state=1),
MLPClassifier(random_state=1)
]
from sklearn import model_selection
results = []
names = []
for classifier in classifiers:
kfold = model_selection.KFold(n_splits=10)
cv_results = model_selection.cross_val_score(classifier, X, y, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(classifier)
msg = "%s: %f (%f)" % (classifier, cv_results.mean(), cv_results.std())
print(msg)
LogisticRegression(multi_class='ovr', solver='liblinear'): 0.993425 (0.000609) LinearSVC(dual=False, max_iter=2000): 0.993105 (0.000792) KNeighborsClassifier(n_neighbors=3): 0.993682 (0.000804) DecisionTreeClassifier(random_state=1): 0.991607 (0.000997) RandomForestClassifier(random_state=1): 0.992483 (0.000882) MLPClassifier(random_state=1): 0.994198 (0.000795)
To compare the accuracy between models, we decided to go with 10-fold cross validation so produce 10 accuracy scores per model. These accuracy scores were then compared to each other, and it was determined that the MLP classifier outperformed the other classifiers, as seen in the boxplot below.
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names, rotation = 90)
plt.show()
Text(0.5, 0.98, 'Algorithm Comparison')
{'whiskers': [<matplotlib.lines.Line2D at 0x22b1cf8ee20>, <matplotlib.lines.Line2D at 0x22b1cf9f1f0>, <matplotlib.lines.Line2D at 0x22b1cfab790>, <matplotlib.lines.Line2D at 0x22b1cfabb20>, <matplotlib.lines.Line2D at 0x22b1cfc1100>, <matplotlib.lines.Line2D at 0x22b1cfc1490>, <matplotlib.lines.Line2D at 0x22b1cfcba60>, <matplotlib.lines.Line2D at 0x22b1cfcbdf0>, <matplotlib.lines.Line2D at 0x22b1cfe13d0>, <matplotlib.lines.Line2D at 0x22b1cfe1760>, <matplotlib.lines.Line2D at 0x22b1cfebd00>, <matplotlib.lines.Line2D at 0x22b266060d0>], 'caps': [<matplotlib.lines.Line2D at 0x22b1cf9f580>, <matplotlib.lines.Line2D at 0x22b1cf9f910>, <matplotlib.lines.Line2D at 0x22b1cfabeb0>, <matplotlib.lines.Line2D at 0x22b1cfb5280>, <matplotlib.lines.Line2D at 0x22b1cfc1820>, <matplotlib.lines.Line2D at 0x22b1cfc1be0>, <matplotlib.lines.Line2D at 0x22b1cfd71c0>, <matplotlib.lines.Line2D at 0x22b1cfd7550>, <matplotlib.lines.Line2D at 0x22b1cfe1af0>, <matplotlib.lines.Line2D at 0x22b1cfe1e80>, <matplotlib.lines.Line2D at 0x22b26606460>, <matplotlib.lines.Line2D at 0x22b266067f0>], 'boxes': [<matplotlib.lines.Line2D at 0x22b1cf8ebb0>, <matplotlib.lines.Line2D at 0x22b1cfab400>, <matplotlib.lines.Line2D at 0x22b1cfb5d30>, <matplotlib.lines.Line2D at 0x22b1cfcb6d0>, <matplotlib.lines.Line2D at 0x22b1cfe1040>, <matplotlib.lines.Line2D at 0x22b1cfeb970>], 'medians': [<matplotlib.lines.Line2D at 0x22b1cf9fca0>, <matplotlib.lines.Line2D at 0x22b1cfb5610>, <matplotlib.lines.Line2D at 0x22b1cfc1f70>, <matplotlib.lines.Line2D at 0x22b1cfd78e0>, <matplotlib.lines.Line2D at 0x22b1cfeb250>, <matplotlib.lines.Line2D at 0x22b26606b80>], 'fliers': [<matplotlib.lines.Line2D at 0x22b1cfab070>, <matplotlib.lines.Line2D at 0x22b1cfb59a0>, <matplotlib.lines.Line2D at 0x22b1cfcb340>, <matplotlib.lines.Line2D at 0x22b1cfd7c70>, <matplotlib.lines.Line2D at 0x22b1cfeb5e0>, <matplotlib.lines.Line2D at 0x22b26606f10>], 'means': []}
[Text(1, 0, "LogisticRegression(multi_class='ovr', solver='liblinear')"), Text(2, 0, 'LinearSVC(dual=False, max_iter=2000)'), Text(3, 0, 'KNeighborsClassifier(n_neighbors=3)'), Text(4, 0, 'DecisionTreeClassifier(random_state=1)'), Text(5, 0, 'RandomForestClassifier(random_state=1)'), Text(6, 0, 'MLPClassifier(random_state=1)')]
For our project, we are using six different models:
Using these different models we achieved an accurate f_score of 99.2% to 99.3%.
We were trying to achieve accuracy scores above 80% which we successfully did. The reason why our models achieved such a high F1 score is because we limited the amount of data. We were able to remove features that would overfit our model. When we cleaned up our data and only included certain features such as age, gender, category, and amount. Hence the reason we were able to get a high F1 score.
We hypothesized that the MLP classifier, which is a neural network classifier, would perform the best given that neural networks perform really well given almost any data set and almost any data science problem. Neural networks, as we all know, consist of connected neurons or nodes in a layered structure having an input layer, hidden layers, and an output layer. With our data, the nodes of our input layer correspond to our data features: age, gender, category, and amount. Because we used the MLP classifiers default parameter settings, we only have 1 hidden layer with 100 nodes or units. The MLP classifier optimizes the log-loss function using stochastic gradient descent and the neurons are connected with synaptic activation functions, in this particular case, the recitifed linear unit function, or relu. Given we fit 6 machine learning classifiers to our training data and tested it with the 30% data we reserved for validation, we needed a way to compare the accuracy results of the classifier’s predictions to the actual testing data. We did so by implementing the ten fold cross validation.
Using our MLP classifier, in the first block of code we are looking at a female between the age of 19 and 25 years old purchasing something that falls under the sports and toys category. She is spending 5000 euro which our machine learning alogrithm detects as fraud. In the second block of code we are looking at a female in the same age range buying something that is under the sports and toys category. She is spending only 50 euro which our machine learning algorithm doesn't detect the transaction as fraud.
classifier = MLPClassifier(random_state=1)
cls = classifier.fit(X_train, y_train)
# female aged 19-25 spending 5000 euro on sports and toys
y_output = cls.predict([[1, 1, 10, 5000]])
print(y_output)
[1]
# female aged 19-25 spending 50 euro on sports and toys
y_output = cls.predict([[1, 1, 10, 50]])
print(y_output)
[0]
Our projects ML success criteria are met when we can successfully detect outliers in consumers purchases more than 80% of the time. All of our models have exceeded our ML success criteria. Our project's business success criteria includes being able to detect and flag fraudulent transactions. All models presented meet the business objectives and none were deficient. The MLP classifier was the most optimal f1_score, while other models are relatively close. We believe the MLP classifier worked best with our data because it works well with all different data types and non uniform distribution data.
Most if not all banks will never have a uniform distributed data set. We are confident that if we wanted to deploy our project we would show banks how we were able to accurately detect outliers with our limited data set. Banks would only be able to provide a limited data set to make sure they do not violate client privacy. We are confident that we could take their banking data farm and plug our machine learning algorithm to be able to detect fraud for their bank. If they were to supply more data variables we are confident that MLP would still return accurate results because MLP, being a neural network, will be more than capable of handling more data types.
Future work entails performing statistical feature selection, statistical test of significance on model performance metrics, ANOVA analysis between model results, plotting ROC curves, and applying new datasets to our fitted model to further evaluate model prediction. The more data we feed into our classifier model, the more we are able to saturate the classifier's learning and improve upon it's predictive power.