Bank Churn Modeling

Jun 28

Project Description

As always, we will start by reviewing the machine learning task at hand. Before we do this, however, I want to apologize for not having written an article recently—AP testing was quite extensive and took up a large portion of my time in May. I plan to write as much as possible over the summer to make up for this deficit.

Let’s start by going over some terminology, as the title of this article was probably a mouthful.

Neural Networks are complex deep learning algorithms that are at the forefront of many types of machine learning tasks today, such as computer vision and time series analysis. In this project, we will design a neural network to classify bank customers into one of two categories.
Churn is defined as “a measure of the number of individuals or items moving out of a collective group over a specific period.” In this project, we will be modeling bank churn. In other words, our model must be able to classify a customer based on their likelihood to leave the bank.

This means we have a binary classification problem for which we must create a neural network. But as always, we must analyze our dataset to determine which columns might have an effect on churn.

Let’s get started!

Dataset Analysis

Let’s look at a sample of our dataset:

Screen_Shot_2021-06-20_at_7.48.33_PM.png

We can usually recognize a dependent variable immediately because most datasets have it in the rightmost column. In this case, “Exited” would be our dependent variable.

0 signifies that a customer did not leave the bank
1 signifies that a customer did leave the bank

All the columns to the left of “Exited” are independent variables, and thus can be inputted into our model. But just because a column is an independent variable does not mean it should be used as data. Sometimes, there is no feasible correlation between a predictor and the outcome. In this case, we can easily eliminate the first three columns because row number, customer id, and surname do not play a role in churn rates.

We may argue that gender also does not play a role in churn rates, but this idea is not completely certain. Because of this, we can input the column into the classifier and remove it if it proves to be detrimental.

Now that we’ve quickly looked over our data, let’s begin coding!

Library Installation

Of course, we must first install all the libraries necessary so that we don’t run into errors later on. These are the libraries we need:

Tensorflow is a library that will be used to create deep learning models such as neural networks.
Pandas is a Python library that helps in data manipulation and analysis, and it offers data structures that are needed in machine learning.
Numpy is another library that makes it easy to work with arrays. It provides several unique functions that will help in data preprocessing.
Scikit-Learn is a library that provides many machine learning algorithms, but we will be using it for data preprocessing.

Fortunately, these libraries can be quickly installed by using Pip, Python’s default package management system. All we have to do is enter the following lines of code into the terminal:

pip3 install sklearn
pip3 install tensorflow
pip3 install pandas
pip3 install numpy

Now, in our text editor (or notebook) of choice, we must import these libraries as follows:

import numpy as np 
import pandas as pd
import tensorflow as tf
import sklearn

Just on a side note: all the code can be found on my GitHub repository for this project.

Data Preprocessing

Dataset Creation and Splitting

Of course, we cannot begin coding our classifier without preprocessing our data. There are several useful predictors that must still be encoded before being inputted into our neural network. First, however, let’s create two Numpy arrays: one for our independent variable and one for our dependent variable.

If you do not understand some of the preprocessing functions/methods being used, please take a look at the documentation of Numpy and Pandas. It is important to understand how the data is being transformed so that we can debug any errors that arise with greater ease.

dataset = pd.read_csv('Churn_Modeling.csv')
X = dataset.iloc[:, 3:-1].values
y = dataset.iloc[:, -1].values

On line 1, we create a Pandas Dataframe, dataset, by using the read_csv function provided by Pandas. On the second and third lines, we divide dataset into two Numpy arrays: X and y.

X is formed by taking all the data from the third to the second-to-last column.
y is formed by taking all the data from the last column, “Exited”.

One of the first steps when beginning data preprocessing is to split our data into training and testing sets. We do this before encoding and scaling our data so that we can avoid leaking information about our testing set into our training set. Avoiding this will ensure that the model’s score on the testing dataset is comparable to its performance in the real world.

We can split our data randomly by using Scikit-Learn’s train_test_split() function. It requires that we pass in X, y, and the sample size that will be used for the test dataset. We will also input a random_state parameter so that the dataset is split with the same random seed each time we run the program.

X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.2, random_state=0)

Now, we have four Numpy arrays that contain all of our data: X_train, X_test, y_train, y_test. The testing data will only be used after our model is created and trained, though.

Scaling the Data

Whenever one works with large deep learning models such as artificial neural networks, it is important to scale the data. This will help processes such as backpropagation converge upon optimal weights with great efficiency.

In this project, we will use Scikit-Learn’s StandardScaler class as shown below. You will notice that only certain columns of the X_train and X_test array are passed in. This is because we do not scale categorical variables as this wrongly converts them into continuous variables.

from sklearn.preprocessing import StandardScaler
X_scale = StandardScaler()
X_train[:, [0, 3, 4, 5, 6, 9]] = X_scale.fit_transform(X_train[:, [0, 3, 4, 5, 6, 9]])
X_test[:, 0, 3, 4, 5, 6, 9]] = X_scale.transform(X_test[:, 0, 3, 4, 5, 6, 9]]

Note that we do not fit the X_scale object on our testing data (we merely run the .transform() method). This is to ensure that our pipeline’s testing results are not skewed: they must reflect how our model will perform in real life.

Encoding Categorical String Data

Before inputting these data into our classifier, we must ensure all the data are numeric, as procedures such as backpropagation won’t be able to function properly otherwise. Let’s take a look at our dataset again to identify which columns must be encoded.

As you can see, we have quite a few categorical columns that must be taken into consideration: Geography, Gender, Has Cr Card, and Is Active Member. Though Has Cr Card and Is Active Member are already encoded in the correct format for our classifier, Gender and Geography are not. So, we must use OneHotEncoding to categorically encode these columns and turn them into integers.

This can be done as shown below.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
geography_encoder = OneHotEncoder(dtype=int, drop='first')
gender_encoder = OneHotEncoder(dtype=int, drop='first')
ct = ColumnTransformer([('geography', geography_encoder, [1]), ('gender', gender_encoder, [2])], remainder='passthrough')

X_train = ct.fit_transform(X_train)
X_test = ct.fit_transform(X_test)

In the third and fourth line above, we create two objects of the OneHotEncoder class: geography_encoder and gender_encoder. We pass in the argument drop=’first’ so that we can avoid high multicollinearity between predictors. This argument ensures that the first column created from one hot encoding is removed from the outputted array.

Then, we create ct, an object of the ColumnTransformer class, by passing in both encoders as per the documentation. The remainder=’passthrough’ parameter tells the object to keep all the non-modified columns intact.

In the final two lines, we simply use the fit_transform() method of ct and create the modified arrays X_train and y_train. At this point, we have successfully scaled and encoded our input data. But, we must do one more thing before moving—we need to convert all data in the dataset to type np.float32. This is because the neural network that we will create expects data of this only this type to be inputted.

X_train = np.asarray(X_train).astype(np.float32)
X_test = np.asarray(X_test).astype(np.float32)

Great! Now that all the data is properly formatted, we can finally begin to create our artificial neural network.

Creating the Network Architecture

The implementation of neural networks in Tensorflow is more complicated than that of normal machine learning models as we have far more control over the model structure. This means that we will have to carefully design our network architecture in a way that prevents overfitting while enabling the model to pick up on accurate patterns in data. Often this comes down to trial and error, which can be a long and frustrating process.

You may notice that the code below is not the same as that in the Github repository; this is because I went back and made some more changes to the network architecture to improve accuracy and eliminate overfitting.

The first layer of an artificial neural network is always the input layer, which has as many nodes as there are predictors. Fortunately, TensorFlow will automatically create an input layer when training the model on the dataset.

This means that we can directly begin forming our hidden layers after initializing our model type. Let’s complete the initialization as shown below.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.regularizers import L2

model = Sequential()

We haven’t yet used all the imports from the first three lines, but rest assured that they will come in handy once we further create our model architecture.

As you can see, we create an object model of the Sequential() class, which represents a feed-forward neural network. We can create this network’s architecture by using the .add() method of the model object.

In my own experimentation, I found that the churn modeling dataset being used is one that often leads to overfitting. Because of this, our model architecture will involve regularization, a modification that lowers the average value of our network’s weights and thereby curbs overfitting.

The exact mathematics of regularization is beyond the scope of this article. However, it is important to know that regularization adds either the absolute value or the square of each of the model’s weights to the cost.

L2 regularization adds the square of each weight

L1 regularization adds the absolute value of each weight

This modified cost function means that higher weights inevitably result in higher cost, urging gradient descent to find numerically lower values for the network’s weights.

Below, you can see how we initialize an L2 regularizer with a lambda of 0.01. The greater the value of lambda, the more impact regularization will have on our model’s weights.

regularizer = L2(0.01)

Now, we can begin creating the hidden, or intermediate, layers of our neural network. I chose a relatively shallow neural network for this dataset to avoid overfitting as much as possible. However, I highly recommend that you experiment with different architectures as well.

model.add(Dense(units=16, activation='relu'))
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=32, activation='relu'))

We use the Rectified Linear Unit (ReLu) activation function in our hidden layers since it tends to lead to better results than a sigmoid function.

Lastly, we must create the output layer of our neural network. Here we must use a sigmoid function since it outputs the probability of the customer leaving the bank. Since there are only two labels in our dataset, we use one node in our output layer.

model.add(Dense(units=1, activation='sigmoid'))

Now, all that is left to do is compile our model by inputting the optimizer, loss function, and evaluation metric that we would like to use. Since this is a binary classification problem, we can use binary crossentropy for our loss function.

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Training Our Network

This step of the process is quite simple: all we have to do is train the model created in the previous section on our training data, X_train and y_train. Like in most cases, there is a neat .fit() method that we can use to carry out this training process. This can be implemented as shown below:

model.fit(X_train, y_train, epochs=300)

At this point, our model object is a trained neural network of the Sequential() class. But, we might also want to save this model so that we can use the same weights for later predictions. This can be done using the .save() method.

model.save('model')

The above line of code, when run, should create a directory called “model” within the directory of the Python script. This folder contains the weight matrices of our neural network and can be loaded in another Python script.

Testing Our Network

Of course, no machine learning pipeline can be completed without testing. To test our model, we have already created and stored testing data in the variables X_test and y_test. We can import the necessary functions from Scikit-Learn and implement them as shown below.

from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

predictions = model.predict(X_test)
predictions = (predictions > 0.5)

cm = confusion_matrix(y_test, predictions)
accuracy = accuracy_score(y_test, predictions)

In the second line of code, we store the model’s test data predictions in the NumPy array predictions. The third line of code transforms this array of probabilities into an array of predicted class labels; this is done using basic Python comparison operators.

If we print our accuracy, we should get a value of around 0.8365, which is approximately 84 percent. While this does not seem too bad at first glance, we must also take into consideration the confusion matrix. It should look something like that below:

[[1477 118]

[ 209 196]]

To analyze this matrix, we must first take a look at the format of the output. Let’s go through this matrix row-by-row.

The first item of the first row contains the number of true negatives.
The second item of the first row contains the number of false positives.
The first item of the second row contains the number of false negatives.
The second item of the second row contains the number of true negatives.

After taking a look at the confusion matrix, you may have noticed that the dataset is heavily skewed and that the classifier is far more likely to predict a 0 rather than a 1. This means that its accuracy on data with a label of 1 is close to 50 percent.

This problem arises from the fact that the dataset is heavily skewed. If we wanted to improve this pipeline, we would likely have to modify the dataset using log transform or another form of preprocessing. This is something that I recommend you try on your own. I’ll likely cover it in a future project, however.

Aside from this negative, our machine learning pipeline is quite successful and now you know how to create an artificial neural network for churn modeling! Thanks for reading my article and be sure to leave any feedback in the comments!

Anirudh Pai