Multiple Linear Regression

What is Multiple Linear Regression?

Last time, we discussed a model known as simple linear regression, which is designed to find a relationship between a single independent and dependent variable. We explained how this model could be used to estimate the profit margin of a lemonade stand when given the average temperature of a certain day.

But what if we also wanted to include the day of the week in our model?

This is where multiple linear regression comes in.

A multiple linear regression model is able to analyze the relationship between several independent variables and a single dependent variable; in the case of the lemonade stand, both the day of the week and the temperature’s effect on the profit margin would be analyzed.

How does Multiple Linear Regression Work?

Model Representation

Much like simple linear regression, multiple linear regression works by changing parameter values to reduce cost, which is the degree of error between the model’s predictions and the the values in the training dataset. With simple linear regression, we had two parameters that needed to be tuned: b_0 (the y-intercept) and b_1 (the slope of the line).

With multiple linear regression, however, we could have any number of parameters. Let’s take a look at multiple linear regression’s equation to visualize this.

Equation of the hyperplane

Equation of the hyperplane

Let’s break this down into its various components:

  • y represents the dependent variable

  • b_0 represents the dependent variable axis intercept (this is a parameter that our model will optimize)

  • n signifies the number of variables in our dataset

  • x_1 through x_n are the independent variables in our dataset

  • The variables b_1 through b_n are coefficient parameters that our model will also tune

As we can see, this isn’t just a simple equation of a line. It’s actually an equation of a hyperplane.

So what exactly is a hyperplane?

A hyperplane is essentially a line of best fit for data in 3 or more dimensions. If we have two independent variables and one dependent variable, the hyperplane would look like this:

A sample hyperplane (image from MyLearningsInAiMl)

A sample hyperplane (image from MyLearningsInAiMl)

In this sample representation, the two horizontal axes represent the independent variables while the vertical axis represents the dependent variable.

So, the regressor tries to create an equation of a hyperplane that best represents the training data it is given.

This means that the regressor will have to try out several different equations to see which hyperplane best fits the data.

But how does it determine how “well” a hyperplane represents the training set?

The Cost Function

It does this by calculating a metric known as cost, which is the degree of error between the hyperplane’s values and those of the training dataset. The cost can be calculated by many different formulas, but the one that linear regression uses is known as the multivariate Mean Squared Error (MSE) Cost Function. By determining how well a hyperplane represents the data, the regression model can tune the values of the parameters (i.e b_0, b_1, etc.) and optimize accuracy.

I wrote another article dedicated to both univariate and multivariate MSE, which I highly suggest you check out since MSE is a foundational concept in regression algorithms. Once you finish reading that, you can come back right away and continue with this one!

Okay, so now that the model has the error for its hyperplane, it can tune all the parameter values to reduce the cost.

But how does it do this?

This where a process called gradient descent comes into play.

Gradient Descent

Gradient descent is a complex algorithm, but it is necessary to learn how it works. Because of its complexity, I have written a whole other article solely dedicated to explaining the intuition and mathematics behind gradient descent. You can check out the article here. Please read the article before proceeding with this one.

After completing gradient descent, our algorithm will have finalized all the parameter values and created the equation of the optimal hyperplane. This equation can then be used to predict future values.

Predicting

Let’s say that our model was trained on a dataset with two independent variables. This would mean that our hyperplane equation would be in this form:

Screen Shot 2020-08-23 at 9.11.45 PM.png

After gradient descent, our model would have converged on three values for b_0, b_1, and b_2. Let’s say that:

  • b_0 converged to 11

  • b_1 converged to -6

  • b_2 converged to 9

This means that our finalized hyperplane equation looks like this:

But how can this be used to predict future values? As you’ve probably already guessed, all we need to do is plug in the values of the independent variables into x_1 and x_2. Let’s assume some more values just to work through this process:

  • x_1 is equal to 350

  • x_2 is equal to 46

Now we simply plug our numbers into the finalized hyperplane equation and the resulting value will be the estimated value of our model! Let’s look at what this simplifies to:

So our model has returned an estimated value of -1675 for the dependent variable!

Recap

Let’s quickly recap how a multiple linear regression model works:

  • The Mean Squared Cost Function is used to reduce the error of the hyperplane

  • The parameters (b_0 - b_n) are tuned using gradient descent

  • Independent variables are plugged into the finalized equation to estimate a value for y

Now for the really exciting part—coding our very own regressor in Python!

Implementation of Multiple Linear Regression in Python

Library Installation

Now that we’ve talked about the intuition behind multiple linear regression, it’s time to implement the model in code.

Note: The dataset used in this article was downloaded from superdatascience.com. For convenience, all the code and data for this section of the article can be found here.

Before we do this, however, we must install three important libraries: Scikit-Learn, Pandas, and Numpy.

  • Scikit-Learn is a machine learning library that provides machine learning algorithms to perform regression, classification, clustering, and more.

  • Pandas is a Python library that helps in data manipulation and analysis, and it offers data structures that are needed in machine learning.

  • Numpy is another library that makes it easy to work with arrays. It provides several unique functions that will help in data preprocessing.

Fortunately, these libraries can be quickly installed by using Pip, Python’s default package-management system. All we have to do is enter the following lines of code into terminal:

pip3 install numpy
pip3 install pandas
pip3 install sklearn

After this is complete, we can begin coding our algorithm in Python!

Step 1: Data Preprocessing

Let’s start by importing the two main libraries that we will use for this project: Numpy and Pandas. This can be done by typing in the following lines of code into our Python file.

import numpy as np 
import pandas as pd

Now, we have to import the main dataset that we are going to be using. To do this, we’ll have to use Pandas’s .read_csv function to read the file and convert it into a Pandas DataFrame.

dataset = pd.read_csv('50_Startups.csv')

Let’s take a quick look at our dataset before we proceed.

This task is to predict a company’s profit given several predictors: Administration, Marketing Spend, State, and R&D Spend. The first four columns are independent variables, while the last column, profit, is a dependent variable.

We need to split our dataset into two different arrays: one for independent variables and one for the dependent variable. We can index the first four columns to create a Numpy array for the independent variables and the last column to create a Numpy array for the dependent variable.

x = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, -1].values

As always, we’ll have to clean the data to ensure that we can input it into the regressor. The first step is to look at the dataset and determine which predictors will need to be cleaned. Let’s take another look at the dataset:

Sample from the dataset

Sample from the dataset

Fortunately, most of our variables are already numerical values, so we won’t need to do much data preprocessing. The one thing we do have to do, however, is categorically encode our state predictor. You’ll notice that there are three categories for this variable:

  • Florida

  • New York

  • California

This means that we’ll have to encode these variables by using Scikit-Learn’s OneHotEncoder and ColumnTransformer. Let’s get right to it!

First, we need to import and instantiate both of these classes. This can be done as shown below.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
encoder = OneHotEncoder(drop='first', dtype=int)
ct = ColumnTransformer([('categorical_encoding', encoder, [3])], remainder='passthrough')

The drop argument in OneHotEncoder is to ensure that we are able to avoid the dummy variable trap, which occurs when two or more variables have high collinearity (which means that they are highly interrelated). When drop=’first’ is specified, OneHotEncoder will get rid of the first encoded column, thereby reducing collinearity.

Now all we have to do is apply ct (our instance of ColumnTransformer) on our dataset x by using the fit_transform method. This will categorically encode the fourth column of our dataset (as specified through the index argument).

x = ct.fit_transform(x)

Great! Now we’re done with data preprocessing. We have one last thing to do before we begin to train our regressor; we need to split our dataset into train and test subsets.

Step 2: Creating Train and Test Datasets

To split our x and y arrays even further, we can use another function from Scikit-Learn called train_test_split. This will split the rows of our datasets into two categories: train or test. It will create four separate arrays based on these splits: x_train, y_train, x_test, y_test.

Let’s start by importing the class as shown below.

from sklearn.model_selection import train_test_split

Now we simply have to instantiate the class by specifying the necessary arguments.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.25, random_state=0)

The first two arguments are simply the datasets that we want to split. The test_size argument specifies the percentage of your data that you want to save to test your model. In this case, we choose a percentage of 25 percent. Finally, the random_state argument is simply a seed for the random generator—we specify a seed of 0. The random state argument is only necessary if you want the train and test datasets to be split the same way each time.

Now we can finally create, train, and test our linear regressor!

Step 3: Creating and Training the Linear Regressor

Scikit-Learn’s LinearRegression class actually also works for several independent variables. That means that we don’t have to do anything differently than when we created our simple linear regression model.

Let’s import and instantiate the class as shown below.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()

Now all we have to do is train the model on our training dataset. This can be done by using LinearRegression’s .train method. We simply have to call the method and input the datasets that we want to train our regressor on.

regressor.fit(x_train, y_train)

Now our regressor is trained and ready to predict future values.

Step 4: Testing the Linear Regressor

To test the regressor, we need to use it to predict on our test data. We can use our model’s .predict method to do this.

predictions = regressor.predict(x_test)

Now the model’s predictions are stored in the variable predictions, which is a Numpy array. But how do we measure the performance of our model? Like with simple linear regression, we need to use the R Squared Metric.

However, this time we have several independent variables, which means that we can’t use this metric directly. This is because the R Squared metric has a drawback: each time you add an independent variable, the metric’s value will get closer to 1; this leads to a performance rating that is inaccurately high.

To tackle this obstacle, we must manually implement an Adjusted R Squared metric since Scikit-Learn does not provide a function to do this. The formula for Adjusted R Squared is below.

Source: https://www.statisticshowto.com/adjusted-r2/

Source: https://www.statisticshowto.com/adjusted-r2/

First we need to find R Squared by using the Scikit-Learn’s r2_score function. Then, we need to plug in the R Squared value into the formula above to get adjusted R Squared.

Let’s start by finding R Squared.

from sklearn.metrics import r2_score
r_squared = r2_score(y_test, predictions)

Now we need to find the number of data values in our test dataset. This can be done by using len() to find the number of rows in x_test.

N = len(x_test)

Now we need to find the number of predictors in x_test. Originally, there were 4 predictors in our dataset, but after categorically encoding we had 6 (the state column became three separate columns of 0’s and 1’s). However, we must remember that we dropped the first of these columns to avoid the dummy variable trap. So, there are 5 predictors in our dataset.

k = 5

Now all we have to do is plug in these numbers into the formula for adjusted R Squared. Let’s take a look at the formula once more.

Source: https://www.statisticshowto.com/adjusted-r2/

Source: https://www.statisticshowto.com/adjusted-r2/

Now, let’s implement the formula manually through code.

adjusted_r_squared = 1 - (((1 - (r_squared ** 2)) * (N - 1)) / (N - k - 1))

print(f'The adjusted R score of our model is: {adjusted_r_squared}')

We’re done with our program! Now, if we hit run, we’ll receive an Adjusted R Squared metric of 0.773, which is a pretty good score for a multiple linear regression model!

And that’s it: we’re done learning about multiple linear regression and its implementation in Python!

I hope you enjoyed my article, and as always, please feel free to leave any feedback you have for me in the comments. If you liked this article, please stay tuned for my next tutorial, which I will be publishing soon.

Thanks for reading!

Previous
Previous

Mean Squared Error Cost Function

Next
Next

Gradient Descent