Mean Squared Error Cost Function

What Is a Cost Function?

A cost function is a mathematical formula that allows a machine learning algorithm to analyze how well its model fits the data given.

A cost function returns an output value, called the cost, which is a numerical value representing the deviation, or degree of error, between the model representation and the data; the greater the cost, the greater the deviation (error).

Thus, an optimal machine learning model would have a cost close to 0.

There are many different cost functions that are used by modern machine learning algorithms, but one of the most popular is known as the Mean Squared Error (MSE) Cost Function.

In this article, we’ll be talking about the MSE Cost Function by using simple and multiple linear regression algorithms as examples.

We’ll first break down the formula for both single and multiple independent variables, and then work through examples so that you can attain a better understanding of the algorithm in practice.

How Does The MSE Cost Function Work?

Univariate MSE

Let’s start by breaking down MSE with one variable. To do this, we’ll need to take a look at the formula for univariate MSE and break down each of the variables and their purposes.

Mean Squared Error Cost Function

Mean Squared Error Cost Function

That formula may look a little intimidating at first, but when broken down, it’s really not all that complex. Let’s look at this graph so we can better visualize this:

Scatterplot of training dataset with the regressor’s line of best fit

Scatterplot of training dataset with the regressor’s line of best fit

This is a scatterplot of the training dataset I showed before, and the blue line represents the regressor’s line of best fit.

  • y-hat (the y with a little symbol over it) is a variable used in statistics to represent the predicted value of our model when training.

  • y is the variable that represents the actual value provided in the training dataset.

  • The i subscript below y and y-hat signifies the i’th point in the dataset.

  • m is the number of data points in our dataset.

  • J(b_0, b_1) is the cost, which we will discuss now.

The purpose of the Mean Squared Error Cost Function is to minimize this error as much as possible for all the points in the dataset. 

But how does it do this?

To answer this question, we need to talk about what the math behind the formula—but fear not, when broken down, it isn’t all that complex.

  1. The regressor will take the difference of y and y-hat for every point in the dataset and square it.

  2. The model will take the sum of all these squared differences and divide them by m, the number of data points.

  3. We divide this by 2 for mathematical convenience when finding the partial derivative of the cost function (don’t worry about this right now). This final value is our cost!

Note: Once again, remember that we used linear regression as an example for this section. However, the same intuition is used when we apply MSE in other machine learning algorithms such as polynomial regression or decision tree regression.

Now that we’ve learned about univariate MSE, let’s take a look at the slightly more advanced multivariate MSE.

Multivariate MSE

Fortunately, multivariate MSE is largely based off its univariate counterpart, and thus, there are only a few changes that we have to learn about.

Once again, let’s take a look at the formula and break it down.

Mean Squared Error Cost Function Formula

Mean Squared Error Cost Function Formula

You’ll notice that the cost function formulas for simple and multiple linear regression are almost exactly the same.

The only difference is that the cost function for multiple linear regression takes into account an infinite amount of potential parameters (coefficients for the independent variables). Let’s break down this formula like we did for simple linear regression.

  • y-hat is the predicted value of the model. In other words, it represents the value of the hyperplane at certain independent variable values.

  • y is the actual dependent value in the training dataset at certain independent variable values.

  • i signifies the i’th point in the training dataset.

  • The i subscript below y and y-hat just signifies the i’th point in the dataset.

  • m is the total number of data points in the dataset

  • J(b_0, b_1, … ,b_n) is the cost.

Let’s visualize this formula with a regression hyperplane in three dimensions to see what it does.

Screen Shot 2020-10-26 at 9.42.24 PM.png

This image has been modified. Original is from Datacadamia.

As we can see, y_i points to the value of the data point, and y_i-hat points to the value of the hyperplane. Our cost function is designed to calculate the average degree of error between all the data points and the predicted value of the hyperplane.

Now that we’ve visualized our cost function, let’s discuss the mathematics behind it. Let’s take a look at our cost function once again.

Now, the algorithm repeats the same steps as it did for multivariate MSE!

For your convenience, I have listed the steps once again below:

  1. The regressor will take the difference of y and y-hat for every point in the dataset and square it.

  2. The model will take the sum of all these squared differences and divide them by m, the number of data points.

  3. We divide this by 2 for mathematical convenience when finding the partial derivative of the cost function (don’t worry about this right now). This final value is our cost!

Great! We’re done learning about the Mean Squared Error Cost Function. If you liked this article, make sure to check back for my upcoming tutorials regarding machine learning algorithms. As always, feel free to leave any feedback you have for me in the comments so that I can improve for my next articles.

Previous
Previous

Polynomial Regression

Next
Next

Multiple Linear Regression