Simple Linear Regression

What is Simple Linear Regression?

This article will talk about the workings behind simple linear regression, a basic machine learning model designed to predict a non-categorical output value given a set of input data. 

But what does all this non-sensical jargon really mean? Let me break it down.

Suppose you were running a lemonade stand, and you wanted to predict how much profit you would make on a given day. One factor that you might want to to account for is be the temperature outside, because the hotter it is, the more likely you are to make a larger profit margin.

A simple linear regression model takes into consideration the temperature, and after some “magic” it returns an output value: the profit. In other words, it finds the relationship between an independent and dependent variable to make future predictions.

How Does Simple Linear Regression Work?

Model Representation

A simple linear regression algorithm tries to find a linear relationship between two variables. The two variables in the lemonade stand scenario I described before would be the temperature(the independent variable x), and the profit(the dependent variable y). 

But how does a regressor(regression model) find the relationship between two variables?

It does this by analyzing past data that shows the independent variable in correlation with the dependent variable. The dataset a machine learning model uses to find a mathematical relationship between variables is called the training dataset

So, in order to build a linear regression model for our lemonade stand, we need to provide it with training data showing a correlation between temperature and profit margin. Take this sample training dataset, for instance:

Temperature is the independent variable, and profit is the dependent variable

Temperature is the independent variable, and profit is the dependent variable

This is where some math comes in. 

We already know that a mathematical function is one that takes in an input and returns an output value. This same concept is applied in linear regression. The regressor will attempt to create an equation of a line that best represents the data it is given. In other words, it tries to create a line of best fit. 

Equation for the line of best fit

Equation for the line of best fit

Let’s break this equation down: 

  • y represents the dependent variable, or in our case, the profit

  • x_1 represents the independent variable, which, in our case, is the temperature

  • b_0 is a parameter (the model will try and assign a constant value to it)

  • b_1 is another parameter which will be “tuned” by the model

If you’re a high schooler reading this, you most likely recognized that this equation is the same as y = mx + b, the slope-intercept form of a line. This makes sense, because the line of best fit is essentially just a line which best represents the data. Now let’s talk about how a regressor can determine whether a certain line is the most accurate representation of the data.

The Cost Function

Since the line of best fit won’t go through all the points (usually), there will be a certain degree of error, called the cost, between the model and the training dataset. The regression algorithm fine-tunes the equation by assigning numerical values to b_0 and b_1 in an attempt to minimize the cost as much as possible. In order to find the best values for b_0 and b_1, the linear regressor will use a mathematical formula known as the univariate Mean Squared Error (MSE) Cost Function.

Since MSE is used in so many different machine learning algorithms today, I decided to create another article solely dedicated to teaching it. Please make sure to check it out, as MSE is a fundamental concept when studying regression algorithms.

By using univariate MSE, our algorithm will be able to determine whether certain parameter values work better than others, and to what extent.

But how will a linear regression model be able to converge upon optimal parameter values?

Gradient Descent

It does this through a process known as Gradient Descent. This algorithm is really important to learn about, but it is also quite complex. Because of this, I have written another article dedicated to explaining the mathematics and intuition behind gradient descent. I highly recommend reading it before proceeding with simple linear regression.

You will notice that I have broken down gradient descent into two types: univariate and multivariate. Since simple linear regression deals with only one independent variable, it utilizes univariate gradient descent.

After completing univariate gradient descent, our algorithm will have reached values of b_0 and b_1 which it believes best lower the total cost.

Predicting

Now that we know how our linear regressor trains itself by using gradient descent, let’s talk about how it makes new predictions!

Let’s say that, after training, our algorithm has finalized its equation for the line of best fit as shown below:

Finalized Line of Best Fit Equation

Finalized Line of Best Fit Equation

As we can see, the linear regression model assigned final values for both b_0 and b_1:

  • b_0 was given the value -28

  • b_1 was given the value 0.47

So, with this equation, how can linear regressor estimate the value of the dependent variable when given the independent variable?

The answer is simple: all it does is plug in the value of the independent variable into x_1. 

For example, let’s say that the regressor was tasked with predicting the profit margin from the lemonade stand on a day with an average temperature of 90ºF. The regressor would substitute 90 for x_1 as shown below:

Equation after substituting 90 for x_1

Equation after substituting 90 for x_1

And then it would simplify the right hand side of the equation to reach a final value for y.

Equation after simplifying

Equation after simplifying

From this, we can see that the regressor produced 14.3 as a final value for y, which in our case, represents the profit margin from the lemonade stand. 

Recap

To summarize the intuition behind simple linear regression:

  • A simple linear regression algorithm is designed to find a linear relationship between two variables given a training dataset.

  • The regressor assigns coefficients for b_0 and b_1 by using the Mean Squared Error Cost Function. The process by which it does this is known as gradient descent.

  • After finalizing the equation of the line of best fit, the regressor will predict the value of y by plugging in the value of the independent variable into x_1.

Implementation of Simple Linear Regression in Python

Library Installation

Now that we’ve talked about the intuition behind simple linear regression, it’s time to implement the model in code.

Before we do this, however, we must install four important libraries: Scikit-Learn, Pandas, Numpy, and Matplotlib. 

  • Scikit-Learn is a machine learning library that provides machine learning algorithms to perform regression, classification, clustering, and more.

  • Pandas is a Python library that helps in data manipulation and analysis, and it offers data structures that are needed in machine learning.

  • Numpy is another library that makes it easy to work with arrays. It provides several unique functions that will help in data preprocessing.

  • Matplotlib is a plotting library that will help us visualize our linear regressor’s line of best fit in correlation to the data.

Fortunately, these libraries can be quickly installed by using Pip, Python’s default package-management system. All we have to do is enter the following lines of code into terminal:

pip3 install sklearn
pip3 install pandas
pip3 install numpy
pip3 install matplotlib

After this is complete, we can begin coding our algorithm in Python.

Step 1: Importing the Data

Note: To train the linear regressor, I’m using a great dataset from superdatascience.com. The dataset and my code can be found in my GitHub repository which I have linked here.

As with any Python program, we will begin implementing the linear regressor by importing all the major libraries that we will use.

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

The dataset that I’m using contains two columns: Experience(in years), and Salary. Our goal is to create a linear regression model that can predict an employee’s salary in a certain company given the the amount of years of experience they already have. Fortunately, this dataset does not contain any words or missing values, and thus, we do not need to perform extensive data cleaning on it. 

However, we still need to create a Pandas data frame of our dataset so that we input it into our linear regressor.

Sample from the training dataset

Sample from the training dataset

As we can see, the first column contains employees’ experience, and the second contains employees’ salary. An employee’s work experience directly affects how much they are paid at a company, which means that experience is our independent variable and salary is our dependent variable. 

As in math and science, the independent variable in machine learning is commonly referred to as X, while the dependent variable is referred to as Y

We need to split our main dataset into two arrays: one for the experience, and one for the salary. This can be done as shown below:

dataset = pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

And that’s it! The data preprocessing phase is complete. It is important to note that some datasets will require much more cleaning than others, so this code will change depending on what dataset you use for linear regression. Fortunately, Scikit-Learn provides us with a variety of functions to make data preprocessing easier and more efficient. You can read more about this here.

Step 2: Creating Train and Test Datasets

This step is essential to our program because it will allow us to test the accuracy of our model after we train it. We will split the datasets x and y into four more datasets: x_train, y_train, x_test, and y_test. The datasets ending with _train will be used by the model to create a line of best fit, and the datasets ending with _test will be used to test the model with an accuracy metric. 

Once again, we can complete this step with the help of the libraries that we have already imported. To create the train and test datasets, we can use the train_test_split function from Scikit-Learn as shown below:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=1/3, random_state=0)

As you can see above, train_test_split has a few arguments that we must input in order to ensure that our data is split correctly. The first two arguments are just the datasets that we want to use for splitting. The argument test_size signifies the portion of our dataset that will be devoted to to testing our model: x_test and y_test. Finally, the last argument random_state takes in a random-generator seed so that train_test_split splits the dataset the same way each time we run the program. 

Step 3: Creating and Training the Linear Regressor

Finally, the part which we’ve all been waiting for: training our machine learning model!

To create and train our linear regressor, we need to first create an object of LinearRegression by assigning it to a variable, and then using the .fit method to train it on our dataset.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

The second line of this code creates the regressor object, and the third line “fits” it to the datasets x_train and y_train. However, we are not done building the model yet. We still need to measure the accuracy of our model by testing it with the prediction datasets we created before. 

Step 4: Testing the Linear Regressor

This step will be just as easy as the last few, because all we need to do is use the regressor’s .predict method on the test datasets we created.

predictions = regressor.predict(x_test)

After executing this line, the variable predictions will contain the regressor’s estimated values for the inputs provided in x_test. The R Squared Accuracy Metric can now be used to measure how well the linear regressor performed after being trained.

from sklearn.metrics import r2_score
score = r2_score(y_test, predictions)
print(score)

The R squared accuracy metric ranges from 0 to 1—the closer the metric is to 1, the more accurate the model is. 

Note: r2_score is not an accurate measure of the model’s accuracy when there is more than one independent variable. In most cases, there will be more than one independent variable in the dataset, and we will need to use the Adjusted R Squared Metric, which I will talk more about in my next article. 

And that’s it! We’re done creating the linear regressor. After you run all the above lines of code, you should get an outputted R Score value which will tell you how well your linear model represents the data.

After I ran the above lines of code, I received an R Score value of approximately 0.976, which is very good! If you’re not getting around the same value, I recommend going back and making sure that you have followed all the steps.

Even though we have completed our linear regression model, there is still one optional step which we can complete—visualizing the results.

Step 5: Visualizing the Results

This is arguably one of the most exciting parts of building machine learning models since you get the opportunity to see all your work come together. For this portion, we’ll be using the MatPlotLib library that we installed in the beginning of this article.

In order to visualize our data, we need to first plot our train and test data points on two different graphs, and then trace our the model’s line of best fit. Fortunately, we can implement this quite simply through code. This one-line piece of code uses MatPlotLib’s .scatter method to create a scatter plot of the data points in the training dataset:

plt.scatter(x_train, y_train, color='red')

The .scatter method takes in two main parameters: the x and y-values of the data points. As we can see, we passed in the values of the training data points with x_train and y_train. The third argument we passed just signifies the color that the data points will be.

Now all we have to do is trace the model’s line of best fit through the scatter plot, which can be done as shown below:

plt.plot(x_train, regressor.predict(x_train), color='blue')

Let’s think about what we just did there for a second. Like with the .scatter method, .plot takes in the x and y-values of the data points. Unlike the .scatter method, however, .plot will then draw a line through all the data points.

Like before, we passed in x_train as the first argument. Since we want to visualize our model’s line of best fit, however, we want to run our model on x_train and pass in its predictions into the y-value argument of .plot. This is what we did in our code above.

We can also add labels to the graph axes and create a graph title as shown below. After setting up the graph, we can run plt.show() to create the graph in a new window.

plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

But, we still need to create a scatter plot of our test dataset. Fortunately, this can be achieved using the same intuition we used for the training dataset.

plt.scatter(x_test, y_test, color='red')
plt.plot(x_test, predictions, color='blue)
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

Here, I’m using predictions to plot the line of best fit since it already holds the model’s predicted values for the test dataset.

Now all we have to do is click run and we’ll get two graphs of our model’s line of best fit and the data points in the training and test datasets.

This is the graph of the training dataset with the model’s line of best fit

This is the graph of the training dataset with the model’s line of best fit

As you can see, the model’s line of best fit goes through the training data points very well! However, we should also look at how well our model’s line of best fit represents the test data.

This is the graph of the test set with the model’s line of best fit.

This is the graph of the test set with the model’s line of best fit.

As you can see, the line of best fit is the same in both graphs, which makes sense, since we’re running the same model on both of the datasets.

And that’s it! We’ve finally completed all the steps to creating a simple linear regression model in Python!

Please feel free to leave feedback in the comments so that I can fix any errors and improve for my next article.

The next algorithm I will write about is multiple linear regression, a machine learning algorithm designed to find a linear relationship between several independent and dependent variables in order to produce an estimated output.

Previous
Previous

Gradient Descent