Simple Linear Regression

7. Simple Linear Regression#

In this assignment, we will use ‘TV advertising’ to predict ‘sales’ by using a simple regression model.

7.1. Preparations#

Including import libraries, dataset and data cleaning, data visualization

7.1.1. Importing libraries and dataset.#

#Import the numpy and pandas package
import numpy as np
import pandas as pd

# Data Visualisation
import matplotlib.pyplot as plt 
import seaborn as sns

advertising = pd.DataFrame(pd.read_csv("https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/ml-fundamental/advertising.csv"))
advertising.head()

	TV	Radio	Newspaper	Sales
0	230.1	37.8	69.2	22.1
1	44.5	39.3	45.1	10.4
2	17.2	45.9	69.3	12.0
3	151.5	41.3	58.5	16.5
4	180.8	10.8	58.4	17.9

7.1.2. Data Inspection#

advertising.shape

(200, 4)

advertising.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB

advertising.describe()

	TV	Radio	Newspaper	Sales
count	200.000000	200.000000	200.000000	200.000000
mean	147.042500	23.264000	30.554000	15.130500
std	85.854236	14.846809	21.778621	5.283892
min	0.700000	0.000000	0.300000	1.600000
25%	74.375000	9.975000	12.750000	11.000000
50%	149.750000	22.900000	25.750000	16.000000
75%	218.825000	36.525000	45.100000	19.050000
max	296.400000	49.600000	114.000000	27.000000

7.1.3. Data Cleaning#

# Checking Null values
advertising.isnull().sum()*100/advertising.shape[0]

TV           0.0
Radio        0.0
Newspaper    0.0
Sales        0.0
dtype: float64

# Outlier Analysis
fig, axs = plt.subplots(3, figsize = (5,5))
plt1 = sns.boxplot(advertising['TV'], orient='h',ax = axs[0])
plt2 = sns.boxplot(advertising['Newspaper'], orient='h',ax = axs[1])
plt3 = sns.boxplot(advertising['Radio'], orient='h',ax = axs[2])
plt.tight_layout()

../../_images/c0c6f83cb0b43f8e7d00521d75929d570dfab9cf40bc53eaa5ecdd4fcc7774b4.png

# Let's see how Sales are related with other variables using scatter plot.
sns.pairplot(advertising, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4, aspect=1, kind='scatter')
plt.show()

../../_images/595ab87c69738a4ecb1fa65908cf8462cd91ec5abb2c4690fd61a6d49310773c.png

# Let's see the correlation between different variables.
sns.heatmap(advertising.corr(), cmap="YlGnBu", annot = True)
plt.show()

../../_images/27e0826d611f686a5f0e48125ed497034c6c15d081e95b3ae0a7d559b09b999f.png

As is visible from the pairplot and the heatmap, the variable TV seems to be most correlated with Sales. So let’s go ahead and perform simple linear regression using TV as our feature variable.

7.2. Model Building#

In this section, you will need to fill in some code to help train the linear regression model, and there will be instructions and hints where you need to fill in.

We first assign the feature variable, TV, in this case, to the variable X and the response variable, Sales, to the variable y.

X = advertising['TV']
y = advertising['Sales']

7.2.1. Train-Test Split#

You now need to split our variable into training and testing sets. You’ll perform this by importing train_test_split from the sklearn.model_selection library. It is usually a good practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset

from sklearn.model_selection import train_test_split
___, ___, ___, ___ = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

X_train.head()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 X_train.head()

NameError: name 'X_train' is not defined

assert X_train.iloc[0] == 213.4
assert X_train.iloc[1] == 151.5
assert X_train.iloc[2] == 205.0
assert X_train.iloc[3] == 142.9
assert X_train.iloc[4] == 134.3

y_train.head()

   17.0
    16.5
  22.6
   15.0
   14.0
Name: Sales, dtype: float64

assert y_train.iloc[0] == 17.0
assert y_train.iloc[1] == 16.5
assert y_train.iloc[2] == 22.6
assert y_train.iloc[3] == 15.0
assert y_train.iloc[4] == 14.0

# Fitting Simple Linear Regression to the Training set
X_train = X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)

# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Predicting the Salary for the Test values
y_pred = regressor.predict(X_test)

Visualize training model and testing model

plt.scatter(X_train, y_train, color='red', label='Training data')
plt.scatter(X_test, y_test, color='blue', label='Test data')
plt.plot(X_train, regressor.predict(X_train), color='green')
plt.title('Sales VS TV advertising')
plt.xlabel('TV advertising')
plt.ylabel('Sales')
plt.legend()  
plt.show()

../../_images/9258d047f07b2e7529938143402fe7d80eb8e3a5c51992d9355c03c46ba7f1db.png

7.3. Model Evaluation#

First, We need to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

# We need to calculate the difference between the predicted value and the true value.
# Being careful to distinguish between the training set and the prediction set!!!

gap = (___ - ___ ) 

fig = plt.figure()
sns.histplot(gap, bins = 15)
fig.suptitle('Error Terms', fontsize = 15)                  # Plot heading 
plt.xlabel('y_train - y_train_pred', fontsize = 15)         # X-label
plt.show()

../../_images/a92cb3018666f1dcc4b7324b2e9c64bde1fee14da75c140cd6945c09072c8199.png

Looking for patterns in the residuals

plt.scatter(X_test,gap)
plt.show()

../../_images/bcff767a13558e10aed6863c7656173041e7d1074118f3be67948c8c3c4bb024.png

Second, Calculate MSE and RMSE, these metrics help us evalaute the performance of linear regression model.

from sklearn.metrics import mean_squared_error

# calculate Mean square error
mse = mean_squared_error(___,___)
print(f"Mean error: {mse:3.3} ")

# calculate the mean square error as a percentage of the mean predicted value
mse_per = ___/np.mean(___)
print(f"Mean error percentage: {mse_per*100:3.3}%")

# calculate RMSE
rmse = np.___(mean_squared_error(___, ___))
print(f"Mean error: {rmse:3.3}")

Mean error: 4.08 
Mean error percentage: 27.4%
Mean error: 2.02

assert mse == 4.077556371826948
assert mse_per == 0.27395402104113163
assert rmse == 2.019296008966231

Third, calculate the R-squared( coefficient of determination) on the test set.

from sklearn.metrics import r2_score
# Please enter the calculation method of r2_score, tips: choosing from y_train, y_test, y_pred
r_squared = r2_score(___, ___)
r_squared

0.7921031601245662

assert r_squared == 0.7921031601245662

7.4. Acknowledgments#

Thanks to Ashish for creating the open-source course Simple Linear Regression. It inspires the majority of the content in this chapter.