Linear Regression

By Chi Kit Yeung in Statistics Python Machine Learning

July 14, 2024

Regression analysis is used whenever there is a causal relationship with the effect. In this section of the course I learned how to find the linear regression equation as well as the statistical concepts behind it.

Introduction

It is a form supervised machine learning where we try to predict a dependent variable, y, using one or more independent variables, x.

A linear regression is a linear approximation of a causal relationship between two or more vaiables.

  1. Get sample data
  2. Design a model that explains the data
  3. Use the model to make a prediction of the population

Assumptions

We must first check our dataset for the following assumptions before we can perform a linear regression.

1. Linearity

The relationship between the independent variable and the dependent variable is linear. This is the simplest (but important) assumption. The easiest way to check this is to plot both variables on a scatterplot and see if they exhibit a straight line pattern.

There are ways to fix non-linear relationships by applying transformations to one or more variables.

Fixes

  1. Run a non-linear regression
  2. Exponential transformation
  3. Log transformation

2. No Endogeneity

This is also known as the ‘Omitted Variable Bias’. Omitted variable bias occurs when you forget to include a variable. This is reflected in the error term as the factor you forgot about is included in the error. In this way, the error is not random but includes a systematic part (the omitted variable).

2 stage least squares

3. Normality and Homoscedasticity

Normality

We can assume that the error term is normally distributed thanks to the central limit theorem given that the population size is large enough.

Zero Mean

Homoscedasticity

This refers to a condition in which the variance of the residual, or error term, in a regression model is constant. What this looks like plotted is when the distribution of observations relative to the regression line is not uniformly distributed.

Some ways to prevent:

  1. Look out for Omitted Variable Bias
  2. Remove outliers
  3. Log transformation

4. No Autocorrelation

Also known as ’no serial correlation’. Errors are assumed to be uncorrelated. What this means is that the value of an error should not be dependent on the value of another error. This type of correlation is very common in time-series data. This is the only assumption that cannot be relaxed.

Detection

  • Once again, autocorrelation may be detected with the help of a scatter plot. If the plot shows a pattern, then it may be autocorrelated.
  • Durbin-Watson statistic. The value falls between 0 to 4. A value less than 1 or higher than 3 suggests that there is an autocorrelation. A value around 2 is ideal.

5. No Multicollinearity

Multicollinearity refers to when there is a correlation between more than one of the variables used. Suppose the value of variable $a$ is correlated to $b$, there is no point in using both variables in the regression since knowing one of them is equivalent to knowing both of them. If variable $b$ can be represented by variable $a$, only one of the variables is enough.

Detection

Find the correlation between each variable

from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = data[[#dependent variable columns]]
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif["features"] = variables.columns

Fixes

  1. Drop one of the variables
  2. Combine the variables into one (eg. average)
  3. Keep both with extreme caution

Simple Linear Regression

Linear regression equation

$$ \color{red} y\hat{}_{i}\color{black} = \color{green} b_{0}\color{black}+\color{blue}b_{1}\color{black}x_{1} $$

Linear equation $$ \color{red} y \color{black}= \color{blue} m\color{black}x + \color{green}b $$

The linear regression equation is very similar to the simple linear regression taught in algebra class where the value y is determined by the coefficient m times the variable x plus the constant b. Switching around the positions and renaming the variables a bit, we’ll get the linear regression equation.

Correlation vs Regression

Correlation does not equate to causation!

Using StatsModel package

Python notebook: SimpleLinearRegression (statsmodel).

For this section I’ll use the sample real estate dataset provided by the course as an example. The dataset contains two columns ['price', 'size']. I’ll use this data to create a linear regression model that predicts a property’s price based on its size.

Workflow: Import libraries > Load the data > Declare variables > Fit the model

Packages Used

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv('file.csv')

data.head()
price size
0 234314.144 643.09
1 228581.528 656.22
2 281626.336 487.29
3 401255.608 1504.75
4 458674.256 1275.46

Define your variables

# Dependent variable
y = data['price']

# Independent variable
x1 = data['size']

Find the Regression

x = sm.add_constant(x1)
results = sm.OLS(y,x).fit()
results.summary()

OLS Regression Results:

Dep. Variable: price R-squared: 0.745
Model: OLS Adj. R-squared: 0.742
Method: Least Squares F-statistic: 285.9
Date: Sat, 20 Jul 2024 Prob (F-statistic): 8.13e-31
Time: 12:52:26 Log-Likelihood: -1198.3
No. Observations: 100 AIC: 2401.
Df Residuals: 98 BIC: 2406.
Df Model: 1
Covariance Type: nonrobust

Coefficients table:

coef std err t P>|t| [0.025 0.975]
const 1.019e+05 1.19e+04 8.550 0.000 7.83e+04 1.26e+05
size 223.1787 13.199 16.909 0.000 196.986 249.371
Omnibus: 6.262 Durbin-Watson: 2.267
Prob(Omnibus): 0.044 Jarque-Bera (JB): 2.938
Skew: 0.117 Prob(JB): 0.230
Kurtosis: 2.194 Cond. No. 2.75e+03

With the information above we are now able to form the linear regression equation.

Look again at the equation. $$ y\hat{}_{i} = b_{0}+b_{1}x_{1} $$

Looking at the coefficients table, we have the constant 1.019e+05 and the coefficient for our independent variable size 223.1787 provided. With this we can form the equation.

$$ y\hat{}_{i} = 101900 + 223.1787x_{1} $$

In other words,

$$ Price = 101900 + 223.1787 * Size $$

With this equation we can now use it to predict real estate prices based on it’s size.

plt.scatter(x1,y)
yhat = x1*223.1787+101900
fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line')
plt.xlabel('Size', fontsize = 20)
plt.ylabel('Price', fontsize = 20)
plt.show()

Evaluating the Model

The model’s summary table provides us with a bunch of different statistics. But what do they mean? First we can look at the std err (standard error) column of the coefficients table. It shows the accuracy of the coefficient. The lower it is, the better. Next we have the P>|t| (p-value). What this asks is if the independent variable used useful in predicting the dependent variable. As a rule of thumb, a p-value below 0.05 is considered significant.

Using Scikit Learn

Like with StatsModel, the workflow to creating a linear regression model with Scikit Learn is similar. The difference is when we load the data to the model we would need to use arrays instead of a dataframe.

Workflow: Import libraries > Load the data > Declare variables > Fit the model

from sklearn.linear_model import LinearRegression

Python notebook: SimpleLinearRegression (sklearn).

Summary Table

With StatsModel, we had the method summary() that gives us a nicely formatted table with all the relevant statistical calculations we need. Scikit learn is a bit different. To get these values, we would need to call a few different methods.

Python notebook: SimpleLinearRegression Summary Table (sklearn)

In the follow code snippets reg is an instance of sklearn’s LinearRegression object.

# To get the R-squared in sklearn we must call the appropriate method
reg.score(x_matrix,y)
# Getting the coefficients of the regression
# Note that the output is an array, as we usually expect several coefficients
reg.coef_
# Getting the intercept of the regression
# Note that the result is a float as we usually expect a single value
reg.intercept_

Statistical Concepts

Sum of Squares

Sum of Squares Total (SST)

SST (Sum of Squares Total): This is the total sum of squares in a regression analysis, which represents the total variation that can and cannot be explained by the independent variables.

$$ \sum_{i=1}^{n}(y_{i} - \bar y)^{2} $$

$$ SST = SSR + SSE $$

Sum of Squares Regression (SSR)

SSR (Sum of Squares Regression): Measures the explained variability of the regression line. This is the variation in the dependent variable that can be explained by the independent variables included in the model. It is calculated as the sum of difference between the predicted value and the mean of the dependent variable.

$$ \sum_{i=1}^{n}(\hat y_{i} - \bar y)^{2} $$

Sum of Squares Error (SSE)

SSE (Sum of Squares Error): Measures the unexplained variability by the regression. This is the sum of squares due to error, which represents the variation in the dependent variable that cannot be explained by any independent variables. It is also known as the “residual” sum of squares. It is calculated as the difference between the total variance of the dependent variable and the regression sum of squares.

$$ \sum_{i=1}^{n} e_{i}^{2} $$

Summary

In summary, SST represents the total variation in the dependent variable independent variables, SSR represents the variation in the dependent variable that can be explained by the independent variables included in the model. In other words, it represents how well the regression fits the data. SSE represents the variation in the dependent variable that cannot be explained by any independent variables.

Ordinary Least Squares (OLS)

Ordinary Least Squares (OLS) is a widely used method in statistics and machine learning for estimating the relationship between a dependent variable and one or more independent variables. It is a linear regression technique that finds the best-fitting line or plane to describe a set of data by minimizing the sum of the squared error (SSE) between the predicted values and the actual values. In other words, OLS seeks to find the linear combination of the independent variables that best predicts the dependent variable. Graphically, it finds the line that is simultaneously closest to all the points.

R-Squared - Goodness of Fit

R-squared (\(R^{2}\)) is a statistical measure used to evaluate the performance of a linear regression model. It represents the proportion of the variance in the dependent variable that is explained by the independent variables included in the model.

$$R^2 = \frac{\text{SSR}}{\text{SST}}$$

A higher R-squared value indicates a better fit between the independent and dependent variables, while a lower value suggests that there may be other factors at play that are affecting the relationship between the variables. A perfect fit would result in an R-squared value of 1. There is no definitive rule as to what a good R-squared value is. It depends on the complexity of the topic and the indepent variables at play.

F-statistic

The lower the F-statistic, the closer it is to a non-significant model. It is another tool that can be used to make comparisons between different models.

Multiple Linear Regression

The simple linear regression models earlier only looked at a single independent variable. However, in most cases it is unlikely that the dependent variable is caused by only a single variable. That’s where the multiple linear regression comes in. Good regression models need to factor in multiple independent variables to address more complex problems.

As there are now more than one independent variable factored into the equation, the multiple regression can no longer be represented on a 2D visualization.

Adjusted R-squared

Like the R-squared, the adjusted R-squared explains how well the model explains the data. It takes into account the number of independent variables in the model and penalizes the use of independent variables that do not help increase the explanatory power of the model.

Using StatsModel package

… TBA …

Using Scikit Learn

Python notebook: MultipleLinearRegression

Adjusted R-squared

Earlier in the Simple Linear regression section on using Scikit Learn we learned that to get the R-squared, we need to call the method score(). However, Scikit Learn does not have a readily built method to return the adjusted R-squared. In this case we’ll have to calculate it manually. Let’s look at the formula for adjusted R-squared:

$$ R^2_{\text adj.} = 1 - (1-R^2)* \frac{n-1}{n-p-1} $$

  • \( n\): Number of observations
  • \( p\): Number of predictors (independent variables)

With the above information we can define our own function to calculate the adjusted R-squared.

def calc_adjusted_score(x,y):
    r_sq = reg.score(x,y)
    n = x.shape[0]
    p = x.shape[1]

    return 1-(1-r_sq)*(n-1)/(n-p-1)
calc_adjusted_score(x,y)

>>> 0.39203134825134

Feature Scaling

Feature scaling is a process where we transform input data into a specific range. The purpose of feature scaling is to ensure that all features are on the same scale so that our ML models treats each feature equally. Think back to the earlier property prices example. We had used the property’s size as a feature to predict the price. But suppose we want to add the feature year to our regression model.

Standardization

The standard scaling method transforms the dataset such that the mean becomes 0 and standard deviation is equal to 1. This is done by applying the following calculation to all the datapoints:

$$ \frac{x - \mu }{\sigma} $$

How to do

Scikit learn package provide us with a method to scale our dataset with preprocessing module.

from sklearn.preprocessing import StandardScaler
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the input data (x)
# Essentially we are calculating the mean and standard deviation feature-wise 
scaler.fit(x)
# The actual scaling of the data is done through the method 'transform()'
# Let's store it in a new variable, named appropriately
x_scaled = scaler.transform(x)

The scaled features are now stored in our variable x_scaled.

Feature Selection

How to detect a variable that is unneeded in a model? The process is called feature selection. It improves speed, simplifies the model, and prevent issues caused by having too many unwanted features.

Feature Selection with p-values

An earlier section already briefly went over how using p-values. Once again StatsModel summary table already nicely provides this value. In the case of Scikit learn we’ll have to use other means.

Python notebook: Feature Selection

from sklearn.feature_selection import f_regression
f_regression(x,y)

Output: (array([56.04804786, 0.17558437]), array([7.19951844e-11, 6.76291372e-01]))

Calling the method f_regression returns a tuple of 2 arrays. First is the F-statistic for each of the regression, and second is the p-values.

Knowing that we can pass it to a their corresponding variables like so..

f_statistics, p_values = f_regression(x,y)

p_values

array([7.19951844e-11, 6.76291372e-01])

Rounding the values to 3 decimal places.

p_values.round(3)

array([0. , 0.676])

Alternatively, the p-values can also be presented in a dataframe table like so

pd.DataFrame({'feature': x.columns, 'p-value':p_values.round(3)})
feature p-value
SAT 0.000
Rand 1,2,3 0.676

Feature Selection with Feature Scaling

Machine Learning Terms to Know:

‘Weight’ is the term used to refer to coefficients

‘Bias’ is the term used to refer to intercept

Training and Testing

Datasets are often split into training and testing. This is to avoid overfitting our model with a prepared dataset resulting in it performing poorly in real world applications.

Overfitting and Underfitting

Overfitting and underfitting are two common issues that can occur in the machine learning process. Overfitting refers to a situation where a model is trained too well on a specific dataset, which leads to poor performance when tested on new, unseen data. This happens when the model learns the noise in the training data rather than the underlying patterns, resulting in high accuracy on the training set but poor accuracy on new data.

On the other hand, underfitting refers to a situation where a model is not trained well enough, which results in poor accuracy. This may be caused by using bad features or even an incompatible model to predict.

from sklearn.model_selection import train_test_split
x_train, x_test = train_test_split(x)