Linear Regression

Linear regression is a supervised machine learning algorithm used for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation. I would like to say it is the starting point of anyone’s ML journey!

Linear regression is the simplest and most widely used algorithm for predictive analysis in Statistics and ML. It describes the linear relationship between independent and dependent variables.

If there is only one independent variable it is known as simple linear regression and if there are multiple independent variables it is called multiple linear regression. The independent variables are known as explanatory variables and the dependent variable is often referred to as the target variable or factor of interest.

It’s appropriate to use linear regression when we want to predict a continuous numeric outcome and there is a linear or near-linear relationship between the predictor(s) and the target variable. For example, it can be used in sales forecasting, price prediction, and understanding the impact of variables on an outcome

Suppose we want to predict the value of a variable Y. Let us assume that in the past the value of Y depended on another variable X1. Then we can represent Y as a function of X1.

Y = A0 + A1X1

This representation is a linear equation that combines Y and X1. A0 and A1 are known as Coefficients or Betas. The number of coefficients increases with dependent variables and this makes the model more complex.

Then the model would look like this,

Y = A0 + A1X1 + A2X2 + A3X3 +…

The red line in the above graph is referred to as the best-fit straight line. Based on the given data points, we try to plot a line that models the points the best. The motive of the linear regression algorithm is to find the best values for A0, A1, A2,…

After representing the model in a linear equation the next step is to estimate the values of the coefficients using the available data. When there is only one independent variable summary statistics (mean, standard deviations, correlations, and covariance) can be used to estimate the coefficients. But if there is more than one independent variable then Ordinary Least Squares can be used to estimate the values of the coefficients.

EXAMPLE: Simple Linear Regression (One Independent Variable):

Imagine we want to predict a person’s weight (dependent variable, Y) based on their height (independent variable, X). In simple linear regression, we have one independent variable.

Summary Statistics Approach:
- Calculate the means of X (mean height) and Y (mean weight).
- Calculate the standard deviations of X and Y.
- Calculate the correlation coefficient (r) between X and Y.
  - r = (Σ(deviation of X from mean) * (deviation of Y from mean)) / ((n - 1) * (standard deviation of X) * (standard deviation of Y))
- Calculate the coefficient (slope) of the regression line (b) using the formula, b = r * (standard deviation of Y) / (standard deviation of X)
- Calculate the intercept (a) of the regression line using the formula: a = mean of Y – (b * mean of X)
- The estimated linear regression equation is Y = a + bX, where a and b are estimated using the summary statistics.

EXAMPLE: Multiple Linear Regression (More Than One Independent Variable):

Now, let’s consider a scenario where we want to predict a person’s weight (Y) based on their height (X1) and age (X2). In multiple linear regression, we have two independent variables.

Ordinary Least Squares (OLS) Approach:
- Using OLS, we estimate the coefficients (slopes) of the regression equation Y = a + b1X1 + b2X2.
- OLS involves minimizing the sum of squared differences between the observed weights and the predicted weights based on both height and age, considering the joint effects of both variables.
- The OLS algorithm simultaneously estimates the intercept (a), the coefficient for height (b1), and the coefficient for age (b2) to find the best-fitting linear model.

In this example, when we have one independent variable (height), we can use summary statistics to estimate the coefficients. However, when we have more than one independent variable (height and age), we use the OLS method to estimate all the coefficients together, taking into account how both variables collectively influence the dependent variable (weight). OLS provides a more comprehensive and robust approach to modeling with multiple predictors.

To see if linear regression is suitable for any given data, a scatter plot can be used. If the relationship looks linear, we can go for a linear model. But if this is not the case, we have to apply some transformations to make the relationship linear. Plotting the scatter plots is easy in the case of simple or univariate linear regression. But in the case of multivariate linear regression, two-dimensional pairwise scatter plots, rotating plots, and dynamic graphs can be plotted.

There are several metrics used to evaluate the performance of linear regression models. These metrics serve different purposes and provide various insights into the model’s performance. Here’s why we have multiple metrics in linear regression:

Mean Squared Error (MSE):
- Purpose: MSE measures the average squared difference between the actual and predicted values.
- Use Case: MSE is a common metric for assessing the overall accuracy of a linear regression model. Lower MSE values indicate better model performance.
Root Mean Squared Error (RMSE):
- Purpose: RMSE is the square root of the MSE and provides an interpretable scale of the error in the same units as the target variable.
- Use Case: RMSE is useful when you want the error metric to be in the same units as the target variable, making it easier to interpret.
Mean Absolute Error (MAE):
- Purpose: MAE measures the average absolute difference between the actual and predicted values.
- Use Case: MAE is less sensitive to outliers compared to MSE, making it suitable when you want to assess model accuracy while ignoring the impact of extreme values.
R-squared (R²):
- Purpose: R-squared quantifies the proportion of variance in the target variable that is explained by the model. It measures the goodness of fit.
- Use Case: R-squared is valuable for understanding how well the model captures the variation in the data. A higher R-squared indicates a better fit.
Adjusted R-squared:
- Purpose: Adjusted R-squared is an extension of R-squared that adjusts for the number of predictors in the model.
- Use Case: Adjusted R-squared helps account for model complexity and is especially useful when comparing models with different numbers of features.
Residual Analysis:
- Purpose: Residual analysis involves examining the distribution of residuals (the differences between actual and predicted values) to check for assumptions of linear regression, such as normality and homoscedasticity.
- Use Case: Residual analysis helps identify potential issues with the model and assesses the validity of regression assumptions.
AIC and BIC:
- Purpose: The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are used for model selection and comparison. They balance model complexity and goodness of fit.
- Use Case: AIC and BIC are helpful when you want to choose the best model among several candidate models with different features.
F-statistic and p-value:
- Purpose: The F-statistic and its associated p-value are used to assess whether the overall model is statistically significant.
- Use Case: These statistics help determine whether there is a linear relationship between the features and the target variable.

In summary, the choice of which metric(s) to use in linear regression depends on the specific goals of your analysis, the nature of your data, and the assumptions you want to test. Different metrics provide different insights, and it’s common to consider multiple metrics to comprehensively evaluate the performance and validity of a linear regression model.

Happy learning!

Archives