Hypothesis testing can be carried out in linear regression for the following purposes:
- To check whether a predictor is significant for the prediction of the target variable. Two common methods for this are —
- By the use of p-values:
If the p-value of a variable is greater than a certain limit (usually 0.05), the variable is insignificant in the prediction of the target variable. - By checking the values of the regression coefficient:
If the value of the regression coefficient corresponding to a predictor is zero, that variable is insignificant in the prediction of the target variable and has no linear relationship with it.
- By the use of p-values:
- To check whether the calculated regression coefficients are good estimators of the actual coefficients.
The Null and Alternate Hypothesis used in the case of linear regression, respectively, are:
β1=0
β1≠0
Thus, if we reject the Null hypothesis, we can say that the coefficient β1 is not equal to zero and hence, is significant for the model. On the other hand, if we fail to reject the Null hypothesis, it is concluded that the coefficient is insignificant and should be dropped from the model.
To check if the overall model fit is significant or not, the primary parameter to be looked at is the F-statistic. While the t-test along with the p-values for betas test if each coefficient is significant or not individually, the F-statistic is a measure that can determine whether the overall model fit with all the coefficients is significant or not.
The basic idea behind the F-test is that it is a relative comparison between the model that you’ve built and the model without any of the coefficients except for β0. If the value of the F-statistic is high, it would mean that the Prob(F) would be low, and hence, you can conclude that the model is significant.
On the other hand, if the value of F-statistic is low, it might lead to the value of Prob(F) being higher than the significance level (taken 0.05, usually) which in turn would conclude that the overall model fit is insignificant and the intercept-only model can provide a better fit.
Apart from that, to test the goodness or the extent of fit, we look at a parameter called R-squared (for simple linear regression models) or Adjusted R-squared (for multiple linear regression models). If your overall model fit is deemed to be significant by the F-test, you can go ahead and look at the value of R-squared. This value lies between 0 and 1, with 1 meaning a perfect fit.
A higher value of R-squared is indicative of the model being good with much of the variance in the data being explained by the straight line fitted. For example, an R-squared value of 0.75 means that 75% of the variance in the data is being explained by the model. But it is important to remember than R-squared only tells the extent of the fit and should not be used to determine whether the model fit is significant or not.