Quantile-Quantile (Q-Q) plot, is a graphical tool for determining if two data sets come from populations with a common distribution such as a Normal, Exponential, or Uniform distribution. This helps in a scenario of linear regression when we have the training and test data set received separately and then we can confirm using the Q-Q plot that both the data sets are from populations with the same distributions.
It is a probability plot for comparing two probability distributions by plotting their quantiles against each other. Quantiles are cutpoints dividing the range of a probability distribution into contiguous intervals with equal probabilities. “Q” stands for quantile. By a quantile, we mean the fraction (or percent) of points below the given value. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value.
A 45-degree reference line is also plotted. If the two sets come from a population with the same distribution, the points should fall approximately along this reference line. If the two data sets have come from populations with different distributions then the data points will be far from the reference line.
The q-q plot is used to find out the following:
- Whether the two data sets come from populations with a common distribution
- Whether the two data sets have a common location and scale?
- Whether the two data sets have similar distributional shapes?
- Whether the two data sets have similar tail behavior?
statsmodels.api provide qqplot and qqplot_2samples to plot Q-Q graph for single and two different data sets respectively.