Supervised Learning

Supervised learning is a class of ML where machines can learn through examples. The historical data set has both input and output (also called target or response) values. Based on the training data set, the machine already has some idea of what the target value would look like, which kind of ‘supervises’ the learning process.

Supervised learning uses the training data set to make predictions. A supervised learning algorithm analyzes the training data and produces a mapping function. This function is called a regression function if the output is continuous or a classifier if the output is discrete. A test data set is used to validate the model. A larger training data set often yields higher predictive accuracy.

Now, suppose we want to predict the cost of a particular commodity based on its past sales data. If the available data set consists of the past costs of the commodity, then it is easy for us to predict the cost.

We can explain this mathematically. Suppose we want to predict the value of Y. Y is our target data. Let us assume that in the past the value of Y depended on another variable X.
So we can represent Y as a function of X.
Y = f(X)
Using this mapping function we can predict the value of Y for a new value of X. We can adjust the prediction iteratively. The predicted value of Y need not be ‘perfect’. However it should be good enough to be useful. Thus, we can say that the target variable present in the training data set ‘supervises’ the learning. Supervised learning can predict the output/target variable from a high dimensional set of input variables.

Supervised Learning algorithms can handle two categories of problems:

Regression
Classification
In regression, the aim is to predict an output variable which is continuous in nature, i.e, the response variables are real numbers. Applications include forecasting stock prices, energy consumption, or incidence of disease.

In classification, the aim is to assign a class or label from a finite set of classes to an observation. That is, responses are categorical variables. Applications include spam filters, advertisement recommendation systems, and image and speech recognition. Predicting whether a patient will have a heart attack within a year is a classification problem. The possible outcome (class or label) for such a prediction is either “True” or “False”.

In a nutshell, we can say when the output variables are real values (height, cost, and so on) we go for Regression. If the output variables are categorical (colors, places and so on) or binary (Yes or No) in nature then we go for Classification.

There are many algorithms in Supervised Learning but the popular ones are:

Linear Regression for regression problems.
Logistic Regression for classification problems.
Random Forest or Support Vector Machine for classification and regression problems.
Linear Regression is simple to learn and use. Logistic Regression is not regression. The name is misleading and does often lead to much confusion. It is used for binary prediction which makes it perfect for classification problems. Support Vector Machine (SVM) algorithm can be used for both classification and regression problems. However, it is mostly used in classification problems. Random Forest algorithm is a versatile ML method capable of performing both regression and classification problems.

Once you understand the key difference between regression and classification problems, it is easy to explore the usage of more algorithms for each problem.

Just remember this: If you are trying to forecast a target value use regression algorithms, and if you are trying forecast a target class use classification algorithms.

Happy learning!

Archives