Many organizations rely on machine learning techniques in their day-today workflow, to cut down on the time required to do a job. The reason why these techniques are robust is because they undergo various tests in order to carry out correct predictions about any data fed into them. During this phase, there are also certain errors generated, which can lead to an inconsistent ML model.
Two common errors that we are going to look at in this article are that of bias and Variance, and how a trade-off can be achieved between the two in order to generate a successful ML model.
Let’s first have a look at what creates these kind of errors.
Machine learning techniques or more precisely supervised learning techniques involve training, often the most important stage in the ML workflow. The machine learning model is trained using the training data.
How is this training data prepared? This is done by using a dataset for which the output of the algorithm is known. During the training stage, the algorithm analyzes the training data that is fed and produces patterns which are captured within an inferred function. This inferred function, which is derived after analysis of the training dataset, is the model that would be further used to map new examples.
An ideal model generated from this training data should be able to generalize well. This means, it should learn from the training data and should correctly predict or classify data within any new problem instance.
In general, the more complex the model is, the better it classifies the training data. However, if the model is too complex i.e it will pick up random features i.e. noise in the training data, this is the case of overfitting i.e. the model is said to overfit . On the other hand, if the model is not so complex, or missing out on important dynamics present within the data, then it is a case of underfitting.
Both overfitting and underfitting are basically errors in the ML models or algorithms. Also, it is generally impossible to minimize both these errors at the same time and this leads to a condition called as the Bias-Variance Tradeoff.
Before getting into knowing how to achieve the trade-off, lets simply understand how bias and variance errors occur.
The Bias and Variance Error
Let’s understand each error with the help of an example. Suppose you have 3 training datasets say T1, T2, and T3, and you pass these datasets through a supervised learning algorithm. The algorithm generates three different models say M1, M2, and M3 from each of the training dataset.
Now let’s say you have a new input A. The whole idea is to apply each model on this new input A. Here, there can be two types of errors that can occur. If the output generated by each model on the input A is different(B1, B2, B3), the algorithm is said to have a high Variance Error. On the other hand, if the output from all the three models is same (B) but incorrect, the algorithm is said to have a high Bias Error.
High Variance also means that the algorithm produces a model that is too specific to the training data, which is a typical case of Overfitting. On the other hand, high bias means that the algorithm has not picked up defining patterns from the dataset, this is a case of Underfitting.
Some examples of high-bias ML algorithms are: Linear Regression, Linear Discriminant Analysis and Logistic Regression
Examples of high-variance Ml algorithms are: Decision Trees, k-Nearest Neighbors and Support Vector Machines.
How to achieve a Bias-Variance Trade-off?
For any supervised algorithm, having a high bias error usually means it has low variance error and vise versa. To be more specific, parametric or linear ML algorithms often have a high bias but low variance. On the other hand, non-parametric or non-linear algorithms have vice versa.
The goal of any ML model is to obtain a low variance and a low bias state, which is often a task due to the parametrization of machine learning algorithms.
So how can we achieve a trade-off between the two?
Following are some ways to achieve the Bias-Variance Tradeoff:
By minimizing the total error:
The optimum location for any model is the level of complexity at which the increase in bias is equivalent to the reduction in variance. Practically, there is no analytical method to find the optimal level. One should use an accurate measure for error prediction and explore different levels of model complexity, and then choose the complexity level that reduces the overall error.
Generally resampling based measures such as cross-validation should be preferred over theoretical measures such as Aikake’s Information Criteria.
(The irreducible error is the noise that cannot be reduced by algorithms but can be reduced with better data cleaning.)
Using Bagging and Resampling techniques:
These can be used to reduce the variance in model predictions. In bagging (Bootstrap Aggregating), several replicas of the original dataset are created using random selection with replacement.
One modeling algorithm that makes use of bagging is Random Forests. In Random Forest algorithm, the bias of the full model is equivalent to the bias of a single decision tree–which itself has high variance. By creating many of these trees, in effect a “forest”, and then averaging them the variance of the final model can be greatly reduced over that of a single tree.
Adjusting minor values in algorithms:
Both the k-nearest algorithms and Support Vector Machines(SVM) algorithms have low bias and high variance. But the trade-offs in both these cases can be changed. In the K-nearest algorithm, the value of k can be increased, which would simultaneously increase the number of neighbors that contribute to the prediction. This in turn would increase the bias of the model. Whereas, in the SVM algorithm, the trade-off can be changed by an increase in the C parameter that would influence the violations of the margin allowed in the training data. This will increase the bias but decrease the variance.
Using a proper Machine learning workflow:
This means you have to ensure proper training by:
- Maintaining separate training and test sets – Splitting the dataset into training (50%), testing(25%), and validation sets ( 25%). The training set is to build the model, test set is to check the accuracy of the model, and the validation set is to evaluate the performance of your model hyperparameters.
- Optimizing your model by using systematic cross-validation – A cross-validation technique is a must to fine tune the model parameters, especially for unknown instances. In supervised machine learning, validation or cross-validation is used to find out the predictive accuracy within various models of varying complexity, in order to find the best model.For instance, one can use the k-fold cross validation method. Here, the dataset is divided into k folds. For each fold, train the algorithm on k-1 folds iteratively, using the remaining fold(also called as ‘holdout fold’)as the test set. Repeat this process until each k has acted as a test set. The average of the k recorded errors is called as the cross validation error and can serve as the performance metric for the model.
- Trying out appropriate algorithms – Before relying on any model we need to first ensure that the model works best for our assumptions. One can make use of the No Free Lunch theorem, which states that one model can not work for only one problem. For instance, while using No Free lunch theorem, a random search will do the same as any of the heuristic optimization algorithms.
- Tuning the hyperparameters that can give an impactful performance – Any machine learning model requires different hyperparameters such as constraints, weights or learning rates for generalizing different data patterns. Tuning these hyperparameters is necessary so that the model can optimally solve machine learning problems. Grid search and randomized search are two such methods practiced for hyperparameter tuning.
So, we have listed some of the ways where you can achieve trade-off between the two.
Both bias and variance are related to each other, if you increase one the other decreases and vice versa. By a trade-off, there is an optimal balance in the bias and variance which gives us a model that is neither underfit nor overfit. And finally, the ultimate goal of any supervised machine algorithm lies in isolating the signal from the dataset, and making sure that it eliminates the noise.