When it comes to machine learning, it’s all about the algorithms. But although machine learning algorithms are the bread and butter of a data scientists job role, it’s not always as straightforward as simply picking up an algorithm and running with it. Algorithm selection is incredibly important and often very challenging. There’s always a number of things you have to take into consideration, such as:
- Accuracy: While accuracy is important, it’s not always necessary. In many cases, an approximation is sufficient, in which case, one shouldn’t look for accuracy while giving up on the processing time.
- Training time: This goes hand in hand with accuracy and is not the same for all algorithms. The training time might go up if there are more parameters as well. When time is a big constraint, you should choose an algorithm wisely.
- Linearity: Algorithms that follow linearity assume that the data trends follow a linear path. While this is good for some problems, for others it can result in lowered accuracy.
Once you’ve taken those 3 considerations on board you can start to dig a little deeper. Kaggle did a survey in 2017 asking their readers which algorithms – or ‘ data science methods’ more broadly – respondents were most likely to use at work. Below is a screenshot of the results.
Kaggle’s research offers a useful insight into the algorithms actually being used by data scientists and data analysts today. But we’ve brought together the types of machine learning algorithms that are most important. Every algorithm is useful in different circumstances – the skill is knowing which one to use and when.
10 machine learning algorithms
This is clearly one of the most interpretable ML algorithms. It requires minimal tuning and is easy to explain, being the key reason for its popularity. It shows the relationship between two or more variables and how a change in one of the dependent variables impacts the independent variable. It is used for forecasting sales based on trends, as well as for risk assessment.
Although with a relatively low level of accuracy, a few parameters needed and lesser training times makes it’s quite popular among beginners.
Logistic regression is typically viewed as a special form of Linear Regression, where the output variable is categorical. It’s generally used to predict a binary outcome i.e.True or False, 1 or 0, Yes or No, for a set of independent variables. As you would have already guessed, this algorithm is generally used when the dependent variable is binary.
Like to Linear regression, logistic regression has a low level of accuracy, fewer parameters and lesser training times. It goes without saying that it’s quite popular among beginners too.
These algorithms are mainly decision support tools that use tree-like graphs or models of decisions and possible consequences, including outcomes based on chance-event, utilities, etc. To put it in simple words, you can say decision trees are the least number of yes/no questions to be asked, in order to identify the probability of making the right decision, as often as possible. It lets you tackle the problem at hand in a structured, systematic way to logically deduce the outcome.
Decision Trees are excellent when it comes to accuracy but their training times are a bit longer as compared to other algorithms. They also require a moderate number of parameters, making them not so complicated to arrive at a good combination.
This is a type of classification ML algorithm that’s based on the popular probability theorem by Bayes. It is one of the most popular learning algorithms. It groups similarities together and is usually used for document classification, facial recognition software or for predicting diseases. It generally works well when you have a medium to large data set to train your models.
These have moderate training times and make use of linearity. While this is good, linearity might also bring down accuracy for certain problems. They also do not bank on too many parameters, making it easy to arrive at a good combination, although at the cost of accuracy.
Without a doubt, this one is a popular go-to machine learning algorithm that creates a group of decision trees with random subsets of the data. It uses the ML method of classification and regression. It is simple to use, as just a few lines of code are enough to implement the algorithm. It is used by banks in order to predict high-risk loan applicants or even by hospitals to predict whether a particular patient is likely to develop a chronic disease or not.
With a high accuracy level and moderate training time, it is quite efficient to implement. Moreover, it has average parameters.
K-Means is a popular unsupervised algorithm that is used for cluster analysis and is an iterative and non-deterministic method. It operates on a given dataset through a predefined number of clusters. The output of a K-Means algorithm will be k clusters, with input data partitioned among these clusters. Biggies like Google use K-means to cluster pages by similarities and discover the relevance of search results.
This algorithm has a moderate training time and has good accuracy. It doesn’t consist of many parameters, meaning that it’s easy to arrive at the best possible combination.
K nearest neighbors
K nearest neighbors is a very popular machine learning algorithm which can be used for both regression as well as classification, although it’s majorly used for the latter. Although it is simple, it is extremely effective. It takes little to no time to train, although its accuracy can be heavily degraded by high dimension data since there is not much of a difference between the nearest neighbor and the farthest one.
Support vector machines
SVMs are one of the several examples of supervised ML algorithms dealing with classification. They can be used for either regression or classification, in situations where the training dataset teaches the algorithm about specific classes, so that it can then classify the newly included data.
What sets them apart from other machine learning algorithms is that they are able to separate classes quicker and with lesser overfitting than several other classification algorithms. A few of the biggest pain points that have been resolved using SVMs are display advertising, image-based gender detection and image classification with large feature sets.
These are moderate in their accuracy, as well as their training times, mostly because it assumes linear approximation. On the other hand, they require an average number of parameters to get the work done.
Ensemble methods are techniques that build a set of classifiers and combine the predictions to classify new data points. Bayesian averaging is originally an ensemble method, but newer algorithms include error-correcting output coding, etc.
Although ensemble methods allow you to devise sophisticated algorithms and produce results with a high level of accuracy, they are not preferred so much in industries where interpretability of the algorithm is more important. However, with their high level of accuracy, it makes sense to use them in fields like healthcare, where even the minutest improvement can add a lot of value.
Artificial neural networks
Artificial neural networks are so named because they mimic the functioning and structure of biological neural networks. In these algorithms, information flows through the network and depending on the input and output, the neural network changes in response. One of the most common use cases for ANNs is speech recognition, like in voice-based services. As the information fed to them grows, these algorithms improve.
However, artificial neural networks are imperfect. With great power comes longer training times. They also have several more parameters as compared to other algorithms. That being said, they are very flexible and customizable.
If you want to skill-up in implementing Machine Learning Algorithms, you can check out the following books from Packt: