We frequently hear about machine learning in the media, especially since the recent wave of interest in deep-learning. The perpetual improvement of machine learning techniques combined with the ever increasing amount of data that are stored suggests endless new applications.
Many innovative solutions emerge: autonomous driving, next generation supermarkets with implicit payment, next generation chatbots that can interact with you as human beings would do, and so on. More than ever, the future seems within reach. But the more extravagant and original the application is, the more the layman is put off.
The plethora of machine learning algorithms and approaches only increases that feeling. In this article, we’ll dispel this myth and we’ll give you tips on choosing your ML algorithm to solve your problem with minimal effort.
1. Why and when to use machine learning?
But first of all, what problem can ML solve ? ML is trendy, for sure, but we should not forget that ML’s main goal is to help us solve problems that are difficult to solve with traditional programming.
What we can do with ML algorithms is:
- Learn complex decision systems from data (for instance: to forecast quantities or give the likely belonging of a data point to a category, Fig 1.1)
- Discover latent structure in unexplored data to find patterns that nobody would expect (in Fig 1.2: Picasso’s paintings organized by periods and artistic style, ok people would have expected this I guess)
- Find anomalies in data (for instance: to automatically raise an alert if something suspicious happens in the data, Fig 1.3).
ML is very useful to automatically treat complex and/or large amount of data.
First let’s compare the traditional programming approach with the ML approach. As you can see in Figure 2, in the traditional programming approach, we start with data and a program we wrote to take this data as an input. We finally obtain the results as an output.
In the ML approach we do it a little bit differently: we start with data AND the results we know we obtained on this dataset before, then we train a program as an output. The obtained program will be used in input of a traditional programming approach after that.
Now you might think:
« Ok great, I now understand when to use ML better and how it is different to the traditional programming approach. »
« Yet, I’d like a more concrete example ».
Ok, let’s give a concrete example then! What about the face detection problem in Figure 3 ? In the case of face detection, the dataset is composed of images of face and images of background. The results known on these are either 1, which means detected, for face images, or 0, which means nothing detected for background images. The final program is the face predictor. At the end, the « face predictor » program will be used as in the traditional programming paradigm and the data will be image chunks of a bigger image in which we wish to detect faces. The results will be : « this image chunk does not contain a face » or « this image chunk contains a face ». In Figure 3, image chunks that are said to contain faces are surrounded by a green border.
2. Defining your problem
At first, it’s very important to define the problem to better solve it afterwards. This can easily be done by answering these three questions:
- What do we want to do?
- What is available?
- What are my constraints?
What do you want to do ?
Do you want to predict a category ? That’s classifying. For instance, you want to know if an input image belongs to the cat category or the dog category.
Do you want to predict a quantity ? That’s regression. For instance, knowing the area of the floor plan of a house, where it is, whether it has a garage or not, predicting its value on the market. In this case, go for a regression approach because you want to predict a price ie. a quantity, not a category.
Do you want to detect an anomaly ? That’s anomaly detection. You want to detect money withdrawal anomalies. Imagine that you live in England and you have never been abroad, and that money has been withdrawn 5 times in Las Vegas from your bank account. In this case you might want the bank to detect that and prevent it to happen it from happening again.
Do you want to discover structure in unexplored data? That’s clustering. For instance: imagine having a large amount of website logs, you might want to explore them to see if there are groups of similar visitor behavior in your website logs. These groups of visitor behaviors might help you improve your website.
What is available?
How much data do you have? Of course, this depends on the problem you want to solve and the kind of data you’re playing with. Knowing the amount of data you have is important. If you have more than 100.000 data points you will be able to use every kind of algorithm!
Do your data points have labels? That is, do we know the category of each data point we have? If we know the category an image belongs to, we know the label (Figure 4). If we don’t, then we cannot label them… (Figure 5).
Do you have a lot of features to work with? The number of features you have might influence your algorithm choice. In the case of house price forecasting, you might need to know the total area of the floor plan of the house, the number of floors, the proximity to the city center, and so on. The more features you have, the more accurate your analysis will be. Too many or too few features will restrict your choice of algorithm. Having too many features might increase the occurrence of redundant features… Features that are correlated, such as the area of the house and its inner volume, affect the performance of some algorithms.
How many classes do you have? Knowing how many classes (categories) is important for some ML algorithms, especially for some exploratory ML algorithms.
What are your constraints?
What is your data storage capacity? Depending on the storage capacity of your system, you might not be able to store gigabytes of classification/regression models or gigabytes of data to clusterize. This is the case, for instance, for embedded systems.
Does the prediction have to be fast? In real time applications, it is obviously very important to have a prediction as fast as possible. For instance, in autonomous driving, it’s important that the classification of road signs be as fast as possible to avoid accidents, obviously…
Does the learning have to be fast? In some circumstances, training models quickly is necessary: sometimes, you need to rapidly update, on the fly, your model with a different dataset.
Two also very important aspects we, enthusiastic developers, have a tendency to forget is the maintainability of the solution we choose, and, communication.
Maintainability: it is sometime more judicious to go for a simpler solution giving correct results, instead of a very sophisticated solution you’re not 100% confident with giving slightly better results. We might not be able to easily update the solution or correct a bug in the future.
Communication: we, developers, are sometime working with non-developers 🙂 For some projects it is sometime necessary to expose your solution to people of other professions. In this case, it might be judicious to go for a ML solution that is more suitable for layman.
3. A little bit of theory
A bit of theory is required before going any further. Let’s first talk about the different existing ML approaches: the supervised, the non-supervised and the semi-supervised approach. Then, we’ll talk about very important notions in ML: the bias and the variance.
In supervised learning, all our data points are labeled. The goal is to find a good separation between classes, as you can see in Figure 6. Here, we want to correctly separate blue labeled data points and red labeled data points.
In non-supervised learning, the input data points are NOT labeled. The goal here is to group data points by similarity or proximity. Then, labels may be attributed to the groups of data points.
In semi-supervised learning, both approaches are mixed. A model is first trained using few labeled data point. Unlabeled data points are used later to further improve the model. This approach is very interesting because we often encounter situations where we have few labeled data points and a large amount of available unlabeled data points. Famous semi-supervised approaches are Active Learning and Co-Training.
- In Active-Learning, users periodically need to manually label data and thus incorporate these new data for next trainings.
- Co-Training is interesting because it does not require any human interaction: two or more predictors are learned on different « views » of the same unlabeled data points (i.e using different sets of features). The classifiers are then tested against new incoming unlabelled data points. Misclassified data points are then reincorporated for next training rounds to correct these errors. This approach requires the definition of a measure of confidence to be sure one or both the classifiers missed the classification, though.
Bias and variance
Bias and variance are two important notions in ML. They are indicators you should always keep an eye on when training your models. They will allow you to have an idea of what the performance of your model is going to be with new input data.
The bias is the error due to erroneous learning assumptions. It simply means that you did not train your model correctly, i.e, that you wrongly separated your data (Figure 8). If you have a high bias, it means that you missed your learning very much.
The variance is the error from sensitivity to small fluctuations in the learning dataset. A high-variance means you fitted your learning data too well. In that case, you won’t be able to adapt to new input data points. ML learning’s main goal is to generate a model that can be generalized to any new input data. Thus, fitting the learning data too much is contrary to this objective. As a concrete example, imagine you trained a model that’s too fitted your learning data as in Figure 9. In that case, if you want to predict the belonging of a new input data point to the blue or the red class, this model will yield that your new input data point belongs to the blue class (Figure 9). Whereas, naturally, you would have expected this new data point to be marked as a red data point, because three other data points are surrounding it.
One of the biggest Machine Learning’s goals is to train a model that can be generalized to new data. If your model is not capable to correctly predict on new data, then your training is useless. As you’ve seen above, having a variance too high doesn’t allow your model to correctly generalize new data. And, quite obviously, having a bias too high doesn’t allow the model to learn from the data at all.
K-fold cross-validation is one way to do it: the original learning data is randomly partitioned into K different folds with the same size. At each step, one fold is selected to test the performance of the model, and (K-1) folds are used for the training. This step is repeated K times. If your model doesn’t suffer from high-variance (aka: overfitting in the ML community), then you should have homogenous performances for the K cases. If your model is performing well (low biais) you should also have a high average performance for the K cases.
4. Popular ML algorithms
Now, we’re going to review the most popular ML algorithms. For each algorithm, we’ll talk about its advantages and drawbacks.
Linear Regression is a regression algorithm (but you probably figured that out!). This algorithm’s principle is to find a linear relation within your data (Figure 10). Once the linear relation is found, predicting a new value is done with respect to this relation.
- Very simple algorithm
- Doesn’t take a lot of memory
- Quite fast
- Easy to explain
- Requires the data to be linearly spread (see « Polynomial Regression » if you think you need a polynomial fitting)
- Unstable in case features are redundant, i.e if there is multicollinearity (note that, in that case you should have a look to « Elastic-Net or « Ridge-Regression »)
The Decision Tree algorithm is a classification and regression algorithm. It subdivides learning data into regions having similar features (Figure 11). Descending the tree as in Fig. 12 allows the prediction of the class or value of the new input data point.
- Quite simple
- Easy to communicate about
- Easy to maintain
- Few parameters are required and they are quite intuitive
- Prediction is quite fast
- Can take a lot of memory (the more features you have, the deeper and larger your decision tree is likely to be)
- Naturally overfits a lot (it generates high-variance models, it suffers less from that if the branches are pruned, though)
- Not capable of being incrementally improved
Random Forest is a classification and regression algorithm. Here, we train several decision trees. The original learning dataset is randomly divided into several subsets of equal size. A decision tree is trained for each subset. Note that a random subset of features is selected for the learning of each decision tree. During the prediction, all decision trees are descended and an average is performed on all predictions, for the regression, or a majority vote is performed, for the classification (Figure 13).
- Robust to overfitting (thus solving one of the biggest disadvantages of decision trees)
- Parameterization remains quite simple and intuitive
- Performs very well when the number of features is big and for large quantity of learning data
- Models generated with Random Forest may take a lot of memory
- Learning may be slow (depending on the parameterization)
- Not possible to iteratively improve the generated models
Boosting is similar to Random Forest because it trains several smaller models to make a bigger one. In this case, models are trained one after the other (i.e the model n+1 will depend on the model n). Here, the smaller models are named « weak predictors ». The Boosting principle is to « increase » the « importance » of data that have not been well trained by the previous weak predictor (Figure 14). Similarly, the « importance » of the learning data that has been well trained before is decreased. By doing these two things, the next weak-predictor will learn better. Thus, the final predictor (model), a serial combination of the weak predictors, will be capable of predicting complex new data. Predicting is simply checking if new data is part of the blue or the red spaces, for instance, in the classification problem of Figure 14.
- Parameterization is quite simple, even a very simple weak-predictor may allow the training of a strong model at the end (for instance: having a decision stump as a weak predictor may lead to great performance!)
- Robust to overfitting (as it’s a serial approach, it can be optimized for prediction)
- Performs well for large amounts of data
- Training may be time consuming (especially if we train, on top of it, an optimization approach for the prediction, such as a Cascade or a Soft-Cascade approach)
- May take a lot of memory, depending on the weak-predictor
Support Vector Machine (SVM)
The Support Vector Machine finds the separation (here, an hyperplane in a n-dimensions space) that maximizes the margin between two data populations (Figure 16). By maximizing this marge, we mathematically reduce the tendency to overfit the learning data. The separation maximizing the margin between the two populations is based on support vectors. These support vectors are the data closest to the separation and defining the marge (Figure 16). Once the hyperplane is trained, you only need to store the support vectors for the prediction. This saves a lot of memory when storing the model.
During prediction, you only need to know if your new input data point is “below” or “above” your hyperplane (Figure 17).
- Mathematically designed to reduce the overfitting by maximizing the margin between data points
- Prediction is fast
- Can manage a lot of data and a lot of features (high dimensional problems)
- Doesn’t take too much memory to store
- Can be time consuming to train
- Parameterization can be tricky in some cases
- Communicating isn’t easy
Neural Networks learn the weights of connections between neurons (Figure 18). The weights are adjusted, learning data point after learning data point as shown in Figure 18. Once all weights are trained, the neural network can be used to predict the class (or a quantity, in case of regression) of a new input data point (Figure 19).
- Very complex models can be trained
- Can be used as a kind of black box, without performing a complex feature engineering before training the model
- Numerous kinds of network structures can be used, allowing you to enjoy very interesting properties (CNN, RNN, LSTM, etc.). Combined with the “deep approach” even more complex models can be learned unleashing new possibilities: object recognition has been recently greatly improved using Deep Neural Networks.
- Very hard to simply explain (people usually say that a Neural Network behaves and learns like a little human brain)
- Parameterization is very complex (what kind of network structure should you choose? What are the best activation functions for my problem?)
- Requires a lot more learning data than usual
- Final model may takes a lot of memory.
The K-Means algorithm
This is the only non-supervised algorithm in this article. The K-Means algorithm discovers groups (or clusters) in non-labelled data. The principle of this algorithm is to first select K random cluster centers in the unlabelled data. The belonging to a group of each unlabelled data point becomes the class of the nearest cluster center. After having attributed a category to each data point, a new center is estimated within the cluster. This step is repeated until convergence. After having iterated enough, we have the labels of our previously unlabelled data! (Figure 20).
- Parametrization is intuitive and works well with a lot of data.
- Needs to know in advance how many clusters there will be in your data… This may require a lot of trials to “guess” the best K number of clusters to define.
- Clusterization may be different from one run to another due to the random initialization of the algorithm
Advantage or drawback?
- The K-Means algorithm is actually more a partitioning algorithm than a clustering algorithm. It means that, if there is noise in your unlabelled data, it will be incorporated within your final clusters. In case you want to avoid modelizing the noise, you might want to go to a more elaborated approach such as the HDBSCAN clustering algorithm or the OPTICS algorithm.
One-Class Support Vector Machine (OC-SVM)
This is the only anomaly Machine Learning algorithm in this article. The principle of the OC-SVM algorithm is very close to the SVM algorithm, except that the hyperplane you train here is the one maximizing the margin between the data and the origin as in Figure 21. In this scenario, there is only one class: the “normal” class, i.e all the data points belongs to one class. If your new input data point is below the hyperplane, it simply means that this specific data point can be considered as an anomaly.
Advantages and drawbacks: similar to those of the SVM algorithm presented above.
5. Choosing which algorithm to use
Now that we’ve been through some of the most popular ML algorithms, this table might help you decide which to use!
* Only non-supervised algorithm presented
** May not require feature engineering
6. Practical advices
Let’s wrap this up with some practical advice! My first advice to you is not to forget to have a look to your data before doing anything! It may save you a lot of time afterwards. Looking directly at your raw data gives you good insights.
I also deeply recommend you to work iteratively! Amongst the ML algorithms you identified as potential good approaches, you should always begin with algorithms whose parametrization is intuitive and simple. That’ll allow you to quickly define if the approach you picked is or isn’t fitting. This is especially true when you’re working on a Proof Of Concept (POC).
Although it is very important, I don’t go into much detail about the feature and its engineering. Depending on the problem, features may be obvious and easy to find in the data. In many cases, it’s enough to get well-performing models. But sometimes, you need additional features for a better training. Be careful about having too many features, that can be a problem: you might face the curse of the dimensionality problem and need a lot more data to compensate. Besides, having too many features increases the occurring of multicollinearity, and that’s not great. Fortunately, the number of dimensions (or features) can be reduced using dimension reduction algorithms (the most known algorithm being the PCA). Adding or removing features is main purpose of feature engineering.
I also recommend you to first experiment your approach in sandbox mode on a restricted dataset. High level languages such as R, Matlab or Python are perfect for that. Once and only once you validated your approach in sandbox mode, you can directly implement it in your product.
In case your problem is non-linear, algorithms such as the Naive Bayes, the linear and the logistic regression are not suitable. Other algorithms may require a different parametrization.
Concerning the performance, it is often difficult to know in advance which algorithm is going to perform the best amongst those identified as good approaches. The best way to know is often to try them all and see!
So let’s get going!
You see, machine learning isn’t out of reach! By correctly defining your problem and understanding how these algorithms work, you can quickly identify good approaches. And with more and more practice, you won’t even have to think about it!
If you want to go deeper into the concept of reinforcement learning, you can read this great article.