In my previous blog, I wrote about the importance of dataset in Machine Learning (If you haven’t read it yet, go check it out here!) and I succinctly outlined solutions and sources for you to build your own.
But one question remained: How can we simply create a trustworthy source of specialized knowledge that suits our needs?
In a recent post, spacemachine.net posted an interesting article comparing the importance of datasets and algorithms in the most notable breakthroughs in AI:
The take-away of this study is that even though progress in AI is considerable, the key solution to most of our problems resides in structured, qualified and valuable data.
As I said earlier, an algorithm provided with enough pre-treated and certified data will spot the underlying characteristics more easily and thus perform better, thanks to better data modeling.
But sometimes, we can’t have a large database. We’re limited to a small amount of data and we’re forced to come up with a model based on it.
There’s a field related to Machine Learning which does just that: data augmentation.
The principle behind data augmentation is simple: from already qualified data, we can generate more by modifying our base.
The main benefit or data augmentation is quite obvious: you get more data, and your training and/or testing set gets better.
The second one is that since we’re adding noise, we get closer to real data, and can improve the measurement of a real-world behavior of our machine learning technique.
Data augmentation is most commonly used is image recognition, where deep learning shines. Let’s take and example:
Let’s imagine a classification task where we need to classify an image as either our logo or an orange.
Let’s say, for the sake of simplicity, that we have one example in each class.
The goal here is to build a model capable of distinguishing the key observations (or features) we need to differentiate the pictures. With deep learning, we need to have a considerable amount of example for the algorithm to learn the model.
From 1 to 10
Obviously, such a small number of examples will lead to considerable overfitting, making our solution non-viable.
The reason overfitting occurs is because you’re trying to learn distinctions from a limited dataset.
When the algorithm will have control over the input, it will make sure it satisfies each and every one of the features, which is not what we want: we want it to generalize from a set of examples.
By rotating, mirroring, adjusting contrast, grayscale or randomly cropping our training sample, we can easily multiply our algorithm knowledge base: it is possible to generate new images from the original ones.
Imagining ways to do data augmentation with images is quite straightforward but what if we want to augment something like trading datapoints?
Well the fact is, there’s no defined way to create augmented data, but I’ll propose some ways to enhance a text dataset:
The same scheme applies to text. We need to find ways to slightly modify the corpus we work with to have an augmented dataset whilst keeping the key points our algorithm needs.
The intuitive approach is to replace words with their closest synonyms: and even there, there’s a few ways to do it:
Here comes our old friend: WordNet (which I talked about in a previous article). WordNet is an ontology built at Princeton University, which models relationships between words, like synonyms, antonyms, hyponyms, and so on. For the record, lots of sites like thesaurus.com internally use WordNet as their knowledge base.
How can you beat that? Word embedding!
Word embeddings are words represented as vectors, basically an array of variable dimensions, holding semantic information meant for computers to understand.
Those mappings are meaningless on their own, but semantically similar words have similar vectors, and vector similarities are easy to compute. This can lead us to perform calculation with words:
vector(man) – vector(woman) + vector(uncle) ~ vector(aunt)
If you want to go deeper about word vectors, I suggest this article from Christopher Olah‘s blog.
If you take a single word, you can retrieve closely related synonyms easily:
Using one of these databases requires a system to disambiguate the word we want to find the synonym for, which can be quite tedious, but we can imagine just generating data from non-ambiguous words.
Another fun approach would be to invert part of sentences which are opposed by words like “but”, “even though”, “althrough” to create a whole new formulation, while keeping the underlying meaning of the phrase.
Moreover, we can even just strip away embellishments like adjectives, adverbs and so on.
But wait, there’s more!
Another benefit of data augmentation is to evaluate the quality of the model you just built.
The usual way of testing the model your created is to take the whole dataset and split it in 80-20. Then we use the former as training set and the latter as testing set. This leads to problems if the original dataset is shallow, because we end up by training our model with 80% of our already small dataset.
Further along the road comes what’s called cross-validation, which aims to ensure that every example from the original dataset has the same chance of appearing in the training and testing set. This allows a complete evaluation the quality of the model trained with our data.
Thanks to data augmentation, we can thus compensate for our lack of entries, and do testing and validation without fear of losing important examples for the generalization to happen.
In short, data augmentation is a wonderful approach to solve several problems we might encounter whilst working with machine learning.
Sometimes, small datasets are all we got, and generating data can be a life-savior for both training and testing our model.
Paul RENVOISÉ — SAP Conversational AI