What is image recognition in a nutshell
Building robots that can process and understand images has always been a key challenge for the scientific community. How nice would it be to say “bring me a beer” to a personal assistant and have him fetch your favorite brew in the fridge? To complete this single interaction, the assistant must avoid obstacles, recognize the fridge, recognize the beer in the fridge and recognize the commander i.e. recognize objects in images.
This applies to a number of applications, not only in robotics. Any image sharing service will face the challenge of automatically checking whether an uploaded picture is NSFW or not for example.
These tasks are covered by computer vision, a research field created in the 60s that aims to extract information from images. It is a very vast field of research and image recognition is actually a sub-category of computer vision.
Types of image recognition
Even though it is a sub-category, image recognition is vast in itself and includes different applications. In the broad sense, image recognition focuses on recognizing what constitutes an image.
This could be the kind of objects found in the image along with their location: this is known as object classification. Another application is identification; unlike classifiers, identifiers extract more detailed information about the object. Classic examples are identifiers of handwritten digits or the impressive face recognition on Facebook that automatically tags your friends when you upload a picture.
Finally, detection is a more binary application of recognition. It extracts whether a certain condition is met in an image. Boss sensor is a fun example of detection: it automatically hides your screen if your boss is approaching in the open space. It is a mix of detection and identification. First, it detects whether a human is present in the image, if so, it identifies whether the human is your boss.
What makes recognition difficult?
Extracting information from an image is an extremely difficult computational problem and it is hard for humans to grasp this because we are naturally so good at it. In Thinking, Fast and Slow, Daniel Kahneman describes how the human brain is constantly searching for patterns; when we see words for example, we can’t help reading them. finding faces in images
Why is it so hard for machines? First, computers analyze images pixel by pixel unlike humans who have a global view of an image and tend to extract its components and how they relate to each other. As an example, a human would recognize an elephant in any situation: at the zoo, in the jungle or in a circus; a machine would have to be retrained for each situation.
What if we simplified the problem, providing images of elephants alone on a white background? Computers would still struggle on this because the elephants in the images could be in different sizes, orientations and positions. Even lighting could affect the recognition. For example, let’s say our recognition algorithm was given the following model for an elephant: “a large grey animal with four legs and big ears.” Give it an image of an elephant viewed from above, the legs are not visible and the model fails.
Existing image recognition APIs and benchmark
The market for image recognition has been growing fast, but past solutions were not efficient enough to open up a recognition API. Recent developments in machine learning have improved the precision of this task and made the recognition API business model a reality.
This is great news for other businesses which are not specialized in image processing. No need to spend weeks to develop a custom ML solution for the recognition problem: the agile way here is to reuse an existing API. The key question then becomes: which one?
Major recognition APIs
The number of recognition APIs has recently exploded; lists of these APIs often exceed 20 different vendors. Some are still experimental, demonstrating their capabilities, while other tried and tested vendors offer a paid service . Their packages come with SLA on response time and precision so we focused on these APIs for our benchmark.
We compared IBM Visual Recognition, Google Cloud Vision, the recent Amazon Rekognition, Cloudsight, Clarifai and Azure computer vision. The API pricing differs but on average a call would cost you $0.0015. Luckily, the benchmark didn’t cost us anything as they offer a certain number of free calls every month 😉
Cloudy_vision benchmark tool
Developing this benchmark would not be so hard: it boils down to writing a client code for each vendor but again why do it when there are existing tools? One of these is Cloudy Vision. Developed by Gaurav Oberoi in python, it contains all the necessary code to produce the benchmark. All that’s left to do is: get a request token for each API and fetch an image dataset for the benchmark.
The tool is quite smart: results of API calls are stored as JSON files so the benchmark can be interrupted and restarted at any point. At the end, an HTML is compiled with input images and corresponding JSONs, nice!
As we can see, recognition APIs return tags and/or captions related to the input image. Also, a confidence score between 0 and 1 is compiled for each tag and caption.
Setting up the benchmark tool
A handful of image datasets is available on the internet and we went with the Caltech 101 dataset. The first benchmark gave a qualitative overview of each vendor capability but we needed more quantitative metrics to make our decision.
For this matter, we extended the benchmark to compute averages and standard deviations of response times and number of tags.
Another relevant metric is what we called the matching count. It’s the number of returned tags that match the input image. For this metric, the dataset must be manually tagged with expected tags. The matching count is then compiled by comparing the tags returned by an API with the expected tags.
We added the possibility to work with a JSON of expected tags per image in Cloudy Vision. Filling this JSON was straightforward since images in Caltech 101 are bundled in folders named after the image content.
When expected tags are present in Cloudy Vision, the matching count average and standard deviation are computed for each API.
A good image recognition API should get the right tags… with great confidence! That’s why we also computed the average and standard deviation of the matching tags’ confidence.
All these additions can be found here and may be integrated in Cloudy Vision 😉
And finally, the results of our benchmark. (r.t stands for response time)
|Name||avg r.t (s)||std r.t (s)||avg tags count||std tags count||avg matching count||std matching count||avg matching confidence||std matching confidence|
|IBM visual recognition||1,18||0,33||7,45||2,79||1,17||1,18||0,53||0,4|
|Google cloud vision||1,12||0,1||6,73||2,63||0,98||1,14||0,53||0,42|
|Azure computer vision||1,99||0,71||3,2||2,77||0,33||0,53||0,24||0,38|
Different API, different purpose
Some metrics really stand out in the results. First, there aren’t any tag related values for Cloudsight. Indeed, Cloudsight returns captions rather than tags and some of them were really impressive: “3 bladed ceiling fan with solar panel”. While that’s really cool, the counterpart is the high response time exceeding 10 seconds.
Another interesting metric is the standard deviation of tags count for Clarifai. This vendor always returns 20 tags going from most precise to most generic. The returned tags describe the image at various conceptual levels (content, color, environment, …). The high quantity and variety of tags can be a good thing, depending on the application. Beware though, it can be challenging to extract the right features from these tags!
The other APIs are more traditional. They are all available at similar prices and have acceptable response times and precision. Looking into the HTML more closely, we found out that the API precision depends on the type of image sent: some were better at recognizing planes, others at recognizing animals. That’s why filling the dataset with images taken from your particular use case is a must to make your final choice. Potential partnerships with these vendors should also be taken into account.
So, where does that leave us?
At the end of the benchmark, recognition felt like a standard service but when you think about it, reaching this level of quality is truly impressive. Today, you can literally ask what is on your image and get relevant responses in seconds. An opportunity for us to extend our request endpoint with an image parameter!
The true conclusion of this benchmark and what drove it was very simple: don’t reinvent the wheel! Use services that already exist! Specialized engineers have spent time developing them, sometimes failing hard, so don’t make the same mistakes and use their work to go further. Engineers are expensive, much more expensive than monthly payments to an external service. Look around, even for a simple benchmark code, because someone may have done it before you! Oh, and don’t forget to share what you learn 🙂
You can find the whole story in How we kinda sorta just added computer vision to chatbots.