How can machine learning help automate image classification?

Author: Mateusz Lewandowski, Data Scientist

From this article you will learn:

  • What machine learning image classification is and how it works.
  • How machine learning image classification has evolved.
  • Some examples of successful application of machine learning and neural networks in image recognition.

Have you ever tried to learn a foreign language? At the initial stages, it is done laboriously, letter by letter, till they start formulating a word that you recognize. If you come across a word that you already know, you don’t need the letter by letter stage – you recognize the word as an entity and read it easily. Let’s focus on the letter by letter part. To do this, you need to know all the letters. As obvious as it might sound, a letter is a combination of pixels grouped forming a shape. So you recognize those shapes. And since there are 26 letters and you don’t expect to find any other letters in an English article, we might confidently conclude that we classify each letter to one of 26 existing classes. 

That brings me to the next point: we all classify images. Letters, words, zip codes, famous brands. When driving we need to additionally “read” (segment) images in the context of street traffic. ‘Trailer’ for instance belongs to the class ‘car’ and hence belongs to ‘traffic’ and should be watched out for). You can also segment images as not a part of street traffic, for example, a building or a tree. That is something called semantic segmentation, of which there are two types, as we will explain below. Segmenting an image means assigning a label to each pixel. Semantic segmentation does not tell the difference between instances, whereas semantic instance segmentation does tell the difference between each person/car etc. We can’t do without image classification. What is machine learning in image classification? Where does it apply and work best? 

A note on semantic segmentation. Roughly there are two main types:

  • semantic segmentation
  • semantic instance segmentation

In the first type, the task is to tell all objects apart that are in the picture, in the second type the task is taken one step further to differentiate between two objects of the same class, for example in an image.

The illustration above clearly presents what the difference is between the two types of segmentation. In both cases, every pixel in a picture is assigned to one of the classes, and then in this example, these pictures either need to be assigned to the class ‘person’ or a specific person. 

This is the same thing that everyone does while driving. Everything in sight needs to be assigned a label roughly as ‘potentially dangerous or ‘not dangerous’. For example, when driving down a street in some touristic place where families come with children to relax, it is not unlikely that the child is going to run into the street. But since there are many children, each of them has to be considered a potential danger and has to be considered separately (drive safely). 

Machines (algorithms) are becoming better and better at classifying images too. Let me explain how.  

What is image classification machine learning? How does it work?

Every machine learning model is a sum of two components: data and an objective that is formulated when training artificial structures, mimicking the way that biological neurons signal one another – neural networks. Fed with data (in case of image classification – graphic data), the model uses a predefined loss function to update weights accordingly, so that the difference between predicted output and real output is minimized. 

Hence we can, with little effort, teach a system to accurately recognize images. It turns out that, given a high volume of data and computational resources, it is easier to formally express what you want the system to accomplish (use a machine for this) than to look for ways to define how you want the system to accomplish this.

Image analysis diagram.

How has machine learning image classification evolved? A brief history

The development of algorithms for computer vision has a long history. In terms of neural networks, it can be roughly divided into the deep learning era and before the deep learning era, with a cut-off point in 2012 with AlexNet.

A short glance into the era before deep learning methods reveals:

1) Viola-Jones algorithm (2001) and Haar-like features 

2) Histograms of Oriented Features (2006) 

3) DPM (Deformable Part-Based model) + Bounding Box Regression

And much more interesting developments after 2012: RCNN (2014), SPPNet, Fast RCNN, Faster RCNN, Pyramid Networks. And when it comes to image segmentation, there are two milestones: U-Net and Mask-RCNN.

U-Net was specifically designed for medical purposes – its in-architecture augmentation methods allow the best use of even the small datasets that are common in medical fields – because the cost of creating these datasets is high. 

What has changed in computer vision since 2012? Before the rise of deep neural networks, a human engineer was responsible for designing the features that should be extracted from pictures. Those could, for example, be horizontal or vertical edges, which are very easy to find, or they could be more sophisticated, like histograms of oriented gradients. What worked for one domain did not necessarily work for another. For example, the first real-time face detection algorithm, developed by Viola and Jones was crafted precisely for that task using ingenious feature selection, but would not work, regardless of hours spent on adaptations, in different domains, for example for images of cats vs dogs. 

The whole paradigm changed with the advent of deep neural networks, which have filters that are designed so that each filter learns its weights to extract a given set of features. In this case, it is not engineers who handpick the best parameters for each filter, but rather, an engineer who selects architecture and backpropagation should do the work of network training. This dramatically reduces the idea – proof of concept time loop, and that is another major practical reason for the recent popularity of deep learning methods: greatly reducing the time required to present results.

Successful application of machine learning and neural networks 

Image classification for efficient navigation (Google Maps)

Ok, enough of the non-stop glorification of deep learning! Let’s focus on a case study of a successful application of this technology. Google describes how they implemented address recognition at Google Maps

Given the immense amount of data that Google has collected from the streets, it was totally impossible to pay someone to painstakingly look through all these images and manually select visible addresses. Anything that could reduce the burden would bring value to the company. To address that, they designed an algorithm that preferably had a high recall but a lower precision. In this way, engineers were sure that the vast majority of the total addresses were attended to. Higher precision also implied a need for a human in the loop, but it was much easier to have that human only glance through annotated parts of the images and confirm or reject their correctness than have them manually go through everything. 

That clearly shows how a carefully designed process can increase faith in its capabilities. Notice, however, that it still required a human in the loop. And there is no intrinsic evil in that. Summing up, engineers at Google took on an enormous task and automated it as much as was feasible (which does not imply a fully automated process). 

In the referenced study, there were no exact numbers regarding actual savings achieved, but when I ran the experiment myself, it could be estimated at 5 – 10 seconds per address (accept/reject vs. find in the image)

Image classification for better healthcare – tumor recognition

Another example would be the application of image recognition in the medical field – as mentioned above. It is extremely costly to train a skilled human to annotate X-rays and USG. The cost of a mistake can be a life. Picture a situation when a patient comes in with a mild tumor and gets an X-ray, but the doctor fails to notice cancer. The patient happily heads home only to come back sometime later, this time with a cancerous growth that is much bigger and requires much more invasive therapy to cure. 

That job can be automated with a trained neural network that segments pixels into the classes ‘tumor’ and ‘neutral’, thus making inference much faster. Possibly, to comply with relevant laws, the process might need to include humans in the loop to verify regions that are predicted with less confidence than others. Even so, the process is much more improved when obvious negatives and obvious positives are automatically classified.

Machine learning model classifies lung cancer slides in under a minute

Image classification for marketing 

Another example of machine learning image classification, this time in marketing:

  • face recognition,
  • emotions recognition,
  • logo recognition.

There are also machine-learning photo processing solutions for e-commerce, i.e. for retouching, sharpening, cutting out one element from the background.

Speaking of logo recognition. Currently, marketers have limited possibilities to measure the effectiveness of e.g. sponsorship deals. Logo recognition by machine learning would allow them to calculate the value of activities such as influencer marketing, brand loyalty, or brand recognition and measure the “share” of their logo among different social media and digital channels.

Read more about automation and machine learning in online marketing here.

How are neural networks for image recognition trained?

Training neural networks from scratch are prohibitively expensive. In computer vision, the process called transfer learning is widely used and has a much longer history than it does, for example, in natural language processing. This is because it turns out that many of the features that are learned by convolutional filters can be reused.

 The process is as follows: 

  1. A deep neural network is trained (deep meaning composed of many layers) on some large-scale dataset, for example, thousands of classes of images in the ImageNet competition.
  2. Once the training is complete, meaning that the desired metric has been achieved, then the network is saved. It so happens that the features that are learned in filters of convolutional networks are universal and can be easily reused for classifying custom datasets.

The idea of deep networks is by no means new – in hardware development, it has made it possible to train deep architectures. In earlier layers, more general features are stored, while in the bottom layers the features responsible for specific types of data are stored. 

There are many pre-trained networks available – pre-trained on a wide range of domains – for scientific but also for commercial use. To train a network to classify some custom objects, it is enough to load the pre-trained architecture, freeze the weights in the upper layers and fine-tunes (adapt weights appropriately) only the bottom layers. It is much easier to do that starting with the architectural design, and then carefully optimize the network in case of a lack of learning, and then all sorts of magic come together. 

Very often, new problems can be solved by fine-tuning pre-trained networks. However, if extremely high performance is required and by no means achievable using a pre-trained network, the only solution might be to design a custom neural network architecture and start the laborious process of training the network from the beginning. It is impossible to tell what technique is going to be successful before the first trial. Usually, the most effective technique is supervised learning – try and see whether it works.

Result of network operation


When properly used, machine learning techniques are very helpful in many fields – image or text classification is just an example – but are never an end in themselves. They help to reduce human work and hence should be considered a tool with high potential but the area of application needs to be carefully selected. 

Also, the data analyzed and the methods employed to analyze that data have the drawback of being difficult to explain. When a budget is being approved or current business decisions are being altered, it is worth taking the time to properly explain all the ‘why’s and estimate the possible outcomes. 

The field is growing, and the understanding of how it can help business is still evolving. In my opinion, new roles will likely continue to emerge that serve the pipeline and support people’s everyday work. 


Learn more about the benefits of machine learning in online marketing: