A convolutional neural network (CNN or ConvNet) is one of the most popular algorithms for deep learning, a variant of machine learning in which a model learns to perform classification tasks directly from images, video data, texts or acoustic data.

Table of Contents

  1. Introduction
  2. Convolution neural networks
    1. The mathematical part
    2. The high-level explanation
      1. CONV
      2. RELU
      3. POOL
  3. Code sample - TensorFlow and Keras


Image recognition is the task of taking an image and labelling it. For us humans, this is one of the first skills we learn from the moment we are born and is one that comes naturally and effortlessly. By the time we reach adulthood we are able to immediately recognize patterns and put labels onto objects we see. These skills to quickly identify images, generalized from prior knowledge, are ones that we do not share with our machines.

How a Machine sees a image

When a computer sees an image, it will see an array of pixel values, each between a range of 0 to 255. These values while meaningless to us, are the only input available to a machine. No one knows how exactly we living beings process images but scientists today have figured out a technique to simulate this process, albeit at a basic level. We call this technique deep learning.

A convolutional neural network (CNN or ConvNet) is one of the most popular algorithms for deep learning, CNNs are especially useful for finding patterns in images and thus recognizing objects, faces and scenes. Facebook uses neural nets for their automatic tagging algorithms, Google for their photo search, Amazon for their product recommendations, etc.

Most popular use case of these networks is for image processing. CNNs learn directly from image data. They use patterns to classify images and make manual extraction of features unnecessary.

Within image processing, let’s take a look at CNNs for image classification using tensor flow and keras.

Convolution neural networks

Image recognition used to be done using much simpler methods such as linear regression and comparison of similarities. The results were obviously not very good, even the simple task of recognizing hand-written alphabets proved difficult. Convolution neural networks (CNNs) are supposed to be a step up from what we traditionally do by offering a computationally cheap method of loosely simulating the neural activities of a human brain when it perceives images.

Let us understand what a convolution is without relating it to any of the brain stuff.

The mathematical part

Simplified depiction of a 32x32x3 image

A typical input image will be broken down into its individual pixel components. In the picture above, we have a 32x32 pixel image which has a R, G, and B value attached to each pixel, therefore a 32x32x3 input, also known as an input with 32 height, 32 width, and 3 depth.

applying a 3x3 filter

Math Filtering

A CNN would then take a small 3x3 pixel chunk from the original image and transform it into a single figure in a process called filtering. This is achieved by multiplying a number to each of the pixel value of the original image and summing it up. A simplified example of how the math is done is as described in the picture above. NOW STOP RIGHT HERE! Make sure you understand the mathematics of how to conduct filtering. Re-read the contents if you need to. As for how we arrive at this filter and why it is of the size 3x3, we will explain later in this article.

Since we are dealing with an image of depth 3 (number of colors), we need to imagine a 3x3x3 sized mini image being multiplied and summed up with another 3x3x3 filter. Then by adding another constant term, we will receive a single number result from this transformation.

filtering in action

This same filter will then be applied to every single possible 3x3 pixel on the original image. Notice that there are only 30x30 unique 3x3 squares on a 32x32 image, also remember that a filter will convert a 3x3 pixel image into a single image so the end result of applying a filter onto a 32x32x3 image will result in a 30x30x1 2nd ‘image’.

The high-level explanation

What we are trying to do here is to detect the presence of simple patterns such as horizontal lines and color contrasts from the original image. The process as described above will output a single number. Typically this number will be either positive or negative. We can understand positive as the presence of a certain feature and negative as the absence of the feature.

identifying vertical and horizontal lines in a picture of a face

In the image above, a filter is applied to find vertical and horizontal lines and as we can see, in each of the pictures on the left, only the places where vertical lines are present will show up in white and likewise horizontal lines for the picture on the right.

Going by this idea we can think of filtering as a process of breaking down the original image into a list of presence of simplified structures. By knowing the presence of slanted lines and horizontal lines and other simple basic information, more interesting features such as eyes and nose and mouth then then be identified. If the presence of eyes, mouth and nose are detected, then the classifier will have a pretty high certainty that the image at hand is probably a face. Basically that is what a CNN would do, by doing detective work on the abstract information that it is able to extract from the input image and through a somewhat logical thought process come to the deduction of the correct label to attach to a particular image. The model might not exactly look for eyes or nose, but it would attempt to do something similar in an abstract manner.

structure of a typical CNN, here classifying a car


In the model in the picture, the first layer is a CONV layer. It is nothing new as CONV is just short form for convolution layer.


The RELU layer (short for rectifier layer) is basically a transformation of all negative outputs of the previous layer into 0. As negative numbers would also contribute to the output of the next layer, 0 has a significance in the sense that it will not affect the results of the next layer. Looking back at the high-level definition of how a convolution works, negative numbers should mean the absence of a feature.


Image processing is a very computationally intensive process. To allow our algorithm to run at a decent speed while not compromising accuracy too heavily, we do a form of reduction on the image size in a technique called pooling. The image below shows how it is done. From each 2x2 square, we find the pixel with the largest value, retain it and throw away all the unused pixels we also do this for each depth layer (recall on the input image, it would be each color layer). Doing this transformation would essentially reduce the dimensions of the original image by half on height and another half on weight. Another reason we wish to do this is to converge features of close proximity together such that more complex features can develop sooner.

pooling on a 4x4 input

  • FC: After retrieving all of the advanced features from each image, we combine them together to classify the image to it’s proper label. We do so in the fully connected layer.

Code sample - TensorFlow and Keras

So now let’s take a look at convolutions and pooling in code. We don’t have to do all the math for filtering and compressing, we simply define convolutional and pooling layers to do the job for us.

model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')

Here we’re specifying the first convolution. We’re asking keras to generate 64 filters for us. These filters are 3 by 3, their activation is relu, which means the negative values will be thrown way, and finally the input shape is as before, the 28 by 28. That extra 1 just means that we are tallying using a single byte for color depth. As we saw before our image is our gray scale, so we just use one byte.

Complete Code below or Open in Google Colab

import tensorflow as tf

class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if (logs.get('accuracy') > 0.998):
print("\nReached 99.8% accuracy so cancelling training!")
self.model.stop_training = True


callbacks = myCallback()

mnist = tf.keras.datasets.mnist
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
training_images = training_images.reshape(60000, 28, 28, 1)
training_images = training_images / 255.0
test_images = test_images.reshape(10000, 28, 28, 1)
test_images = test_images / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(training_images, training_labels, epochs=10, callbacks=[callbacks])
test_loss, test_acc = model.evaluate(test_images, test_labels)