Convolutional Neural Networks | 101 — Practical Guide

7 min readDec 26, 2023

The Intel Classification Challenge

This article will cover some important concepts of CNNs through hands-on coding and an in-depth exploration of the Intel Image Classification dataset, available on Kaggle. The dataset contains 25 thousand images from 6 different classes: buildings, forests, glaciers, mountains, sea, and streets. The objective is to classify images according to the class they belong to.

Samples collected from the dataset, each one representing a different class

Fundamental Concepts

Convolutional Neural Networks (CNNs) represent a specialized branch within the broader field of Neural Networks, primarily known for their effectiveness in image-related tasks such as classification and recognition. As a subset of these networks is the concept of ‘Deep Learning’. These are essentially neural networks composed of multiple hidden layers, each layer designed to progressively extract and refine features from the input data. This layered architecture enables CNNs to effectively learn complex patterns, making them indispensable in the realm of computer vision and image analysis.

Convolution

A convolution is a specialized mathematical operation used in various fields, including deep learning. In the context of CNNs, convolution functions are akin to a scanning process. Imagine a filter (also known as a kernel) sliding across the entire area of an image. This filter begins its journey from the top-left corner and moves across the image horizontally, much like how we read, proceeding downwards row by row until it reaches the bottom.

Image sourced from ‘Deep Learning Illustrated — A Visual, Interactive Guide to Artificial Intelligence’ by Jon Krohn, Grant Beyleveld, and Aglaé Bassens

During this scanning process, the convolution operation involves element-wise multiplication of the filter’s weights with the pixel values of the image. This operation is performed for every position of the filter over the image. The sum of these multiplications at each position forms what is known as a ‘feature map.’ This feature map is a condensed representation of certain features or patterns detected in the image, such as edges, textures, or specific shapes.

*Image adapted from* *https://serokell.io/files/m9/m9et6q21.11_(1).png*

Data Exploration

Our Dataset comes with 25k images separated into 3 distinct folders:

14k images inside seg_train directory
3k images inside seg_test directory
8k images inside seg_pred directory

We came to notice that the images of the seg_pred dataset are not labeled, probably they were used as submission data during the challenge, so each competitor applied the prediction over the images and only the contest organization had access to the real classification of each image.

So, as we cannot affirm if the predictions made over this dataset are correct, we are not using the seg_pred dataset. Instead, we are going to split our seg_train dataset into train_dataset and validation_dataset. And the accuracy of our models is going to be defined using the test_dataset.

The images are well distributed across 6 classes. This means we are working with a balanced dataset

Setup

Image Pre-Processing

In our current approach, we are not implementing any form of image pre-processing. However, for those aiming to achieve optimal results, incorporating techniques such as data augmentation and various transformations can be highly beneficial. These methods include rotating, shifting, shearing, and zooming the images. Applying these techniques can significantly enhance the accuracy of your model.

Optimizer Adam

For every model variation, we opted to use the same optimizer: Adam.
The Adam Optimizer was introduced in a paper at the “International Conference for Learning Representations” in San Diego, in 2015.

The Adam Optimizer was introduced in a paper at the “International Conference for Learning Representations” in San Diego, in 2015.

In a brief explanation, that’s why we’ve set Adam as the optimizer for our models:

ADAM is often the default choice due to its strong performance. This optimizer is well-regarded in machine learning circles for its efficiency in handling various types of neural network tasks.
It combines ideas from two other well-known optimizers — Momentum and RMSprop. This integration allows ADAM to inherit the strengths of both approaches, enhancing its overall effectiveness.
Momentum helps to accelerate the gradient in the right direction, aiding in faster convergence, while RMSprop adjusts the learning rate for each parameter. This dual approach enables a more nuanced and effective optimization process.
This means that it can adjust how quickly or slowly each weight in the neural network should be updated. Such flexibility allows ADAM to be more adaptable to different learning scenarios and data sets.
Efficient in terms of memory and computation. It requires relatively small storage space, making it suitable for large-scale problems. This aspect of ADAM makes it a practical choice for handling complex and sizeable datasets in neural network applications.

Early Stop

Training Convolutional Neural Networks can be a time-intensive process. Given that our article is focused on educational purposes rather than achieving the highest possible accuracy, we opted not to train the model for an extensive number of epochs.

To manage this, we implemented what is known as ‘Early Stopping’. This function continuously monitors a specific metric — in our case, the validation loss — and halts the training once this metric either reaches a certain threshold or ceases to improve.

callback = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.01, patience=1)

You can read more about how Keras handles and implements Early Stopping at https://keras.io/api/callbacks/early_stopping/

Code Implementation

In this project, our goal is to compare two types of neural networks: a custom-built CNN and the well-known VGG16 model.

We’ve set up an array to store the definitions of our models. The custom CNN we’ve designed includes three convolutional layers. Notably, the third convolutional layer contains 64 filters, which is double the number of filters in the first two layers. This design choice is based on our expectation to capture more detailed, or ‘specific,’ features from the images towards the end of the learning process. Following each convolutional layer, there is a max-pooling layer. Once the data has passed through these layers, we flatten the feature maps. Flattening is crucial for transforming the processed data into a format suitable for the final stage of our network: a dense layer. This dense layer employs a softmax activation function and contains 6 neurons, corresponding to the 6 distinct classes in our classification task.

Additionally, we are utilizing the VGG16 network. Similar to our custom model, after the VGG16 processes the data, it also goes through a flattening layer, followed by a dense layer.

Let’s now evaluate and compare the performance of these two networks!

neural_networks = []    

neural_networks.append(
    {
        'description': 'Conv2D_MaxPooling32_Conv2D_MaxPooling32_Conv2D_MaxPooling64',
        'model': Sequential([
                    Convolution2D(filters=32, kernel_size=(3, 3), activation='relu', input_shape=(150, 150, 3)),
                    MaxPooling2D(pool_size=(2, 2)),
                    Convolution2D(filters=32, kernel_size=(3, 3), activation='relu'),
                    MaxPooling2D(pool_size=(2, 2)),
                    Convolution2D(filters=64, kernel_size=(3, 3), activation='relu'),
                    MaxPooling2D(pool_size=(2, 2)),
                    Flatten(input_shape=(150, 150, 3)),
                    Dense(units=6, activation='softmax')
                    ]
                ),
        'epochs': 20,
        'optimizer': Adam(learning_rate=0.0001),
        'loss_function': 'categorical_crossentropy',
        'accuracy': None,
        'loss': None,
        'training_time': None,
        "history": None
    }
)

# Appending VGG16 for benchmark
from keras.applications.vgg16 import VGG16
vgg16 = VGG16(include_top=False, input_shape=(150, 150, 3))
vgg16.trainable = False
neural_networks.append(
    {
        'description': 'VGG16',
        'model': Sequential([
                    vgg16,
                    Flatten(input_shape=(150, 150, 3)),
                    Dense(units=6, activation='softmax')
                    ]
                ),
        'epochs': 20,
        'optimizer': Adam(learning_rate=0.0001),
        'loss_function': 'categorical_crossentropy',
        'accuracy': None,
        'loss': None,
        'training_time': None,
        "history": None
    }
)

for nn in neural_networks:
    model = nn['model']
    model.compile(optimizer=nn['optimizer'], loss=nn['loss_function'], metrics=['accuracy'])

for nn in neural_networks:
    model = nn['model']

    try:
        model.load_weights('models_80_20/' + nn['description'] + '.h5')
        print('Loaded model weights from disk')
    except:
        print('Weights not found. Training model...')
        nn["history"] = model.fit(
            train_dataset,
            epochs=nn['epochs'],
            verbose=2,
            validation_data=validation_dataset
        )

        model.save('models_80_20/' + nn['description'] + '.h5')

Results

Confusion Matrix

We have created a confusion matrix to understand where our model is making mistakes. The image below shows this matrix. It helps us see that our model has trouble telling the difference between ‘buildings’ and ‘streets’, as well as between ‘glaciers’ and ‘seas’, and ‘glaciers’ and ‘mountains’. But it’s important to remember that some images are hard to classify, even for people. We have included a few of these tricky images side-by-site in the confusion matrix for you to see.

Comparison between models

Our custom network’s training halted after 7 epochs, triggered by the Early Stop setting we implemented. Consequently, to maintain consistency, we also trained the VGG16 model for the same duration, 7 epochs. For both networks, I divided the data into an 80/20 split for training and validation purposes.

The table below shows that the VGG16 model achieved an accuracy that is 9.4% higher than our custom network. This suggests there’s considerable potential for further improvement, possibly through enhanced pre-processing and extending the number of training epochs.

Note: I have experimented with various configurations for the custom network, adjusting the number of layers, filters, and kernel sizes. For the sake of brevity in this article, I have only included the configuration that yielded the best results

Conclusion

I wrote this article as part of my journey of learning and exploring Convolutional Neural Networks (CNNs).

I hope that it will assist others who are navigating through the same process!

Additionally, to aid in understanding, I’ve created a Concept Map that encapsulates all the topics discussed in this article. This should be particularly beneficial for visual learners, providing a clear and structured overview of the key concepts.

https://www.cmaps.io/maps/72ba9725-9895-424a-8189-106590aa393e#