Convolutional Neural Networks (CNNs)

Understanding the foundations and impact of deep learning for visual data

Early History

The concept of the Convolutional Neural Network (CNN) dates back to the 1980s and 1990s, drawing inspiration from the structure of the visual cortex in animals. Early work by Kunihiko Fukushima introduced the Neocognitron in 1980, a hierarchical, multilayered artificial neural network for pattern recognition. However, it was Yann LeCun in the late 1980s and early 1990s who popularized CNNs in their modern form.

ImageNet Example:
Imagine a giant photo album with millions of pictures and thousands of categories, like "cat", "dog", "car", or "banana". ImageNet is this enormous dataset. To test a computer's vision, you ask it to label each picture. For example, the computer sees a photo of a dog and must say, "this is a dog."
Diagram: Computer sorts different images into the right categories.
How: ImageNet gives computers lots of labeled images to practice on, helping them get really good at "seeing" and understanding pictures.
GPU vs CPU Example:
Imagine you want to solve a huge math worksheet. A CPU is like a very smart person working quickly on each problem, one by one. A GPU is like an army of helpers, each working on a problem at the same time. For deep learning, this means GPUs can process many calculations in parallel, making them much faster for training neural networks.
Diagram: CPU (one worker on many tasks) vs GPU (many workers at once).
Difference: CPUs are fast on single tasks, but GPUs have many more cores, so they handle lots of tasks at once — perfect for deep learning!

For many years, CNNs were limited by computational resources and small datasets. The resurgence of CNNs in the 2010s was driven by the advent of large datasets (like ImageNet) and powerful GPUs.

Design of the Network

A Convolutional Neural Network is a type of deep neural network designed to process data with grid-like topology, such as images. The typical architecture includes:

  1. Convolutional Layers:
    What is a filter?
    A filter is like a small window that slides over an image, looking for certain patterns. For example, a filter could look for vertical edges.
    Diagram: A filter (right) scanning a small part of an image (left).
    What is convolution operation?
    Convolution means "multiplying the filter with the part of the image it's on and summing up the result." This gives a number for each spot, showing how well the filter matches that part.
    Diagram: Filter slides and creates a new image showing matches.
    What are local features?
    Local features are things like edges or corners seen in small parts of an image. Filters help find these local details.
    Diagram: Local edge detected by a filter.
  2. Activation Functions:
    What is an activation function?
    Imagine a filter sliding over an image: after calculating numbers for each spot, the activation function decides which numbers should "pass through" to the next layer. It's like a gate that only lets important signals go on.
    Input values before activation
    How does ReLU work?
    ReLU stands for Rectified Linear Unit. It simply turns all negative numbers into zero and leaves positive numbers unchanged.
    Graph: ReLU = max(0, x)
    Output values after ReLU activation
    Why? This helps the network ignore unimportant (negative) signals and focus on useful patterns.
  3. Pooling Layers:
    Easy Example: What does pooling do?
    Pooling is like taking a big picture and shrinking it, keeping only the most important parts. In max pooling, for each small block (like 2×2 squares), you keep only the largest number.
    Diagram: Max pooling keeps only the biggest value in each block.
    How does max pooling work?
    Imagine a grid of numbers:
    [3, 1] [2, 4]
    In this 2×2 square, the max is 4. So after pooling, only 4 is kept for this block. The result is a smaller grid that still holds the most important features.
  4. Fully Connected Layers:
    Easy Example: How do fully connected layers work?
    After convolution and pooling, you have a small grid of numbers (features). To make a decision (like "Is this a cat or a dog?"), the CNN turns this grid into a single long list (a flat vector) and connects every number to every output choice.
    Diagram: The grid is flattened to a line and each connects to outputs.
    How does flattening work?
    Imagine a 2×2 grid: [1, 4] [2, 5]
    After flattening, it becomes a list: [1, 4, 2, 5].
    Then, each number in this list is connected to every output neuron (for example: "cat", "dog", "car").
  5. Output Layer: Generates final predictions, often using softmax for classification.

The design allows CNNs to learn hierarchical representations: lower layers capture simple patterns, while deeper layers detect complex objects and shapes.

Example & How to Use CNNs for Image Study

Example: Handwritten digit recognition (MNIST)

Example of a hand-drawn digit image (simulated)
  1. Prepare the dataset:
    How do we preprocess images?
    Before feeding images to the CNN, we often need to:
    • Resize them to the same shape (like 28×28 pixels),
    • Normalize the pixel values (e.g., from 0–255 to 0–1),
    • (Sometimes) turn colored images into grayscale or do other cleaning steps.
    Diagram: Resize and normalize images before training.
    Easy Example:
    • Resize: Imagine you have a big photo (100×100 pixels) and a small photo (30×30 pixels). To make them the same size for the CNN, you shrink or stretch each one to, say, 28×28 pixels.
      This is like resizing all pictures in your album to fit the same frame.
    • Normalize: Suppose the original pixel values are numbers from 0 (black) to 255 (white). Normalization means dividing all values by 255, so they are now between 0 and 1.
      For example, pixel value 128 becomes 128/255 ≈ 0.5.
  2. Define the CNN: Specify layers (e.g., convolution, pooling, fully connected).
    Diagram: The CNN processes the image through layers step by step.
    Easy Example:
    Think of a CNN as a series of steps:
    1. The image goes in,
    2. it passes through convolutional layers (find features),
    3. then through pooling layers (shrink the data),
    4. then through fully connected layers (make decisions),
    5. and finally out comes the prediction (like "3" or "cat").
  3. Train the network: Use labeled data to optimize weights via backpropagation.
    Diagram: The network learns by comparing its guess to the correct answer and adjusting itself!
    Easy Example:
    Imagine the CNN sees an image of a "3", but it guesses "5". The network checks the true answer ("3"), sees it's wrong, and tweaks its internal settings (weights) to do better next time. This is called optimizing weights and is done using a process called backpropagation.
  4. Evaluate: Test on new images to assess accuracy.
    How to get new images for testing?
    • Use a test set: Most datasets are split into a training set (to learn from) and a test set (to check the network). For MNIST, there are 10,000 test digit images never shown during training.
    • Draw your own: You can create your own images (like drawing a digit in a paint app) and run them through the trained CNN.
    • Find online: Download sample images from the internet (make sure they match your training format, e.g. 28x28 grayscale for MNIST).
    • Camera or phone: Take a photo, crop it, preprocess to the right size and format, and test it with the CNN.
    Diagram: New images (from test set, camera, or hand-drawn) go into the CNN for evaluation.
    Tip: The key is to use images the network has never seen before to check if it's truly learned.

Sample Python Code (using Keras):


# Import TensorFlow and Keras layers/models
import tensorflow as tf
from tensorflow.keras import layers, models

# Example of preprocessing: normalize and resize images
train_images = train_images / 255.0                   # Normalize pixel values to 0-1
train_images = tf.image.resize(train_images, (28, 28)) # Resize all images to 28x28

# Build the CNN model step by step
model = models.Sequential([
    layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)), # First convolutional layer: 32 filters, 3x3 size, ReLU, input shape 28x28x1
    layers.MaxPooling2D((2,2)),                                         # Pooling: reduces size by taking max in 2x2 blocks
    layers.Conv2D(64, (3,3), activation='relu'),                        # Second convolutional layer: 64 filters, 3x3, ReLU
    layers.MaxPooling2D((2,2)),                                         # Second pooling layer
    layers.Flatten(),                                                   # Flatten feature maps into a 1D vector
    layers.Dense(64, activation='relu'),                                # Fully connected layer with 64 neurons and ReLU
    layers.Dense(10, activation='softmax')                              # Output layer: 10 neurons (digits 0-9), softmax for probabilities
])

# Choose optimizer, loss function, and metric
model.compile(
    optimizer='adam',                                # Use the Adam optimizer for training
    loss='sparse_categorical_crossentropy',          # For multi-class classification
    metrics=['accuracy']                             # Track accuracy during training
)

# Train the model on the training data for 5 epochs
model.fit(train_images, train_labels, epochs=5)

# (Optional) Evaluate on new/test images after training:
test_images = test_images / 255.0
test_images = tf.image.resize(test_images, (28, 28))
test_loss, test_acc = model.evaluate(test_images, test_labels)
print("Test accuracy:", test_acc)

CNNs are widely used in applications such as facial recognition, medical imaging, object detection, and more.

Major Contributors

About the 2018 Turing Award:
The Turing Award is often called the "Nobel Prize of Computing." In 2018, it was awarded to Yann LeCun, Geoffrey Hinton, and Yoshua Bengio for their breakthroughs in deep learning. Their work made it possible for computers to learn from vast amounts of data, especially using neural networks. This award recognized how their research transformed artificial intelligence, making technologies like speech recognition, computer vision, and language translation possible in everyday life.
About AlexNet and the ImageNet Competition:
AlexNet is a deep convolutional neural network created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. In 2012, AlexNet competed in the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which is a competition for computers to correctly classify and identify objects in millions of images across thousands of categories.
  • Why was AlexNet special? It used deep CNNs, the ReLU activation function, data augmentation, and ran efficiently on GPUs for the first time at this scale.
  • Impact: AlexNet achieved an error rate much lower than any previous model (by more than 10 percentage points) and demonstrated that deep learning could outperform traditional computer vision methods.
  • Why did AlexNet win the award? AlexNet’s victory showed the world the practical power of deep neural networks and started a revolution in AI research and industry. Its success led to rapid advances in image recognition, speech recognition, and more.
In short: AlexNet and the 2018 Turing Award together mark the turning point when deep learning became the foundation of modern artificial intelligence.

Test For Your Understanding

References

  1. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202.
  2. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
  3. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25, 1097-1105.
  4. Deep Learning (Goodfellow, Bengio, Courville)
  5. Stanford CS231n: Convolutional Neural Networks for Visual Recognition