Home/Concepts/Artificial Intelligence/Convolutional Neural Networks

Convolutional Neural Networks

Build image classifiers and visualize feature detection

⏱️ 23 min⚡ 20 interactions

What are Convolutional Neural Networks?

CNNs are specialized neural networks for processing grid-like data such as images. They use convolution operations to automatically learn hierarchical features.

💡 Why CNNs Revolutionized Computer Vision

🔍

Local Connectivity

Each neuron looks at a small region, detecting local patterns

🔄

Parameter Sharing

Same filter applied everywhere - learns once, uses everywhere

📊

Translation Invariance

Recognizes features regardless of position in image

The Convolution Operation: Mathematical Foundation

🧮 How Convolution Actually Works

📊 The Sliding Window Process

Convolution performs element-wise multiplication and summation between a small kernel (filter) and overlapping regions of the input. Think of it as a sliding dot product.

Step-by-Step Example:

1. Position kernel at top-left of input image

2. Multiply each kernel value with corresponding input pixel

3. Sum all products to get single output value

4. Slide kernel by stride amount (typically 1 or 2 pixels)

5. Repeat until entire input is covered

Concrete Example (3×3 kernel on 5×5 input):

Input region:

[[1, 2, 3],

[4, 5, 6],

[7, 8, 9]]

Kernel (edge detector):

[[-1, 0, 1],

[-1, 0, 1],

[-1, 0, 1]]

Output = (1×-1)+(2×0)+(3×1)+(4×-1)+(5×0)+(6×1)+(7×-1)+(8×0)+(9×1) = 12

📐 Critical Parameters

Kernel Size (k):

• 1×1: Pointwise, changes channels only

• 3×3: Most common, good efficiency

• 5×5, 7×7: Larger receptive field, more parameters

Modern trend: Stack multiple 3×3 instead of one 5×5 (fewer params, same receptive field)

Stride (s):

• How many pixels kernel moves each step

• Stride=1: Dense feature maps, more computation

• Stride=2: Halves spatial dimensions, replaces pooling

Larger stride = faster but loses information

Padding (p):

• Zeros added around input border

• Valid (p=0): Output shrinks

• Same (p=(k-1)/2): Output size = input size

Padding preserves spatial dimensions and edge information

Output Size Formula:

Output = ⌊(Input + 2×Padding - Kernel) / Stride⌋ + 1

Example 1: Input=28, Kernel=5, Stride=1, Padding=0 → Output=⌊(28-5)/1⌋+1 = 24

Example 2: Input=28, Kernel=3, Stride=1, Padding=1 → Output=⌊(28+2-3)/1⌋+1 = 28 (same!)

Example 3: Input=224, Kernel=7, Stride=2, Padding=3 → Output=⌊(224+6-7)/2⌋+1 = 112

✓ Why Convolution > Fully Connected

• Sparse connectivity: Each output connected to small input region (not all pixels)

• Parameter sharing: Same kernel applied everywhere (9 params for 3×3 vs millions)

• Translation equivariance: Output shifts when input shifts

Example: 28×28 image → FC layer needs 784 × hidden_dim params. Conv 3×3 → only 9 × channels!

🎯 Real-World Computational Cost

Conv layer: 32 filters, 3×3 kernel, 224×224 input

Operations = (224×224) × (3×3) × 32

= 14.5 million multiplications

Parameters: (3×3×3 input channels + 1 bias) × 32 = 896

Contrast with FC: 224×224×32 = 1.6M parameters!

1. Convolution Operation

🎯 Interactive: Sliding Filter Window

Convolution slides a filter (kernel) across the input, computing dot products to create a feature map.

Kernel Size: 3×3

Stride: 1

Padding: 0

Input Parameters

Input Size:28×28

Kernel:3×3

Stride:1

Padding:0

Output

222×222

Output Feature Map Size

Output = ⌊(28 + 2×0 - 3) / 1⌋ + 1

💡 Formula: Output Size = (Input + 2×Padding - Kernel) / Stride + 1

2. Convolutional Filters

🔍 Interactive: Common Filter Types

Edge Detection

Detects horizontal edges by finding intensity gradients

3×3 Kernel

-1

→

Effect on Image

Highlights boundaries where pixel values change rapidly. Essential for detecting shapes and objects.

Pooling: Spatial Downsampling and Invariance

📉 Why Downsample Feature Maps?

🎯 Four Key Benefits of Pooling

1. Dimensionality Reduction:

• Reduces spatial dimensions by factor of pool size (typically 2×2 → 75% reduction)

• Example: 224×224 feature map → 112×112 after 2×2 max pooling

• Computational savings: Next layer processes 4× fewer pixels!

Memory: 224² = 50,176 → 112² = 12,544 (75% reduction)

2. Translation Invariance:

• Small shifts in input don't change pooled output (object moves 1 pixel → same max value)

• Makes network robust to minor variations in object position

• Critical for classification: "cat" detected regardless of exact pixel coordinates

3. Receptive Field Expansion:

• Each neuron in next layer "sees" larger area of original image

• Example: Layer 1 (3×3 conv) sees 3×3 region. After 2×2 pooling, Layer 2 (3×3 conv) sees 6×6 region!

Without pooling:

3 conv layers (3×3) → receptive field = 7×7

With pooling after each conv:

3 conv layers + 2 pooling → receptive field = 28×28

4. Overfitting Prevention:

• Aggregates information (summarization acts as regularization)

• Forces network to learn robust features, not memorize exact pixel positions

• Similar effect to dropout but deterministic

⚖️ Max Pooling vs Average Pooling

Max Pooling (Most Common):

Input 2×2 region:

[1, 3]

[2, 4]

Output: max(1,2,3,4) = 4

Why max?

• Captures strongest activations (most prominent features)

• If edge detector fires strongly anywhere in region → preserved

• Dominant paradigm: AlexNet, VGG, ResNet all use max pooling

• Better gradient flow during backpropagation (only max value gets gradient)

Average Pooling:

Input 2×2 region:

[1, 3]

[2, 4]

Output: avg(1,2,3,4) = 2.5

When to use average?

• Smoother downsampling (all values contribute equally)

• Often used before final classification layer (Global Average Pooling in ResNet)

• Better for preserving background/context information

• Less aggressive than max (retains more information about distribution)

Modern Alternative: Strided Convolutions

• Some architectures (ResNet variants) use stride=2 convolutions instead of pooling

• Advantage: Learnable downsampling (network learns best way to reduce dimensions)

• Disadvantage: More parameters, more computation

Trend: Hybrid approach - pooling in early layers, strided conv in later layers

📊 Typical CNN Pattern

Input: 224×224×3 (RGB image)

→ Conv 64 filters → 224×224×64

→ MaxPool 2×2 → 112×112×64

→ Conv 128 filters → 112×112×128

→ MaxPool 2×2 → 56×56×128

→ Conv 256 filters → 56×56×256

→ MaxPool 2×2 → 28×28×256

Pattern: Spatial dimensions ↓, Channel depth ↑

3. Pooling Layers

📉 Interactive: Downsampling Operations

Pool Size: 2×2

Input (8×8)

Output (4×4)

🔝 Max Pooling

Takes maximum value from each region. Captures strongest activations, preserves most prominent features.

💡 Purpose: Reduce spatial dimensions, increase receptive field, add translation invariance, reduce computation.

4. Feature Map Depth

📚 Interactive: Multiple Filters Learning

Number of Filters: 32

Feature Map 1

Learned Pattern

🔴 Detects vertical edges and lines

Activation Strength

Layer 1:

Layer 2:

Layer 3:

💡 Key Insight: Each filter learns a different feature. More filters = more diverse patterns detected. Typical: 32→64→128→256 filters in deeper layers.

Receptive Fields: The Hierarchical Vision Mechanism

👁️ What Each Neuron "Sees" in the Original Image

🔭 Definition: Receptive Field

The receptive field of a neuron is the region in the input image that can influence that neuron's activation. As you go deeper in the network, receptive fields grow exponentially, allowing neurons to "see" and integrate information from larger areas.

Biological Inspiration:

• Hubel & Wiesel (1962): Discovered hierarchical organization in cat visual cortex

• V1 neurons: Respond to simple edges, small receptive fields (1-2°)

• V2 neurons: Respond to corners, textures, medium receptive fields (5-10°)

• V4/IT neurons: Respond to complex shapes/objects, large receptive fields (20-50°)

CNNs mirror this hierarchical structure!

📐 Calculating Receptive Field Size

Simplified Formula (stride=1, same padding):

RF_layer = RF_prev + (kernel_size - 1) × product_of_previous_strides

Example: Stack of 3×3 convolutions

• Layer 1: RF = 3×3 (sees 9 pixels)

• Layer 2: RF = 3 + (3-1)×1 = 5×5 (sees 25 pixels)

• Layer 3: RF = 5 + (3-1)×1 = 7×7 (sees 49 pixels)

• Layer 4: RF = 7 + (3-1)×1 = 9×9 (sees 81 pixels)

With Pooling (2×2 max pool, stride=2):

Pooling doubles the rate of receptive field growth!

• Layer 1 (Conv 3×3): RF = 3×3

• MaxPool 2×2: RF = 4×4 (stride doubles effective coverage)

• Layer 2 (Conv 3×3): RF = 4 + (3-1)×2 = 8×8

• MaxPool 2×2: RF = 10×10

• Layer 3 (Conv 3×3): RF = 10 + (3-1)×4 = 18×18

Each pooling layer multiplies stride effect, growing RF exponentially

Why Larger Kernels (5×5, 7×7) are Rare:

• VGG insight: Two 3×3 convs have same receptive field as one 5×5

→ 3×3 + 3×3 = 5×5 RF, but 18 params vs 25 params

• Three 3×3 convs = one 7×7 conv (27 params vs 49 params, 45% savings!)

Bonus: More non-linearity (ReLU after each conv) improves learning

🧠 Hierarchical Feature Learning

Early Layers (RF: 3×3 to 15×15):

• Detect: Edges, colors, gradients

• Examples: Vertical line, horizontal line, diagonal edge

• Transferable: Same edge detectors work for cats, cars, buildings

Low-level features, highly reusable

Middle Layers (RF: 20×20 to 80×80):

• Detect: Textures, parts, patterns

• Examples: Eye, wheel, window, fur texture

• Domain-specific: Combine edges into meaningful parts

Mid-level features, moderately reusable

Deep Layers (RF: 100×100 to entire image):

• Detect: Complete objects, scenes

• Examples: Face, car, dog, building

• Task-specific: Optimized for classification/detection task

High-level features, less transferable

Visualization Studies (Zeiler & Fergus, 2014):

Deconvolutional networks revealed what CNNs actually learn:

• Layer 1: Gabor-like edge filters at various orientations

• Layer 2: Color blobs, corner detectors, simple textures

• Layer 3: Mesh patterns, text, complex textures

• Layer 4: Dog faces, bird legs, bicycle parts

• Layer 5: Full objects in various poses

🎯 Design Principle

• Start with small receptive fields (3×3, 5×5)

• Gradually expand through depth and pooling

• Final layers should "see" entire object or significant portion

• Rule of thumb: RF at layer N should cover N×N region

ImageNet (224×224): Need RF ≥ 100×100 in final conv layers

⚠️ Common Pitfall

Shallow networks with large kernels:

• 7×7 conv → RF = 49 pixels ✓

• But: Fewer non-linearities, harder to learn complex functions

Better: Deep networks with small kernels:

• 3×3 → 3×3 → 3×3 → RF = 49 pixels ✓

• Bonus: 3 ReLUs, more expressive power

5. Receptive Field Growth

🔭 Interactive: What Each Layer "Sees"

Network Depth: 1 layers

Receptive Field Size

3×3

pixels in original image

Layer 1:

1×1

Early Layers

Detect simple features: edges, colors, textures

Middle Layers

Detect parts: eyes, wheels, windows

Deep Layers

Detect objects: faces, cars, buildings

CNN Architecture Evolution: From AlexNet to Modern Networks

🏗️ Milestones in CNN Design

📅 Historical Timeline

AlexNet (2012) - The Breakthrough:

• ImageNet winner: 15.3% error (vs 26% previous year) - 40% improvement!

• Architecture: 5 conv layers + 3 FC layers = 60M parameters

• Innovations: ReLU activation, dropout, GPU training (2 GPUs in parallel)

• Impact: Sparked deep learning revolution in computer vision

Key lesson: Deep networks + ReLU + data + GPU = breakthrough performance

VGG (2014) - Simplicity at Scale:

• Insight: Use only 3×3 filters throughout entire network

• VGG-16: 16 layers, 138M parameters. VGG-19: 19 layers, 144M params

• Pattern: Conv-Conv-Pool repeated, channels double after pooling (64→128→256→512)

• Problem: Huge parameter count (FC layers dominate), slow training

Key lesson: Uniformity and depth matter more than fancy filter designs

ResNet (2015) - Solving the Depth Problem:

• Problem solved: Very deep networks (50+ layers) were degrading, not improving

• Solution: Skip connections - F(x) + x instead of just F(x)

• ResNet-50: 50 layers, 25M params. ResNet-152: 152 layers still trainable!

• Impact: 3.6% ImageNet error. Enabled training of 1000+ layer networks

y = ReLU(F(x, W) + x) ← "x" is the identity shortcut

Key lesson: Let network learn residuals (difference) instead of full mapping

🔧 Design Patterns and Trade-offs

Accuracy-Focused Architectures:

ResNet-152, DenseNet-264:

• Very deep (100-200+ layers)

• High parameter count (50-200M)

• ImageNet top-5: 4-5% error

• Training time: Days on 8 GPUs

Use when: Accuracy is paramount, computational budget unlimited

Efficiency-Focused Architectures:

MobileNet, EfficientNet, SqueezeNet:

• Shallow or narrow (10-50 layers)

• Low parameter count (1-10M)

• ImageNet top-5: 10-15% error

• Inference: Real-time on mobile/edge

Use when: Deploying to mobile/IoT, latency critical, limited compute

Architecture Decision Tree:

1. Small dataset (<10K images)? → Use pretrained model + transfer learning

2. Need real-time inference? → MobileNet, EfficientNet-B0

3. Accuracy is critical? → ResNet-50/101, EfficientNet-B7

4. Learning from scratch? → Start simple (VGG-style), add complexity if needed

5. Limited GPU memory? → Reduce batch size, use gradient checkpointing

🧩 Modern Building Blocks

Residual Block (ResNet):

x → Conv(3×3)

→ BatchNorm → ReLU

→ Conv(3×3)

→ BatchNorm

↓ + x (skip)

→ ReLU → output

Gradients flow through skip connection

Inception Block (GoogLeNet):

↙ ↓ ↓ ↘

1×1 3×3 5×5 pool

↘ ↓ ↓ ↙

concatenate

Multi-scale feature extraction in parallel

Depthwise Separable (MobileNet):

x → DepthwiseConv(3×3)

[per-channel]

→ PointwiseConv(1×1)

[mix channels]

→ output

8-9× fewer params than standard conv!

6. CNN Architectures

🏗️ Interactive: Famous Architectures

Simple CNN

~1.2M parameters

Conv 32

Pool

Conv 64

Pool

Dense 128

Output

🎯 Simple CNN

Basic architecture for learning CNNs. Good for MNIST, CIFAR-10. Fast training, easy to understand.

7. Parameter Calculation

📐 Interactive: Count Parameters

Input Image Size: 224×224

Number of Classes: 1000

Example Layer: Conv(32 filters, 3×3 kernel)

Kernel Size

3×3

Input Channels

Output Filters

Calculation:

Params = (3×3×3 + 1) × 32 = 896

(kernel × input_channels + bias) × output_filters

Total Model Estimate

1,058,816

parameters (simplified)

Memory Required

4.0 MB

@ 32-bit floats

8. Activation Functions

⚡ Interactive: Non-linearity

ReLU: f(x) = max(0, x)

Characteristics

Range:[0, ∞)

Pros:Fast, no vanishing gradient

Cons:Dying ReLU problem

Usage:Most common in CNNs

9. Data Augmentation

🎨 Interactive: Training Data Tricks

Original Image

🐱

Effect

Base training image without modifications. Start here.

Other Techniques:

Brightness, contrast, color jitter, cutout, mixup

💡 Why Augment? Artificially expand training data, prevent overfitting, improve generalization, make model robust to variations.

Transfer Learning: Standing on the Shoulders of Giants

🎓 Reusing Pre-trained Knowledge

💡 The Core Insight

Transfer learning leverages a model pre-trained on a massive dataset (typically ImageNet: 1.4M images, 1000 classes) and adapts it to your specific task. This is the default approach for almost all computer vision tasks in 2024.

Why Transfer Learning is Dominant:

• 10-100× faster training: Pre-trained features already capture edges, textures, shapes

• 10-100× less data needed: Works with 100-1000 images instead of 100K+

• Better accuracy: Often outperforms training from scratch even with more data

• Lower computational cost: Fine-tuning needs 1 GPU for hours, not 8 GPUs for weeks

When NOT to Use Transfer Learning:

• Domain is radically different: Medical X-rays, satellite imagery (but still helps!)

• You have 10M+ labeled images: Training from scratch might match/exceed transfer learning

• Academic research: Studying learning dynamics from initialization

Reality: Even for medical/satellite, transfer learning is starting point 90% of the time

🔧 Transfer Learning Strategies

Strategy 1: Feature Extraction (Frozen Base)

1. Load pretrained model (e.g., ResNet-50)

2. Freeze all convolutional layers

3. Replace final FC layer (1000 → your_classes)

4. Train only new FC layer

When to use:

• Small dataset (100-1,000 images)

• Similar to ImageNet (natural photos)

• Limited computational budget

Training time: Minutes to hours

Strategy 2: Fine-Tuning (Partially Frozen)

1. Load pretrained model

2. Freeze early layers (generic features)

3. Unfreeze later layers (task-specific)

4. Train with small learning rate

When to use:

• Medium dataset (1,000-100,000 images)

• Somewhat different from ImageNet

• Want to adapt high-level features

Training time: Hours to days

Strategy 3: Full Fine-Tuning (All Unfrozen)

• Unfreeze all layers, train entire network with low learning rate (1e-5 to 1e-4)

• When: Large dataset (100K+ images), domain shift from ImageNet

• Caution: Can overfit if dataset too small, may destroy pretrained features

Best practice: Start with frozen/partial, then full fine-tune if needed

Learning Rate Guidelines:

• New layers (randomly initialized): 1e-3 to 1e-2 (normal rate)

• Fine-tuning pretrained layers: 1e-5 to 1e-4 (10-100× smaller)

• Why smaller? Pretrained weights already near optimal, don't want large updates

Technique: Discriminative learning rates (different LR per layer group)

📊 Practical Results Comparison

Example Task: Classify 20 dog breeds (2,000 images)

Train from Scratch (ResNet-50):

• Training time: 48 hours (8 GPUs)

• Validation accuracy: 62% (heavy overfitting)

• Problem: Not enough data to learn low-level features

Transfer Learning - Feature Extraction:

• Training time: 20 minutes (1 GPU)

• Validation accuracy: 78%

• Benefit: Pretrained features (edges, textures) already useful

Transfer Learning - Fine-Tuning:

• Training time: 2 hours (1 GPU)

• Validation accuracy: 89%

• Benefit: Adapted high-level features to dog-specific patterns

Transfer learning: 144× faster, 27% better accuracy! 🎉

🔍 Which Pretrained Model to Choose?

• ResNet-50: Default choice, excellent accuracy/speed balance

• EfficientNet-B0 to B7: Best accuracy per param, scalable

• MobileNetV2: Mobile/edge deployment, real-time inference

• Vision Transformer (ViT): Cutting edge, needs large datasets

Start with ResNet-50 or EfficientNet-B0, optimize later if needed

⚡ Quick Start Checklist

1. Load pretrained model (PyTorch, TensorFlow hubs)

2. Replace final layer: model.fc = Linear(2048, num_classes)

3. Freeze base: for param in model.parameters(): param.requires_grad = False

4. Unfreeze FC: for param in model.fc.parameters(): param.requires_grad = True

5. Train with Adam, LR=1e-3, monitor validation

6. If accuracy plateaus, unfreeze more layers, reduce LR to 1e-5

10. Transfer Learning

🎓 Interactive: Pre-trained Models

🏗️

ResNet-50

Deep residual network with skip connections

ImageNet Accuracy

76.1%

Parameters

25.6M

Transfer Learning Strategy:

1. Load pre-trained weights (trained on ImageNet)

2. Freeze early layers (they learned general features)

3. Replace final classification layer for your task

4. Fine-tune on your dataset (faster, less data needed)

✓ Benefits: Train 10-100× faster, need 10-100× less data, achieve better accuracy. Start here for real projects!

🎯 Key Takeaways

🔍

Convolution Magic

Sliding filters detect local patterns. Parameter sharing means the same feature detector works everywhere, making CNNs efficient and translation-invariant.

📚

Hierarchical Features

Early layers detect edges, middle layers detect parts (eyes, wheels), deep layers detect objects. Network learns feature hierarchy automatically.

📉

Pooling Reduces Size

Max/average pooling downsamples feature maps. Reduces computation, increases receptive field, adds translation invariance. Essential for deep networks.

🏗️

Famous Architectures

VGG (simple, deep), ResNet (skip connections), MobileNet (efficient). Each innovation solved specific problems. Use pre-trained versions!

🎨

Data Augmentation

Flip, rotate, crop training images. Artificially expands dataset, prevents overfitting, improves generalization. Essential for small datasets.

🎓

Transfer Learning

Use pre-trained models (ImageNet). Fine-tune for your task. Trains faster, needs less data, achieves better results. The default approach for vision tasks.