Home/Concepts/Artificial Intelligence/Convolutional Neural Networks

Convolutional Neural Networks

Build image classifiers and visualize feature detection

โฑ๏ธ 23 minโšก 20 interactions

What are Convolutional Neural Networks?

CNNs are specialized neural networks for processing grid-like data such as images. They use convolution operations to automatically learn hierarchical features.

๐Ÿ’ก Why CNNs Revolutionized Computer Vision

๐Ÿ”
Local Connectivity
Each neuron looks at a small region, detecting local patterns
๐Ÿ”„
Parameter Sharing
Same filter applied everywhere - learns once, uses everywhere
๐Ÿ“Š
Translation Invariance
Recognizes features regardless of position in image

The Convolution Operation: Mathematical Foundation

๐Ÿงฎ How Convolution Actually Works

๐Ÿ“Š The Sliding Window Process

Convolution performs element-wise multiplication and summation between a small kernel (filter) and overlapping regions of the input. Think of it as a sliding dot product.

Step-by-Step Example:
1. Position kernel at top-left of input image
2. Multiply each kernel value with corresponding input pixel
3. Sum all products to get single output value
4. Slide kernel by stride amount (typically 1 or 2 pixels)
5. Repeat until entire input is covered
Concrete Example (3ร—3 kernel on 5ร—5 input):
Input region:
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
Kernel (edge detector):
[[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]]
Output = (1ร—-1)+(2ร—0)+(3ร—1)+(4ร—-1)+(5ร—0)+(6ร—1)+(7ร—-1)+(8ร—0)+(9ร—1) = 12

๐Ÿ“ Critical Parameters

Kernel Size (k):
โ€ข 1ร—1: Pointwise, changes channels only
โ€ข 3ร—3: Most common, good efficiency
โ€ข 5ร—5, 7ร—7: Larger receptive field, more parameters
Modern trend: Stack multiple 3ร—3 instead of one 5ร—5 (fewer params, same receptive field)
Stride (s):
โ€ข How many pixels kernel moves each step
โ€ข Stride=1: Dense feature maps, more computation
โ€ข Stride=2: Halves spatial dimensions, replaces pooling
Larger stride = faster but loses information
Padding (p):
โ€ข Zeros added around input border
โ€ข Valid (p=0): Output shrinks
โ€ข Same (p=(k-1)/2): Output size = input size
Padding preserves spatial dimensions and edge information
Output Size Formula:
Output = โŒŠ(Input + 2ร—Padding - Kernel) / StrideโŒ‹ + 1
Example 1: Input=28, Kernel=5, Stride=1, Padding=0 โ†’ Output=โŒŠ(28-5)/1โŒ‹+1 = 24
Example 2: Input=28, Kernel=3, Stride=1, Padding=1 โ†’ Output=โŒŠ(28+2-3)/1โŒ‹+1 = 28 (same!)
Example 3: Input=224, Kernel=7, Stride=2, Padding=3 โ†’ Output=โŒŠ(224+6-7)/2โŒ‹+1 = 112

โœ“ Why Convolution > Fully Connected

โ€ข Sparse connectivity: Each output connected to small input region (not all pixels)
โ€ข Parameter sharing: Same kernel applied everywhere (9 params for 3ร—3 vs millions)
โ€ข Translation equivariance: Output shifts when input shifts
Example: 28ร—28 image โ†’ FC layer needs 784 ร— hidden_dim params. Conv 3ร—3 โ†’ only 9 ร— channels!

๐ŸŽฏ Real-World Computational Cost

Conv layer: 32 filters, 3ร—3 kernel, 224ร—224 input
Operations = (224ร—224) ร— (3ร—3) ร— 32
= 14.5 million multiplications
Parameters: (3ร—3ร—3 input channels + 1 bias) ร— 32 = 896
Contrast with FC: 224ร—224ร—32 = 1.6M parameters!

1. Convolution Operation

๐ŸŽฏ Interactive: Sliding Filter Window

Convolution slides a filter (kernel) across the input, computing dot products to create a feature map.

Input Parameters

Input Size:28ร—28
Kernel:3ร—3
Stride:1
Padding:0

Output

222ร—222
Output Feature Map Size
Output = โŒŠ(28 + 2ร—0 - 3) / 1โŒ‹ + 1

๐Ÿ’ก Formula: Output Size = (Input + 2ร—Padding - Kernel) / Stride + 1

2. Convolutional Filters

๐Ÿ” Interactive: Common Filter Types

Edge Detection

Detects horizontal edges by finding intensity gradients

3ร—3 Kernel
-1
-1
-1
0
0
0
1
1
1
โ†’
Effect on Image
Highlights boundaries where pixel values change rapidly. Essential for detecting shapes and objects.

Pooling: Spatial Downsampling and Invariance

๐Ÿ“‰ Why Downsample Feature Maps?

๐ŸŽฏ Four Key Benefits of Pooling

1. Dimensionality Reduction:
โ€ข Reduces spatial dimensions by factor of pool size (typically 2ร—2 โ†’ 75% reduction)
โ€ข Example: 224ร—224 feature map โ†’ 112ร—112 after 2ร—2 max pooling
โ€ข Computational savings: Next layer processes 4ร— fewer pixels!
Memory: 224ยฒ = 50,176 โ†’ 112ยฒ = 12,544 (75% reduction)
2. Translation Invariance:
โ€ข Small shifts in input don't change pooled output (object moves 1 pixel โ†’ same max value)
โ€ข Makes network robust to minor variations in object position
โ€ข Critical for classification: "cat" detected regardless of exact pixel coordinates
3. Receptive Field Expansion:
โ€ข Each neuron in next layer "sees" larger area of original image
โ€ข Example: Layer 1 (3ร—3 conv) sees 3ร—3 region. After 2ร—2 pooling, Layer 2 (3ร—3 conv) sees 6ร—6 region!
Without pooling:
3 conv layers (3ร—3) โ†’ receptive field = 7ร—7
With pooling after each conv:
3 conv layers + 2 pooling โ†’ receptive field = 28ร—28
4. Overfitting Prevention:
โ€ข Aggregates information (summarization acts as regularization)
โ€ข Forces network to learn robust features, not memorize exact pixel positions
โ€ข Similar effect to dropout but deterministic

โš–๏ธ Max Pooling vs Average Pooling

Max Pooling (Most Common):
Input 2ร—2 region:
[1, 3]
[2, 4]
Output: max(1,2,3,4) = 4
Why max?
โ€ข Captures strongest activations (most prominent features)
โ€ข If edge detector fires strongly anywhere in region โ†’ preserved
โ€ข Dominant paradigm: AlexNet, VGG, ResNet all use max pooling
โ€ข Better gradient flow during backpropagation (only max value gets gradient)
Average Pooling:
Input 2ร—2 region:
[1, 3]
[2, 4]
Output: avg(1,2,3,4) = 2.5
When to use average?
โ€ข Smoother downsampling (all values contribute equally)
โ€ข Often used before final classification layer (Global Average Pooling in ResNet)
โ€ข Better for preserving background/context information
โ€ข Less aggressive than max (retains more information about distribution)
Modern Alternative: Strided Convolutions
โ€ข Some architectures (ResNet variants) use stride=2 convolutions instead of pooling
โ€ข Advantage: Learnable downsampling (network learns best way to reduce dimensions)
โ€ข Disadvantage: More parameters, more computation
Trend: Hybrid approach - pooling in early layers, strided conv in later layers

๐Ÿ“Š Typical CNN Pattern

Input: 224ร—224ร—3 (RGB image)
โ†’ Conv 64 filters โ†’ 224ร—224ร—64
โ†’ MaxPool 2ร—2 โ†’ 112ร—112ร—64
โ†’ Conv 128 filters โ†’ 112ร—112ร—128
โ†’ MaxPool 2ร—2 โ†’ 56ร—56ร—128
โ†’ Conv 256 filters โ†’ 56ร—56ร—256
โ†’ MaxPool 2ร—2 โ†’ 28ร—28ร—256
Pattern: Spatial dimensions โ†“, Channel depth โ†‘

3. Pooling Layers

๐Ÿ“‰ Interactive: Downsampling Operations

Input (8ร—8)

3
7
9
8
4
8
2
9
6
2
3
1
5
7
2
6

Output (4ร—4)

9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
๐Ÿ” Max Pooling
Takes maximum value from each region. Captures strongest activations, preserves most prominent features.

๐Ÿ’ก Purpose: Reduce spatial dimensions, increase receptive field, add translation invariance, reduce computation.

4. Feature Map Depth

๐Ÿ“š Interactive: Multiple Filters Learning

Feature Map 1

Learned Pattern
๐Ÿ”ด Detects vertical edges and lines
Activation Strength
Layer 1:
Layer 2:
Layer 3:

๐Ÿ’ก Key Insight: Each filter learns a different feature. More filters = more diverse patterns detected. Typical: 32โ†’64โ†’128โ†’256 filters in deeper layers.

Receptive Fields: The Hierarchical Vision Mechanism

๐Ÿ‘๏ธ What Each Neuron "Sees" in the Original Image

๐Ÿ”ญ Definition: Receptive Field

The receptive field of a neuron is the region in the input image that can influence that neuron's activation. As you go deeper in the network, receptive fields grow exponentially, allowing neurons to "see" and integrate information from larger areas.

Biological Inspiration:
โ€ข Hubel & Wiesel (1962): Discovered hierarchical organization in cat visual cortex
โ€ข V1 neurons: Respond to simple edges, small receptive fields (1-2ยฐ)
โ€ข V2 neurons: Respond to corners, textures, medium receptive fields (5-10ยฐ)
โ€ข V4/IT neurons: Respond to complex shapes/objects, large receptive fields (20-50ยฐ)
CNNs mirror this hierarchical structure!

๐Ÿ“ Calculating Receptive Field Size

Simplified Formula (stride=1, same padding):
RF_layer = RF_prev + (kernel_size - 1) ร— product_of_previous_strides
Example: Stack of 3ร—3 convolutions
โ€ข Layer 1: RF = 3ร—3 (sees 9 pixels)
โ€ข Layer 2: RF = 3 + (3-1)ร—1 = 5ร—5 (sees 25 pixels)
โ€ข Layer 3: RF = 5 + (3-1)ร—1 = 7ร—7 (sees 49 pixels)
โ€ข Layer 4: RF = 7 + (3-1)ร—1 = 9ร—9 (sees 81 pixels)
With Pooling (2ร—2 max pool, stride=2):
Pooling doubles the rate of receptive field growth!
โ€ข Layer 1 (Conv 3ร—3): RF = 3ร—3
โ€ข MaxPool 2ร—2: RF = 4ร—4 (stride doubles effective coverage)
โ€ข Layer 2 (Conv 3ร—3): RF = 4 + (3-1)ร—2 = 8ร—8
โ€ข MaxPool 2ร—2: RF = 10ร—10
โ€ข Layer 3 (Conv 3ร—3): RF = 10 + (3-1)ร—4 = 18ร—18
Each pooling layer multiplies stride effect, growing RF exponentially
Why Larger Kernels (5ร—5, 7ร—7) are Rare:
โ€ข VGG insight: Two 3ร—3 convs have same receptive field as one 5ร—5
โ†’ 3ร—3 + 3ร—3 = 5ร—5 RF, but 18 params vs 25 params
โ€ข Three 3ร—3 convs = one 7ร—7 conv (27 params vs 49 params, 45% savings!)
Bonus: More non-linearity (ReLU after each conv) improves learning

๐Ÿง  Hierarchical Feature Learning

Early Layers (RF: 3ร—3 to 15ร—15):
โ€ข Detect: Edges, colors, gradients
โ€ข Examples: Vertical line, horizontal line, diagonal edge
โ€ข Transferable: Same edge detectors work for cats, cars, buildings
Low-level features, highly reusable
Middle Layers (RF: 20ร—20 to 80ร—80):
โ€ข Detect: Textures, parts, patterns
โ€ข Examples: Eye, wheel, window, fur texture
โ€ข Domain-specific: Combine edges into meaningful parts
Mid-level features, moderately reusable
Deep Layers (RF: 100ร—100 to entire image):
โ€ข Detect: Complete objects, scenes
โ€ข Examples: Face, car, dog, building
โ€ข Task-specific: Optimized for classification/detection task
High-level features, less transferable
Visualization Studies (Zeiler & Fergus, 2014):
Deconvolutional networks revealed what CNNs actually learn:
โ€ข Layer 1: Gabor-like edge filters at various orientations
โ€ข Layer 2: Color blobs, corner detectors, simple textures
โ€ข Layer 3: Mesh patterns, text, complex textures
โ€ข Layer 4: Dog faces, bird legs, bicycle parts
โ€ข Layer 5: Full objects in various poses

๐ŸŽฏ Design Principle

โ€ข Start with small receptive fields (3ร—3, 5ร—5)
โ€ข Gradually expand through depth and pooling
โ€ข Final layers should "see" entire object or significant portion
โ€ข Rule of thumb: RF at layer N should cover Nร—N region
ImageNet (224ร—224): Need RF โ‰ฅ 100ร—100 in final conv layers

โš ๏ธ Common Pitfall

Shallow networks with large kernels:
โ€ข 7ร—7 conv โ†’ RF = 49 pixels โœ“
โ€ข But: Fewer non-linearities, harder to learn complex functions
Better: Deep networks with small kernels:
โ€ข 3ร—3 โ†’ 3ร—3 โ†’ 3ร—3 โ†’ RF = 49 pixels โœ“
โ€ข Bonus: 3 ReLUs, more expressive power

5. Receptive Field Growth

๐Ÿ”ญ Interactive: What Each Layer "Sees"

Receptive Field Size
3ร—3
pixels in original image
Layer 1:
1ร—1
Early Layers
Detect simple features: edges, colors, textures
Middle Layers
Detect parts: eyes, wheels, windows
Deep Layers
Detect objects: faces, cars, buildings

CNN Architecture Evolution: From AlexNet to Modern Networks

๐Ÿ—๏ธ Milestones in CNN Design

๐Ÿ“… Historical Timeline

AlexNet (2012) - The Breakthrough:
โ€ข ImageNet winner: 15.3% error (vs 26% previous year) - 40% improvement!
โ€ข Architecture: 5 conv layers + 3 FC layers = 60M parameters
โ€ข Innovations: ReLU activation, dropout, GPU training (2 GPUs in parallel)
โ€ข Impact: Sparked deep learning revolution in computer vision
Key lesson: Deep networks + ReLU + data + GPU = breakthrough performance
VGG (2014) - Simplicity at Scale:
โ€ข Insight: Use only 3ร—3 filters throughout entire network
โ€ข VGG-16: 16 layers, 138M parameters. VGG-19: 19 layers, 144M params
โ€ข Pattern: Conv-Conv-Pool repeated, channels double after pooling (64โ†’128โ†’256โ†’512)
โ€ข Problem: Huge parameter count (FC layers dominate), slow training
Key lesson: Uniformity and depth matter more than fancy filter designs
ResNet (2015) - Solving the Depth Problem:
โ€ข Problem solved: Very deep networks (50+ layers) were degrading, not improving
โ€ข Solution: Skip connections - F(x) + x instead of just F(x)
โ€ข ResNet-50: 50 layers, 25M params. ResNet-152: 152 layers still trainable!
โ€ข Impact: 3.6% ImageNet error. Enabled training of 1000+ layer networks
y = ReLU(F(x, W) + x) โ† "x" is the identity shortcut
Key lesson: Let network learn residuals (difference) instead of full mapping

๐Ÿ”ง Design Patterns and Trade-offs

Accuracy-Focused Architectures:
ResNet-152, DenseNet-264:
โ€ข Very deep (100-200+ layers)
โ€ข High parameter count (50-200M)
โ€ข ImageNet top-5: 4-5% error
โ€ข Training time: Days on 8 GPUs
Use when: Accuracy is paramount, computational budget unlimited
Efficiency-Focused Architectures:
MobileNet, EfficientNet, SqueezeNet:
โ€ข Shallow or narrow (10-50 layers)
โ€ข Low parameter count (1-10M)
โ€ข ImageNet top-5: 10-15% error
โ€ข Inference: Real-time on mobile/edge
Use when: Deploying to mobile/IoT, latency critical, limited compute
Architecture Decision Tree:
1. Small dataset (<10K images)? โ†’ Use pretrained model + transfer learning
2. Need real-time inference? โ†’ MobileNet, EfficientNet-B0
3. Accuracy is critical? โ†’ ResNet-50/101, EfficientNet-B7
4. Learning from scratch? โ†’ Start simple (VGG-style), add complexity if needed
5. Limited GPU memory? โ†’ Reduce batch size, use gradient checkpointing

๐Ÿงฉ Modern Building Blocks

Residual Block (ResNet):
x โ†’ Conv(3ร—3)
โ†’ BatchNorm โ†’ ReLU
โ†’ Conv(3ร—3)
โ†’ BatchNorm
โ†“ + x (skip)
โ†’ ReLU โ†’ output
Gradients flow through skip connection
Inception Block (GoogLeNet):
x
โ†™ โ†“ โ†“ โ†˜
1ร—1 3ร—3 5ร—5 pool
โ†˜ โ†“ โ†“ โ†™
concatenate
Multi-scale feature extraction in parallel
Depthwise Separable (MobileNet):
x โ†’ DepthwiseConv(3ร—3)
[per-channel]
โ†’ PointwiseConv(1ร—1)
[mix channels]
โ†’ output
8-9ร— fewer params than standard conv!

6. CNN Architectures

๐Ÿ—๏ธ Interactive: Famous Architectures

Simple CNN

~1.2M parameters
1
Conv 32
2
Pool
3
Conv 64
4
Pool
5
Dense 128
6
Output
๐ŸŽฏ Simple CNN
Basic architecture for learning CNNs. Good for MNIST, CIFAR-10. Fast training, easy to understand.

7. Parameter Calculation

๐Ÿ“ Interactive: Count Parameters

Example Layer: Conv(32 filters, 3ร—3 kernel)

Kernel Size
3ร—3
Input Channels
3
Output Filters
32
Calculation:
Params = (3ร—3ร—3 + 1) ร— 32 = 896
(kernel ร— input_channels + bias) ร— output_filters
Total Model Estimate
1,058,816
parameters (simplified)
Memory Required
4.0 MB
@ 32-bit floats

8. Activation Functions

โšก Interactive: Non-linearity

ReLU: f(x) = max(0, x)

Characteristics

Range:[0, โˆž)
Pros:Fast, no vanishing gradient
Cons:Dying ReLU problem
Usage:Most common in CNNs

9. Data Augmentation

๐ŸŽจ Interactive: Training Data Tricks

Original Image

๐Ÿฑ

Effect

Base training image without modifications. Start here.
Other Techniques:
Brightness, contrast, color jitter, cutout, mixup

๐Ÿ’ก Why Augment? Artificially expand training data, prevent overfitting, improve generalization, make model robust to variations.

Transfer Learning: Standing on the Shoulders of Giants

๐ŸŽ“ Reusing Pre-trained Knowledge

๐Ÿ’ก The Core Insight

Transfer learning leverages a model pre-trained on a massive dataset (typically ImageNet: 1.4M images, 1000 classes) and adapts it to your specific task. This is the default approach for almost all computer vision tasks in 2024.

Why Transfer Learning is Dominant:
โ€ข 10-100ร— faster training: Pre-trained features already capture edges, textures, shapes
โ€ข 10-100ร— less data needed: Works with 100-1000 images instead of 100K+
โ€ข Better accuracy: Often outperforms training from scratch even with more data
โ€ข Lower computational cost: Fine-tuning needs 1 GPU for hours, not 8 GPUs for weeks
When NOT to Use Transfer Learning:
โ€ข Domain is radically different: Medical X-rays, satellite imagery (but still helps!)
โ€ข You have 10M+ labeled images: Training from scratch might match/exceed transfer learning
โ€ข Academic research: Studying learning dynamics from initialization
Reality: Even for medical/satellite, transfer learning is starting point 90% of the time

๐Ÿ”ง Transfer Learning Strategies

Strategy 1: Feature Extraction (Frozen Base)
1. Load pretrained model (e.g., ResNet-50)
2. Freeze all convolutional layers
3. Replace final FC layer (1000 โ†’ your_classes)
4. Train only new FC layer
When to use:
โ€ข Small dataset (100-1,000 images)
โ€ข Similar to ImageNet (natural photos)
โ€ข Limited computational budget
Training time: Minutes to hours
Strategy 2: Fine-Tuning (Partially Frozen)
1. Load pretrained model
2. Freeze early layers (generic features)
3. Unfreeze later layers (task-specific)
4. Train with small learning rate
When to use:
โ€ข Medium dataset (1,000-100,000 images)
โ€ข Somewhat different from ImageNet
โ€ข Want to adapt high-level features
Training time: Hours to days
Strategy 3: Full Fine-Tuning (All Unfrozen)
โ€ข Unfreeze all layers, train entire network with low learning rate (1e-5 to 1e-4)
โ€ข When: Large dataset (100K+ images), domain shift from ImageNet
โ€ข Caution: Can overfit if dataset too small, may destroy pretrained features
Best practice: Start with frozen/partial, then full fine-tune if needed
Learning Rate Guidelines:
โ€ข New layers (randomly initialized): 1e-3 to 1e-2 (normal rate)
โ€ข Fine-tuning pretrained layers: 1e-5 to 1e-4 (10-100ร— smaller)
โ€ข Why smaller? Pretrained weights already near optimal, don't want large updates
Technique: Discriminative learning rates (different LR per layer group)

๐Ÿ“Š Practical Results Comparison

Example Task: Classify 20 dog breeds (2,000 images)
Train from Scratch (ResNet-50):
โ€ข Training time: 48 hours (8 GPUs)
โ€ข Validation accuracy: 62% (heavy overfitting)
โ€ข Problem: Not enough data to learn low-level features
Transfer Learning - Feature Extraction:
โ€ข Training time: 20 minutes (1 GPU)
โ€ข Validation accuracy: 78%
โ€ข Benefit: Pretrained features (edges, textures) already useful
Transfer Learning - Fine-Tuning:
โ€ข Training time: 2 hours (1 GPU)
โ€ข Validation accuracy: 89%
โ€ข Benefit: Adapted high-level features to dog-specific patterns
Transfer learning: 144ร— faster, 27% better accuracy! ๐ŸŽ‰

๐Ÿ” Which Pretrained Model to Choose?

โ€ข ResNet-50: Default choice, excellent accuracy/speed balance
โ€ข EfficientNet-B0 to B7: Best accuracy per param, scalable
โ€ข MobileNetV2: Mobile/edge deployment, real-time inference
โ€ข Vision Transformer (ViT): Cutting edge, needs large datasets
Start with ResNet-50 or EfficientNet-B0, optimize later if needed

โšก Quick Start Checklist

1. Load pretrained model (PyTorch, TensorFlow hubs)
2. Replace final layer: model.fc = Linear(2048, num_classes)
3. Freeze base: for param in model.parameters(): param.requires_grad = False
4. Unfreeze FC: for param in model.fc.parameters(): param.requires_grad = True
5. Train with Adam, LR=1e-3, monitor validation
6. If accuracy plateaus, unfreeze more layers, reduce LR to 1e-5

10. Transfer Learning

๐ŸŽ“ Interactive: Pre-trained Models

๐Ÿ—๏ธ

ResNet-50

Deep residual network with skip connections

ImageNet Accuracy
76.1%
Parameters
25.6M
Transfer Learning Strategy:
1. Load pre-trained weights (trained on ImageNet)
2. Freeze early layers (they learned general features)
3. Replace final classification layer for your task
4. Fine-tune on your dataset (faster, less data needed)

โœ“ Benefits: Train 10-100ร— faster, need 10-100ร— less data, achieve better accuracy. Start here for real projects!

๐ŸŽฏ Key Takeaways

๐Ÿ”

Convolution Magic

Sliding filters detect local patterns. Parameter sharing means the same feature detector works everywhere, making CNNs efficient and translation-invariant.

๐Ÿ“š

Hierarchical Features

Early layers detect edges, middle layers detect parts (eyes, wheels), deep layers detect objects. Network learns feature hierarchy automatically.

๐Ÿ“‰

Pooling Reduces Size

Max/average pooling downsamples feature maps. Reduces computation, increases receptive field, adds translation invariance. Essential for deep networks.

๐Ÿ—๏ธ

Famous Architectures

VGG (simple, deep), ResNet (skip connections), MobileNet (efficient). Each innovation solved specific problems. Use pre-trained versions!

๐ŸŽจ

Data Augmentation

Flip, rotate, crop training images. Artificially expands dataset, prevents overfitting, improves generalization. Essential for small datasets.

๐ŸŽ“

Transfer Learning

Use pre-trained models (ImageNet). Fine-tune for your task. Trains faster, needs less data, achieves better results. The default approach for vision tasks.