Home/Concepts/Artificial Intelligence/AI Model Evaluation

AI Model Evaluation

Understand metrics, confusion matrices, and model performance

⏱️ 28 min⚡ 20 interactions

Why Model Evaluation Matters

Model Evaluation is how we measure if our AI actually works. Accuracy alone is misleading - you need precision, recall, F1-score, and ROC curves to truly understand performance. The right metric depends on your problem!

💡 The Core Challenge

🎯

Measure Performance

Quantify how well your model predicts

⚖️

Choose Right Metric

Different problems need different metrics

🔍

Detect Issues

Find overfitting, bias, class imbalance

Confusion Matrix: The Foundation of Classification Metrics

🎯 Understanding the 2×2 Grid

📋 What Is a Confusion Matrix?

Core Concept

A confusion matrix shows where your classifier gets confused - comparing predicted labels vs actual labels

Predicted: Yes

Predicted: No

Actual: Yes

TP (True Positive)

FN (False Negative)

Actual: No

FP (False Positive)

TN (True Negative)

Why "Confusion" Matrix?

• Diagonal (TP + TN): Correct predictions ✓

• Off-diagonal (FP + FN): Where model gets confused ✗

• Pattern analysis: Which class is harder to predict?

High FP? Model too aggressive (predicts positive too often)

High FN? Model too conservative (misses positives)

✅ True Positive (TP): Correct "Yes" Prediction

Definition

Model predicts positive, actual label is positive → Correct!

Real-World Examples

• Email spam: Flagged spam, actually spam ✓

• Disease diagnosis: Predicted sick, patient is sick ✓

• Fraud detection: Flagged fraud, transaction is fraud ✓

→ Good job! Model correctly identified the positive class

❌ False Positive (FP): Type I Error - False Alarm

Definition

Model predicts positive, actual label is negative → Wrong! (False alarm)

Real-World Examples & Impact

• Email spam: Legitimate email marked as spam ✗

Cost: User misses important email, loses trust

• Disease diagnosis: Healthy patient diagnosed sick ✗

Cost: Unnecessary treatment, anxiety, medical expenses

• Fraud detection: Legit transaction blocked ✗

Cost: Customer frustration, lost sales

→ "Crying wolf" - Said yes when should have said no

Type I Error: Rejecting a true null hypothesis (false alarm, false positive)

❌ False Negative (FN): Type II Error - Missed Detection

Definition

Model predicts negative, actual label is positive → Wrong! (Missed it)

Real-World Examples & Impact

• Email spam: Spam email reaches inbox ✗

Cost: User annoyance, potential phishing risk

• Disease diagnosis: Sick patient diagnosed healthy ✗

Cost: Delayed treatment, disease progression, death risk!

• Fraud detection: Fraudulent transaction approved ✗

Cost: Financial loss, identity theft

→ "Missed the target" - Said no when should have said yes

Type II Error: Failing to reject a false null hypothesis (miss, false negative)

✅ True Negative (TN): Correct "No" Prediction

Definition

Model predicts negative, actual label is negative → Correct!

Real-World Examples

• Email spam: Normal email reaches inbox ✓

• Disease diagnosis: Healthy patient diagnosed healthy ✓

• Fraud detection: Legitimate transaction approved ✓

→ Good! Model correctly identified the negative class

Often overlooked but important: In imbalanced datasets, TN can be the largest number

⚖️ FP vs FN: Which Error Is Worse?

When FP Is Worse (Minimize False Alarms)

• Spam filtering: Don't block legitimate emails

• Ad targeting: Don't annoy users with irrelevant ads

• Content moderation: Don't censor valid posts

→ Use Precision as key metric

When FN Is Worse (Catch All Positives)

• Cancer screening: Never miss a cancer case

• Fraud detection: Catch all fraudulent transactions

• Security threats: Detect all intrusions

→ Use Recall as key metric

The Fundamental Tradeoff

• Reduce FP → Increase FN (more conservative predictions)

• Reduce FN → Increase FP (more aggressive predictions)

• Balance both → Use F1-score or adjust threshold

🧮 Complete Example: Medical Diagnosis

Scenario: COVID-19 Test (1000 patients)

Ground truth: 100 actually have COVID, 900 don't

Predicted +

Predicted -

Actual +

TP: 85

FN: 15

Actual -

FP: 45

TN: 855

TP = 85: Correctly identified 85 COVID+ patients ✓

FP = 45: 45 healthy people told they have COVID ✗ (anxiety, quarantine)

FN = 15: 15 COVID+ patients told they're healthy ✗ (spread disease!)

TN = 855: Correctly identified 855 healthy people ✓

Which error is worse here?

FN is dangerous! Those 15 people spread COVID thinking they're healthy. Better to have more FP (false alarms) than miss positive cases. → Optimize for Recall

🔑 Key Insight

The confusion matrix is the source of truth. All other metrics (accuracy, precision, recall, F1) are just different ways to summarize these 4 numbers!

💡 Practical Tip

Always visualize the confusion matrix first! It reveals which classes are being confused and helps you decide which metric matters most for your problem.

1. Confusion Matrix: The Foundation

📊 Interactive: Build Your Confusion Matrix

True Positives (TP): 85

False Positives (FP): 15

False Negatives (FN): 10

True Negatives (TN): 90

True Positive

✓ Correct

False Positive

✗ Type I Error

False Negative

✗ Type II Error

True Negative

✓ Correct

Total Predictions: 200

Classification Metrics: From Confusion Matrix to Insight

📏 The Four Core Metrics

✓ Accuracy: Overall Correctness

Formula

Accuracy = (TP + TN) / (TP + TN + FP + FN)

= Correct predictions / Total predictions

Intuition:

What percentage of predictions were correct, regardless of class?

Example:

TP=85, FP=15, FN=10, TN=90

Accuracy = (85+90) / (85+90+15+10) = 175/200 = 87.5%

When to Use Accuracy

✅ Balanced classes: 50-50 or 60-40 split

✅ All errors equal: FP and FN cost the same

✅ Overall performance: Quick sanity check

Example: Image classification (cat vs dog with equal samples)

⚠️ The Accuracy Paradox

Problem: Misleading with imbalanced classes!

Example: Fraud Detection (99% legit, 1% fraud)

• Dumb model: Always predict "not fraud" → 99% accuracy!

• But catches 0% of fraud (all FN)

• Accuracy = 99% but model is useless!

→ Don't use accuracy alone for imbalanced data!

🎯 Precision: Avoid False Alarms

Formula

Precision = TP / (TP + FP)

= True positives / Predicted positives

Intuition:

When model says "positive", how often is it actually correct?

Alternative name: Positive Predictive Value (PPV)

Example:

TP=85, FP=15 (model predicted 100 as positive)

Precision = 85 / (85+15) = 85/100 = 85%

Out of 100 positive predictions, 85 were correct

When to Use Precision

✅ False positives are costly: Minimize false alarms

• Spam filter: Don't block legitimate emails

• Product recommendations: Don't annoy users with bad suggestions

• Ad targeting: Don't waste budget on wrong audience

• Content moderation: Don't censor valid posts

→ "When I say yes, I better be right!"

💡 Precision Focus

Precision ignores FN (missed positives). It only cares about avoiding false positives among your positive predictions. High precision = low false alarm rate.

🔎 Recall: Catch All Positives (Sensitivity)

Formula

Recall = TP / (TP + FN)

= True positives / Actual positives

Intuition:

Of all actual positive cases, how many did we find?

Alternative names:

• Sensitivity (medical field)

• True Positive Rate (TPR)

• Hit Rate

Example:

TP=85, FN=10 (there were 95 actual positives)

Recall = 85 / (85+10) = 85/95 = 89.5%

We caught 85 out of 95 positive cases

When to Use Recall

✅ False negatives are dangerous: Can't miss positives

• Cancer screening: Never miss a cancer case (lives at stake!)

• Fraud detection: Catch all fraudulent transactions

• Security threats: Detect all intrusions/attacks

• Search engines: Return all relevant documents

→ "Don't let any positive slip through!"

💡 Recall Focus

Recall ignores FP (false alarms). It only cares about catching all actual positives, even if it means more false alarms. High recall = low miss rate.

⚖️ Precision-Recall Tradeoff

The Fundamental Tension

↑ Precision (fewer FP):

• Be more conservative in saying "positive"

• Higher threshold → predict positive less often

• Result: Miss more positives (↑ FN) → ↓ Recall

↑ Recall (fewer FN):

• Be more aggressive in saying "positive"

• Lower threshold → predict positive more often

• Result: More false alarms (↑ FP) → ↓ Precision

Can't have both perfect!

To catch every positive (100% recall), you'd predict everything as positive → terrible precision. To be 100% sure when you predict positive (100% precision), you'd be very conservative → miss many positives (low recall).

Visual Example: Threshold Effect

Low Threshold (0.3)

Precision: 60%

Recall: 95%

Aggressive - catches most but many false alarms

Medium Threshold (0.5)

Precision: 85%

Recall: 85%

Balanced - good tradeoff

High Threshold (0.7)

Precision: 95%

Recall: 60%

Conservative - very accurate but misses many

🎯 F1-Score: Harmonic Mean of Precision & Recall

Formula

F1 = 2 × (Precision × Recall) / (Precision + Recall)

= Harmonic mean of precision and recall

Why harmonic mean (not arithmetic)?

• Arithmetic mean: (P + R) / 2

Problem: High if either is high (P=100%, R=10% → 55%)

• Harmonic mean: 2PR / (P+R)

Better: Only high if both are high (P=100%, R=10% → 18%)

→ Penalizes extreme imbalance between P and R

Example:

Precision = 85%, Recall = 89.5%

F1 = 2 × (0.85 × 0.895) / (0.85 + 0.895)

F1 = 2 × 0.761 / 1.745 = 87.2%

When to Use F1-Score

✅ Need balance: Both FP and FN matter

✅ Imbalanced classes: Better than accuracy

✅ Single metric: Summarize model performance

✅ Can't decide: Which error is worse?

Example: General classification tasks, model comparison

⚡ F1 Variations

• F1-Score: Equal weight to precision and recall (β=1)

• F2-Score: Weight recall 2× more (β=2) - for recall-critical tasks

• F0.5-Score: Weight precision 2× more (β=0.5) - for precision-critical tasks

Fβ = (1+β²) × (P×R) / (β²×P + R)

📊 Metric Comparison Table

Metric	Formula	Best For	Limitation
Accuracy	(TP+TN)/(Total)	Balanced classes	Misleading when imbalanced
Precision	TP/(TP+FP)	Minimize false alarms	Ignores false negatives
Recall	TP/(TP+FN)	Catch all positives	Ignores false positives
F1-Score	2PR/(P+R)	Balance both, imbalanced data	Hides which metric is weak

🔑 Key Insight

No single metric is best for everything! Choose based on your problem: What's the cost of FP vs FN? Are classes balanced? Report multiple metrics to tell the full story.

💡 Practical Tip

Always report precision, recall, AND F1 together. Don't rely on one metric alone - each tells a different part of the performance story!

2. Classification Metrics

🎯 Interactive: Calculate Key Metrics

87.5%

accuracy

Formula:

(TP + TN) / (TP + TN + FP + FN)

When to Use:

Balanced classes, overall correctness matters

ROC Curves & AUC: Threshold-Independent Evaluation

📈 Understanding ROC Curves

🎯 What Is a ROC Curve?

Full Name & History

ROC: Receiver Operating Characteristic

• Developed in WWII for radar signal detection
• "Receiver" = radar operator distinguishing signal from noise
• Now used universally for binary classification evaluation

Core Concept

ROC curve plots TPR vs FPR at all possible thresholds

• X-axis: False Positive Rate (FPR) = FP / (FP + TN)

How many negatives misclassified as positive

• Y-axis: True Positive Rate (TPR) = TP / (TP + FN)

Same as Recall - how many positives correctly identified

• Each point: One threshold value (0.0 to 1.0)

Why ROC Curves Matter

✅ Threshold-independent: Shows performance across all thresholds

✅ Visual comparison: Easy to compare different models

✅ Cost-agnostic: No assumption about FP vs FN costs

✅ Works with imbalanced data: TPR/FPR unaffected by class ratio

📊 Reading a ROC Curve

Key Reference Points

(0, 0) - Bottom Left

• Threshold = 1.0
• Predicts nothing as positive
• TPR=0%, FPR=0%
• All predictions negative

(1, 1) - Top Right

• Threshold = 0.0
• Predicts everything as positive
• TPR=100%, FPR=100%
• All predictions positive

(0, 1) - Top Left (Perfect!)

• TPR=100%, FPR=0%
• Catches all positives
• No false alarms
• Perfect classifier

Diagonal Line (Random)

• TPR = FPR
• AUC = 0.5
• Random guessing
• No discrimination

How to Interpret the Curve

• Curve hugs top-left corner: Excellent model (high TPR, low FPR)

• Curve follows diagonal: Random model (useless)

• Curve below diagonal: Model worse than random (inverted predictions!)

→ Higher curve = better model across all thresholds

🔢 AUC: Area Under the Curve

What Is AUC?

AUC = Total area under the ROC curve

Ranges from 0.0 to 1.0 (higher is better)

Intuitive Interpretation:

AUC = Probability that classifier ranks a random positive sample higher than a random negative sample

Example: AUC = 0.85

If you pick one positive and one negative sample at random, there's an 85% chance the model gives the positive sample a higher predicted probability than the negative sample.

AUC Score Guidelines

0.5-0.6: Poor (barely better than random)

0.6-0.7: Fair (some discrimination)

0.7-0.8: Good (acceptable for many tasks)

0.8-0.9: Excellent (strong discrimination)

0.9-1.0: Outstanding (near-perfect)

💡 Why AUC Is Powerful

• Single number: Summarizes entire ROC curve

• Threshold-free: No need to pick optimal threshold

• Class-imbalance robust: Works well even with 99-1 split

• Easy comparison: Compare models with one metric

⚡ TPR vs FPR: The Tradeoff

Formula Definitions

TPR = TP / (TP + FN)

• True Positive Rate (Recall, Sensitivity)
• Of actual positives, % correctly identified
• Want HIGH: Catch all positives!

FPR = FP / (FP + TN)

• False Positive Rate
• Of actual negatives, % incorrectly flagged positive
• Want LOW: Minimize false alarms!

Threshold Effect Example

Scenario: Disease diagnosis with 100 sick, 900 healthy

Low Threshold (0.2)

TP=95, FN=5

FP=300, TN=600

TPR = 95/100 = 95%

FPR = 300/900 = 33%

Aggressive: Catch most cases but many false alarms

Med Threshold (0.5)

TP=85, FN=15

FP=90, TN=810

TPR = 85/100 = 85%

FPR = 90/900 = 10%

Balanced: Good tradeoff

High Threshold (0.8)

TP=60, FN=40

FP=18, TN=882

TPR = 60/100 = 60%

FPR = 18/900 = 2%

Conservative: Few false alarms but miss many cases

→ Each threshold gives one (FPR, TPR) point on ROC curve

📊 ROC vs Precision-Recall Curve

ROC Curve (TPR vs FPR)

✅ Imbalance-robust: Works with 99-1 split

✅ Standard in ML: Widely used & understood

✅ Intuitive: TPR↑ good, FPR↓ good

⚠️ Can be optimistic: With heavy imbalance

Use when: Classes moderately balanced, or want standard metric

PR Curve (Precision vs Recall)

✅ Focus on positives: Both metrics use TP

✅ Better for imbalanced: Emphasizes rare class

✅ Informative: Shows precision-recall tradeoff

⚠️ Less intuitive: Harder to interpret

Use when: Heavy imbalance (99-1), positive class critical

When Classes Are Imbalanced

With 99% negative, 1% positive:
• ROC-AUC: May look good (0.95) because TN is huge → FPR stays low
• PR-AUC: More realistic (0.60) because precision considers FP directly
→ For rare events (fraud, disease), report both ROC-AUC and PR-AUC

🔑 Key Insight

ROC-AUC is threshold-independent - one number summarizes performance across all thresholds. Perfect for model comparison without committing to a threshold!

💡 Practical Tip

AUC 0.8+ is generally good, but for critical applications (medical, security), aim for 0.9+. Use PR-AUC too if classes are heavily imbalanced!

3. ROC Curve & AUC

📈 Interactive: Receiver Operating Characteristic

Classification Threshold: 0.50

0.0 (All Positive)0.5 (Balanced)1.0 (All Negative)

ROC Space

FPR (False Positive Rate) →

↑ TPR (True Positive Rate)

True Positive Rate (Recall)

85%

False Positive Rate

15%

AUC (Area Under Curve)

0.92

Excellent (0.9-1.0)

💡 ROC Curve: Shows tradeoff between TPR and FPR at different thresholds. Higher AUC = better model. Random classifier = 0.5, perfect = 1.0.

4. Precision-Recall Tradeoff

⚖️ Interactive: Balance the Tradeoff

Decision Threshold: 0.50

Precision

85%

Of predicted positives, how many are correct?

Recall

85%

Of actual positives, how many did we catch?

⚖️ Balanced threshold: Good tradeoff between precision and recall.

Cross-Validation: Robust Performance Estimation

🔄 Why Cross-Validation?

❌ The Problem with Single Train-Test Split

High Variance in Performance Estimate

Problem: Performance depends heavily on which samples end up in test set!

Example: 100 samples, 80-20 split

• Split A: Test set happens to be "easy" → 95% accuracy

• Split B: Test set happens to be "hard" → 78% accuracy

• Split C: Test set is "typical" → 86% accuracy

→ Which is the "true" performance? 95%? 78%? 86%? We don't know!

Other Issues with Single Split

• Wastes data: 20% sits unused in test set

• Lucky/unlucky splits: Random chance affects results

• No confidence interval: Can't estimate uncertainty

• Overfitting risk: Might accidentally select model that works well on that specific test set

✅ K-Fold Cross-Validation: The Solution

Core Idea

Split data into K equal folds, use each fold as test set exactly once

K-Fold Cross-Validation Algorithm:

1. Shuffle dataset randomly

2. Split into K equal-sized folds

3. for i = 1 to K:

• Use fold i as test set

• Use remaining K-1 folds as training set

• Train model on training set

• Evaluate on test set → score_i

4. Final score = mean(score_1, ..., score_K)

5. Report std dev for uncertainty

Example: 5-Fold CV with 100 Samples

Fold 1: Train on samples 21-100, test on samples 1-20 → Acc = 87%

Fold 2: Train on 1-20, 41-100, test on 21-40 → Acc = 85%

Fold 3: Train on 1-40, 61-100, test on 41-60 → Acc = 89%

Fold 4: Train on 1-60, 81-100, test on 61-80 → Acc = 86%

Fold 5: Train on 1-80, test on 81-100 → Acc = 88%

Final: 87.0% ± 1.4% (mean ± std dev)

→ Every sample used for training 4 times and testing 1 time!

🔢 Choosing K: How Many Folds?

K = 3-5 (Small K)

✅ Faster: Train 3-5 times

✅ Good for large datasets: Saves time

⚠️ Higher variance: Fewer estimates

⚠️ Less data per fold: Higher bias

Use when: Large dataset (>10K samples), computational constraints

K = 10 (Standard)

✅ Industry standard: Most common choice

✅ Good bias-variance balance: Proven empirically

✅ Reliable: Stable estimates

⚠️ 10× training time: Moderate cost

Use when: Default choice, medium datasets (100-10K samples)

K = N (LOOCV)

✅ Maximum data: N-1 samples for training

✅ No randomness: Deterministic

✅ Low bias: Almost all data used

⚠️ Expensive: Train N times!

⚠️ High variance: Test sets overlap heavily

Use when: Tiny datasets (<100 samples), need max data utilization

💡 Practical Recommendation

• Default: K = 10 (or K = 5 for large datasets)
• Small data (<100): K = 10 or LOOCV
• Large data (>10K): K = 3-5 or single train-test split (80-20)

⚖️ Stratified K-Fold: For Imbalanced Data

The Problem with Regular K-Fold

With imbalanced classes (e.g., 90-10 split), random folding might create skewed folds:

Example: 100 samples (90 negative, 10 positive), K=5

• Fold 1: 20 negative, 0 positive (0% positive!) ✗

• Fold 2: 18 negative, 2 positive (10% positive) ✓

• Fold 3: 17 negative, 3 positive (15% positive) ⚠️

• Fold 4: 19 negative, 1 positive (5% positive) ⚠️

• Fold 5: 16 negative, 4 positive (20% positive!) ✗

→ Folds have very different class distributions!

Stratified K-Fold Solution

Key idea: Ensure each fold has the same class distribution as the full dataset

Stratified Splitting:

1. Separate samples by class (90 negative, 10 positive)

2. Split each class into K folds independently

• Negatives: 18 per fold (90/5)

• Positives: 2 per fold (10/5)

3. Combine: Each fold has 18 negative + 2 positive = 10% positive ✓

→ All folds now have consistent 90-10 split!

🎯 When to Use Stratified K-Fold

• Imbalanced classes: Always! (Even mild imbalance like 70-30)
• Classification tasks: Default choice for stratified split
• Small datasets: Especially important with few samples

sklearn default: StratifiedKFold for classification

📊 Reporting Cross-Validation Results

What to Report

✅ Mean score: Average performance across folds

✅ Standard deviation: Variability in performance

✅ Min/Max scores: Best and worst fold

✅ Number of folds: E.g., "10-fold CV"

Example Report:

"10-fold cross-validation accuracy: 87.2% ± 2.3%"

(range: 83.5% to 91.0%)

Interpreting Standard Deviation

• Low std (<2%): Stable model, consistent performance

• Medium std (2-5%): Acceptable variance

• High std (>5%): Unstable model or very small dataset

High std suggests model is sensitive to training data → consider more data, regularization, or simpler model

⚡ Benefits Summary

✅ Advantages

• Robust estimate: Uses all data for testing

• Reduces variance: Averages K evaluations

• Confidence interval: Get std dev for uncertainty

• Data efficient: Every sample used for training and testing

• Detects overfitting: If CV score << train score

⚠️ Disadvantages

• Computational cost: K× training time

• Not for time series: Violates temporal order

• Correlated estimates: Training sets overlap

For time series: Use TimeSeriesSplit or walk-forward validation instead

🔑 Key Insight

Cross-validation is the gold standard for model evaluation. Always use it (typically K=10) to get a reliable, unbiased estimate of your model's true performance!

💡 Practical Tip

Use StratifiedKFold by default for classification. Report mean ± std dev. If std is high, your model might be unstable or you need more data!

5. K-Fold Cross-Validation

🔄 Interactive: Split Your Data

Number of Folds (K): 5

Fold 1:

Test

Train

Fold 2:

Train

Test

Train

Fold 3:

Train

Test

Train

Fold 4:

Train

Test

Train

Fold 5:

Train

Test

Avg Accuracy

87.7%

Std Dev

1.12

Train Time

4.0s

🔄 Why K-Fold? Each data point is used for both training and testing. More folds = better estimate but slower. K=5 or K=10 is typical.

6. Bias-Variance Tradeoff

🎯 Interactive: Find the Sweet Spot

Model Complexity: 5

Simple (High Bias)OptimalComplex (High Variance)

Bias (Underfitting)

30%

Variance (Overfitting)

50%

Total Error

80%

✅ Sweet Spot: Good balance between bias and variance. Optimal generalization!

7. Handling Class Imbalance

⚖️ Interactive: Imbalanced Datasets

Class Ratio: 50% Positive / 50% Negative

Imbalance Ratio

1:2

Recommended Metric

Accuracy OK

💡 Solutions:

Oversample minority class (SMOTE)
Undersample majority class
Use class weights in loss function
Choose right metric (F1, PR-AUC)

8. Choosing the Right Metric

🎯 Interactive: Match Metric to Problem

📧

spam Detection

✅ Best Metric: Precision

Avoid false positives (legitimate emails marked as spam)

🎯 Priority:

Minimize annoying users with false alarms

9. Detecting Overfitting

🔍 Interactive: Train vs Test Performance

Training Accuracy: 95%

Test Accuracy: 85%

Training Accuracy

95%

Test Accuracy

85%

Performance Gap

10%

Slight Overfit

Acceptable gap

10. Learning Curves

📊 Interactive: Data Size vs Performance

Training Dataset Size: 1000 samples

Learning Curve

Training Score

Validation Score

📈 Insight: Moderate dataset - curves converging. More data would still help.

🎯 Key Takeaways

📊

Confusion Matrix First

Start with TP, FP, FN, TN. Everything else (accuracy, precision, recall, F1) derives from these four numbers. Visualize it!

🎯

Match Metric to Problem

Spam detection? Precision. Cancer screening? Recall. Balanced problem? F1-score or ROC-AUC. Accuracy is often misleading!

📈

ROC-AUC for Binary Classification

ROC curve shows performance across all thresholds. AUC = 0.5 (random), 0.7-0.8 (fair), 0.8-0.9 (good), 0.9+ (excellent). Threshold-independent metric.

🔄

Always Cross-Validate

Single train-test split is unreliable. Use K-fold (K=5 or 10) to get robust performance estimate. Report mean ± std dev.

⚖️

Watch for Overfitting

Train accuracy >> test accuracy? Overfitting. Use regularization, more data, simpler model, or dropout. Gap <5% is healthy.

⚖️

Handle Imbalanced Classes

99-1 split? Don't use accuracy! Use F1-score, PR-AUC, or apply SMOTE/class weights. Minority class matters most.