Natural Language Processing
Process text, extract meaning, and build language understanding
What is Natural Language Processing?
Natural Language Processing (NLP) is the bridge between human communication and machine understanding. It enables computers to read, interpret, and generate human language, powering everything from chatbots to translation services.
💡 The Core Challenge
Tokenization: From Text to Processable Units
🔤 Why Tokenization Matters
The Fundamental Problem
⚖️ Three Approaches & Their Trade-offs
🔧 Subword Algorithms
📊 Vocabulary Size Impact
⚡ Modern Practice
🎯 Key Insight
1. Tokenization: Breaking Text Apart
✂️ Interactive: Tokenize Text
Tokenization splits text into smaller units (tokens). Different methods serve different purposes!
Word Embeddings: Capturing Semantic Meaning
🧠 Distributional Semantics
The Core Hypothesis
"You shall know a word by the company it keeps"
— J.R. Firth (1957)
📊 From One-Hot to Dense Vectors
🔬 Word2Vec: Learning Embeddings from Context
Target: "cat"
Context: "The", "sat", "on", "mat"
Negatives: ("cat", "airplane"), ("cat", "democracy") → minimize
🌐 GloVe: Global Vectors for Word Representation
// Xij = how often word i and j co-occur
⚡ Semantic Relationships & Vector Arithmetic
// 1 = identical, 0 = orthogonal, -1 = opposite
📦 Static vs Contextual
🎯 Pre-trained Embeddings
💡 Key Insight
2. Word Embeddings: Meaning as Vectors
🧮 Interactive: Explore Word Vectors
Words with similar meanings have similar vectors. "King" and "Queen" are close in vector space!
Word: "king"
🎯 Key Insight: Word2Vec, GloVe, and FastText learn these embeddings from massive text corpora. Similar contexts → similar vectors!
3. Sentiment Analysis: Understanding Emotion
😊 Interactive: Analyze Sentiment
🔍 How it Works: Sentiment analysis uses machine learning to classify emotional tone. Models can be rule-based, ML-based (Naive Bayes, SVM), or deep learning (BERT, RoBERTa).
4. Named Entity Recognition (NER)
🏷️ Interactive: Extract Entities
NER identifies and classifies named entities: people, organizations, locations, dates, etc.
Apple Inc. is located in Cupertino, California and was founded by Steve Jobs in 1976.
5. Text Classification
🗂️ Interactive: Classify Text
Automatically categorize text into predefined classes: spam/ham, topic, sentiment, intent, etc.
💡 Applications: Spam filtering, topic labeling, intent detection, priority routing, content moderation, language identification.
6. TF-IDF: Term Importance
📊 Interactive: Calculate TF-IDF
TF-IDF measures how important a word is to a document. High TF-IDF = distinctive term!
TF-IDF Score
🎯 Use Case: TF-IDF helps search engines rank documents. Terms with high TF-IDF are most relevant to that specific document!
Attention Mechanism: Learning What Matters
🎯 Why Attention Revolutionized NLP
The RNN Bottleneck Problem
→ What is sleeping? Information gets lost!
🧮 Self-Attention Mathematics
Query from "sat": "Who performed this action?"
Keys from all words: "cat" has high match!
Value from "cat": Return cat's information
K = X WK // Each word gets 3 representations
V = X WV // Learned during training
• √dk: scaling factor (prevents vanishing gradients)
• Result: (seq_len, seq_len) matrix of attention scores
• High scores → high attention
• Each word now has distribution over all other words
• Important words contribute more
• Result: context-aware representation
Similarity scores
Scaling factor
Weighted sum
🎭 Multi-Head Attention
where headi = Attention(QWiQ, KWiK, VWiV)
⚡ Why Attention Beats RNNs
Attention: O(1) sequential ops, O(n²) parallel ops
→ With modern GPUs, parallel O(n²) beats sequential O(n)!
🔍 Attention Variants
🎯 Real-World Scale
💡 Key Insight
7. Attention Mechanism
👁️ Interactive: Visualize Attention
Attention helps models focus on relevant words when processing text. Click a word to see what it attends to!
Attention Weights
Word "sat" pays attention to:
🧠 Transformers: Models like BERT and GPT use multi-head attention to capture different types of relationships simultaneously!
Language Models: Predicting the Next Word
📚 What is a Language Model?
Core Definition
A language model assigns probabilities to sequences of words
📊 Evolution: From N-grams to Neural Networks
Trigram (n=3): P(sat | the cat) = count("the cat sat") / count("the cat")
P(wt | context) = softmax(W ht + b)
🎲 Sampling Strategies: Controlling Generation
Focused, deterministic
"The" → 0.95, "A" → 0.04
Balanced sampling
"The" → 0.60, "A" → 0.25
Creative, random
"The" → 0.35, "A" → 0.30
k=3 → Renormalize: [sunny: 0.47, cloudy: 0.29, rainy: 0.24]
Sometimes 2 words, sometimes 50 words (adapts to confidence!)
📈 Evaluation Metric: Perplexity
🚀 Modern LLMs: The GPT Architecture
• Objective: Predict next token (autoregressive)
• GPT-3: 300B tokens, ~$4.6M compute cost
• Result: General language understanding
• ChatGPT: RLHF (Reinforcement Learning from Human Feedback)
• Small dataset, quick training
• Result: Task-specific expert
Can only attend to previous tokens → autoregressive generation
GPT-3: 175B params, GPT-4: ~1.7T params (rumored)
Few-shot prompting without parameter updates!
Reasoning, math, code at sufficient scale
🎯 Autoregressive LMs
🔄 Masked LMs
💡 Key Insight
8. Language Model Prediction
🤖 Interactive: Generate Text
Language models predict the next word based on context. Temperature controls randomness!
Top Predictions
9. Build Your NLP Pipeline
🔧 Interactive: Design Processing Pipeline
Real NLP systems chain multiple steps. Build your pipeline by adding processing stages!
Available Steps
Your Pipeline (0 steps)
💡 Pipeline Flow: Text → Tokenization → Cleaning → Feature Extraction → Model → Output. Order matters! Preprocessing comes before modeling.
🎯 Key Takeaways
Tokenization is Foundation
Breaking text into tokens is the first critical step. Word, character, or subword - each method has trade-offs. Modern models use subword tokenization (BPE, WordPiece).
Words Become Vectors
Word embeddings (Word2Vec, GloVe, FastText) convert text to numbers while preserving semantic relationships. Similar words have similar vectors in high-dimensional space.
Multiple NLP Tasks
Sentiment analysis, NER, classification, translation, summarization - NLP covers diverse tasks. Each requires different architectures but shares preprocessing steps.
Attention is Key
Attention mechanism revolutionized NLP. Transformers (BERT, GPT) use self-attention to capture context and relationships between all words simultaneously, not sequentially.
Pipelines are Powerful
Real NLP systems chain preprocessing, feature extraction, and models into pipelines. spaCy, NLTK, and Hugging Face Transformers provide ready-to-use components.
Modern Era: LLMs
Large Language Models (GPT-4, Claude, LLaMA) are pre-trained on massive text corpora. They excel at few-shot learning, understanding context, and generating human-like text across domains.