17Theory

∀Theoretical Foundations: PAC Learning, MDL & Information Bottleneck

Canonical Papers

A Theory of the Learnable

Valiant1984CACM

Read paper →

Modeling by Shortest Data Description

Rissanen1978Automatica

Read paper →

Deep Learning and the Information Bottleneck Principle

Tishby et al.2015ITW

Read paper →

Core Mathematics

PAC learning: concept class $\mathcal C$ is PAC-learnable if algorithm uses $n = O\left(\frac{1}{\epsilon}\big(d \log \tfrac{1}{\epsilon} + \log\tfrac{1}{\delta}\big)\right)$ examples to output hypothesis with error ≤ $\epsilon$ with probability ≥ $1-\delta$ , where $d$ is VC dimension.

MDL: choose hypothesis $H$ minimizing: $L(D,H) = L(H) + L(D\mid H)$

Information Bottleneck: learn representation $Z$ of input $X$ that keeps information about target $Y$ but compresses $X$ :

\max_{p(z\mid x)} I(Z;Y) - \beta I(Z;X)

Key Equation

\max_{p(z\mid x)} I(Z;Y) - \beta I(Z;X)

Interactive Visualization

Why It Matters for Modern Models

PAC/VC theory gives language and bounds for generalization and sample complexity
MDL/compression perspectives underlie "good models should be simple and predictive"
Info-bottleneck motivates representation compression and what middle layers do

Missing Intuition

What is still poorly explained in textbooks and papers:

How to connect formal sample-complexity bounds to actual overparameterized models that interpolate
Intuitive, visual examples of MDL in deep nets (comparing code lengths of different architectures)

Connections

Prerequisites

ΘNTK

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

⚠️ Breaks When

∪Bias-variance puzzle→Double Descent

↔️ Mathematical Dual

ℒLikelihood ↔ MDL→ML/CE/KL 🔤Tokens ↔ Codelength→Tokens ◎Bottleneck viewpoint→Embeddings ⌇Flatness ↔ Complexity→Sharpness

🔄 Same Technique

👍Human-aware utility→KTO