17Theory

Theoretical Foundations: PAC Learning, MDL & Information Bottleneck

Canonical Papers

A Theory of the Learnable

Valiant1984CACM
Read paper →

Modeling by Shortest Data Description

Rissanen1978Automatica
Read paper →

Deep Learning and the Information Bottleneck Principle

Tishby et al.2015ITW
Read paper →

Core Mathematics

PAC learning: concept class C\mathcal C is PAC-learnable if algorithm uses n=O(1ϵ(dlog1ϵ+log1δ))n = O\left(\frac{1}{\epsilon}\big(d \log \tfrac{1}{\epsilon} + \log\tfrac{1}{\delta}\big)\right) examples to output hypothesis with error ≤ ϵ\epsilon with probability ≥ 1δ1-\delta, where dd is VC dimension.

MDL: choose hypothesis HH minimizing: L(D,H)=L(H)+L(DH)L(D,H) = L(H) + L(D\mid H)

Information Bottleneck: learn representation ZZ of input XX that keeps information about target YY but compresses XX:

maxp(zx)I(Z;Y)βI(Z;X)\max_{p(z\mid x)} I(Z;Y) - \beta I(Z;X)
Key Equation
maxp(zx)I(Z;Y)βI(Z;X)\max_{p(z\mid x)} I(Z;Y) - \beta I(Z;X)

Interactive Visualization

Why It Matters for Modern Models

  • PAC/VC theory gives language and bounds for generalization and sample complexity
  • MDL/compression perspectives underlie "good models should be simple and predictive"
  • Info-bottleneck motivates representation compression and what middle layers do

Missing Intuition

What is still poorly explained in textbooks and papers:

  • How to connect formal sample-complexity bounds to actual overparameterized models that interpolate
  • Intuitive, visual examples of MDL in deep nets (comparing code lengths of different architectures)

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.