Legacy Concept Lab

Adversarial Examples & Robustness

Reveals that neural networks are "right for the wrong reasons"—decision boundaries are brittle

Concept 44 of 100TheoryPhase 8

#44AdversarialTheory

key equationx_{\text{adv}} = x + \epsilon \cdot \text{sign}(\nabla_x L)

Phase 8: Scaling, theory & multimodalConcept 44 of 100

Why It Matters for Modern Models

Reveals that neural networks are "right for the wrong reasons"—decision boundaries are brittle
Foundation for understanding model robustness, jailbreaks, and AI safety
Adversarial training remains the most reliable defense—robust models generalize better to distribution shift

What is still poorly explained in textbooks and papers:

Adversarial examples exist because of high dimensionality: many directions to push decision boundaries
Linear hypothesis: even linear models are vulnerable due to high-dimensional dot products
Robustness-accuracy tradeoff: adversarial training typically hurts clean accuracy by 2-10%

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

x_{\text{adv}} = x + \epsilon \cdot \text{sign}(\nabla_x L)

FGSM (Fast Gradient Sign Method) generates adversarial examples:

x_{\text{adv}} = x + \epsilon \cdot \text{sign}(\nabla_x L(\theta, x, y))

PGD (Projected Gradient Descent) iterates:

x^{(t+1)} = \Pi_{\mathcal{B}_\epsilon(x)} \left( x^{(t)} + \alpha \cdot \text{sign}(\nabla_x L) \right)

Adversarial training min-max objective:

\min_\theta \mathbb{E}_{(x,y)} \left[ \max_{\|\delta\| \leq \epsilon} L(\theta, x + \delta, y) \right]

Small perturbations $\delta$ cause large changes in model predictions.

Goodfellow, Shlens, Szegedy2015ICLR

Explore this concept from different angles — like a mathematician would.