Machine Learning

Regularization: Ridge, Lasso, and Elastic Net

Ridge, lasso, and elastic net add shape to coefficient space so models trade training fit for shrinkage, sparsity, and more stable validation behavior.

status: publishedimportance: criticaldifficulty 3/5math: undergraduateread: 20mlive demo

Back to Machine Learning Next: Model Selection and Hyperparameter Search

Concept Structure

Regularization: Ridge, Lasso, and Elastic Net

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

3prerequisites

2next concepts

2related links

Learning map

Regularization: Ridge, Lasso, and Elastic Net

BeforeLinear Regression & Least SquaresNow4/4 sections readyTryManipulate one control and predict the visible change.NextModel Selection and Hyperparameter Search

Object flow

4/4 sections readyAsk about this Research room

ConceptRegularization: Ridge, Lasso, and Elastic NetMachine Learning EquationRegularization: Ridge, Lasso, and Elastic Net equation 1Exact equation object CodeRegularization: Ridge, Lasso, and Elastic Net code witness 1Exact code witness DemoRegularization: Ridge, Lasso, and Elastic Net interactive demoVisualization object ClaimRidge adds an L2 penalty that continuously penalizes large coefficien...Exact claim check SourceAn Introduction to Statistical LearningExact source object

ConceptRegularization: Ridge, Lasso, and Elastic NetMachine Learning

4 sources attachedLocal snapshot ready

concept:machine-learning/regularization-ridge-lasso-elastic-net

Codewitness nearby Predictbefore reveal Roomobject handoff

Conceptual Bridge

What should feel connected as you move through this page.

Carry inLinear Regression & Least Squares

Bring the mental model from Linear Regression & Least Squares; this page will reuse it instead of restarting from zero.

Work hereRegularization: Ridge, Lasso, and Elastic Net

Ridge, lasso, and elastic net add shape to coefficient space so models trade training fit for shrinkage, sparsity, and more stable validation behavior.

Carry outModel Selection and Hyperparameter Search

The next edge should feel earned: use the demo prediction here before following Model Selection and Hyperparameter Search.

Test the linkManipulate one control and predict the visible change.Then continue to Model Selection and Hyperparameter Search

01IntuitionStart with the picture, metaphor, or geometric mechanism.02MathMake the objects explicit and connect them with notation.03CodeMirror the equations with runnable implementation details.04Interactive DemoManipulate the mechanism and watch the idea respond.

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

You are here because a model with many features can explain the training data in too many ways. Regularization asks each explanation to pay rent: if a coefficient is large, the model must justify why that size is worth the extra complexity.

Before this, know least squares, norms, and why train/dev/test separation matters. By the end, you should be able to predict how ridge, lasso, and elastic net change coefficients, why feature scaling matters, and why the penalty strength $\lambda$ belongs to validation or cross-validation, not the final test set.

Start with least squares. It only asks, "Did the predictions match the training targets?" If two features are correlated, many coefficient pairs can make similar predictions. One fit might put most of the weight on feature 1, another on feature 2, and both can look equally good on the training set.

Regularization adds a second question:

How expensive is this coefficient vector?

Ridge regression uses an $\ell_2$ cost. It dislikes large squared coefficients, so it pulls weights smoothly toward zero. It usually keeps correlated features sharing the load rather than choosing one winner.

Lasso uses an $\ell_1$ cost. Its geometry has sharp corners on the coordinate axes. Those corners make exact zeros natural, so lasso can perform feature selection: some coefficients are not merely small; they disappear.

Elastic net mixes the two. It keeps the lasso's ability to create zeros, but adds enough ridge-like smoothing that groups of correlated features behave less erratically.

Three caveats matter:

Scale matters. Penalizing a coefficient is unfair if one feature is measured in dollars and another in thousands of dollars. Standardize features before comparing penalties, but fit that scaler on the training fold only and apply the same transformation to dev or test rows.
Sparsity is not truth. A zero lasso coefficient means "not chosen under this data, scaling, penalty, and correlation structure," not "causally irrelevant."
Test data stays sealed. Choose $\lambda$ and the ridge/lasso mix using dev data or cross-validation, then evaluate the selected pipeline once on test.

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Equation 1L(w)=\frac{1}{2n}\lVert y-Xw\rVert_2^2.Equation 2J_{\text{ridge}}(w)= \frac{1}{2n}\lVert y-Xw\rVert_2^2+\frac{\lambda}{2}\lVert w\rVert_2^2, \...

Let $X\in\mathbb R^{n\times p}$ be a standardized design matrix, let $y\in\mathbb R^n$ be centered targets, and let $w\in\mathbb R^p$ be the coefficients. We handle the intercept separately and do not penalize it. The unregularized least-squares loss is

L(w)=\frac{1}{2n}\lVert y-Xw\rVert_2^2.

Ridge regression adds a squared $\ell_2$ penalty:

J_{\text{ridge}}(w)= \frac{1}{2n}\lVert y-Xw\rVert_2^2+\frac{\lambda}{2}\lVert w\rVert_2^2, \qquad \lambda\ge 0.

The gradient is

\nabla J_{\text{ridge}}(w)=\frac{1}{n}X^{\mathsf T}(Xw-y)+\lambda w.

If $X^{\mathsf T}X+\lambda n I$ is invertible, the ridge solution is

\hat w_{\text{ridge}}=(X^{\mathsf T}X+\lambda n I)^{-1}X^{\mathsf T}y.

The constants differ across books depending on whether the squared error is summed or averaged, but the invariant is the same: adding a positive multiple of the identity makes large coefficient directions more expensive and stabilizes nearly singular directions.

Lasso replaces squared length with $\ell_1$ length:

J_{\text{lasso}}(w)= \frac{1}{2n}\lVert y-Xw\rVert_2^2+\lambda\lVert w\rVert_1.

The $\ell_1$ norm is not differentiable at $w_j=0$ . Its subgradient is

\partial |w_j|= \begin{cases} \{1\}, & w_j>0,\\ [-1,1], & w_j=0,\\ \{-1\}, & w_j<0. \end{cases}

That interval at zero is the algebraic reason exact zeros can be optimal. In the special case where the empirical Gram matrix satisfies $\frac1n X^{\mathsf T}X=I$ , define $z_j=\frac1n x_j^{\mathsf T}y$ . The lasso solution is soft-thresholding:

\hat w_j=\operatorname{sign}(z_j)\max(|z_j|-\lambda,0).

Small coordinates vanish. Large coordinates are shifted toward zero by a fixed amount.

Elastic net blends the two penalties with a mixing parameter $\alpha\in[0,1]$ :

J_{\text{enet}}(w)= \frac{1}{2n}\lVert y-Xw\rVert_2^2+ \lambda\left(\alpha\lVert w\rVert_1+\frac{1-\alpha}{2}\lVert w\rVert_2^2\right).

When $\alpha=0$ , this is ridge. When $\alpha=1$ , this is lasso. Between them, elastic net can keep sparsity while reducing some of the instability lasso has when predictors are strongly correlated. Exact zero coefficients come from the $\ell_1$ part, so the mix needs $\alpha>0$ for lasso-style zeros.

The penalty strength $\lambda$ is a hyperparameter. A clean workflow fits candidate pipelines on training folds, chooses $\lambda$ by development loss or cross-validation, optionally refits the selected pipeline by a pre-declared rule, and tests once.

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

Code witness 1import numpy as np rng = np.random.default_rng(7) n = 120 x1 = rng.normal(size=n) x2 = 0.92 *...python

import numpy as np

rng = np.random.default_rng(7)
n = 120
x1 = rng.normal(size=n)
x2 = 0.92 * x1 + np.sqrt(1 - 0.92**2) * rng.normal(size=n)
x3 = rng.normal(size=n)                     # noise feature
X_raw = np.column_stack([x1, x2, x3])
y_raw = 2.0 * x1 + rng.normal(scale=0.7, size=n)
train, dev = np.arange(80), np.arange(80, n)
x_mean = X_raw[train].mean(axis=0)
x_std = X_raw[train].std(axis=0)
X = (X_raw - x_mean) / x_std
y_mean = y_raw[train].mean()
y = y_raw - y_mean

def soft_threshold(z, amount):
    return np.sign(z) * max(abs(z) - amount, 0.0)

def fit_elastic_net(alpha, lam, steps=700):
    Xtr, ytr = X[train], y[train]
    w = np.zeros(X.shape[1])
    for _ in range(steps):
        for j in range(X.shape[1]):
            residual = ytr - Xtr @ w + Xtr[:, j] * w[j]
            rho = np.mean(Xtr[:, j] * residual)
            denom = np.mean(Xtr[:, j] ** 2) + lam * (1 - alpha)
            w[j] = soft_threshold(rho, lam * alpha) / denom
    return w

for name, alpha in [("ridge", 0.0), ("elastic", 0.5), ("lasso", 1.0)]:
    w = fit_elastic_net(alpha=alpha, lam=0.35)
    mse = np.mean((X[dev] @ w - y[dev]) ** 2)
    print(name, "coef=", np.round(w, 3), "dev MSE=", round(mse, 3))

The code mirrors the math. alpha=0 uses a ridge-style denominator, alpha=1 uses pure soft-thresholding, and alpha=0.5 blends them. Because x1 and x2 are highly correlated, lasso may choose one correlated twin while ridge tends to spread weight across both.

The scaler and target centering are fit on training rows only; dev rows are transformed with those same statistics.

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Choose the penalty strength, the $\ell_1$ mix, and how correlated the two signal features are. Before revealing the coefficients, predict the dominant visible behavior: will the model spread weight across the correlated pair, drop one correlated twin, or mostly shrink coefficient magnitudes?

The bars show fitted coefficients for two correlated signal features and one noise feature. The correlation slider controls a population construction target, so the finite sample correlation will be close but not exact. The dev-set readout is there to keep the workflow honest: $\lambda$ and $\alpha$ are modeling choices, so they belong to validation or cross-validation rather than the final test set.

Live Concept Demo

Explore Regularization: Ridge, Lasso, and Elastic Net

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 3/5undergraduatecode-aligned

Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Regularization: Ridge, Lasso, and Elastic Net should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

Ridge, lasso, and elastic net add shape to coefficient space so models trade training fit for shrinkage, sparsity, and more stable validation behavior.

Prediction open01 / Intuition

Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Regularization: Ridge, Lasso, and Elastic Net should make visible.

Visual Inquiry

Make the image answer a mathematical question

Ridge, lasso, and elastic net add shape to coefficient space so models trade training fit for shrinkage, sparsity, and more stable validation behavior.

4/4 stages readyLive demo connected

Visual cueWhich visible object should carry the first intuition?

Inspection depth2/4

Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Regularization: Ridge, Lasso, and Elastic Net easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

book · 2023An Introduction to Statistical LearningJames, Witten, Hastie, Tibshirani, and Taylor

Source for ridge/lasso objectives, shrinkage, sparsity, scaling caveats, and cross-validated lambda selection.

Open source

course-notes · 2019CS229 Notes: Regularization and Model SelectionStanford CS229

Source for regularization as a model-selection/generalization tool and cross-validation of regularized choices.

Open source

book · 2016Deep Learning: Regularization for Deep LearningGoodfellow, Bengio, and Courville

Source for parameter-norm penalties, L2 weight decay, L1 sparsity, and bias-variance regularization framing.

Open source

paper · 2005Regularization and Variable Selection via the Elastic NetZou and Hastie

Source for the elastic-net objective, sparsity plus shrinkage framing, and correlated-predictor grouping behavior.

Open source

Claim Review

Ridge, lasso, and elastic net add shape to coefficient space so models trade training fit for shrinkage, sparsity, and more stable validation behavior.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources4 references

islr-ridge-lasso, cs229-regularization-model-selection, goodfellow-regularization, zou-hastie-elastic-net

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedRidge adds an L2 penalty that continuously penalizes large coefficient norms and usually shrinks the fitted vector toward zero, lasso adds an L1 penalty that can set coefficients exactly to zero, and elastic net blends both while lambda must be chosen without test-set leakage.Claim metadata: source checked

ISLR directly supports ridge/lasso objectives, scaling caveats, L1 sparsity, and CV lambda selection; CS229 supports cross-validation/model-selection framing; Goodfellow supports parameter-norm penalties and L1/L2 regularization behavior; Zou and Hastie support the elastic-net objective and grouping behavior under correlated predictors.

Sources: An Introduction to Statistical Learning, CS229 Notes: Regularization and Model Selection, Deep Learning: Regularization for Deep Learning, Regularization and Variable Selection via the Elastic NetThis claim covers linear-model penalty geometry and toy coordinate-descent witnesses; it does not claim lasso always recovers true causal features, that ridge/lasso universally dominate each other, or that neural weight decay is identical to AdamW.A bounded review summary is present; still check caveats and exact source scope.

Substantively reviewed after train-only preprocessing, demo classifier, correlation-generation, and elastic-net source fixes. Sources support ridge/lasso/elastic-net objectives, L1 sparsity, scaling-within-training-fold caveat, and validation/CV lambda selection. Caveats: toy linear-model witness only; no causal feature-selection guarantee; no claim of coordinate-wise monotone ridge shrinkage or AdamW equivalence.

Reviewer: gpt-pro; reviewed 2026-06-28

source-span-islr-ridge-lasso source-span-cs229-regularization-model-selection source-span-goodfellow-regularization source-span-zou-hastie-elastic-net math-object-1 math-object-2 code-witness-1 interactive-demo

Source support candidates

book 2023An Introduction to Statistical Learning

Source for ridge/lasso objectives, shrinkage, sparsity, scaling caveats, and cross-validated lambda selection.

course-notes 2019CS229 Notes: Regularization and Model Selection

Source for regularization as a model-selection/generalization tool and cross-validation of regularized choices.

book 2016Deep Learning: Regularization for Deep Learning

Source for parameter-norm penalties, L2 weight decay, L1 sparsity, and bias-variance regularization framing.

paper 2005Regularization and Variable Selection via the Elastic Net

Source for the elastic-net objective, sparsity plus shrinkage framing, and correlated-predictor grouping behavior.

Mechanism witnesses

Equation 1

L(w)=\frac{1}{2n}\lVert y-Xw\rVert_2^2.

Equation 2

J_{\text{ridge}}(w)= \frac{1}{2n}\lVert y-Xw\rVert_2^2+\frac{\lambda}{2}\lVert w\rVert_2^2, \qquad \lambda\ge 0.

Code witness 1import numpy as np rng = np.random.default_rng(7) n = 120 x1 = rng.normal(size=n) x2 = 0.92 *...Demo stateLive mechanism probe

Practice Loop

Try the idea before it explains itself

Ridge, lasso, and elastic net add shape to coefficient space so models trade training fit for shrinkage, sparsity, and more stable validation behavior.

Readiness0/3 checks ready

Predict

Before touching the demo, predict one visible change that should happen in Regularization: Ridge, Lasso, and Elastic Net.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Your answer canvas

Local checks

Claim

A concrete answer is on the canvas.

Mechanism

The answer names why the claim should hold.

Bridge

It touches the page context or a neighboring idea.

Misconception check

Object research drawerClose

ConceptRegularization: Ridge, Lasso, and Elastic NetMachine Learning

Code witness comparisonRegularization: Ridge, Lasso, and Elastic Net code witness 1rng = np.random.default_rng(7)Prediction before revealRegularization: Ridge, Lasso, and Elastic Net interactive demoManipulate one control and predict the visible change.

Grounded room questionWhat is the smallest example that makes Regularization: Ridge, Lasso, and Elastic Net click without losing the math?Local snapshot ready

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.

Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptMachine Learning

Regularization: Ridge, Lasso, and Elastic Net

Anchored question

What is the smallest example that makes Regularization: Ridge, Lasso, and Elastic Net click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action

Local action draft

This draft stays locally in this browser for concept:machine-learning/regularization-ridge-lasso-elastic-net.

Draft noteNext action

No local draft saved.

Evidence to inspect

Source ids to inspect: islr-ridge-lasso, cs229-regularization-model-selection, goodfellow-regularization, zou-hastie-elastic-net
Definition, prerequisite, and contrast concept links
The equation or code witness that makes the concept operational
One demo state that shows the invariant instead of a slogan

What would resolve this

The learner can state the mechanism in their own words
The learner can name the prerequisite that would repair confusion
The learner can predict how the mechanism changes under one perturbation

Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Regularization: Ridge, Lasso, and Elastic Net Object key: concept:machine-learning/regularization-ridge-lasso-elastic-net Context: Machine Learning Anchor id: concept/concept-notebook/machine-learning/regularization-ridge-lasso-elastic-net Open question: What is the smallest example that makes Regularization: Ridge, Lasso, and Elastic Net click without losing the math? Evidence to inspect: - Source ids to inspect: islr-ridge-lasso, cs229-regularization-model-selection, goodfellow-regularization, zou-hastie-elastic-net - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object

concept/concept-notebook/machine-learning/regularization-ridge-lasso-elastic-net
concept:machine-learning/regularization-ridge-lasso-elastic-net

Learning Map

Before / Now / Try / Next

BeforeLinear Regression & Least Squares

NowIntuition → Math → Code → Demo

TryManipulate one control and predict the visible change.

NextModel Selection and Hyperparameter Search

Intuitionready
Mathready
Codeready
Interactive Demoready

Object Companion

Ask beside the selected object

Ridge, lasso, and elastic net add shape to coefficient space so models trade training fit for shrinkage, sparsity, and more stable validation behavior.

Your question

GoalComfortStyleStuck on

Context prompt

You are my AI learning companion for Continuous Function. Current context: Machine Learning concept. Learning surface: Regularization: Ridge, Lasso, and Elastic Net. What this page says: Ridge, lasso, and elastic net add shape to coefficient space so models trade training fit for shrinkage, sparsity, and more stable validation behavior. Current section: Intuition, math, code, and interactive demo. Suggested next step: Manipulate one control and predict the visible change.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Explain the central idea in plain language, then restate it with the exact math objects from the page. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.

Domain

Machine Learning

machine-learningregularizationridgelassoelastic-netmodel-selection

Prerequisites

Linear Regression & Least Squares Norms Train/Dev/Test Splits, Cross-Validation, and Leakage

Leads To

Model Selection and Hyperparameter Search Weight Decay & AdamW: Decoupled Regularization

Bias-Variance Decomposition Gradient Descent

Within this domain