Polynomial Regression: When Linear Isn’t Enough

Learn polynomial regression — how to model curved relationships by adding polynomial features. Includes degree selection, overfitting risks, and full Python implementation.

By Techietory on February 20, 2026

Polynomial Regression: When Linear Isn't Enough

Polynomial regression extends linear regression to model curved, non-linear relationships by adding polynomial terms (x², x³, etc.) as new features. Although the relationship between the input x and output y is curved, the model remains linear in its parameters — it is still linear regression applied to transformed features. For example, a degree-2 polynomial fits ŷ = w₁x + w₂x² + b, capturing a parabola. Higher degrees capture more complex curves but risk overfitting, making degree selection through cross-validation essential.

Introduction: When Data Curves

Not all relationships between variables are straight lines. A drug’s effectiveness increases with dosage up to a point, then plateaus or even decreases. A car’s fuel efficiency improves as speed increases from city to highway, then worsens at very high speeds. Employee productivity grows with experience early in a career, then levels off. Population growth accelerates exponentially, then slows as resources become constrained.

These curved relationships are everywhere in nature, science, business, and human behavior. Simple and multiple linear regression can’t capture them — no matter how many linear features you include, you can only fit planes and hyperplanes to data, not curves and waves.

Polynomial regression solves this problem elegantly. By adding polynomial powers of the original features as new inputs — x becomes x, x², x³, and so on — the model gains the flexibility to follow curves in the data. The mathematical insight is beautiful: even though the relationship between x and y is non-linear, the relationship between the polynomial features and y is still linear. This means all the machinery of linear regression — the cost function, gradient descent, the normal equation, regularization — applies unchanged.

This comprehensive guide covers polynomial regression in complete depth. You’ll learn when and why it’s needed, the mathematical foundations, how polynomial features are created, degree selection and the bias-variance tradeoff, regularization to prevent overfitting, extensions to multiple features, and complete Python implementations with scikit-learn and from scratch.

When Linear Regression Fails

The Problem of Non-Linear Data

Consider predicting a car’s stopping distance from its speed:

Plaintext

Speed (mph)   Stopping Distance (ft)
10            12
20            36
30            72
40            120
50            180
60            252
70            336

Speed (mph)   Stopping Distance (ft)
10            12
20            36
30            72
40            120
50            180
60            252
70            336

Plot this data:

Plaintext

Distance
300 │                         ●
    │                    ●
200 │               ●
    │          ●
100 │     ●
    │●
    │    ●
  0 └────────────────────── Speed
    10  20  30  40  50  60  70

Distance
300 │                         ●
    │                    ●
200 │               ●
    │          ●
100 │     ●
    │●
    │    ●
  0 └────────────────────── Speed
    10  20  30  40  50  60  70

Fit a linear model:

Plaintext

ŷ = 4.8 × speed − 36

R² ≈ 0.97  (looks impressive!)

But check the residuals:
Speed 10:  Predicted=12, Actual=12   ✓
Speed 40:  Predicted=156, Actual=120 ✗ (over by 30%)
Speed 70:  Predicted=300, Actual=336 ✗ (under by 11%)

Linear model systematically wrong in the middle and at extremes.
The residuals form a curved pattern — classic sign of non-linearity.

ŷ = 4.8 × speed − 36

R² ≈ 0.97  (looks impressive!)

But check the residuals:
Speed 10:  Predicted=12, Actual=12   ✓
Speed 40:  Predicted=156, Actual=120 ✗ (over by 30%)
Speed 70:  Predicted=300, Actual=336 ✗ (under by 11%)

Linear model systematically wrong in the middle and at extremes.
The residuals form a curved pattern — classic sign of non-linearity.

Physical reality: Stopping distance scales with speed squared (kinetic energy = ½mv²). The true relationship is quadratic, not linear.

Recognising Non-Linear Patterns

Visual signs that linear regression is insufficient:

Plaintext

1. Curved scatter plot
   y │    ●●
     │  ●    ●●
     │●        ●
     └──────────── x
     (U-shape, parabola)

2. Curved residual pattern
   Residuals │    ●●
             │ ●     ●
         0 ──┼──────────── predicted
             │   (should be random, not curved)

3. Physical knowledge
   "Speed-squared relationship"
   "Diminishing returns"
   "Exponential growth"

1. Curved scatter plot
   y │    ●●
     │  ●    ●●
     │●        ●
     └──────────── x
     (U-shape, parabola)

2. Curved residual pattern
   Residuals │    ●●
             │ ●     ●
         0 ──┼──────────── predicted
             │   (should be random, not curved)

3. Physical knowledge
   "Speed-squared relationship"
   "Diminishing returns"
   "Exponential growth"

The Core Idea: Adding Powers as Features

Polynomial regression’s key insight: transform non-linear relationships into linear ones.

Creating Polynomial Features

Original feature: x

Degree-2 polynomial features: x, x², and bias term Degree-3 polynomial features: x, x², x³, and bias term Degree-d polynomial features: x, x², …, xᵈ, and bias term

The Model Equations

Degree 1 (Linear Regression):

Plaintext

ŷ = w₁x + b

Line — constant slope

ŷ = w₁x + b

Line — constant slope

Degree 2 (Quadratic):

Plaintext

ŷ = w₁x + w₂x² + b

Parabola — one curve direction change
Can model U-shapes and ∩-shapes

ŷ = w₁x + w₂x² + b

Parabola — one curve direction change
Can model U-shapes and ∩-shapes

Degree 3 (Cubic):

Plaintext

ŷ = w₁x + w₂x² + w₃x³ + b

S-curve — two direction changes
Can model growth that accelerates then decelerates

ŷ = w₁x + w₂x² + w₃x³ + b

S-curve — two direction changes
Can model growth that accelerates then decelerates

Degree d (General Polynomial):

Plaintext

ŷ = w₁x + w₂x² + w₃x³ + ... + wᵈxᵈ + b

d-1 possible direction changes

ŷ = w₁x + w₂x² + w₃x³ + ... + wᵈxᵈ + b

d-1 possible direction changes

Why It’s Still “Linear Regression”

The key: Despite fitting curves, the model is linear in its parameters w.

Relabeling trick:

Plaintext

Let z₁ = x
Let z₂ = x²
Let z₃ = x³

Then: ŷ = w₁z₁ + w₂z₂ + w₃z₃ + b

This IS multiple linear regression on new features z₁, z₂, z₃!

Let z₁ = x
Let z₂ = x²
Let z₃ = x³

Then: ŷ = w₁z₁ + w₂z₂ + w₃z₃ + b

This IS multiple linear regression on new features z₁, z₂, z₃!

Implication: Everything from linear regression applies:

Same MSE cost function
Same gradient descent update rules
Same normal equation solution
Same regularization techniques (Ridge, Lasso)
Same evaluation metrics (R², RMSE, MAE)

The only difference is a preprocessing step: transform x into polynomial features.

Step-by-Step: Fitting the Stopping Distance Data

Step 1: Prepare Data and Polynomial Features

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

# Data: stopping distance
speed    = np.array([10, 20, 30, 40, 50, 60, 70], dtype=float)
distance = np.array([12, 36, 72, 120, 180, 252, 336], dtype=float)

X = speed.reshape(-1, 1)   # Shape (7, 1) — required for sklearn

# ── Polynomial feature transformation ────────────────────────
poly2 = PolynomialFeatures(degree=2, include_bias=False)
X_poly2 = poly2.fit_transform(X)

print("Original X (first 3 rows):")
print(X[:3])

print("\nPolynomial features (degree 2) — first 3 rows:")
print(X_poly2[:3])
print("Columns:", poly2.get_feature_names_out())

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

# Data: stopping distance
speed    = np.array([10, 20, 30, 40, 50, 60, 70], dtype=float)
distance = np.array([12, 36, 72, 120, 180, 252, 336], dtype=float)

X = speed.reshape(-1, 1)   # Shape (7, 1) — required for sklearn

# ── Polynomial feature transformation ────────────────────────
poly2 = PolynomialFeatures(degree=2, include_bias=False)
X_poly2 = poly2.fit_transform(X)

print("Original X (first 3 rows):")
print(X[:3])

print("\nPolynomial features (degree 2) — first 3 rows:")
print(X_poly2[:3])
print("Columns:", poly2.get_feature_names_out())

Output:

Python

Original X (first 3 rows):
[[10.]
 [20.]
 [30.]]

Polynomial features (degree 2) — first 3 rows:
[[  10.  100.]
 [  20.  400.]
 [  30.  900.]]
Columns: ['x0' 'x0^2']

Original X (first 3 rows):
[[10.]
 [20.]
 [30.]]

Polynomial features (degree 2) — first 3 rows:
[[  10.  100.]
 [  20.  400.]
 [  30.  900.]]
Columns: ['x0' 'x0^2']

PolynomialFeatures automatically creates x and x² as separate columns.

Step 2: Fit Models of Different Degrees

Python

degrees = [1, 2, 3, 4]
colors  = ['red', 'blue', 'green', 'orange']
x_plot  = np.linspace(8, 75, 200).reshape(-1, 1)

fig, axes = plt.subplots(2, 2, figsize=(12, 9))
axes = axes.flatten()

results = {}

for ax, degree, color in zip(axes, degrees, colors):
    # Build pipeline: polynomial transform → linear regression
    model = Pipeline([
        ('poly',    PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear',  LinearRegression())
    ])
    model.fit(X, distance)

    y_pred_train = model.predict(X)
    y_plot       = model.predict(x_plot)
    r2           = r2_score(distance, y_pred_train)
    results[degree] = {'model': model, 'r2': r2}

    # Plot
    ax.scatter(speed, distance, color='black', s=60,
               zorder=5, label='Data')
    ax.plot(x_plot, y_plot, color=color,
            linewidth=2, label=f'Degree {degree}')
    ax.set_xlabel('Speed (mph)')
    ax.set_ylabel('Stopping Distance (ft)')
    ax.set_title(f'Degree {degree}  |  R² = {r2:.4f}')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle('Polynomial Regression: Different Degrees', fontsize=14)
plt.tight_layout()
plt.show()

for deg, res in results.items():
    print(f"Degree {deg}: R² = {res['r2']:.6f}")

degrees = [1, 2, 3, 4]
colors  = ['red', 'blue', 'green', 'orange']
x_plot  = np.linspace(8, 75, 200).reshape(-1, 1)

fig, axes = plt.subplots(2, 2, figsize=(12, 9))
axes = axes.flatten()

results = {}

for ax, degree, color in zip(axes, degrees, colors):
    # Build pipeline: polynomial transform → linear regression
    model = Pipeline([
        ('poly',    PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear',  LinearRegression())
    ])
    model.fit(X, distance)

    y_pred_train = model.predict(X)
    y_plot       = model.predict(x_plot)
    r2           = r2_score(distance, y_pred_train)
    results[degree] = {'model': model, 'r2': r2}

    # Plot
    ax.scatter(speed, distance, color='black', s=60,
               zorder=5, label='Data')
    ax.plot(x_plot, y_plot, color=color,
            linewidth=2, label=f'Degree {degree}')
    ax.set_xlabel('Speed (mph)')
    ax.set_ylabel('Stopping Distance (ft)')
    ax.set_title(f'Degree {degree}  |  R² = {r2:.4f}')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle('Polynomial Regression: Different Degrees', fontsize=14)
plt.tight_layout()
plt.show()

for deg, res in results.items():
    print(f"Degree {deg}: R² = {res['r2']:.6f}")

Results:

Plaintext

Degree 1: R² = 0.966042   (Linear — misses curve)
Degree 2: R² = 0.999999   (Quadratic — perfect fit!)
Degree 3: R² = 1.000000   (Cubic — also perfect)
Degree 4: R² = 1.000000   (Degree 4 — same)

Degree 1: R² = 0.966042   (Linear — misses curve)
Degree 2: R² = 0.999999   (Quadratic — perfect fit!)
Degree 3: R² = 1.000000   (Cubic — also perfect)
Degree 4: R² = 1.000000   (Degree 4 — same)

Interpretation: Degree 2 achieves near-perfect fit — confirming the true quadratic (v²) relationship. Higher degrees add nothing useful here.

Step 3: Examine the Learned Coefficients

Python

# Degree-2 model coefficients
model_d2 = results[2]['model']
lr = model_d2.named_steps['linear']

print("Degree-2 Polynomial Regression:")
print(f"  Coefficient for x:    {lr.coef_[0]:.4f}")
print(f"  Coefficient for x²:   {lr.coef_[1]:.4f}")
print(f"  Intercept (bias):     {lr.intercept_:.4f}")
print(f"\nEquation: ŷ = {lr.coef_[0]:.3f}x + {lr.coef_[1]:.4f}x² + {lr.intercept_:.2f}")

# Degree-2 model coefficients
model_d2 = results[2]['model']
lr = model_d2.named_steps['linear']

print("Degree-2 Polynomial Regression:")
print(f"  Coefficient for x:    {lr.coef_[0]:.4f}")
print(f"  Coefficient for x²:   {lr.coef_[1]:.4f}")
print(f"  Intercept (bias):     {lr.intercept_:.4f}")
print(f"\nEquation: ŷ = {lr.coef_[0]:.3f}x + {lr.coef_[1]:.4f}x² + {lr.intercept_:.2f}")

Output:

Plaintext

Degree-2 Polynomial Regression:
  Coefficient for x:    -0.0000
  Coefficient for x²:    0.0686
  Intercept (bias):      0.0000

Equation: ŷ = -0.000x + 0.0686x² + 0.00

Degree-2 Polynomial Regression:
  Coefficient for x:    -0.0000
  Coefficient for x²:    0.0686
  Intercept (bias):      0.0000

Equation: ŷ = -0.000x + 0.0686x² + 0.00

The model discovered the true relationship: stopping distance ≈ 0.0686 × speed². Physics confirmed!

The Bias-Variance Tradeoff in Polynomial Regression

Polynomial degree is the primary control for the bias-variance tradeoff.

Underfitting: Degree Too Low

Plaintext

Degree 1 on curved data:
  High bias — model too simple, can't capture curve
  Low variance — consistent across different datasets
  Result: Systematic errors, poor fit

Degree 1 on curved data:
  High bias — model too simple, can't capture curve
  Low variance — consistent across different datasets
  Result: Systematic errors, poor fit

Overfitting: Degree Too High

Plaintext

Degree 10 on 12 data points:
  Low bias — can fit every point exactly
  High variance — wiggles wildly between points
  Result: Perfect training fit, terrible on new data

Degree 10 on 12 data points:
  Low bias — can fit every point exactly
  High variance — wiggles wildly between points
  Result: Perfect training fit, terrible on new data

The Sweet Spot

Plaintext

Degree 2 on quadratic data:
  Low bias — captures the true curve
  Low variance — stable, doesn't wiggle
  Result: Excellent fit on both training and test data

Degree 2 on quadratic data:
  Low bias — captures the true curve
  Low variance — stable, doesn't wiggle
  Result: Excellent fit on both training and test data

Visual Demonstration on Noisy Data

Python

# Generate noisy quadratic data
np.random.seed(42)
n = 30
X_noisy = np.sort(np.random.uniform(-3, 3, n))
y_noisy  = 1.5 * X_noisy**2 - 2 * X_noisy + 1 \
           + np.random.normal(0, 2, n)   # True: quadratic + noise

X_n = X_noisy.reshape(-1, 1)
x_dense = np.linspace(-3.5, 3.5, 300).reshape(-1, 1)

fig, axes = plt.subplots(1, 4, figsize=(18, 5))

for ax, degree in zip(axes, [1, 2, 5, 15]):
    model = Pipeline([
        ('poly',   PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear', LinearRegression())
    ])
    model.fit(X_n, y_noisy)

    y_dense = model.predict(x_dense)
    train_r2 = r2_score(y_noisy, model.predict(X_n))

    ax.scatter(X_noisy, y_noisy, s=30, color='steelblue',
               zorder=5, alpha=0.8)
    ax.plot(x_dense, y_dense, color='crimson', linewidth=2)
    ax.set_ylim(-5, 25)
    ax.set_title(f'Degree {degree}\nTrain R² = {train_r2:.3f}')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.grid(True, alpha=0.3)

plt.suptitle('Underfitting → Ideal → Overfitting', fontsize=13)
plt.tight_layout()
plt.show()

# Generate noisy quadratic data
np.random.seed(42)
n = 30
X_noisy = np.sort(np.random.uniform(-3, 3, n))
y_noisy  = 1.5 * X_noisy**2 - 2 * X_noisy + 1 \
           + np.random.normal(0, 2, n)   # True: quadratic + noise

X_n = X_noisy.reshape(-1, 1)
x_dense = np.linspace(-3.5, 3.5, 300).reshape(-1, 1)

fig, axes = plt.subplots(1, 4, figsize=(18, 5))

for ax, degree in zip(axes, [1, 2, 5, 15]):
    model = Pipeline([
        ('poly',   PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear', LinearRegression())
    ])
    model.fit(X_n, y_noisy)

    y_dense = model.predict(x_dense)
    train_r2 = r2_score(y_noisy, model.predict(X_n))

    ax.scatter(X_noisy, y_noisy, s=30, color='steelblue',
               zorder=5, alpha=0.8)
    ax.plot(x_dense, y_dense, color='crimson', linewidth=2)
    ax.set_ylim(-5, 25)
    ax.set_title(f'Degree {degree}\nTrain R² = {train_r2:.3f}')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.grid(True, alpha=0.3)

plt.suptitle('Underfitting → Ideal → Overfitting', fontsize=13)
plt.tight_layout()
plt.show()

What you’ll see:

Plaintext

Degree  1: Straight line — misses the U-shape (underfitting)
Degree  2: Smooth curve — follows the true pattern (ideal)
Degree  5: Slightly wiggly — starts to follow noise
Degree 15: Wildly oscillating — memorizes noise (overfitting)

Degree  1: Straight line — misses the U-shape (underfitting)
Degree  2: Smooth curve — follows the true pattern (ideal)
Degree  5: Slightly wiggly — starts to follow noise
Degree 15: Wildly oscillating — memorizes noise (overfitting)

Choosing the Right Degree: Cross-Validation

Never choose degree based on training R² — it always increases with degree. Use cross-validation on a validation or test set.

Learning Curve by Degree

Python

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

degrees_to_try = range(1, 16)
train_scores = []
cv_scores    = []

for degree in degrees_to_try:
    model = Pipeline([
        ('poly',   PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear', LinearRegression())
    ])

    # Training R²
    model.fit(X_n, y_noisy)
    train_r2 = r2_score(y_noisy, model.predict(X_n))
    train_scores.append(train_r2)

    # 5-fold cross-validation R²
    cv_r2 = cross_val_score(model, X_n, y_noisy,
                             cv=5, scoring='r2').mean()
    cv_scores.append(cv_r2)

# Plot
plt.figure(figsize=(9, 5))
plt.plot(degrees_to_try, train_scores, 'b-o',
         markersize=5, linewidth=2, label='Training R²')
plt.plot(degrees_to_try, cv_scores,    'r-o',
         markersize=5, linewidth=2, label='CV R²')
plt.axvline(x=2, color='green', linestyle='--',
            alpha=0.8, label='True degree = 2')
plt.xlabel('Polynomial Degree')
plt.ylabel('R² Score')
plt.title('Training vs Cross-Validation R² by Degree\n'
          '(Choose degree where CV R² peaks)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(degrees_to_try)
plt.tight_layout()
plt.show()

best_degree = degrees_to_try[np.argmax(cv_scores)]
print(f"Best degree by CV: {best_degree}")
print(f"Best CV R²:        {max(cv_scores):.4f}")

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

degrees_to_try = range(1, 16)
train_scores = []
cv_scores    = []

for degree in degrees_to_try:
    model = Pipeline([
        ('poly',   PolynomialFeatures(degree=degree, include_bias=False)),
        ('linear', LinearRegression())
    ])

    # Training R²
    model.fit(X_n, y_noisy)
    train_r2 = r2_score(y_noisy, model.predict(X_n))
    train_scores.append(train_r2)

    # 5-fold cross-validation R²
    cv_r2 = cross_val_score(model, X_n, y_noisy,
                             cv=5, scoring='r2').mean()
    cv_scores.append(cv_r2)

# Plot
plt.figure(figsize=(9, 5))
plt.plot(degrees_to_try, train_scores, 'b-o',
         markersize=5, linewidth=2, label='Training R²')
plt.plot(degrees_to_try, cv_scores,    'r-o',
         markersize=5, linewidth=2, label='CV R²')
plt.axvline(x=2, color='green', linestyle='--',
            alpha=0.8, label='True degree = 2')
plt.xlabel('Polynomial Degree')
plt.ylabel('R² Score')
plt.title('Training vs Cross-Validation R² by Degree\n'
          '(Choose degree where CV R² peaks)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(degrees_to_try)
plt.tight_layout()
plt.show()

best_degree = degrees_to_try[np.argmax(cv_scores)]
print(f"Best degree by CV: {best_degree}")
print(f"Best CV R²:        {max(cv_scores):.4f}")

What to look for:

Plaintext

Training R²:  Monotonically increases with degree (always!)
CV R²:        Peaks at true degree, then decreases (overfitting)

Decision rule: Choose degree at CV R² peak

Training R²:  Monotonically increases with degree (always!)
CV R²:        Peaks at true degree, then decreases (overfitting)

Decision rule: Choose degree at CV R² peak

Grid Search for Degree

Python

from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('poly',   PolynomialFeatures(include_bias=False)),
    ('linear', LinearRegression())
])

param_grid = {'poly__degree': list(range(1, 12))}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='r2',
    refit=True
)
grid_search.fit(X_n, y_noisy)

print(f"Best degree:  {grid_search.best_params_['poly__degree']}")
print(f"Best CV R²:   {grid_search.best_score_:.4f}")

from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('poly',   PolynomialFeatures(include_bias=False)),
    ('linear', LinearRegression())
])

param_grid = {'poly__degree': list(range(1, 12))}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='r2',
    refit=True
)
grid_search.fit(X_n, y_noisy)

print(f"Best degree:  {grid_search.best_params_['poly__degree']}")
print(f"Best CV R²:   {grid_search.best_score_:.4f}")

Regularized Polynomial Regression

High-degree polynomials overfit. Regularization controls the overfitting without reducing degree.

Ridge Polynomial Regression

Python

from sklearn.linear_model import Ridge, RidgeCV

# Compare: plain vs. Ridge polynomial regression at degree 10
X_train_n, X_test_n, y_train_n, y_test_n = (
    X_n[:20], X_n[20:], y_noisy[:20], y_noisy[20:]
)

poly_transform = PolynomialFeatures(degree=10, include_bias=False)
X_train_p = poly_transform.fit_transform(X_train_n)
X_test_p  = poly_transform.transform(X_test_n)

# Plain linear regression (no regularization)
lr_plain = LinearRegression()
lr_plain.fit(X_train_p, y_train_n)

# Ridge regression
ridge = Ridge(alpha=10)
ridge.fit(X_train_p, y_train_n)

print("Degree-10 Polynomial Regression:")
print(f"  Plain  — Train R²: {r2_score(y_train_n, lr_plain.predict(X_train_p)):.4f}"
      f"  Test R²: {r2_score(y_test_n, lr_plain.predict(X_test_p)):.4f}")
print(f"  Ridge  — Train R²: {r2_score(y_train_n, ridge.predict(X_train_p)):.4f}"
      f"  Test R²: {r2_score(y_test_n, ridge.predict(X_test_p)):.4f}")

# Visualise
x_dense = np.linspace(-3.5, 3.5, 300).reshape(-1, 1)
x_dense_p = poly_transform.transform(x_dense)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

for ax, model, title, color in zip(
        axes,
        [lr_plain, ridge],
        ['Degree 10 — No Regularization\n(Overfitting)',
         'Degree 10 — Ridge Regularization\n(Controlled)'],
        ['crimson', 'steelblue']):

    ax.scatter(X_train_n, y_train_n, color='blue',
               s=40, label='Train', zorder=5)
    ax.scatter(X_test_n, y_test_n, color='green',
               s=60, marker='D', label='Test', zorder=5)
    ax.plot(x_dense, model.predict(x_dense_p),
            color=color, linewidth=2, label='Model')
    ax.set_ylim(-8, 25)
    ax.set_title(title)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle('Effect of Regularization on Polynomial Regression', fontsize=13)
plt.tight_layout()
plt.show()

from sklearn.linear_model import Ridge, RidgeCV

# Compare: plain vs. Ridge polynomial regression at degree 10
X_train_n, X_test_n, y_train_n, y_test_n = (
    X_n[:20], X_n[20:], y_noisy[:20], y_noisy[20:]
)

poly_transform = PolynomialFeatures(degree=10, include_bias=False)
X_train_p = poly_transform.fit_transform(X_train_n)
X_test_p  = poly_transform.transform(X_test_n)

# Plain linear regression (no regularization)
lr_plain = LinearRegression()
lr_plain.fit(X_train_p, y_train_n)

# Ridge regression
ridge = Ridge(alpha=10)
ridge.fit(X_train_p, y_train_n)

print("Degree-10 Polynomial Regression:")
print(f"  Plain  — Train R²: {r2_score(y_train_n, lr_plain.predict(X_train_p)):.4f}"
      f"  Test R²: {r2_score(y_test_n, lr_plain.predict(X_test_p)):.4f}")
print(f"  Ridge  — Train R²: {r2_score(y_train_n, ridge.predict(X_train_p)):.4f}"
      f"  Test R²: {r2_score(y_test_n, ridge.predict(X_test_p)):.4f}")

# Visualise
x_dense = np.linspace(-3.5, 3.5, 300).reshape(-1, 1)
x_dense_p = poly_transform.transform(x_dense)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

for ax, model, title, color in zip(
        axes,
        [lr_plain, ridge],
        ['Degree 10 — No Regularization\n(Overfitting)',
         'Degree 10 — Ridge Regularization\n(Controlled)'],
        ['crimson', 'steelblue']):

    ax.scatter(X_train_n, y_train_n, color='blue',
               s=40, label='Train', zorder=5)
    ax.scatter(X_test_n, y_test_n, color='green',
               s=60, marker='D', label='Test', zorder=5)
    ax.plot(x_dense, model.predict(x_dense_p),
            color=color, linewidth=2, label='Model')
    ax.set_ylim(-8, 25)
    ax.set_title(title)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle('Effect of Regularization on Polynomial Regression', fontsize=13)
plt.tight_layout()
plt.show()

Selecting Regularization Strength

Python

# Cross-validated Ridge over range of alphas
alphas = np.logspace(-3, 5, 50)

ridge_cv = Pipeline([
    ('poly',  PolynomialFeatures(degree=8, include_bias=False)),
    ('ridge', RidgeCV(alphas=alphas, cv=5))
])
ridge_cv.fit(X_n, y_noisy)

best_alpha = ridge_cv.named_steps['ridge'].alpha_
print(f"Best Ridge alpha: {best_alpha:.4f}")

# Cross-validated Ridge over range of alphas
alphas = np.logspace(-3, 5, 50)

ridge_cv = Pipeline([
    ('poly',  PolynomialFeatures(degree=8, include_bias=False)),
    ('ridge', RidgeCV(alphas=alphas, cv=5))
])
ridge_cv.fit(X_n, y_noisy)

best_alpha = ridge_cv.named_steps['ridge'].alpha_
print(f"Best Ridge alpha: {best_alpha:.4f}")

Multiple Features with Polynomial Expansion

Polynomial features extend naturally to multiple input features, though the number of features grows rapidly.

Feature Count with Polynomial Expansion

Plaintext

1 feature, degree 2:  1, x, x²               = 3 features
1 feature, degree 3:  1, x, x², x³            = 4 features
2 features, degree 2: 1, x₁, x₂, x₁², x₁x₂, x₂² = 6 features
3 features, degree 2: 10 features
5 features, degree 2: 21 features
5 features, degree 3: 56 features
10 features, degree 3: 286 features

1 feature, degree 2:  1, x, x²               = 3 features
1 feature, degree 3:  1, x, x², x³            = 4 features
2 features, degree 2: 1, x₁, x₂, x₁², x₁x₂, x₂² = 6 features
3 features, degree 2: 10 features
5 features, degree 2: 21 features
5 features, degree 3: 56 features
10 features, degree 3: 286 features

Combinatorial explosion: Each degree-d term is a product of at most d features.

Interaction Terms

Polynomial features include cross-product terms (interactions):

Plaintext

x₁² → captures x₁'s quadratic effect
x₂² → captures x₂'s quadratic effect
x₁×x₂ → captures interaction between x₁ and x₂

Example: House price
sqft² → diminishing returns from very large homes
sqft × age → older large homes depreciate more

x₁² → captures x₁'s quadratic effect
x₂² → captures x₂'s quadratic effect
x₁×x₂ → captures interaction between x₁ and x₂

Example: House price
sqft² → diminishing returns from very large homes
sqft × age → older large homes depreciate more

Example with Two Features

Python

from sklearn.datasets import make_regression

# Two-feature example
np.random.seed(42)
X_2d  = np.random.uniform(-2, 2, (100, 2))
y_2d  = (X_2d[:, 0]**2                     # x₁² effect
         + 2 * X_2d[:, 0] * X_2d[:, 1]    # interaction
         - X_2d[:, 1]**2                   # x₂² effect
         + np.random.normal(0, 0.5, 100))

# Polynomial feature names
poly2d = PolynomialFeatures(degree=2, include_bias=False)
X_2d_poly = poly2d.fit_transform(X_2d)

print("Original features:", X_2d.shape[1])
print("Polynomial features (degree 2):", X_2d_poly.shape[1])
print("Feature names:", poly2d.get_feature_names_out())

# Train
model_2d = LinearRegression()
model_2d.fit(X_2d_poly, y_2d)
print(f"\nR² with polynomial features: {model_2d.score(X_2d_poly, y_2d):.4f}")

# Compare to linear model (no polynomial)
model_lin = LinearRegression()
model_lin.fit(X_2d, y_2d)
print(f"R² without polynomial features: {model_lin.score(X_2d, y_2d):.4f}")

from sklearn.datasets import make_regression

# Two-feature example
np.random.seed(42)
X_2d  = np.random.uniform(-2, 2, (100, 2))
y_2d  = (X_2d[:, 0]**2                     # x₁² effect
         + 2 * X_2d[:, 0] * X_2d[:, 1]    # interaction
         - X_2d[:, 1]**2                   # x₂² effect
         + np.random.normal(0, 0.5, 100))

# Polynomial feature names
poly2d = PolynomialFeatures(degree=2, include_bias=False)
X_2d_poly = poly2d.fit_transform(X_2d)

print("Original features:", X_2d.shape[1])
print("Polynomial features (degree 2):", X_2d_poly.shape[1])
print("Feature names:", poly2d.get_feature_names_out())

# Train
model_2d = LinearRegression()
model_2d.fit(X_2d_poly, y_2d)
print(f"\nR² with polynomial features: {model_2d.score(X_2d_poly, y_2d):.4f}")

# Compare to linear model (no polynomial)
model_lin = LinearRegression()
model_lin.fit(X_2d, y_2d)
print(f"R² without polynomial features: {model_lin.score(X_2d, y_2d):.4f}")

Output:

Plaintext

Original features: 2
Polynomial features (degree 2): 5
Feature names: ['x0' 'x1' 'x0^2' 'x0 x1' 'x1^2']

R² with polynomial features:    0.9831
R² without polynomial features: 0.0012

Original features: 2
Polynomial features (degree 2): 5
Feature names: ['x0' 'x1' 'x0^2' 'x0 x1' 'x1^2']

R² with polynomial features:    0.9831
R² without polynomial features: 0.0012

The true relationship is purely polynomial — the linear model completely fails while polynomial features capture it perfectly.

Complete Real-World Example: Engine Performance

Problem: Predict Fuel Efficiency from Engine Parameters

Python

# Simulate engine dataset
np.random.seed(7)
n = 200

rpm         = np.random.uniform(800, 6000, n)
temperature = np.random.uniform(150, 300, n)
load_pct    = np.random.uniform(10, 100, n)

# True relationship: non-linear
mpg = (
    40
    - 0.003 * rpm
    + 0.000001 * rpm**2          # Quadratic rpm effect
    - 0.05 * temperature
    - 0.3 * load_pct
    + 0.001 * rpm * load_pct / 100  # Interaction
    + np.random.normal(0, 1.5, n)
)

X_eng = np.column_stack([rpm, temperature, load_pct])
y_eng = mpg
feature_names_eng = ['rpm', 'temperature', 'load_pct']

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_tr, X_te, y_tr, y_te = train_test_split(
    X_eng, y_eng, test_size=0.2, random_state=42
)

# Compare linear vs polynomial (degree 2)
results_eng = {}

for name, degree in [('Linear (deg 1)', 1),
                     ('Quadratic (deg 2)', 2),
                     ('Cubic (deg 3)', 3)]:
    pipe = Pipeline([
        ('poly',   PolynomialFeatures(degree=degree, include_bias=False)),
        ('scaler', StandardScaler()),
        ('ridge',  Ridge(alpha=1.0))
    ])
    pipe.fit(X_tr, y_tr)

    train_r2 = pipe.score(X_tr, y_tr)
    test_r2  = pipe.score(X_te, y_te)
    n_feats  = (PolynomialFeatures(degree=degree, include_bias=False)
                .fit_transform(X_tr).shape[1])

    results_eng[name] = {
        'train_r2': train_r2,
        'test_r2':  test_r2,
        'n_features': n_feats
    }
    print(f"{name:22s} | Features: {n_feats:3d} | "
          f"Train R²: {train_r2:.4f} | Test R²: {test_r2:.4f}")

# Simulate engine dataset
np.random.seed(7)
n = 200

rpm         = np.random.uniform(800, 6000, n)
temperature = np.random.uniform(150, 300, n)
load_pct    = np.random.uniform(10, 100, n)

# True relationship: non-linear
mpg = (
    40
    - 0.003 * rpm
    + 0.000001 * rpm**2          # Quadratic rpm effect
    - 0.05 * temperature
    - 0.3 * load_pct
    + 0.001 * rpm * load_pct / 100  # Interaction
    + np.random.normal(0, 1.5, n)
)

X_eng = np.column_stack([rpm, temperature, load_pct])
y_eng = mpg
feature_names_eng = ['rpm', 'temperature', 'load_pct']

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_tr, X_te, y_tr, y_te = train_test_split(
    X_eng, y_eng, test_size=0.2, random_state=42
)

# Compare linear vs polynomial (degree 2)
results_eng = {}

for name, degree in [('Linear (deg 1)', 1),
                     ('Quadratic (deg 2)', 2),
                     ('Cubic (deg 3)', 3)]:
    pipe = Pipeline([
        ('poly',   PolynomialFeatures(degree=degree, include_bias=False)),
        ('scaler', StandardScaler()),
        ('ridge',  Ridge(alpha=1.0))
    ])
    pipe.fit(X_tr, y_tr)

    train_r2 = pipe.score(X_tr, y_tr)
    test_r2  = pipe.score(X_te, y_te)
    n_feats  = (PolynomialFeatures(degree=degree, include_bias=False)
                .fit_transform(X_tr).shape[1])

    results_eng[name] = {
        'train_r2': train_r2,
        'test_r2':  test_r2,
        'n_features': n_feats
    }
    print(f"{name:22s} | Features: {n_feats:3d} | "
          f"Train R²: {train_r2:.4f} | Test R²: {test_r2:.4f}")

Expected Output:

Python

Linear (deg 1)         | Features:   3 | Train R²: 0.8621 | Test R²: 0.8489
Quadratic (deg 2)      | Features:   9 | Train R²: 0.9742 | Test R²: 0.9681
Cubic (deg 3)          | Features:  19 | Train R²: 0.9801 | Test R²: 0.9703

Linear (deg 1)         | Features:   3 | Train R²: 0.8621 | Test R²: 0.8489
Quadratic (deg 2)      | Features:   9 | Train R²: 0.9742 | Test R²: 0.9681
Cubic (deg 3)          | Features:  19 | Train R²: 0.9801 | Test R²: 0.9703

Degree 2 captures most of the improvement; degree 3 adds marginal benefit with many more features.

Common Pitfalls and Best Practices

Pitfall 1: Forgetting to Scale Features

Problem: x² and x³ have hugely different magnitudes than x, destabilizing gradient descent.

Python

# WRONG: Polynomial features then fit directly
X_poly = PolynomialFeatures(degree=5).fit_transform(X)
LinearRegression().fit(X_poly, y)  # May fail to converge

# RIGHT: Scale AFTER polynomial expansion
Pipeline([
    ('poly',   PolynomialFeatures(degree=5, include_bias=False)),
    ('scaler', StandardScaler()),   # Scale the expanded features
    ('linear', LinearRegression())
])

# WRONG: Polynomial features then fit directly
X_poly = PolynomialFeatures(degree=5).fit_transform(X)
LinearRegression().fit(X_poly, y)  # May fail to converge

# RIGHT: Scale AFTER polynomial expansion
Pipeline([
    ('poly',   PolynomialFeatures(degree=5, include_bias=False)),
    ('scaler', StandardScaler()),   # Scale the expanded features
    ('linear', LinearRegression())
])

Pitfall 2: Using Training R² to Select Degree

Problem: Training R² always increases with degree — useless for selection.

Python

# WRONG: Choose degree with best training R²
for degree in range(1, 20):
    ...
    print(f"Training R² = {train_r2}")  # Always increases!

# RIGHT: Use cross-validation or test set
cv_r2 = cross_val_score(model, X, y, cv=5, scoring='r2').mean()

# WRONG: Choose degree with best training R²
for degree in range(1, 20):
    ...
    print(f"Training R² = {train_r2}")  # Always increases!

# RIGHT: Use cross-validation or test set
cv_r2 = cross_val_score(model, X, y, cv=5, scoring='r2').mean()

Pitfall 3: Applying to Very High Degree Without Regularization

Problem: Degree 15+ without regularization leads to extreme overfitting.

Python

# RISKY: High degree, no regularization
Pipeline([
    ('poly',   PolynomialFeatures(degree=15)),
    ('linear', LinearRegression())         # Overfits badly
])

# SAFE: High degree with Ridge
Pipeline([
    ('poly',   PolynomialFeatures(degree=15)),
    ('scaler', StandardScaler()),
    ('ridge',  Ridge(alpha=10))            # Controls overfitting
])

# RISKY: High degree, no regularization
Pipeline([
    ('poly',   PolynomialFeatures(degree=15)),
    ('linear', LinearRegression())         # Overfits badly
])

# SAFE: High degree with Ridge
Pipeline([
    ('poly',   PolynomialFeatures(degree=15)),
    ('scaler', StandardScaler()),
    ('ridge',  Ridge(alpha=10))            # Controls overfitting
])

Pitfall 4: Feature Explosion with Many Input Features

Problem: 20 features at degree 3 → 1,771 polynomial features.

Plaintext

Features: 20, Degree: 3
Combinations: C(20+3, 3) = 1,771 features

With 1,000 training examples → severe overfitting risk

Features: 20, Degree: 3
Combinations: C(20+3, 3) = 1,771 features

With 1,000 training examples → severe overfitting risk

Solution: Use only low-degree (2) with many features, or select features first.

Pitfall 5: Extrapolation Disasters

Problem: High-degree polynomials behave wildly outside the training range.

Python

# Degree-10 model trained on x ∈ [0, 10]
# Prediction at x = 11: may be enormous
# Prediction at x = 15: completely unreliable

# Always warn users: polynomial models are only valid within training range
print(f"Valid prediction range: [{X.min():.1f}, {X.max():.1f}]")

# Degree-10 model trained on x ∈ [0, 10]
# Prediction at x = 11: may be enormous
# Prediction at x = 15: completely unreliable

# Always warn users: polynomial models are only valid within training range
print(f"Valid prediction range: [{X.min():.1f}, {X.max():.1f}]")

When to Use Polynomial Regression

Use When:

Plaintext

✓ Scatter plot shows a clear curve (quadratic, S-shape)
✓ Physical knowledge suggests non-linear relationship
  (kinetic energy, compound interest, drug dose-response)
✓ Residual plot from linear model shows curved pattern
✓ Relatively low-dimensional input (1-5 features)
✓ Have enough data to support added parameters

✓ Scatter plot shows a clear curve (quadratic, S-shape)
✓ Physical knowledge suggests non-linear relationship
  (kinetic energy, compound interest, drug dose-response)
✓ Residual plot from linear model shows curved pattern
✓ Relatively low-dimensional input (1-5 features)
✓ Have enough data to support added parameters

Consider Alternatives When:

Plaintext

→ Many input features: Use tree models (Random Forest, XGBoost)
→ Relationship unknown/complex: Try gradient boosting or neural nets
→ Very noisy data: Higher variance with polynomial features
→ Need to extrapolate: Polynomial extrapolation is unreliable
→ Interpretability critical: Polynomial coefficients hard to interpret

→ Many input features: Use tree models (Random Forest, XGBoost)
→ Relationship unknown/complex: Try gradient boosting or neural nets
→ Very noisy data: Higher variance with polynomial features
→ Need to extrapolate: Polynomial extrapolation is unreliable
→ Interpretability critical: Polynomial coefficients hard to interpret

Comparison: Linear vs. Polynomial Regression

Aspect	Linear Regression	Polynomial Regression
Decision boundary	Straight line/plane	Curve/surface
Equation	ŷ = Xw + b	ŷ = X_poly·w + b
Parameters	n + 1	Depends on degree and n
Feature engineering	None needed	Polynomial expansion
Bias	High (for curved data)	Tunable via degree
Variance	Low	Grows with degree
Overfitting risk	Low	High at large degrees
Interpretability	High	Decreases with degree
Regularization	Ridge/Lasso	Ridge/Lasso (more important)
Degree selection	N/A	Cross-validation
Extrapolation	Linear, predictable	Unreliable
Best for	Linear data	Curved, polynomial relationships

Conclusion: Curves Within the Linear Framework

Polynomial regression is a powerful and elegant solution to one of linear regression’s most obvious limitations. By creating polynomial features from the original inputs, it extends the linear framework to model curved relationships — all without changing any of the underlying machinery.

The central insight — that a non-linear relationship between x and y can become linear when expressed in terms of polynomial features z₁=x, z₂=x², z₃=x³ — is one of machine learning’s most instructive ideas. It shows that “linear regression” really means “linear in the parameters,” not “linear in the raw inputs.” This opens the door to a huge class of feature transformations (logarithms, square roots, interactions, ratios) that all fit within the standard linear regression framework.

The key lessons:

Degree selection requires cross-validation. Training R² always increases — only held-out data reveals whether higher degrees actually generalize.

Regularization is your safety net. Ridge regression with polynomial features gives you flexibility without wild overfitting, especially at higher degrees.

Feature scaling is mandatory. After polynomial expansion, features like x³ and x are on completely different scales — standardize before fitting.

Be cautious with extrapolation. Polynomial models curve sharply outside their training range, making predictions unreliable there.

Feature explosion demands care. With multiple inputs, degree-2 polynomial features multiply quickly — use Ridge regularization or feature selection to manage.

Polynomial regression sits at a beautiful intersection: it leverages the elegant simplicity of linear regression while capturing the curved complexity of real-world data. Master it, and you have a flexible, interpretable tool for modeling the non-linear patterns that simple linear regression can never reach.

0 Comments

Inline Feedbacks

View all comments

Discover More

Click For More

Polynomial Regression: When Linear Isn’t Enough

Introduction: When Data Curves

When Linear Regression Fails

The Problem of Non-Linear Data

Recognising Non-Linear Patterns

The Core Idea: Adding Powers as Features

Creating Polynomial Features

The Model Equations

Why It’s Still “Linear Regression”

Step-by-Step: Fitting the Stopping Distance Data

Step 1: Prepare Data and Polynomial Features

Step 2: Fit Models of Different Degrees

Step 3: Examine the Learned Coefficients

The Bias-Variance Tradeoff in Polynomial Regression

Underfitting: Degree Too Low

Overfitting: Degree Too High

The Sweet Spot

Visual Demonstration on Noisy Data

Choosing the Right Degree: Cross-Validation

Learning Curve by Degree

Grid Search for Degree

Regularized Polynomial Regression

Ridge Polynomial Regression

Selecting Regularization Strength

Multiple Features with Polynomial Expansion

Feature Count with Polynomial Expansion

Interaction Terms

Example with Two Features

Complete Real-World Example: Engine Performance

Problem: Predict Fuel Efficiency from Engine Parameters

Common Pitfalls and Best Practices

Pitfall 1: Forgetting to Scale Features

Pitfall 2: Using Training R² to Select Degree

Pitfall 3: Applying to Very High Degree Without Regularization

Pitfall 4: Feature Explosion with Many Input Features

Pitfall 5: Extrapolation Disasters

When to Use Polynomial Regression

Use When:

Consider Alternatives When:

Comparison: Linear vs. Polynomial Regression

Conclusion: Curves Within the Linear Framework

Discover More

Apple Patches Actively Exploited WebKit Zero-Days – Why This One Matters For Everyone

Understanding the Cost Function in Linear Regression

Alphabet Announces Massive AI Capex Increase, Shares Fall

Choosing the Right Raspberry Pi Model: A Beginner’s Guide

Default Arguments in C++ Functions: Flexible Function Calls

Finding Total Resistance in Parallel Circuits: The Reciprocal Method