Choosing the Right K Value in KNN

Learn how to choose the optimal K in K-Nearest Neighbors. Covers cross-validation sweeps, elbow method, bias-variance tradeoff, odd K rule, and Python implementations.

By Techietory on May 19, 2026

The optimal K in K-Nearest Neighbors is found by cross-validated grid search: train and evaluate the model at multiple K values, then select the K that maximizes validation performance. Small K values (K=1) produce complex, noisy decision boundaries that overfit; large K values produce overly smooth boundaries that underfit. The optimal K is typically found between K=3 and K=30 for most datasets, with cross-validation providing an unbiased estimate of where this sweet spot lies.

Introduction

Among the hyperparameters in machine learning, K in KNN has an unusually direct and interpretable effect on model behavior. Every increase in K smooths the decision boundary. Every decrease makes it more complex. There is no obscure internal mechanism — you can literally watch the boundary change by plotting it as K increases from 1 to 100.

This interpretability is a gift for understanding the bias-variance tradeoff. K=1 is the purest possible high-variance model: each prediction is determined entirely by the single nearest training point, memorizing the training set perfectly while generalizing poorly. K=N (all training samples) is the purest possible high-bias model: every prediction becomes the global majority class, ignoring all local structure. Somewhere between these extremes, your data has an optimal K.

Finding that K is not guesswork — it is a systematic process using cross-validation. This article covers every aspect of K selection: the theoretical tradeoffs, practical selection methods from simple cross-validation sweeps to statistically rigorous tests, common rules of thumb and when they fail, how K interacts with dataset size and dimensionality, and a complete Python implementation of the selection workflow.

The Bias-Variance Tradeoff Through the Lens of K

Before diving into selection methods, it helps to have precise intuition for what K controls.

K = 1: Maximum Variance, Zero Bias on Training Data

With K=1, each prediction is made solely by the single nearest training neighbor. This model has no training error — every training point is its own nearest neighbor, so it predicts every training label perfectly. But the decision boundary traces around every individual point, including mislabeled examples and outliers. On unseen data, performance suffers because the model has memorized noise rather than learned structure.

The variance of a K=1 classifier is high: small perturbations to the training set (adding or removing a few points, changing a few labels) can cause large changes to the decision boundary in regions near those points.

K = N: Maximum Bias, Minimum Variance

With K=N (using all training samples), every prediction is identical: the global majority class. This is the simplest possible predictor — it ignores all features entirely. The variance is zero (adding or removing a few training points will not change the global majority class). But bias is maximum: the model fails to capture any local structure.

The Optimal K: The Bias-Variance Sweet Spot

The optimal K minimizes total prediction error, which is the sum of bias² and variance:

$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$

As K increases from 1 to N:

Bias increases: the model’s predictions become increasingly smoothed, losing local precision
Variance decreases: each prediction averages more neighbors, becoming more stable

The optimal K sits at the bottom of the total error curve — where the benefit of lower variance no longer outweighs the cost of higher bias.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons, make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')


def visualize_k_effect(X, y, k_values=[1, 3, 7, 15, 30, 100], figsize=(18, 10)):
    """
    Visualize how decision boundaries change as K increases.
    Shows the bias-variance tradeoff directly.
    """
    scaler_v = StandardScaler()
    X_s = scaler_v.fit_transform(X)

    x_min, x_max = X_s[:, 0].min() - 0.5, X_s[:, 0].max() + 0.5
    y_min, y_max = X_s[:, 1].min() - 0.5, X_s[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 250),
                          np.linspace(y_min, y_max, 250))
    grid = np.c_[xx.ravel(), yy.ravel()]

    n_cols = len(k_values)
    fig, axes = plt.subplots(1, n_cols, figsize=figsize)

    colors_cls = ['steelblue', 'coral']
    colors_bg  = ['#d0e8f8', '#fde8e0']

    for ax, k in zip(axes, k_values):
        knn_v = KNeighborsClassifier(n_neighbors=k)
        knn_v.fit(X_s, y)

        Z = knn_v.predict(grid).reshape(xx.shape)

        ax.contourf(xx, yy, Z, alpha=0.4,
                    colors=colors_bg[:len(np.unique(y))])
        ax.contour(xx, yy, Z, colors='black', linewidths=0.7, alpha=0.5)

        for cls_i, color in zip(np.unique(y), colors_cls):
            mask = y == cls_i
            ax.scatter(X_s[mask, 0], X_s[mask, 1],
                       c=color, edgecolors='white', s=35,
                       linewidth=0.4, alpha=0.85)

        train_acc = knn_v.score(X_s, y)
        ax.set_title(f'K = {k}\nTrain acc = {train_acc:.3f}',
                     fontsize=11, fontweight='bold')
        ax.set_xlabel('Feature 1', fontsize=9)
        ax.set_ylabel('Feature 2', fontsize=9)
        ax.grid(True, alpha=0.2)

        # Label the regime
        if k <= 2:
            regime = "Overfit"
            color_label = '#cc3333'
        elif k >= len(y) // 3:
            regime = "Underfit"
            color_label = '#3366cc'
        else:
            regime = "Balanced"
            color_label = '#33aa33'

        ax.text(0.05, 0.05, regime,
                transform=ax.transAxes, fontsize=10,
                color=color_label, fontweight='bold',
                bbox=dict(boxstyle='round', fc='white', alpha=0.8))

    plt.suptitle('Effect of K on KNN Decision Boundary\n'
                 '(Left = high variance/overfit → Right = high bias/underfit)',
                 fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.savefig('knn_k_effect_boundaries.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("Saved: knn_k_effect_boundaries.png")


np.random.seed(42)
X_vis, y_vis = make_moons(n_samples=300, noise=0.25, random_state=42)
visualize_k_effect(X_vis, y_vis, k_values=[1, 3, 7, 15, 30, 100])

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons, make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')


def visualize_k_effect(X, y, k_values=[1, 3, 7, 15, 30, 100], figsize=(18, 10)):
    """
    Visualize how decision boundaries change as K increases.
    Shows the bias-variance tradeoff directly.
    """
    scaler_v = StandardScaler()
    X_s = scaler_v.fit_transform(X)

    x_min, x_max = X_s[:, 0].min() - 0.5, X_s[:, 0].max() + 0.5
    y_min, y_max = X_s[:, 1].min() - 0.5, X_s[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 250),
                          np.linspace(y_min, y_max, 250))
    grid = np.c_[xx.ravel(), yy.ravel()]

    n_cols = len(k_values)
    fig, axes = plt.subplots(1, n_cols, figsize=figsize)

    colors_cls = ['steelblue', 'coral']
    colors_bg  = ['#d0e8f8', '#fde8e0']

    for ax, k in zip(axes, k_values):
        knn_v = KNeighborsClassifier(n_neighbors=k)
        knn_v.fit(X_s, y)

        Z = knn_v.predict(grid).reshape(xx.shape)

        ax.contourf(xx, yy, Z, alpha=0.4,
                    colors=colors_bg[:len(np.unique(y))])
        ax.contour(xx, yy, Z, colors='black', linewidths=0.7, alpha=0.5)

        for cls_i, color in zip(np.unique(y), colors_cls):
            mask = y == cls_i
            ax.scatter(X_s[mask, 0], X_s[mask, 1],
                       c=color, edgecolors='white', s=35,
                       linewidth=0.4, alpha=0.85)

        train_acc = knn_v.score(X_s, y)
        ax.set_title(f'K = {k}\nTrain acc = {train_acc:.3f}',
                     fontsize=11, fontweight='bold')
        ax.set_xlabel('Feature 1', fontsize=9)
        ax.set_ylabel('Feature 2', fontsize=9)
        ax.grid(True, alpha=0.2)

        # Label the regime
        if k <= 2:
            regime = "Overfit"
            color_label = '#cc3333'
        elif k >= len(y) // 3:
            regime = "Underfit"
            color_label = '#3366cc'
        else:
            regime = "Balanced"
            color_label = '#33aa33'

        ax.text(0.05, 0.05, regime,
                transform=ax.transAxes, fontsize=10,
                color=color_label, fontweight='bold',
                bbox=dict(boxstyle='round', fc='white', alpha=0.8))

    plt.suptitle('Effect of K on KNN Decision Boundary\n'
                 '(Left = high variance/overfit → Right = high bias/underfit)',
                 fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.savefig('knn_k_effect_boundaries.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("Saved: knn_k_effect_boundaries.png")


np.random.seed(42)
X_vis, y_vis = make_moons(n_samples=300, noise=0.25, random_state=42)
visualize_k_effect(X_vis, y_vis, k_values=[1, 3, 7, 15, 30, 100])

Method 1: Cross-Validated Grid Search

The most reliable and general method for K selection is cross-validated grid search: evaluate the model at every candidate K value using cross-validation, and select the K with the best average validation score.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, load_wine, load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


def knn_k_selection_cv(X, y, k_range=None, cv_folds=5, scoring='accuracy',
                         n_repeats=3, random_state=42):
    """
    Select optimal K for KNN using repeated stratified cross-validation.

    Args:
        X, y:         <a href="https://techietory.com/ai/features-labels-supervised-learning/">Features and labels</a>
        k_range:      List/array of K values to try
        cv_folds:     Number of CV folds
        scoring:      Sklearn metric string
        n_repeats:    Number of CV repetitions for lower-variance estimates
        random_state: Random seed

    Returns:
        Dictionary with results, optimal K, and plot data
    """
    from sklearn.model_selection import RepeatedStratifiedKFold

    if k_range is None:
        n = len(y)
        k_range = list(range(1, min(31, n // cv_folds)))

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn',    KNeighborsClassifier())
    ])

    cv = RepeatedStratifiedKFold(n_splits=cv_folds, n_repeats=n_repeats,
                                  random_state=random_state)

    mean_scores = []
    std_scores  = []

    for k in k_range:
        pipeline.set_params(knn__n_neighbors=k)
        scores = cross_val_score(pipeline, X, y, cv=cv,
                                  scoring=scoring, n_jobs=-1)
        mean_scores.append(scores.mean())
        std_scores.append(scores.std())

    mean_scores = np.array(mean_scores)
    std_scores  = np.array(std_scores)
    best_idx    = np.argmax(mean_scores)
    best_k      = k_range[best_idx]
    best_score  = mean_scores[best_idx]

    return {
        'k_range':     np.array(k_range),
        'mean_scores': mean_scores,
        'std_scores':  std_scores,
        'best_k':      best_k,
        'best_score':  best_score,
        'best_idx':    best_idx,
    }


def plot_k_selection_curve(results, title='KNN: K Selection Curve',
                            figsize=(10, 6)):
    """
    Plot mean CV score ± 1 std across K values.
    Marks the optimal K and the '1-SE rule' K.
    """
    k_range  = results['k_range']
    means    = results['mean_scores']
    stds     = results['std_scores']
    best_k   = results['best_k']
    best_idx = results['best_idx']

    # 1-Standard Error Rule: select the largest K within 1 SE of the best
    # This prefers simpler (larger K) models when performance is statistically tied
    one_se_threshold = means[best_idx] - stds[best_idx]
    one_se_candidates = [k for k, m in zip(k_range, means) if m >= one_se_threshold]
    k_1se = max(one_se_candidates)  # Prefer largest K (simplest) within 1-SE

    fig, ax = plt.subplots(figsize=figsize)

    ax.plot(k_range, means, 'o-', color='steelblue',
            linewidth=2.5, markersize=7, label='CV score (mean)')
    ax.fill_between(k_range, means - stds, means + stds,
                    alpha=0.15, color='steelblue', label='±1 std')

    # Mark optimal K
    ax.axvline(x=best_k, color='coral', linestyle='--', lw=2,
               label=f'Best K = {best_k} (score={means[best_idx]:.4f})')

    # Mark 1-SE K (if different from best)
    if k_1se != best_k:
        ax.axvline(x=k_1se, color='mediumseagreen', linestyle=':', lw=2,
                   label=f'1-SE rule K = {k_1se} (simpler model)')

    # Mark 1-SE threshold
    ax.axhline(y=one_se_threshold, color='gray', linestyle=':', lw=1.5,
               alpha=0.6, label=f'1-SE threshold = {one_se_threshold:.4f}')

    ax.set_xlabel('K (Number of Neighbors)', fontsize=12)
    ax.set_ylabel(f'CV {results.get("scoring", "Score")}', fontsize=12)
    ax.set_title(f'{title}\n(Best K={best_k}, Score={means[best_idx]:.4f})',
                 fontsize=13, fontweight='bold')
    ax.legend(fontsize=10, loc='lower right')
    ax.grid(True, alpha=0.3)

    # Add text annotations for the regimes
    ax.axvspan(k_range[0], k_range[len(k_range)//4],
               alpha=0.04, color='red', label='_nolegend_')
    ax.axvspan(k_range[3*len(k_range)//4], k_range[-1],
               alpha=0.04, color='blue', label='_nolegend_')

    ax.text(k_range[1], ax.get_ylim()[0] + 0.01,
            'Overfit\nregion', fontsize=8, color='darkred', alpha=0.7)
    ax.text(k_range[-3], ax.get_ylim()[0] + 0.01,
            'Underfit\nregion', fontsize=8, color='darkblue', alpha=0.7,
            ha='right')

    plt.tight_layout()
    plt.savefig('knn_k_selection_curve.png', dpi=150)
    plt.show()
    print("Saved: knn_k_selection_curve.png")

    return best_k, k_1se


# Run on multiple datasets
datasets = {
    'Wine':          load_wine(),
    'Breast Cancer': load_breast_cancer(),
}

print("=== K Selection via Cross-Validation ===\n")

for name, data in datasets.items():
    results = knn_k_selection_cv(
        data.data, data.target,
        k_range=list(range(1, 41)),
        cv_folds=5, scoring='accuracy', n_repeats=3
    )
    results['scoring'] = 'Accuracy'

    best_k, k_1se = plot_k_selection_curve(
        results, title=f'KNN K-Selection: {name} Dataset'
    )

    print(f"\n  {name} Dataset:")
    print(f"    Samples: {len(data.target)}, Features: {data.data.shape[1]}")
    print(f"    Best K (max score):  K={results['best_k']}, "
          f"Score={results['best_score']:.4f}")
    print(f"    1-SE Rule K:         K={k_1se} (prefers simpler model)")
    print()

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, load_wine, load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


def knn_k_selection_cv(X, y, k_range=None, cv_folds=5, scoring='accuracy',
                         n_repeats=3, random_state=42):
    """
    Select optimal K for KNN using repeated stratified cross-validation.

    Args:
        X, y:         Features and labels
        k_range:      List/array of K values to try
        cv_folds:     Number of CV folds
        scoring:      Sklearn metric string
        n_repeats:    Number of CV repetitions for lower-variance estimates
        random_state: Random seed

    Returns:
        Dictionary with results, optimal K, and plot data
    """
    from sklearn.model_selection import RepeatedStratifiedKFold

    if k_range is None:
        n = len(y)
        k_range = list(range(1, min(31, n // cv_folds)))

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn',    KNeighborsClassifier())
    ])

    cv = RepeatedStratifiedKFold(n_splits=cv_folds, n_repeats=n_repeats,
                                  random_state=random_state)

    mean_scores = []
    std_scores  = []

    for k in k_range:
        pipeline.set_params(knn__n_neighbors=k)
        scores = cross_val_score(pipeline, X, y, cv=cv,
                                  scoring=scoring, n_jobs=-1)
        mean_scores.append(scores.mean())
        std_scores.append(scores.std())

    mean_scores = np.array(mean_scores)
    std_scores  = np.array(std_scores)
    best_idx    = np.argmax(mean_scores)
    best_k      = k_range[best_idx]
    best_score  = mean_scores[best_idx]

    return {
        'k_range':     np.array(k_range),
        'mean_scores': mean_scores,
        'std_scores':  std_scores,
        'best_k':      best_k,
        'best_score':  best_score,
        'best_idx':    best_idx,
    }


def plot_k_selection_curve(results, title='KNN: K Selection Curve',
                            figsize=(10, 6)):
    """
    Plot mean CV score ± 1 std across K values.
    Marks the optimal K and the '1-SE rule' K.
    """
    k_range  = results['k_range']
    means    = results['mean_scores']
    stds     = results['std_scores']
    best_k   = results['best_k']
    best_idx = results['best_idx']

    # 1-Standard Error Rule: select the largest K within 1 SE of the best
    # This prefers simpler (larger K) models when performance is statistically tied
    one_se_threshold = means[best_idx] - stds[best_idx]
    one_se_candidates = [k for k, m in zip(k_range, means) if m >= one_se_threshold]
    k_1se = max(one_se_candidates)  # Prefer largest K (simplest) within 1-SE

    fig, ax = plt.subplots(figsize=figsize)

    ax.plot(k_range, means, 'o-', color='steelblue',
            linewidth=2.5, markersize=7, label='CV score (mean)')
    ax.fill_between(k_range, means - stds, means + stds,
                    alpha=0.15, color='steelblue', label='±1 std')

    # Mark optimal K
    ax.axvline(x=best_k, color='coral', linestyle='--', lw=2,
               label=f'Best K = {best_k} (score={means[best_idx]:.4f})')

    # Mark 1-SE K (if different from best)
    if k_1se != best_k:
        ax.axvline(x=k_1se, color='mediumseagreen', linestyle=':', lw=2,
                   label=f'1-SE rule K = {k_1se} (simpler model)')

    # Mark 1-SE threshold
    ax.axhline(y=one_se_threshold, color='gray', linestyle=':', lw=1.5,
               alpha=0.6, label=f'1-SE threshold = {one_se_threshold:.4f}')

    ax.set_xlabel('K (Number of Neighbors)', fontsize=12)
    ax.set_ylabel(f'CV {results.get("scoring", "Score")}', fontsize=12)
    ax.set_title(f'{title}\n(Best K={best_k}, Score={means[best_idx]:.4f})',
                 fontsize=13, fontweight='bold')
    ax.legend(fontsize=10, loc='lower right')
    ax.grid(True, alpha=0.3)

    # Add text annotations for the regimes
    ax.axvspan(k_range[0], k_range[len(k_range)//4],
               alpha=0.04, color='red', label='_nolegend_')
    ax.axvspan(k_range[3*len(k_range)//4], k_range[-1],
               alpha=0.04, color='blue', label='_nolegend_')

    ax.text(k_range[1], ax.get_ylim()[0] + 0.01,
            'Overfit\nregion', fontsize=8, color='darkred', alpha=0.7)
    ax.text(k_range[-3], ax.get_ylim()[0] + 0.01,
            'Underfit\nregion', fontsize=8, color='darkblue', alpha=0.7,
            ha='right')

    plt.tight_layout()
    plt.savefig('knn_k_selection_curve.png', dpi=150)
    plt.show()
    print("Saved: knn_k_selection_curve.png")

    return best_k, k_1se


# Run on multiple datasets
datasets = {
    'Wine':          load_wine(),
    'Breast Cancer': load_breast_cancer(),
}

print("=== K Selection via Cross-Validation ===\n")

for name, data in datasets.items():
    results = knn_k_selection_cv(
        data.data, data.target,
        k_range=list(range(1, 41)),
        cv_folds=5, scoring='accuracy', n_repeats=3
    )
    results['scoring'] = 'Accuracy'

    best_k, k_1se = plot_k_selection_curve(
        results, title=f'KNN K-Selection: {name} Dataset'
    )

    print(f"\n  {name} Dataset:")
    print(f"    Samples: {len(data.target)}, Features: {data.data.shape[1]}")
    print(f"    Best K (max score):  K={results['best_k']}, "
          f"Score={results['best_score']:.4f}")
    print(f"    1-SE Rule K:         K={k_1se} (prefers simpler model)")
    print()

The 1-Standard Error Rule: Prefer Simpler Models

When multiple K values produce statistically indistinguishable performance, which should you choose? The 1-Standard Error Rule from Leo Breiman’s work on decision trees (later adapted broadly) provides a principled answer: select the simplest model whose performance is within 1 standard error of the best.

For KNN, “simpler” means larger K (smoother boundary, fewer effective parameters). The 1-SE rule therefore recommends the largest K whose mean CV score is at least as high as (best_score − best_std).

Why this makes sense:

A model at K=7 with CV score 0.912 ± 0.015 and a model at K=15 with CV score 0.908 ± 0.014 are not meaningfully different in performance. The difference (0.004) is well within the noise level (0.015). The K=15 model is simpler, will generalize more smoothly, and is less sensitive to noise near the boundary. Prefer it.

Python

import numpy as np

def one_se_rule(k_range, mean_scores, std_scores):
    """
    Apply the 1-Standard Error Rule to select K.
    
    Returns the largest K whose mean score is within
    1 standard error of the best score.
    
    This prefers simpler (larger K, smoother) models
    when performance differences are not meaningful.
    """
    best_idx = np.argmax(mean_scores)
    threshold = mean_scores[best_idx] - std_scores[best_idx]
    
    # All K values where score >= threshold
    valid_ks = [k for k, score in zip(k_range, mean_scores)
                if score >= threshold]
    
    # Return the largest valid K (simplest model)
    return max(valid_ks)


def compare_k_selection_rules(X, y, k_range, cv_folds=10, random_state=42):
    """
    Compare different K selection rules on the same dataset.
    
    Rules compared:
    1. Maximum CV score (most common)
    2. 1-Standard Error Rule (prefer simpler)
    3. sqrt(n) rule of thumb
    4. Odd K closest to sqrt(n) (<a href="https://techietory.com/ai/binary-classification-predicting-yes-no-outcomes/">binary classification</a>)
    """
    from sklearn.model_selection import RepeatedStratifiedKFold
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    cv = RepeatedStratifiedKFold(n_splits=cv_folds, n_repeats=5,
                                  random_state=random_state)

    mean_scores, std_scores = [], []
    for k in k_range:
        pipeline.set_params(knn__n_neighbors=k)
        sc = cross_val_score(pipeline, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
        mean_scores.append(sc.mean())
        std_scores.append(sc.std())

    mean_scores = np.array(mean_scores)
    std_scores  = np.array(std_scores)

    n = len(y)

    # Rule 1: Maximum CV score
    k_max = k_range[np.argmax(mean_scores)]

    # Rule 2: 1-SE Rule
    k_1se = one_se_rule(k_range, mean_scores, std_scores)

    # Rule 3: sqrt(n)
    k_sqrt = int(np.round(np.sqrt(n)))
    k_sqrt = max(1, min(k_sqrt, max(k_range)))

    # Rule 4: Odd K nearest to sqrt(n) (reduces ties in binary classification)
    k_odd = k_sqrt if k_sqrt % 2 == 1 else k_sqrt + 1
    k_odd = min(k_odd, max(k_range))

    rules = {
        f'Max CV score':      k_max,
        f'1-SE Rule':         k_1se,
        f'sqrt(n)={k_sqrt}':  k_sqrt,
        f'Odd sqrt(n)':       k_odd,
    }

    print(f"  n={n}, sqrt(n)≈{np.sqrt(n):.1f}")
    print(f"\n  {'Rule':<25} | {'K':>4} | {'CV Score':>10} | {'Assessment'}")
    print("  " + "-" * 60)

    best_score = mean_scores[np.argmax(mean_scores)]
    for rule_name, k_val in rules.items():
        k_idx = list(k_range).index(k_val) if k_val in k_range else 0
        score = mean_scores[k_idx]
        within_1se = abs(score - best_score) <= std_scores[np.argmax(mean_scores)]
        flag = "✓ within 1-SE" if within_1se else "✗ outside 1-SE"
        print(f"  {rule_name:<25} | {k_val:>4} | {score:>10.4f} | {flag}")

    return rules, mean_scores, std_scores


np.random.seed(42)
X_rule, y_rule = make_classification(
    n_samples=600, n_features=15, n_informative=10, random_state=42
)
k_candidates = list(range(1, 51))

print("=== K Selection Rules Comparison ===\n")
rules, means, stds = compare_k_selection_rules(X_rule, y_rule, k_candidates)

import numpy as np

def one_se_rule(k_range, mean_scores, std_scores):
    """
    Apply the 1-Standard Error Rule to select K.
    
    Returns the largest K whose mean score is within
    1 standard error of the best score.
    
    This prefers simpler (larger K, smoother) models
    when performance differences are not meaningful.
    """
    best_idx = np.argmax(mean_scores)
    threshold = mean_scores[best_idx] - std_scores[best_idx]
    
    # All K values where score >= threshold
    valid_ks = [k for k, score in zip(k_range, mean_scores)
                if score >= threshold]
    
    # Return the largest valid K (simplest model)
    return max(valid_ks)


def compare_k_selection_rules(X, y, k_range, cv_folds=10, random_state=42):
    """
    Compare different K selection rules on the same dataset.
    
    Rules compared:
    1. Maximum CV score (most common)
    2. 1-Standard Error Rule (prefer simpler)
    3. sqrt(n) rule of thumb
    4. Odd K closest to sqrt(n) (binary classification)
    """
    from sklearn.model_selection import RepeatedStratifiedKFold
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    cv = RepeatedStratifiedKFold(n_splits=cv_folds, n_repeats=5,
                                  random_state=random_state)

    mean_scores, std_scores = [], []
    for k in k_range:
        pipeline.set_params(knn__n_neighbors=k)
        sc = cross_val_score(pipeline, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
        mean_scores.append(sc.mean())
        std_scores.append(sc.std())

    mean_scores = np.array(mean_scores)
    std_scores  = np.array(std_scores)

    n = len(y)

    # Rule 1: Maximum CV score
    k_max = k_range[np.argmax(mean_scores)]

    # Rule 2: 1-SE Rule
    k_1se = one_se_rule(k_range, mean_scores, std_scores)

    # Rule 3: sqrt(n)
    k_sqrt = int(np.round(np.sqrt(n)))
    k_sqrt = max(1, min(k_sqrt, max(k_range)))

    # Rule 4: Odd K nearest to sqrt(n) (reduces ties in binary classification)
    k_odd = k_sqrt if k_sqrt % 2 == 1 else k_sqrt + 1
    k_odd = min(k_odd, max(k_range))

    rules = {
        f'Max CV score':      k_max,
        f'1-SE Rule':         k_1se,
        f'sqrt(n)={k_sqrt}':  k_sqrt,
        f'Odd sqrt(n)':       k_odd,
    }

    print(f"  n={n}, sqrt(n)≈{np.sqrt(n):.1f}")
    print(f"\n  {'Rule':<25} | {'K':>4} | {'CV Score':>10} | {'Assessment'}")
    print("  " + "-" * 60)

    best_score = mean_scores[np.argmax(mean_scores)]
    for rule_name, k_val in rules.items():
        k_idx = list(k_range).index(k_val) if k_val in k_range else 0
        score = mean_scores[k_idx]
        within_1se = abs(score - best_score) <= std_scores[np.argmax(mean_scores)]
        flag = "✓ within 1-SE" if within_1se else "✗ outside 1-SE"
        print(f"  {rule_name:<25} | {k_val:>4} | {score:>10.4f} | {flag}")

    return rules, mean_scores, std_scores


np.random.seed(42)
X_rule, y_rule = make_classification(
    n_samples=600, n_features=15, n_informative=10, random_state=42
)
k_candidates = list(range(1, 51))

print("=== K Selection Rules Comparison ===\n")
rules, means, stds = compare_k_selection_rules(X_rule, y_rule, k_candidates)

Rules of Thumb: When to Use Them and When Not To

Several rules of thumb for choosing K appear frequently in the literature. Understanding their derivations helps you know when they apply.

The Square Root Rule: K ≈ √n

The most popular rule of thumb sets K ≈ √n, where n is the number of training samples. This emerges from an asymptotic analysis: under certain regularity conditions, the optimal K for minimizing mean squared error in KNN regression scales as O(n^(2/(d+2))), and for d=2 features this simplifies to approximately √n.

When it applies well: Balanced binary classification, moderate dimensionality (2–10 features), n in the hundreds to low thousands.

When it fails:

High dimensionality: the optimal K grows more slowly in high dimensions
Class imbalance: √n may select far too many neighbors when the minority class is sparse
Small datasets: √(50) = 7, which is already a large fraction of total data
Non-uniform data: when class density varies strongly across the feature space

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def evaluate_sqrt_rule(n_values, n_features=10, cv_folds=5, random_state=42):
    """
    Test how well the sqrt(n) rule tracks the true optimal K
    across different dataset sizes.
    """
    np.random.seed(random_state)

    results = []

    for n in n_values:
        X_n, y_n = make_classification(
            n_samples=n, n_features=n_features,
            n_informative=max(3, n_features // 2),
            random_state=random_state
        )

        k_sqrt = max(1, int(np.round(np.sqrt(n))))
        k_range = list(range(1, min(k_sqrt * 3 + 1, n // 5)))

        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('knn', KNeighborsClassifier())
        ])

        cv = RepeatedStratifiedKFold(n_splits=cv_folds, n_repeats=3,
                                      random_state=random_state)

        mean_scores = []
        for k in k_range:
            pipeline.set_params(knn__n_neighbors=k)
            sc = cross_val_score(pipeline, X_n, y_n, cv=cv,
                                  scoring='accuracy', n_jobs=-1)
            mean_scores.append(sc.mean())

        k_true_best = k_range[np.argmax(mean_scores)]
        score_sqrt  = mean_scores[k_range.index(k_sqrt)] if k_sqrt in k_range else None
        score_best  = max(mean_scores)
        gap         = score_best - score_sqrt if score_sqrt else None

        results.append({
            'n':          n,
            'k_sqrt':     k_sqrt,
            'k_best':     k_true_best,
            'score_sqrt': score_sqrt,
            'score_best': score_best,
            'gap':        gap,
        })

    print("=== sqrt(n) Rule Evaluation ===\n")
    print(f"  {'n':>6} | {'√n':>5} | {'sqrt(n) K':>10} | {'True Best K':>12} | "
          f"{'Sqrt Score':>11} | {'Best Score':>11} | {'Gap':>6}")
    print("  " + "-" * 80)

    for r in results:
        gap_str = f"{r['gap']:.4f}" if r['gap'] is not None else "N/A"
        print(f"  {r['n']:>6,} | {np.sqrt(r['n']):>5.1f} | {r['k_sqrt']:>10} | "
              f"{r['k_best']:>12} | {r['score_sqrt']:>11.4f} | "
              f"{r['score_best']:>11.4f} | {gap_str:>6}")

    print("\n  Interpretation: Small gap = sqrt(n) rule works well here.")
    print("  Large gap = cross-validation would give better results.")


evaluate_sqrt_rule(n_values=[100, 300, 500, 1000, 2000, 5000])

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def evaluate_sqrt_rule(n_values, n_features=10, cv_folds=5, random_state=42):
    """
    Test how well the sqrt(n) rule tracks the true optimal K
    across different dataset sizes.
    """
    np.random.seed(random_state)

    results = []

    for n in n_values:
        X_n, y_n = make_classification(
            n_samples=n, n_features=n_features,
            n_informative=max(3, n_features // 2),
            random_state=random_state
        )

        k_sqrt = max(1, int(np.round(np.sqrt(n))))
        k_range = list(range(1, min(k_sqrt * 3 + 1, n // 5)))

        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('knn', KNeighborsClassifier())
        ])

        cv = RepeatedStratifiedKFold(n_splits=cv_folds, n_repeats=3,
                                      random_state=random_state)

        mean_scores = []
        for k in k_range:
            pipeline.set_params(knn__n_neighbors=k)
            sc = cross_val_score(pipeline, X_n, y_n, cv=cv,
                                  scoring='accuracy', n_jobs=-1)
            mean_scores.append(sc.mean())

        k_true_best = k_range[np.argmax(mean_scores)]
        score_sqrt  = mean_scores[k_range.index(k_sqrt)] if k_sqrt in k_range else None
        score_best  = max(mean_scores)
        gap         = score_best - score_sqrt if score_sqrt else None

        results.append({
            'n':          n,
            'k_sqrt':     k_sqrt,
            'k_best':     k_true_best,
            'score_sqrt': score_sqrt,
            'score_best': score_best,
            'gap':        gap,
        })

    print("=== sqrt(n) Rule Evaluation ===\n")
    print(f"  {'n':>6} | {'√n':>5} | {'sqrt(n) K':>10} | {'True Best K':>12} | "
          f"{'Sqrt Score':>11} | {'Best Score':>11} | {'Gap':>6}")
    print("  " + "-" * 80)

    for r in results:
        gap_str = f"{r['gap']:.4f}" if r['gap'] is not None else "N/A"
        print(f"  {r['n']:>6,} | {np.sqrt(r['n']):>5.1f} | {r['k_sqrt']:>10} | "
              f"{r['k_best']:>12} | {r['score_sqrt']:>11.4f} | "
              f"{r['score_best']:>11.4f} | {gap_str:>6}")

    print("\n  Interpretation: Small gap = sqrt(n) rule works well here.")
    print("  Large gap = cross-validation would give better results.")


evaluate_sqrt_rule(n_values=[100, 300, 500, 1000, 2000, 5000])

The Odd-K Rule for Binary Classification

When K is even, ties are possible: K/2 neighbors vote class 0, K/2 vote class 1. The algorithm must break the tie somehow — usually by defaulting to the smaller class index, which introduces an asymmetric bias. Using odd K eliminates ties entirely in binary classification.

Python

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def compare_odd_vs_even_k(X, y, k_pairs=None):
    """
    Compare adjacent odd and even K values to demonstrate
    tie-breaking effects in binary classification.
    """
    if k_pairs is None:
        k_pairs = [(2, 3), (4, 5), (6, 7), (8, 9), (10, 11), (14, 15), (20, 21)]

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    print("=== Odd vs Even K in Binary Classification ===\n")
    print(f"  {'K pair (even, odd)':>22} | {'Even K Score':>13} | "
          f"{'Odd K Score':>12} | {'Odd Better?':>11}")
    print("  " + "-" * 68)

    odd_better_count = 0
    for k_even, k_odd in k_pairs:
        pipeline.set_params(knn__n_neighbors=k_even)
        score_even = cross_val_score(
            pipeline, X, y, cv=5, scoring='accuracy', n_jobs=-1
        ).mean()

        pipeline.set_params(knn__n_neighbors=k_odd)
        score_odd = cross_val_score(
            pipeline, X, y, cv=5, scoring='accuracy', n_jobs=-1
        ).mean()

        odd_better = score_odd >= score_even
        if odd_better:
            odd_better_count += 1
        flag = "✓ Yes" if odd_better else "✗ No"
        print(f"  {'(' + str(k_even) + ', ' + str(k_odd) + ')':>22} | "
              f"{score_even:>13.4f} | {score_odd:>12.4f} | {flag:>11}")

    print(f"\n  Odd K outperformed: {odd_better_count}/{len(k_pairs)} comparisons")
    print("  Conclusion: Prefer odd K for binary classification to avoid ties.")


np.random.seed(42)
X_odd, y_odd = make_classification(
    n_samples=500, n_features=10, n_informative=6,
    n_classes=2, random_state=42
)

compare_odd_vs_even_k(X_odd, y_odd)

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def compare_odd_vs_even_k(X, y, k_pairs=None):
    """
    Compare adjacent odd and even K values to demonstrate
    tie-breaking effects in binary classification.
    """
    if k_pairs is None:
        k_pairs = [(2, 3), (4, 5), (6, 7), (8, 9), (10, 11), (14, 15), (20, 21)]

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    print("=== Odd vs Even K in Binary Classification ===\n")
    print(f"  {'K pair (even, odd)':>22} | {'Even K Score':>13} | "
          f"{'Odd K Score':>12} | {'Odd Better?':>11}")
    print("  " + "-" * 68)

    odd_better_count = 0
    for k_even, k_odd in k_pairs:
        pipeline.set_params(knn__n_neighbors=k_even)
        score_even = cross_val_score(
            pipeline, X, y, cv=5, scoring='accuracy', n_jobs=-1
        ).mean()

        pipeline.set_params(knn__n_neighbors=k_odd)
        score_odd = cross_val_score(
            pipeline, X, y, cv=5, scoring='accuracy', n_jobs=-1
        ).mean()

        odd_better = score_odd >= score_even
        if odd_better:
            odd_better_count += 1
        flag = "✓ Yes" if odd_better else "✗ No"
        print(f"  {'(' + str(k_even) + ', ' + str(k_odd) + ')':>22} | "
              f"{score_even:>13.4f} | {score_odd:>12.4f} | {flag:>11}")

    print(f"\n  Odd K outperformed: {odd_better_count}/{len(k_pairs)} comparisons")
    print("  Conclusion: Prefer odd K for binary classification to avoid ties.")


np.random.seed(42)
X_odd, y_odd = make_classification(
    n_samples=500, n_features=10, n_informative=6,
    n_classes=2, random_state=42
)

compare_odd_vs_even_k(X_odd, y_odd)

Method 2: Grid Search with Scikit-learn

Scikit-learn’s GridSearchCV provides a clean, production-ready API for K selection with additional features: parallel evaluation, refit on best parameters, and direct integration with Pipelines.

Python

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Load dataset and make holdout split
data = load_breast_cancer()
X_bc, y_bc = data.data, data.target

X_dev, X_hold, y_dev, y_hold = train_test_split(
    X_bc, y_bc, test_size=0.15, random_state=42, stratify=y_bc
)

print("=== GridSearchCV for K and Metric Selection ===\n")
print(f"  Dev set: {len(y_dev)} samples | Holdout: {len(y_hold)} samples\n")

# Pipeline: scaling + KNN
pipeline_gs = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

# Parameter grid: K values + distance metrics + weighting schemes
param_grid = {
    'knn__n_neighbors': list(range(1, 31)),
    'knn__weights':     ['uniform', 'distance'],
    'knn__metric':      ['euclidean', 'manhattan'],
}

# 5-fold stratified cross-validation
from sklearn.model_selection import StratifiedKFold
cv_gs = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    pipeline_gs,
    param_grid,
    cv=cv_gs,
    scoring='roc_auc',           # Use AUC for medical data
    n_jobs=-1,
    refit=True,                  # Refit best model on all dev data
    verbose=0,
    return_train_score=True
)

grid_search.fit(X_dev, y_dev)

print(f"  Best parameters found:")
for param, val in grid_search.best_params_.items():
    param_clean = param.replace('knn__', '')
    print(f"    {param_clean}: {val}")

print(f"\n  Best CV AUC: {grid_search.best_score_:.4f}")

# Evaluate best model on holdout (only once!)
best_model = grid_search.best_estimator_
from sklearn.metrics import roc_auc_score, accuracy_score
y_hold_proba = best_model.predict_proba(X_hold)[:, 1]
y_hold_pred  = best_model.predict(X_hold)

holdout_auc = roc_auc_score(y_hold, y_hold_proba)
holdout_acc = accuracy_score(y_hold, y_hold_pred)

print(f"\n  Holdout AUC:      {holdout_auc:.4f}")
print(f"  Holdout Accuracy: {holdout_acc:.4f}")
print(f"  CV-Holdout gap:   {grid_search.best_score_ - holdout_auc:+.4f}")

# Show top 10 parameter combinations
import pandas as pd

cv_results_df = pd.DataFrame(grid_search.cv_results_)
top_10 = (cv_results_df[['param_knn__n_neighbors',
                           'param_knn__weights',
                           'param_knn__metric',
                           'mean_test_score',
                           'std_test_score',
                           'mean_train_score']]
           .sort_values('mean_test_score', ascending=False)
           .head(10))

print(f"\n  Top 10 configurations by CV AUC:\n")
print(f"  {'K':>4} | {'Weights':>10} | {'Metric':>10} | "
      f"{'Test AUC':>9} | {'±':>7} | {'Train AUC':>9} | Overfit?")
print("  " + "-" * 72)

for _, row in top_10.iterrows():
    gap = row['mean_train_score'] - row['mean_test_score']
    flag = "⚠" if gap > 0.05 else ""
    print(f"  {int(row['param_knn__n_neighbors']):>4} | "
          f"{str(row['param_knn__weights']):>10} | "
          f"{str(row['param_knn__metric']):>10} | "
          f"{row['mean_test_score']:>9.4f} | "
          f"{row['std_test_score']:>7.4f} | "
          f"{row['mean_train_score']:>9.4f} | {flag}")

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Load dataset and make holdout split
data = load_breast_cancer()
X_bc, y_bc = data.data, data.target

X_dev, X_hold, y_dev, y_hold = train_test_split(
    X_bc, y_bc, test_size=0.15, random_state=42, stratify=y_bc
)

print("=== GridSearchCV for K and Metric Selection ===\n")
print(f"  Dev set: {len(y_dev)} samples | Holdout: {len(y_hold)} samples\n")

# Pipeline: scaling + KNN
pipeline_gs = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

# Parameter grid: K values + distance metrics + weighting schemes
param_grid = {
    'knn__n_neighbors': list(range(1, 31)),
    'knn__weights':     ['uniform', 'distance'],
    'knn__metric':      ['euclidean', 'manhattan'],
}

# 5-fold stratified cross-validation
from sklearn.model_selection import StratifiedKFold
cv_gs = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    pipeline_gs,
    param_grid,
    cv=cv_gs,
    scoring='roc_auc',           # Use AUC for medical data
    n_jobs=-1,
    refit=True,                  # Refit best model on all dev data
    verbose=0,
    return_train_score=True
)

grid_search.fit(X_dev, y_dev)

print(f"  Best parameters found:")
for param, val in grid_search.best_params_.items():
    param_clean = param.replace('knn__', '')
    print(f"    {param_clean}: {val}")

print(f"\n  Best CV AUC: {grid_search.best_score_:.4f}")

# Evaluate best model on holdout (only once!)
best_model = grid_search.best_estimator_
from sklearn.metrics import roc_auc_score, accuracy_score
y_hold_proba = best_model.predict_proba(X_hold)[:, 1]
y_hold_pred  = best_model.predict(X_hold)

holdout_auc = roc_auc_score(y_hold, y_hold_proba)
holdout_acc = accuracy_score(y_hold, y_hold_pred)

print(f"\n  Holdout AUC:      {holdout_auc:.4f}")
print(f"  Holdout Accuracy: {holdout_acc:.4f}")
print(f"  CV-Holdout gap:   {grid_search.best_score_ - holdout_auc:+.4f}")

# Show top 10 parameter combinations
import pandas as pd

cv_results_df = pd.DataFrame(grid_search.cv_results_)
top_10 = (cv_results_df[['param_knn__n_neighbors',
                           'param_knn__weights',
                           'param_knn__metric',
                           'mean_test_score',
                           'std_test_score',
                           'mean_train_score']]
           .sort_values('mean_test_score', ascending=False)
           .head(10))

print(f"\n  Top 10 configurations by CV AUC:\n")
print(f"  {'K':>4} | {'Weights':>10} | {'Metric':>10} | "
      f"{'Test AUC':>9} | {'±':>7} | {'Train AUC':>9} | Overfit?")
print("  " + "-" * 72)

for _, row in top_10.iterrows():
    gap = row['mean_train_score'] - row['mean_test_score']
    flag = "⚠" if gap > 0.05 else ""
    print(f"  {int(row['param_knn__n_neighbors']):>4} | "
          f"{str(row['param_knn__weights']):>10} | "
          f"{str(row['param_knn__metric']):>10} | "
          f"{row['mean_test_score']:>9.4f} | "
          f"{row['std_test_score']:>7.4f} | "
          f"{row['mean_train_score']:>9.4f} | {flag}")

How K Interacts with Dataset Properties

The optimal K is not independent of your dataset. Several dataset properties shift the optimal K up or down.

Effect of Dataset Size

As n grows, the optimal K generally grows too (roughly as √n), but more slowly in high dimensions. The intuition: with more data, each neighborhood contains more points, so you can afford to expand K without losing resolution.

Effect of Class Imbalance

With severe class imbalance (e.g., 95% class 0, 5% class 1), the K nearest neighbors of any point will contain mostly class 0 examples — not necessarily because they are more similar, but because they are everywhere. This pushes the model to always predict class 0.

Mitigations: use weights='distance', reduce K (fewer neighbors = less dominated by majority class), or combine with oversampling.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score

def analyze_k_vs_imbalance(prevalences, n_total=1000, k_range=None):
    """
    Show how class imbalance shifts the optimal K value.
    Uses <a href="https://techietory.com/ai/f1-score-imbalanced-datasets/">F1 score</a> (better for imbalanced data) rather than accuracy.
    """
    if k_range is None:
        k_range = list(range(1, 31))

    np.random.seed(42)
    f1_scorer = make_scorer(f1_score, zero_division=0)

    fig, axes = plt.subplots(1, len(prevalences), figsize=(16, 5))

    print("=== Optimal K vs Class Imbalance ===\n")
    print(f"  {'Prevalence':>12} | {'Optimal K (F1)':>15} | {'Best F1':>9}")
    print("  " + "-" * 42)

    for ax, prev in zip(axes, prevalences):
        X_imb, y_imb = make_classification(
            n_samples=n_total, n_features=10, n_informative=6,
            weights=[1 - prev, prev], random_state=42
        )

        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('knn', KNeighborsClassifier())
        ])

        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        means = []

        for k in k_range:
            pipeline.set_params(knn__n_neighbors=k)
            sc = cross_val_score(pipeline, X_imb, y_imb,
                                  cv=cv, scoring=f1_scorer, n_jobs=-1)
            means.append(sc.mean())

        means = np.array(means)
        best_k = k_range[np.argmax(means)]

        ax.plot(k_range, means, 'o-', color='steelblue', lw=2, markersize=5)
        ax.axvline(x=best_k, color='coral', linestyle='--', lw=2,
                   label=f'Best K={best_k}')
        ax.set_title(f'Prevalence = {prev*100:.0f}%\nBest K = {best_k}',
                     fontsize=11, fontweight='bold')
        ax.set_xlabel('K', fontsize=10)
        ax.set_ylabel('CV F1 Score', fontsize=10)
        ax.legend(fontsize=9)
        ax.grid(True, alpha=0.3)

        print(f"  {prev*100:>11.0f}% | {best_k:>15} | {means.max():>9.4f}")

    plt.suptitle('Effect of Class Imbalance on Optimal K\n'
                 '(Severe imbalance requires lower K to preserve minority class signal)',
                 fontsize=13, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.savefig('knn_k_vs_imbalance.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("\nSaved: knn_k_vs_imbalance.png")


analyze_k_vs_imbalance(prevalences=[0.40, 0.20, 0.10, 0.05])

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score

def analyze_k_vs_imbalance(prevalences, n_total=1000, k_range=None):
    """
    Show how class imbalance shifts the optimal K value.
    Uses F1 score (better for imbalanced data) rather than accuracy.
    """
    if k_range is None:
        k_range = list(range(1, 31))

    np.random.seed(42)
    f1_scorer = make_scorer(f1_score, zero_division=0)

    fig, axes = plt.subplots(1, len(prevalences), figsize=(16, 5))

    print("=== Optimal K vs Class Imbalance ===\n")
    print(f"  {'Prevalence':>12} | {'Optimal K (F1)':>15} | {'Best F1':>9}")
    print("  " + "-" * 42)

    for ax, prev in zip(axes, prevalences):
        X_imb, y_imb = make_classification(
            n_samples=n_total, n_features=10, n_informative=6,
            weights=[1 - prev, prev], random_state=42
        )

        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('knn', KNeighborsClassifier())
        ])

        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        means = []

        for k in k_range:
            pipeline.set_params(knn__n_neighbors=k)
            sc = cross_val_score(pipeline, X_imb, y_imb,
                                  cv=cv, scoring=f1_scorer, n_jobs=-1)
            means.append(sc.mean())

        means = np.array(means)
        best_k = k_range[np.argmax(means)]

        ax.plot(k_range, means, 'o-', color='steelblue', lw=2, markersize=5)
        ax.axvline(x=best_k, color='coral', linestyle='--', lw=2,
                   label=f'Best K={best_k}')
        ax.set_title(f'Prevalence = {prev*100:.0f}%\nBest K = {best_k}',
                     fontsize=11, fontweight='bold')
        ax.set_xlabel('K', fontsize=10)
        ax.set_ylabel('CV F1 Score', fontsize=10)
        ax.legend(fontsize=9)
        ax.grid(True, alpha=0.3)

        print(f"  {prev*100:>11.0f}% | {best_k:>15} | {means.max():>9.4f}")

    plt.suptitle('Effect of Class Imbalance on Optimal K\n'
                 '(Severe imbalance requires lower K to preserve minority class signal)',
                 fontsize=13, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.savefig('knn_k_vs_imbalance.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("\nSaved: knn_k_vs_imbalance.png")


analyze_k_vs_imbalance(prevalences=[0.40, 0.20, 0.10, 0.05])

Effect of Dimensionality

In high dimensions, the curse of dimensionality means that neighbors are not meaningfully “near.” Using small K in high dimensions amplifies this problem: the single nearest neighbor may be nearly as far away as a random point. Larger K provides some averaging that partially compensates, but the fundamental problem remains.

Python

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def analyze_k_vs_dimensionality(feature_counts, n_informative_frac=0.5):
    """
    Show how the optimal K shifts with increasing dimensionality.
    """
    np.random.seed(42)
    n_total = 800
    k_range = list(range(1, 41))

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    print("=== Optimal K vs Feature Dimensionality ===\n")
    print(f"  {'Features':>10} | {'Informative':>12} | {'Optimal K':>10} | "
          f"{'Best Acc':>9} | Regime")
    print("  " + "-" * 58)

    optimal_ks = []

    for n_feat in feature_counts:
        n_info = max(3, int(n_feat * n_informative_frac))
        n_red  = min(n_feat - n_info, n_info)

        X_d, y_d = make_classification(
            n_samples=n_total, n_features=n_feat,
            n_informative=n_info, n_redundant=n_red,
            random_state=42
        )

        means = []
        for k in k_range:
            pipeline.set_params(knn__n_neighbors=k)
            sc = cross_val_score(pipeline, X_d, y_d,
                                  cv=cv, scoring='accuracy', n_jobs=-1)
            means.append(sc.mean())

        means = np.array(means)
        best_k = k_range[np.argmax(means)]
        optimal_ks.append(best_k)

        regime = ("Low-dim (KNN works well)" if n_feat <= 10
                  else "High-dim (KNN struggles)" if n_feat >= 50
                  else "Moderate")

        print(f"  {n_feat:>10} | {n_info:>12} | {best_k:>10} | "
              f"{means.max():>9.4f} | {regime}")

    print(f"\n  Pattern: Optimal K {'tends to increase' if optimal_ks[-1] > optimal_ks[0] else 'varies'} "
          f"with dimensionality.")
    print("  In very high dimensions, KNN performance degrades regardless of K.")
    print("  Consider PCA or feature selection before applying KNN.")


analyze_k_vs_dimensionality(feature_counts=[2, 5, 10, 20, 50, 100])

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def analyze_k_vs_dimensionality(feature_counts, n_informative_frac=0.5):
    """
    Show how the optimal K shifts with increasing dimensionality.
    """
    np.random.seed(42)
    n_total = 800
    k_range = list(range(1, 41))

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    print("=== Optimal K vs Feature Dimensionality ===\n")
    print(f"  {'Features':>10} | {'Informative':>12} | {'Optimal K':>10} | "
          f"{'Best Acc':>9} | Regime")
    print("  " + "-" * 58)

    optimal_ks = []

    for n_feat in feature_counts:
        n_info = max(3, int(n_feat * n_informative_frac))
        n_red  = min(n_feat - n_info, n_info)

        X_d, y_d = make_classification(
            n_samples=n_total, n_features=n_feat,
            n_informative=n_info, n_redundant=n_red,
            random_state=42
        )

        means = []
        for k in k_range:
            pipeline.set_params(knn__n_neighbors=k)
            sc = cross_val_score(pipeline, X_d, y_d,
                                  cv=cv, scoring='accuracy', n_jobs=-1)
            means.append(sc.mean())

        means = np.array(means)
        best_k = k_range[np.argmax(means)]
        optimal_ks.append(best_k)

        regime = ("Low-dim (KNN works well)" if n_feat <= 10
                  else "High-dim (KNN struggles)" if n_feat >= 50
                  else "Moderate")

        print(f"  {n_feat:>10} | {n_info:>12} | {best_k:>10} | "
              f"{means.max():>9.4f} | {regime}")

    print(f"\n  Pattern: Optimal K {'tends to increase' if optimal_ks[-1] > optimal_ks[0] else 'varies'} "
          f"with dimensionality.")
    print("  In very high dimensions, KNN performance degrades regardless of K.")
    print("  Consider PCA or feature selection before applying KNN.")


analyze_k_vs_dimensionality(feature_counts=[2, 5, 10, 20, 50, 100])

Method 3: Learning Curve for K — Visualizing Bias-Variance Tradeoff

Rather than just reporting the optimal K, plotting the full training and validation score curves as K varies gives a visual diagnosis of the model’s behavior. This is the K-axis analogue of the training-size learning curve from Article 70.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def plot_k_bias_variance_curve(X, y, k_range=None, cv_folds=5,
                                 scoring='accuracy', title="KNN Bias-Variance Curve"):
    """
    Plot both training and validation scores as K varies.

    Shows the classic bias-variance picture:
    - Training score starts high (K=1 = perfect memorization) and falls
    - Validation score starts low, peaks, then falls
    - The gap between them quantifies overfitting at each K
    """
    if k_range is None:
        k_range = list(range(1, min(61, len(y) // cv_folds)))

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)

    train_means, train_stds = [], []
    val_means,   val_stds   = [], []

    from sklearn.model_selection import cross_validate
    for k in k_range:
        pipeline.set_params(knn__n_neighbors=k)
        result = cross_validate(
            pipeline, X, y, cv=cv, scoring=scoring,
            return_train_score=True, n_jobs=-1
        )
        train_means.append(result['train_score'].mean())
        train_stds.append(result['train_score'].std())
        val_means.append(result['test_score'].mean())
        val_stds.append(result['test_score'].std())

    train_means = np.array(train_means)
    val_means   = np.array(val_means)
    train_stds  = np.array(train_stds)
    val_stds    = np.array(val_stds)

    gaps = train_means - val_means
    best_val_idx = np.argmax(val_means)
    best_k = k_range[best_val_idx]

    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(11, 9), sharex=True)

    # Top: training and validation scores
    ax1.plot(k_range, train_means, 'o-', color='steelblue', lw=2.5,
             markersize=5, label='Training score')
    ax1.fill_between(k_range, train_means - train_stds,
                     train_means + train_stds, alpha=0.12, color='steelblue')

    ax1.plot(k_range, val_means, 's-', color='coral', lw=2.5,
             markersize=5, label='Validation score (CV)')
    ax1.fill_between(k_range, val_means - val_stds,
                     val_means + val_stds, alpha=0.12, color='coral')

    ax1.axvline(x=best_k, color='black', linestyle='--', lw=1.8,
                label=f'Best K = {best_k}')
    ax1.set_ylabel(scoring.replace('_', ' ').title(), fontsize=11)
    ax1.set_title(f'{title}\n(Best K={best_k}, Val={val_means[best_val_idx]:.4f})',
                  fontsize=12, fontweight='bold')
    ax1.legend(fontsize=10)
    ax1.grid(True, alpha=0.3)

    # Annotate regimes
    ax1.text(k_range[1], ax1.get_ylim()[0] + 0.01,
             'Overfit\n(high variance)', fontsize=8, color='darkred',
             ha='left', alpha=0.7)
    ax1.text(k_range[-3], ax1.get_ylim()[0] + 0.01,
             'Underfit\n(high bias)', fontsize=8, color='darkblue',
             ha='right', alpha=0.7)

    # Bottom: gap between training and validation (overfitting degree)
    ax2.fill_between(k_range, gaps, 0, where=(gaps >= 0),
                     alpha=0.3, color='coral', label='Train-Val gap (overfit)')
    ax2.plot(k_range, gaps, '-', color='coral', lw=2)
    ax2.axvline(x=best_k, color='black', linestyle='--', lw=1.8)
    ax2.axhline(y=0, color='gray', lw=1, alpha=0.5)
    ax2.set_xlabel('K (Number of Neighbors)', fontsize=11)
    ax2.set_ylabel('Train - Val Gap', fontsize=11)
    ax2.set_title('Overfitting Degree by K\n(Large gap = memorizing, small gap = generalizing)',
                  fontsize=11, fontweight='bold')
    ax2.legend(fontsize=10)
    ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig('knn_bias_variance_curve.png', dpi=150)
    plt.show()
    print("Saved: knn_bias_variance_curve.png")

    # Print summary table for key K values
    print(f"\n  Key K values summary:")
    print(f"  {'K':>4} | {'Train':>8} | {'Val':>8} | {'Gap':>7} | {'Regime'}")
    print("  " + "-" * 48)

    highlight_ks = [1, 3, 5, best_k] + k_range[::max(1, len(k_range)//6)]
    highlight_ks = sorted(set(highlight_ks))

    for k in highlight_ks:
        if k in k_range:
            idx = k_range.index(k)
            gap = gaps[idx]
            if gap > 0.15:
                regime = "High variance"
            elif gap < 0.02 and val_means[idx] < val_means[best_val_idx] - 0.05:
                regime = "High bias"
            elif idx == best_val_idx:
                regime = "★ OPTIMAL"
            else:
                regime = "Balanced"
            print(f"  {k:>4} | {train_means[idx]:>8.4f} | {val_means[idx]:>8.4f} | "
                  f"{gap:>7.4f} | {regime}")

    return best_k, val_means[best_val_idx]


np.random.seed(42)
X_bv, y_bv = make_classification(
    n_samples=600, n_features=12, n_informative=8, random_state=42
)

plot_k_bias_variance_curve(X_bv, y_bv, k_range=list(range(1, 61)),
                            scoring='accuracy',
                            title="KNN Bias-Variance Tradeoff Curve")

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def plot_k_bias_variance_curve(X, y, k_range=None, cv_folds=5,
                                 scoring='accuracy', title="KNN Bias-Variance Curve"):
    """
    Plot both training and validation scores as K varies.

    Shows the classic bias-variance picture:
    - Training score starts high (K=1 = perfect memorization) and falls
    - Validation score starts low, peaks, then falls
    - The gap between them quantifies overfitting at each K
    """
    if k_range is None:
        k_range = list(range(1, min(61, len(y) // cv_folds)))

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)

    train_means, train_stds = [], []
    val_means,   val_stds   = [], []

    from sklearn.model_selection import cross_validate
    for k in k_range:
        pipeline.set_params(knn__n_neighbors=k)
        result = cross_validate(
            pipeline, X, y, cv=cv, scoring=scoring,
            return_train_score=True, n_jobs=-1
        )
        train_means.append(result['train_score'].mean())
        train_stds.append(result['train_score'].std())
        val_means.append(result['test_score'].mean())
        val_stds.append(result['test_score'].std())

    train_means = np.array(train_means)
    val_means   = np.array(val_means)
    train_stds  = np.array(train_stds)
    val_stds    = np.array(val_stds)

    gaps = train_means - val_means
    best_val_idx = np.argmax(val_means)
    best_k = k_range[best_val_idx]

    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(11, 9), sharex=True)

    # Top: training and validation scores
    ax1.plot(k_range, train_means, 'o-', color='steelblue', lw=2.5,
             markersize=5, label='Training score')
    ax1.fill_between(k_range, train_means - train_stds,
                     train_means + train_stds, alpha=0.12, color='steelblue')

    ax1.plot(k_range, val_means, 's-', color='coral', lw=2.5,
             markersize=5, label='Validation score (CV)')
    ax1.fill_between(k_range, val_means - val_stds,
                     val_means + val_stds, alpha=0.12, color='coral')

    ax1.axvline(x=best_k, color='black', linestyle='--', lw=1.8,
                label=f'Best K = {best_k}')
    ax1.set_ylabel(scoring.replace('_', ' ').title(), fontsize=11)
    ax1.set_title(f'{title}\n(Best K={best_k}, Val={val_means[best_val_idx]:.4f})',
                  fontsize=12, fontweight='bold')
    ax1.legend(fontsize=10)
    ax1.grid(True, alpha=0.3)

    # Annotate regimes
    ax1.text(k_range[1], ax1.get_ylim()[0] + 0.01,
             'Overfit\n(high variance)', fontsize=8, color='darkred',
             ha='left', alpha=0.7)
    ax1.text(k_range[-3], ax1.get_ylim()[0] + 0.01,
             'Underfit\n(high bias)', fontsize=8, color='darkblue',
             ha='right', alpha=0.7)

    # Bottom: gap between training and validation (overfitting degree)
    ax2.fill_between(k_range, gaps, 0, where=(gaps >= 0),
                     alpha=0.3, color='coral', label='Train-Val gap (overfit)')
    ax2.plot(k_range, gaps, '-', color='coral', lw=2)
    ax2.axvline(x=best_k, color='black', linestyle='--', lw=1.8)
    ax2.axhline(y=0, color='gray', lw=1, alpha=0.5)
    ax2.set_xlabel('K (Number of Neighbors)', fontsize=11)
    ax2.set_ylabel('Train - Val Gap', fontsize=11)
    ax2.set_title('Overfitting Degree by K\n(Large gap = memorizing, small gap = generalizing)',
                  fontsize=11, fontweight='bold')
    ax2.legend(fontsize=10)
    ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig('knn_bias_variance_curve.png', dpi=150)
    plt.show()
    print("Saved: knn_bias_variance_curve.png")

    # Print summary table for key K values
    print(f"\n  Key K values summary:")
    print(f"  {'K':>4} | {'Train':>8} | {'Val':>8} | {'Gap':>7} | {'Regime'}")
    print("  " + "-" * 48)

    highlight_ks = [1, 3, 5, best_k] + k_range[::max(1, len(k_range)//6)]
    highlight_ks = sorted(set(highlight_ks))

    for k in highlight_ks:
        if k in k_range:
            idx = k_range.index(k)
            gap = gaps[idx]
            if gap > 0.15:
                regime = "High variance"
            elif gap < 0.02 and val_means[idx] < val_means[best_val_idx] - 0.05:
                regime = "High bias"
            elif idx == best_val_idx:
                regime = "★ OPTIMAL"
            else:
                regime = "Balanced"
            print(f"  {k:>4} | {train_means[idx]:>8.4f} | {val_means[idx]:>8.4f} | "
                  f"{gap:>7.4f} | {regime}")

    return best_k, val_means[best_val_idx]


np.random.seed(42)
X_bv, y_bv = make_classification(
    n_samples=600, n_features=12, n_informative=8, random_state=42
)

plot_k_bias_variance_curve(X_bv, y_bv, k_range=list(range(1, 61)),
                            scoring='accuracy',
                            title="KNN Bias-Variance Tradeoff Curve")

Method 4: Statistical Significance Testing Between K Values

When two K values produce very similar cross-validation scores, you might wonder whether the difference is real or just sampling noise. Paired t-tests on CV fold scores provide a principled answer.

Python

import numpy as np
from scipy import stats
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def compare_k_values_statistically(X, y, k_values_to_compare,
                                     cv_folds=10, random_state=42):
    """
    Use paired t-test to determine if differences between K values
    are statistically significant.

    Since all models are evaluated on the same folds, fold-level
    scores are paired — the variance within each fold cancels out,
    giving much more statistical power than unpaired tests.

    Args:
        k_values_to_compare: List of K values to compare pairwise
        cv_folds:            Number of folds (more = better powered test)
    """
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=random_state)

    # Collect per-fold scores for each K
    fold_scores = {}
    for k in k_values_to_compare:
        pipeline.set_params(knn__n_neighbors=k)
        scores = []
        for train_idx, val_idx in cv.split(X, y):
            from sklearn.preprocessing import StandardScaler as SS
            import copy
            p = copy.deepcopy(pipeline)
            p.fit(X[train_idx], y[train_idx])
            scores.append(p.score(X[val_idx], y[val_idx]))
        fold_scores[k] = np.array(scores)

    print("=== Statistical Comparison of K Values ===\n")
    print(f"  {'K':>4} | {'Mean Acc':>10} | {'Std':>7}")
    print("  " + "-" * 27)
    for k in k_values_to_compare:
        print(f"  {k:>4} | {fold_scores[k].mean():>10.4f} | {fold_scores[k].std():>7.4f}")

    print(f"\n  Pairwise paired t-tests (α=0.05):\n")
    print(f"  {'K1 vs K2':>12} | {'Δ Mean':>8} | {'t-stat':>8} | {'p-value':>9} | {'Significant?':>13} | {'Prefer'}")
    print("  " + "-" * 75)

    for i, k1 in enumerate(k_values_to_compare):
        for k2 in k_values_to_compare[i+1:]:
            diff = fold_scores[k1] - fold_scores[k2]
            t_stat, p_val = stats.ttest_rel(fold_scores[k1], fold_scores[k2])
            significant = p_val < 0.05
            delta = fold_scores[k1].mean() - fold_scores[k2].mean()

            if significant:
                prefer = f"K={k1}" if delta > 0 else f"K={k2}"
            else:
                prefer = f"K={max(k1,k2)} (simpler)"  # Prefer larger K if tied

            sig_str = "✓ Yes" if significant else "✗ No (tied)"
            print(f"  {str(k1)+'vs'+str(k2):>12} | {delta:>+8.4f} | {t_stat:>8.3f} | "
                  f"{p_val:>9.4f} | {sig_str:>13} | {prefer}")

    print(f"\n  Interpretation:")
    print(f"  Non-significant difference → no evidence to prefer one K over the other.")
    print(f"  Apply 1-SE rule: pick the larger K (simpler model) among tied candidates.")


np.random.seed(42)
X_stat, y_stat = make_classification(
    n_samples=500, n_features=10, n_informative=7, random_state=42
)

compare_k_values_statistically(
    X_stat, y_stat,
    k_values_to_compare=[5, 7, 9, 11, 15, 20],
    cv_folds=10
)

import numpy as np
from scipy import stats
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def compare_k_values_statistically(X, y, k_values_to_compare,
                                     cv_folds=10, random_state=42):
    """
    Use paired t-test to determine if differences between K values
    are statistically significant.

    Since all models are evaluated on the same folds, fold-level
    scores are paired — the variance within each fold cancels out,
    giving much more statistical power than unpaired tests.

    Args:
        k_values_to_compare: List of K values to compare pairwise
        cv_folds:            Number of folds (more = better powered test)
    """
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=random_state)

    # Collect per-fold scores for each K
    fold_scores = {}
    for k in k_values_to_compare:
        pipeline.set_params(knn__n_neighbors=k)
        scores = []
        for train_idx, val_idx in cv.split(X, y):
            from sklearn.preprocessing import StandardScaler as SS
            import copy
            p = copy.deepcopy(pipeline)
            p.fit(X[train_idx], y[train_idx])
            scores.append(p.score(X[val_idx], y[val_idx]))
        fold_scores[k] = np.array(scores)

    print("=== Statistical Comparison of K Values ===\n")
    print(f"  {'K':>4} | {'Mean Acc':>10} | {'Std':>7}")
    print("  " + "-" * 27)
    for k in k_values_to_compare:
        print(f"  {k:>4} | {fold_scores[k].mean():>10.4f} | {fold_scores[k].std():>7.4f}")

    print(f"\n  Pairwise paired t-tests (α=0.05):\n")
    print(f"  {'K1 vs K2':>12} | {'Δ Mean':>8} | {'t-stat':>8} | {'p-value':>9} | {'Significant?':>13} | {'Prefer'}")
    print("  " + "-" * 75)

    for i, k1 in enumerate(k_values_to_compare):
        for k2 in k_values_to_compare[i+1:]:
            diff = fold_scores[k1] - fold_scores[k2]
            t_stat, p_val = stats.ttest_rel(fold_scores[k1], fold_scores[k2])
            significant = p_val < 0.05
            delta = fold_scores[k1].mean() - fold_scores[k2].mean()

            if significant:
                prefer = f"K={k1}" if delta > 0 else f"K={k2}"
            else:
                prefer = f"K={max(k1,k2)} (simpler)"  # Prefer larger K if tied

            sig_str = "✓ Yes" if significant else "✗ No (tied)"
            print(f"  {str(k1)+'vs'+str(k2):>12} | {delta:>+8.4f} | {t_stat:>8.3f} | "
                  f"{p_val:>9.4f} | {sig_str:>13} | {prefer}")

    print(f"\n  Interpretation:")
    print(f"  Non-significant difference → no evidence to prefer one K over the other.")
    print(f"  Apply 1-SE rule: pick the larger K (simpler model) among tied candidates.")


np.random.seed(42)
X_stat, y_stat = make_classification(
    n_samples=500, n_features=10, n_informative=7, random_state=42
)

compare_k_values_statistically(
    X_stat, y_stat,
    k_values_to_compare=[5, 7, 9, 11, 15, 20],
    cv_folds=10
)

Complete K Selection Workflow

Putting all methods together into a single, reusable workflow:

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import (
    train_test_split, GridSearchCV, StratifiedKFold,
    cross_val_score, RepeatedStratifiedKFold
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score

def full_knn_k_selection_workflow(X, y, dataset_name="Dataset",
                                   holdout_size=0.15,
                                   k_min=1, k_max=50,
                                   cv_folds=5, cv_repeats=5,
                                   random_state=42):
    """
    Complete, production-quality K selection workflow.

    Steps:
    1. Reserve holdout test set (touch only at the very end)
    2. Cross-validated K sweep on development set
    3. Apply 1-SE rule to prefer simpler model
    4. Optionally expand to grid search over metrics and weights
    5. Evaluate final model on holdout (once)
    6. Report full diagnostic

    Args:
        X, y:           Features and labels
        dataset_name:   For display
        holdout_size:   Fraction for final test set
        k_min, k_max:   Range of K values to search
        cv_folds:       Number of cross-validation folds
        cv_repeats:     Number of CV repetitions
        random_state:   Random seed
    """
    print(f"{'='*60}")
    print(f"  KNN K-Selection Workflow: {dataset_name}")
    print(f"{'='*60}")
    print(f"\n  Dataset: {len(y):,} samples, {X.shape[1]} features, "
          f"{len(np.unique(y))} classes")

    # Step 1: Holdout split
    X_dev, X_hold, y_dev, y_hold = train_test_split(
        X, y, test_size=holdout_size, random_state=random_state, stratify=y
    )
    print(f"  Dev: {len(y_dev):,} | Holdout: {len(y_hold):,}\n")

    # Step 2: K sweep on dev set
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    k_range = list(range(k_min, min(k_max + 1, len(y_dev) // cv_folds)))
    cv = RepeatedStratifiedKFold(n_splits=cv_folds, n_repeats=cv_repeats,
                                  random_state=random_state)

    print("  Step 2: K sweep (cross-validation)")
    mean_scores, std_scores = [], []

    for k in k_range:
        pipeline.set_params(knn__n_neighbors=k)
        sc = cross_val_score(pipeline, X_dev, y_dev, cv=cv,
                              scoring='accuracy', n_jobs=-1)
        mean_scores.append(sc.mean())
        std_scores.append(sc.std())

    mean_scores = np.array(mean_scores)
    std_scores  = np.array(std_scores)

    # Step 3: Apply selection rules
    best_idx  = np.argmax(mean_scores)
    k_best    = k_range[best_idx]
    score_best = mean_scores[best_idx]

    # 1-SE rule
    threshold = score_best - std_scores[best_idx]
    k_1se = max(k for k, s in zip(k_range, mean_scores) if s >= threshold)

    # sqrt(n) rule
    k_sqrt = max(1, int(round(np.sqrt(len(y_dev)))))
    k_sqrt = min(k_sqrt, max(k_range))
    # Make odd for binary classification
    if len(np.unique(y)) == 2 and k_sqrt % 2 == 0:
        k_sqrt_odd = k_sqrt + 1
    else:
        k_sqrt_odd = k_sqrt

    print(f"\n  {'Rule':<25} | {'K':>4} | {'CV Score':>9}")
    print("  " + "-" * 42)
    for rule_name, k_val in [
        ('Max CV score',       k_best),
        ('1-SE Rule',          k_1se),
        ('sqrt(n) rule',       k_sqrt),
        ('Odd sqrt(n)',        k_sqrt_odd),
    ]:
        k_idx = list(k_range).index(k_val) if k_val in k_range else 0
        score = mean_scores[k_idx]
        print(f"  {rule_name:<25} | {k_val:>4} | {score:>9.4f}")

    # Step 4: Final model using recommended K (1-SE rule)
    recommended_k = k_1se
    print(f"\n  Recommended K (1-SE rule): {recommended_k}")

    # Step 5: Evaluate on holdout
    best_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=recommended_k,
                                      weights='distance'))
    ])
    best_pipeline.fit(X_dev, y_dev)

    y_hold_pred  = best_pipeline.predict(X_hold)
    holdout_acc  = accuracy_score(y_hold, y_hold_pred)
    holdout_f1   = f1_score(y_hold, y_hold_pred, average='weighted')

    cv_score_at_k = mean_scores[list(k_range).index(recommended_k)]

    print(f"\n  Step 5: Final Holdout Evaluation (first and only look)")
    print(f"  {'Metric':<20} | {'CV Estimate':>12} | {'Holdout':>8} | Gap")
    print("  " + "-" * 52)
    print(f"  {'Accuracy':<20} | {cv_score_at_k:>12.4f} | {holdout_acc:>8.4f} | "
          f"{holdout_acc - cv_score_at_k:+.4f}")
    print(f"  {'F1 (weighted)':<20} | {'N/A':>12} | {holdout_f1:>8.4f} | N/A")

    # Plot K selection curve
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(k_range, mean_scores, 'o-', color='steelblue', lw=2.5,
            markersize=6, label='CV Accuracy (mean)')
    ax.fill_between(k_range, mean_scores - std_scores,
                    mean_scores + std_scores, alpha=0.15, color='steelblue')
    ax.axvline(x=k_best, color='coral', linestyle='--', lw=2,
               label=f'Best K={k_best} (acc={score_best:.4f})')
    ax.axvline(x=k_1se, color='mediumseagreen', linestyle=':', lw=2,
               label=f'1-SE K={k_1se} (simpler, recommended)')
    ax.axhline(y=threshold, color='gray', linestyle=':', lw=1.5, alpha=0.7,
               label=f'1-SE threshold = {threshold:.4f}')
    ax.set_xlabel('K (Number of Neighbors)', fontsize=12)
    ax.set_ylabel('CV Accuracy', fontsize=12)
    ax.set_title(f'{dataset_name}: K Selection Curve\n'
                 f'(Recommended K={recommended_k}, Holdout Acc={holdout_acc:.4f})',
                 fontsize=12, fontweight='bold')
    ax.legend(fontsize=10, loc='lower right')
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(f'knn_k_selection_{dataset_name.replace(" ", "_")}.png', dpi=150)
    plt.show()
    print(f"\n  Saved: knn_k_selection_{dataset_name.replace(' ', '_')}.png")

    return {
        'recommended_k': recommended_k,
        'k_best':        k_best,
        'k_1se':         k_1se,
        'cv_score':      cv_score_at_k,
        'holdout_acc':   holdout_acc,
    }


# Run on Wine dataset
wine = load_wine()
results_wine = full_knn_k_selection_workflow(
    wine.data, wine.target,
    dataset_name="Wine Dataset",
    k_min=1, k_max=40,
    cv_folds=5, cv_repeats=5
)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import (
    train_test_split, GridSearchCV, StratifiedKFold,
    cross_val_score, RepeatedStratifiedKFold
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score

def full_knn_k_selection_workflow(X, y, dataset_name="Dataset",
                                   holdout_size=0.15,
                                   k_min=1, k_max=50,
                                   cv_folds=5, cv_repeats=5,
                                   random_state=42):
    """
    Complete, production-quality K selection workflow.

    Steps:
    1. Reserve holdout test set (touch only at the very end)
    2. Cross-validated K sweep on development set
    3. Apply 1-SE rule to prefer simpler model
    4. Optionally expand to grid search over metrics and weights
    5. Evaluate final model on holdout (once)
    6. Report full diagnostic

    Args:
        X, y:           Features and labels
        dataset_name:   For display
        holdout_size:   Fraction for final test set
        k_min, k_max:   Range of K values to search
        cv_folds:       Number of cross-validation folds
        cv_repeats:     Number of CV repetitions
        random_state:   Random seed
    """
    print(f"{'='*60}")
    print(f"  KNN K-Selection Workflow: {dataset_name}")
    print(f"{'='*60}")
    print(f"\n  Dataset: {len(y):,} samples, {X.shape[1]} features, "
          f"{len(np.unique(y))} classes")

    # Step 1: Holdout split
    X_dev, X_hold, y_dev, y_hold = train_test_split(
        X, y, test_size=holdout_size, random_state=random_state, stratify=y
    )
    print(f"  Dev: {len(y_dev):,} | Holdout: {len(y_hold):,}\n")

    # Step 2: K sweep on dev set
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())
    ])

    k_range = list(range(k_min, min(k_max + 1, len(y_dev) // cv_folds)))
    cv = RepeatedStratifiedKFold(n_splits=cv_folds, n_repeats=cv_repeats,
                                  random_state=random_state)

    print("  Step 2: K sweep (cross-validation)")
    mean_scores, std_scores = [], []

    for k in k_range:
        pipeline.set_params(knn__n_neighbors=k)
        sc = cross_val_score(pipeline, X_dev, y_dev, cv=cv,
                              scoring='accuracy', n_jobs=-1)
        mean_scores.append(sc.mean())
        std_scores.append(sc.std())

    mean_scores = np.array(mean_scores)
    std_scores  = np.array(std_scores)

    # Step 3: Apply selection rules
    best_idx  = np.argmax(mean_scores)
    k_best    = k_range[best_idx]
    score_best = mean_scores[best_idx]

    # 1-SE rule
    threshold = score_best - std_scores[best_idx]
    k_1se = max(k for k, s in zip(k_range, mean_scores) if s >= threshold)

    # sqrt(n) rule
    k_sqrt = max(1, int(round(np.sqrt(len(y_dev)))))
    k_sqrt = min(k_sqrt, max(k_range))
    # Make odd for binary classification
    if len(np.unique(y)) == 2 and k_sqrt % 2 == 0:
        k_sqrt_odd = k_sqrt + 1
    else:
        k_sqrt_odd = k_sqrt

    print(f"\n  {'Rule':<25} | {'K':>4} | {'CV Score':>9}")
    print("  " + "-" * 42)
    for rule_name, k_val in [
        ('Max CV score',       k_best),
        ('1-SE Rule',          k_1se),
        ('sqrt(n) rule',       k_sqrt),
        ('Odd sqrt(n)',        k_sqrt_odd),
    ]:
        k_idx = list(k_range).index(k_val) if k_val in k_range else 0
        score = mean_scores[k_idx]
        print(f"  {rule_name:<25} | {k_val:>4} | {score:>9.4f}")

    # Step 4: Final model using recommended K (1-SE rule)
    recommended_k = k_1se
    print(f"\n  Recommended K (1-SE rule): {recommended_k}")

    # Step 5: Evaluate on holdout
    best_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=recommended_k,
                                      weights='distance'))
    ])
    best_pipeline.fit(X_dev, y_dev)

    y_hold_pred  = best_pipeline.predict(X_hold)
    holdout_acc  = accuracy_score(y_hold, y_hold_pred)
    holdout_f1   = f1_score(y_hold, y_hold_pred, average='weighted')

    cv_score_at_k = mean_scores[list(k_range).index(recommended_k)]

    print(f"\n  Step 5: Final Holdout Evaluation (first and only look)")
    print(f"  {'Metric':<20} | {'CV Estimate':>12} | {'Holdout':>8} | Gap")
    print("  " + "-" * 52)
    print(f"  {'Accuracy':<20} | {cv_score_at_k:>12.4f} | {holdout_acc:>8.4f} | "
          f"{holdout_acc - cv_score_at_k:+.4f}")
    print(f"  {'F1 (weighted)':<20} | {'N/A':>12} | {holdout_f1:>8.4f} | N/A")

    # Plot K selection curve
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(k_range, mean_scores, 'o-', color='steelblue', lw=2.5,
            markersize=6, label='CV Accuracy (mean)')
    ax.fill_between(k_range, mean_scores - std_scores,
                    mean_scores + std_scores, alpha=0.15, color='steelblue')
    ax.axvline(x=k_best, color='coral', linestyle='--', lw=2,
               label=f'Best K={k_best} (acc={score_best:.4f})')
    ax.axvline(x=k_1se, color='mediumseagreen', linestyle=':', lw=2,
               label=f'1-SE K={k_1se} (simpler, recommended)')
    ax.axhline(y=threshold, color='gray', linestyle=':', lw=1.5, alpha=0.7,
               label=f'1-SE threshold = {threshold:.4f}')
    ax.set_xlabel('K (Number of Neighbors)', fontsize=12)
    ax.set_ylabel('CV Accuracy', fontsize=12)
    ax.set_title(f'{dataset_name}: K Selection Curve\n'
                 f'(Recommended K={recommended_k}, Holdout Acc={holdout_acc:.4f})',
                 fontsize=12, fontweight='bold')
    ax.legend(fontsize=10, loc='lower right')
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(f'knn_k_selection_{dataset_name.replace(" ", "_")}.png', dpi=150)
    plt.show()
    print(f"\n  Saved: knn_k_selection_{dataset_name.replace(' ', '_')}.png")

    return {
        'recommended_k': recommended_k,
        'k_best':        k_best,
        'k_1se':         k_1se,
        'cv_score':      cv_score_at_k,
        'holdout_acc':   holdout_acc,
    }


# Run on Wine dataset
wine = load_wine()
results_wine = full_knn_k_selection_workflow(
    wine.data, wine.target,
    dataset_name="Wine Dataset",
    k_min=1, k_max=40,
    cv_folds=5, cv_repeats=5
)

Common Mistakes in K Selection

Mistake 1: Testing K on the test set The most harmful mistake. If you try K=1, 3, 5, … 30, evaluate each on your test set, and pick the best, you have effectively trained K on the test set. The reported performance will be optimistically biased. Always use cross-validation on the training/development set for K selection.

Mistake 2: Using accuracy as the sole metric on imbalanced data With 95% class 0 prevalence, K=500 that always predicts class 0 achieves 95% accuracy. Use F1, AUC, or recall/precision depending on your cost asymmetry. The optimal K under accuracy and under F1 can differ dramatically on imbalanced data.

Mistake 3: Forgetting to normalize K selection results measured on unnormalized features are meaningless — the features with larger magnitudes dominate distance calculations entirely. Always scale before searching K.

Mistake 4: Setting K without considering n K=50 on a dataset with 100 training samples means each prediction averages over 50% of all training data. This is usually extreme oversmoothing. As a sanity check, K should generally be much smaller than n — at most 10–15% of training size.

Mistake 5: Not trying odd K values for binary classification When your optimal K sweep returns an even value, try the adjacent odd values. They eliminate tie-breaking randomness and often perform marginally better or equally.

Python

import numpy as np

def k_selection_sanity_checks(k_recommended, n_train, n_classes):
    """
    Run sanity checks on a recommended K value.
    """
    print("=== K Selection Sanity Checks ===\n")
    issues = []

    # Check 1: K relative to training size
    k_pct = k_recommended / n_train * 100
    if k_pct > 20:
        issues.append(f"K={k_recommended} is {k_pct:.1f}% of training set — likely oversmoothing")
    elif k_pct < 0.5:
        issues.append(f"K={k_recommended} is very small relative to {n_train} training samples — high variance risk")

    # Check 2: Odd K for binary classification
    if n_classes == 2 and k_recommended % 2 == 0:
        issues.append(f"K={k_recommended} is even for binary classification — ties possible, try K={k_recommended+1}")

    # Check 3: K=1 special case
    if k_recommended == 1:
        issues.append("K=1 is the highest-variance setting — consider if this truly generalizes or just memorizes")

    # Check 4: Very large K
    if k_recommended > 50:
        issues.append(f"K={k_recommended} is very large — verify the model isn't just predicting the majority class")

    print(f"  Recommended K:   {k_recommended}")
    print(f"  Training size:   {n_train:,}")
    print(f"  K / n:           {k_pct:.2f}%")
    print(f"  Classes:         {n_classes}")

    if issues:
        print(f"\n  ⚠ Issues detected:")
        for issue in issues:
            print(f"    - {issue}")
    else:
        print(f"\n  ✓ No issues detected — K={k_recommended} looks reasonable")

# Test on several scenarios
print("Scenario 1: Reasonable K")
k_selection_sanity_checks(k_recommended=7, n_train=300, n_classes=2)
print("\nScenario 2: Even K for binary")
k_selection_sanity_checks(k_recommended=10, n_train=300, n_classes=2)
print("\nScenario 3: Very large K")
k_selection_sanity_checks(k_recommended=80, n_train=300, n_classes=3)

import numpy as np

def k_selection_sanity_checks(k_recommended, n_train, n_classes):
    """
    Run sanity checks on a recommended K value.
    """
    print("=== K Selection Sanity Checks ===\n")
    issues = []

    # Check 1: K relative to training size
    k_pct = k_recommended / n_train * 100
    if k_pct > 20:
        issues.append(f"K={k_recommended} is {k_pct:.1f}% of training set — likely oversmoothing")
    elif k_pct < 0.5:
        issues.append(f"K={k_recommended} is very small relative to {n_train} training samples — high variance risk")

    # Check 2: Odd K for binary classification
    if n_classes == 2 and k_recommended % 2 == 0:
        issues.append(f"K={k_recommended} is even for binary classification — ties possible, try K={k_recommended+1}")

    # Check 3: K=1 special case
    if k_recommended == 1:
        issues.append("K=1 is the highest-variance setting — consider if this truly generalizes or just memorizes")

    # Check 4: Very large K
    if k_recommended > 50:
        issues.append(f"K={k_recommended} is very large — verify the model isn't just predicting the majority class")

    print(f"  Recommended K:   {k_recommended}")
    print(f"  Training size:   {n_train:,}")
    print(f"  K / n:           {k_pct:.2f}%")
    print(f"  Classes:         {n_classes}")

    if issues:
        print(f"\n  ⚠ Issues detected:")
        for issue in issues:
            print(f"    - {issue}")
    else:
        print(f"\n  ✓ No issues detected — K={k_recommended} looks reasonable")

# Test on several scenarios
print("Scenario 1: Reasonable K")
k_selection_sanity_checks(k_recommended=7, n_train=300, n_classes=2)
print("\nScenario 2: Even K for binary")
k_selection_sanity_checks(k_recommended=10, n_train=300, n_classes=2)
print("\nScenario 3: Very large K")
k_selection_sanity_checks(k_recommended=80, n_train=300, n_classes=3)

Summary

Choosing K in KNN is one of the clearest examples of the bias-variance tradeoff in practice. Every increase in K increases bias and reduces variance; the optimal K is where their sum is minimized for your specific dataset.

Cross-validated grid search is the most reliable K selection method. Running it across a range from K=1 to √n or 2√n, using repeated stratified K-fold for stability, and applying the 1-SE rule to prefer simpler (larger K) models when performance differences are statistically negligible — this workflow produces good K choices consistently across diverse datasets.

Rules of thumb like K ≈ √n provide useful starting points but should be treated as hypotheses to test, not final answers. The optimal K depends on dataset size, dimensionality, class balance, and noise level in ways that no simple formula captures perfectly.

A few habits pay dividends every time: always scale features before searching K, use F1 or AUC rather than accuracy for imbalanced data, prefer odd K for binary classification, and never evaluate K selection performance on your holdout test set.

0 Comments

Discover More

Click For More

Search Techietory

Choosing the Right K Value in KNN

Introduction

The Bias-Variance Tradeoff Through the Lens of K

K = 1: Maximum Variance, Zero Bias on Training Data

K = N: Maximum Bias, Minimum Variance

The Optimal K: The Bias-Variance Sweet Spot

Method 1: Cross-Validated Grid Search

The 1-Standard Error Rule: Prefer Simpler Models

Rules of Thumb: When to Use Them and When Not To

The Square Root Rule: K ≈ √n

The Odd-K Rule for Binary Classification

Method 2: Grid Search with Scikit-learn

How K Interacts with Dataset Properties

Effect of Dataset Size

Effect of Class Imbalance

Effect of Dimensionality

Method 3: Learning Curve for K — Visualizing Bias-Variance Tradeoff

Method 4: Statistical Significance Testing Between K Values

Complete K Selection Workflow

Common Mistakes in K Selection

Summary

Discover More

Understanding User Accounts and Permissions Across Operating Systems

Waabi Raises $1 Billion for Robotaxi Expansion with Uber Partnership

How Operating Systems Handle File Permissions and Security

The Rule of Three in C++: Destructor, Copy Constructor, and Assignment Operator

Implementing Logistic Regression with Scikit-learn

How Long Does It Really Take to Learn Data Science?