Anomaly Detection: Finding Outliers in Your Data

Master anomaly detection from first principles. Learn Isolation Forest, Local Outlier Factor, One-Class SVM, statistical methods, and PCA reconstruction error — with complete Python implementations.

By Techietory on May 21, 2026

Anomaly Detection: Finding Outliers in Your Data

Anomaly detection finds data points that don’t fit the expected pattern — without needing labeled examples of anomalies. The main approaches are: statistical methods (points more than 3 standard deviations from the mean), density-based methods (points in low-density regions, like DBSCAN’s noise labels or Local Outlier Factor), isolation-based methods (points that are easy to isolate, like Isolation Forest), and reconstruction-based methods (points with high PCA or autoencoder reconstruction error). Most real-world anomaly detection uses Isolation Forest as the default starting point.

Introduction

Fraud. Equipment failure before it happens. A hospital patient whose vital signs deviate from all similar patients. A network packet with a suspicious payload. These are anomalies — data points that don’t fit the learned pattern of normal behavior.

Anomaly detection is unique in the machine learning taxonomy: it is an unsupervised problem (you rarely have labeled examples of fraud, since fraud is caught after the fact), but it is evaluated against a definition of “normal” that must be learned from data. The challenge is defining normal precisely enough to distinguish genuine anomalies from natural variation, without requiring any anomaly examples to learn from.

The applications are wide: financial fraud detection, network intrusion detection, predictive equipment maintenance, medical outlier detection, data quality monitoring, and content moderation. In each case, anomalies are rare, often unknown in advance, and expensive to miss.

This article covers the complete landscape of anomaly detection: statistical methods (z-score, IQR), density-based methods (LOF, DBSCAN), isolation-based methods (Isolation Forest), reconstruction-based methods (PCA reconstruction error), and practical considerations including threshold selection, evaluation, and production deployment.

The Anomaly Detection Problem

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)


def visualize_anomaly_types(figsize=(16, 5)):
    """
    Three canonical types of anomalies:
    1. Point anomaly:   a single point far from all others
    2. Contextual:      a point normal in one context, anomalous in another
    3. Collective:      a group of points anomalous together, each normal alone
    """
    fig, axes = plt.subplots(1, 3, figsize=figsize)

    # ── Type 1: Point anomaly ────────────────────────────────────
    ax = axes[0]
    X_normal = np.random.randn(200, 2)
    X_anom   = np.array([[5, 5], [-4, 3], [3, -4]])
    ax.scatter(X_normal[:, 0], X_normal[:, 1], c='steelblue', s=25,
               alpha=0.6, edgecolors='white', linewidth=0.3, label='Normal')
    ax.scatter(X_anom[:, 0], X_anom[:, 1], c='red', s=150,
               marker='*', zorder=5, label='Point anomaly')
    ax.set_title('1. Point Anomaly\n'
                 'Individual point far from all others',
                 fontsize=10, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.2)
    ax.set_xlim(-5, 7); ax.set_ylim(-5, 7)

    # ── Type 2: Contextual anomaly ───────────────────────────────
    ax = axes[1]
    t = np.linspace(0, 4 * np.pi, 200)
    signal    = np.sin(t) + np.random.randn(200) * 0.15
    anom_idx  = [80, 81, 82]
    signal_with_anom = signal.copy()
    signal_with_anom[anom_idx] = [2.5, 2.8, 2.6]

    ax.plot(t, signal_with_anom, 'steelblue', lw=1.5, alpha=0.7)
    ax.scatter(t[anom_idx], signal_with_anom[anom_idx], c='red',
               s=80, zorder=5, label='Contextual anomaly')
    ax.set_title('2. Contextual Anomaly\n'
                 'Normal value in wrong context\n'
                 '(spike during flat period)',
                 fontsize=10, fontweight='bold')
    ax.set_xlabel('Time', fontsize=9); ax.set_ylabel('Value', fontsize=9)
    ax.legend(fontsize=9); ax.grid(True, alpha=0.2)

    # ── Type 3: Collective anomaly ───────────────────────────────
    ax = axes[2]
    np.random.seed(42)
    X_norm3 = np.random.randn(200, 2) * 0.5
    # Collective anomaly: a small cluster in the wrong region
    X_coll = np.random.randn(15, 2) * 0.3 + np.array([3, 3])
    ax.scatter(X_norm3[:, 0], X_norm3[:, 1], c='steelblue', s=25,
               alpha=0.6, edgecolors='white', linewidth=0.3, label='Normal')
    ax.scatter(X_coll[:, 0], X_coll[:, 1], c='red', s=60,
               edgecolors='white', linewidth=0.5, zorder=4,
               label='Collective anomaly\n(normal individually,\nabnormal as a group)')
    ax.set_title('3. Collective Anomaly\n'
                 'Group of points jointly anomalous\n'
                 '(each within normal range alone)',
                 fontsize=10, fontweight='bold')
    ax.legend(fontsize=7, loc='upper left'); ax.grid(True, alpha=0.2)

    plt.suptitle('Three Types of Anomalies', fontsize=13, fontweight='bold',
                 y=1.02)
    plt.tight_layout()
    plt.savefig('anomaly_types.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("Saved: anomaly_types.png")


visualize_anomaly_types()

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)


def visualize_anomaly_types(figsize=(16, 5)):
    """
    Three canonical types of anomalies:
    1. Point anomaly:   a single point far from all others
    2. Contextual:      a point normal in one context, anomalous in another
    3. Collective:      a group of points anomalous together, each normal alone
    """
    fig, axes = plt.subplots(1, 3, figsize=figsize)

    # ── Type 1: Point anomaly ────────────────────────────────────
    ax = axes[0]
    X_normal = np.random.randn(200, 2)
    X_anom   = np.array([[5, 5], [-4, 3], [3, -4]])
    ax.scatter(X_normal[:, 0], X_normal[:, 1], c='steelblue', s=25,
               alpha=0.6, edgecolors='white', linewidth=0.3, label='Normal')
    ax.scatter(X_anom[:, 0], X_anom[:, 1], c='red', s=150,
               marker='*', zorder=5, label='Point anomaly')
    ax.set_title('1. Point Anomaly\n'
                 'Individual point far from all others',
                 fontsize=10, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.2)
    ax.set_xlim(-5, 7); ax.set_ylim(-5, 7)

    # ── Type 2: Contextual anomaly ───────────────────────────────
    ax = axes[1]
    t = np.linspace(0, 4 * np.pi, 200)
    signal    = np.sin(t) + np.random.randn(200) * 0.15
    anom_idx  = [80, 81, 82]
    signal_with_anom = signal.copy()
    signal_with_anom[anom_idx] = [2.5, 2.8, 2.6]

    ax.plot(t, signal_with_anom, 'steelblue', lw=1.5, alpha=0.7)
    ax.scatter(t[anom_idx], signal_with_anom[anom_idx], c='red',
               s=80, zorder=5, label='Contextual anomaly')
    ax.set_title('2. Contextual Anomaly\n'
                 'Normal value in wrong context\n'
                 '(spike during flat period)',
                 fontsize=10, fontweight='bold')
    ax.set_xlabel('Time', fontsize=9); ax.set_ylabel('Value', fontsize=9)
    ax.legend(fontsize=9); ax.grid(True, alpha=0.2)

    # ── Type 3: Collective anomaly ───────────────────────────────
    ax = axes[2]
    np.random.seed(42)
    X_norm3 = np.random.randn(200, 2) * 0.5
    # Collective anomaly: a small cluster in the wrong region
    X_coll = np.random.randn(15, 2) * 0.3 + np.array([3, 3])
    ax.scatter(X_norm3[:, 0], X_norm3[:, 1], c='steelblue', s=25,
               alpha=0.6, edgecolors='white', linewidth=0.3, label='Normal')
    ax.scatter(X_coll[:, 0], X_coll[:, 1], c='red', s=60,
               edgecolors='white', linewidth=0.5, zorder=4,
               label='Collective anomaly\n(normal individually,\nabnormal as a group)')
    ax.set_title('3. Collective Anomaly\n'
                 'Group of points jointly anomalous\n'
                 '(each within normal range alone)',
                 fontsize=10, fontweight='bold')
    ax.legend(fontsize=7, loc='upper left'); ax.grid(True, alpha=0.2)

    plt.suptitle('Three Types of Anomalies', fontsize=13, fontweight='bold',
                 y=1.02)
    plt.tight_layout()
    plt.savefig('anomaly_types.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("Saved: anomaly_types.png")


visualize_anomaly_types()

Method 1: Statistical Methods

Statistical anomaly detection assumes data follows a known distribution. Points that fall in the tails of that distribution — farther than expected — are flagged as anomalies.

Z-Score Method

Python

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats


def zscore_anomaly_detection(X, threshold=3.0, feature_names=None):
    """
    Z-score anomaly detection: flag points more than `threshold` standard
    deviations from the mean on any feature.

    Works best when:
    - Data is approximately Gaussian
    - Anomalies are global outliers (far from the mean)

    Fails when:
    - Data is multimodal (multiple clusters)
    - Anomalies are local (normal globally but anomalous in local context)

    Args:
        X:            Feature matrix (n_samples, n_features)
        threshold:    Z-score threshold (3.0 = ~0.27% false positive rate)
        feature_names: Optional feature names

    Returns:
        anomaly_mask: Boolean array, True = anomaly
        z_scores:     Z-score array (n_samples, n_features)
    """
    z_scores    = np.abs(stats.zscore(X, axis=0))
    anomaly_mask = (z_scores > threshold).any(axis=1)

    if feature_names is None:
        feature_names = [f'feature_{i}' for i in range(X.shape[1])]

    print(f"=== Z-Score Anomaly Detection ===\n")
    print(f"  Threshold: ±{threshold} std")
    print(f"  Total samples:    {len(X)}")
    print(f"  Anomalies found:  {anomaly_mask.sum()} "
          f"({anomaly_mask.mean()*100:.2f}%)\n")

    # Per-feature breakdown
    print(f"  {'Feature':<20} | {'Mean':>8} | {'Std':>8} | "
          f"{'Max |z|':>9} | {'Outliers'}")
    print(f"  {'-'*62}")
    for i, fname in enumerate(feature_names):
        feat_z = z_scores[:, i]
        n_out  = (feat_z > threshold).sum()
        print(f"  {fname:<20} | {X[:, i].mean():>8.3f} | "
              f"{X[:, i].std():>8.3f} | {feat_z.max():>9.3f} | "
              f"{n_out} flagged")

    return anomaly_mask, z_scores


# IQR method (non-parametric, robust to non-Gaussian distributions)
def iqr_anomaly_detection(X, multiplier=1.5, feature_names=None):
    """
    IQR (Interquartile Range) anomaly detection.
    Flags points outside [Q1 - k*IQR, Q3 + k*IQR].

    More robust than z-score for skewed or non-Gaussian distributions.
    k=1.5 is the standard Tukey fence; k=3.0 is the "extreme outlier" fence.
    """
    Q1  = np.percentile(X, 25, axis=0)
    Q3  = np.percentile(X, 75, axis=0)
    IQR = Q3 - Q1

    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR

    below_lower = (X < lower).any(axis=1)
    above_upper = (X > upper).any(axis=1)
    anomaly_mask = below_lower | above_upper

    if feature_names is None:
        feature_names = [f'feature_{i}' for i in range(X.shape[1])]

    print(f"=== IQR Anomaly Detection ===\n")
    print(f"  Multiplier: {multiplier} × IQR (Tukey fences)")
    print(f"  Anomalies: {anomaly_mask.sum()} ({anomaly_mask.mean()*100:.2f}%)")

    return anomaly_mask, lower, upper


# Demonstrate on synthetic data
np.random.seed(42)
X_stat = np.vstack([
    np.random.randn(200, 3),           # Normal data
    np.array([[5, 5, 5], [-5, -5, -5],  # Point anomalies
               [0, 8, 0], [7, 0, -6]])
])
y_stat_true = np.array([0]*200 + [1]*4)

zscore_mask, z_scores = zscore_anomaly_detection(
    X_stat, threshold=3.0,
    feature_names=['feature_0', 'feature_1', 'feature_2']
)

from sklearn.metrics import precision_score, recall_score, f1_score
print(f"\n  Evaluation:")
print(f"    Precision: {precision_score(y_stat_true, zscore_mask):.4f}")
print(f"    Recall:    {recall_score(y_stat_true, zscore_mask):.4f}")
print(f"    F1:        {f1_score(y_stat_true, zscore_mask):.4f}")

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats


def zscore_anomaly_detection(X, threshold=3.0, feature_names=None):
    """
    Z-score anomaly detection: flag points more than `threshold` standard
    deviations from the mean on any feature.

    Works best when:
    - Data is approximately Gaussian
    - Anomalies are global outliers (far from the mean)

    Fails when:
    - Data is multimodal (multiple clusters)
    - Anomalies are local (normal globally but anomalous in local context)

    Args:
        X:            Feature matrix (n_samples, n_features)
        threshold:    Z-score threshold (3.0 = ~0.27% false positive rate)
        feature_names: Optional feature names

    Returns:
        anomaly_mask: Boolean array, True = anomaly
        z_scores:     Z-score array (n_samples, n_features)
    """
    z_scores    = np.abs(stats.zscore(X, axis=0))
    anomaly_mask = (z_scores > threshold).any(axis=1)

    if feature_names is None:
        feature_names = [f'feature_{i}' for i in range(X.shape[1])]

    print(f"=== Z-Score Anomaly Detection ===\n")
    print(f"  Threshold: ±{threshold} std")
    print(f"  Total samples:    {len(X)}")
    print(f"  Anomalies found:  {anomaly_mask.sum()} "
          f"({anomaly_mask.mean()*100:.2f}%)\n")

    # Per-feature breakdown
    print(f"  {'Feature':<20} | {'Mean':>8} | {'Std':>8} | "
          f"{'Max |z|':>9} | {'Outliers'}")
    print(f"  {'-'*62}")
    for i, fname in enumerate(feature_names):
        feat_z = z_scores[:, i]
        n_out  = (feat_z > threshold).sum()
        print(f"  {fname:<20} | {X[:, i].mean():>8.3f} | "
              f"{X[:, i].std():>8.3f} | {feat_z.max():>9.3f} | "
              f"{n_out} flagged")

    return anomaly_mask, z_scores


# IQR method (non-parametric, robust to non-Gaussian distributions)
def iqr_anomaly_detection(X, multiplier=1.5, feature_names=None):
    """
    IQR (Interquartile Range) anomaly detection.
    Flags points outside [Q1 - k*IQR, Q3 + k*IQR].

    More robust than z-score for skewed or non-Gaussian distributions.
    k=1.5 is the standard Tukey fence; k=3.0 is the "extreme outlier" fence.
    """
    Q1  = np.percentile(X, 25, axis=0)
    Q3  = np.percentile(X, 75, axis=0)
    IQR = Q3 - Q1

    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR

    below_lower = (X < lower).any(axis=1)
    above_upper = (X > upper).any(axis=1)
    anomaly_mask = below_lower | above_upper

    if feature_names is None:
        feature_names = [f'feature_{i}' for i in range(X.shape[1])]

    print(f"=== IQR Anomaly Detection ===\n")
    print(f"  Multiplier: {multiplier} × IQR (Tukey fences)")
    print(f"  Anomalies: {anomaly_mask.sum()} ({anomaly_mask.mean()*100:.2f}%)")

    return anomaly_mask, lower, upper


# Demonstrate on synthetic data
np.random.seed(42)
X_stat = np.vstack([
    np.random.randn(200, 3),           # Normal data
    np.array([[5, 5, 5], [-5, -5, -5],  # Point anomalies
               [0, 8, 0], [7, 0, -6]])
])
y_stat_true = np.array([0]*200 + [1]*4)

zscore_mask, z_scores = zscore_anomaly_detection(
    X_stat, threshold=3.0,
    feature_names=['feature_0', 'feature_1', 'feature_2']
)

from sklearn.metrics import precision_score, recall_score, f1_score
print(f"\n  Evaluation:")
print(f"    Precision: {precision_score(y_stat_true, zscore_mask):.4f}")
print(f"    Recall:    {recall_score(y_stat_true, zscore_mask):.4f}")
print(f"    F1:        {f1_score(y_stat_true, zscore_mask):.4f}")

Method 2: Isolation Forest

Isolation Forest (Liu et al., 2008) is the most widely used anomaly detection algorithm and the recommended default for tabular data. Its core insight: anomalies are rare and different — they are easier to isolate with random splits than normal points.

The Core Idea

Build many random decision trees (isolation trees). At each node, choose a random feature and a random threshold. Keep splitting until each point is isolated in its own leaf. Anomalies reach isolation in very few splits (short average path length) because they are far from dense clusters. Normal points require many splits to isolate because they are surrounded by similar points.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.datasets import make_blobs


class IsolationTreeFromScratch:
    """
    Single isolation tree — educational implementation.
    In production use sklearn's IsolationForest.
    """

    def __init__(self, max_depth=None, random_state=None):
        self.max_depth    = max_depth
        self.random_state = random_state
        self.root_        = None

    def fit(self, X, depth=0):
        """Build the isolation tree recursively."""
        rng = np.random.RandomState(self.random_state or np.random.randint(10000))
        return self._build(X, depth, rng)

    def _build(self, X, depth, rng):
        n, d = X.shape

        # Stop conditions: isolated or max depth
        if n <= 1 or (self.max_depth is not None and depth >= self.max_depth):
            return {'type': 'leaf', 'size': n, 'depth': depth}

        # Random split: random feature and random threshold
        feat      = rng.randint(d)
        feat_min  = X[:, feat].min()
        feat_max  = X[:, feat].max()

        if feat_min == feat_max:
            return {'type': 'leaf', 'size': n, 'depth': depth}

        threshold = rng.uniform(feat_min, feat_max)

        left_mask  = X[:, feat] <= threshold
        right_mask = ~left_mask

        return {
            'type':      'internal',
            'feature':   feat,
            'threshold': threshold,
            'depth':     depth,
            'left':      self._build(X[left_mask],  depth + 1, rng),
            'right':     self._build(X[right_mask], depth + 1, rng),
        }

    def path_length(self, node, x):
        """Compute path length (depth) to isolate a single point."""
        if node['type'] == 'leaf':
            # Adjustment factor for subtrees that were stopped early
            n = node['size']
            c = 2 * (np.log(n - 1) + 0.5772) - 2*(n-1)/n if n > 1 else 0
            return node['depth'] + c

        if x[node['feature']] <= node['threshold']:
            return self.path_length(node['left'], x)
        else:
            return self.path_length(node['right'], x)


def demonstrate_isolation_concept(figsize=(14, 6)):
    """
    Show why anomalies are isolated faster than normal points.
    """
    np.random.seed(42)
    X_norm  = np.random.randn(200, 2)
    X_anom  = np.array([[5, 5], [-4, 3], [0.5, 5.0]])

    X_all   = np.vstack([X_norm, X_anom])
    y_all   = np.array([0]*200 + [1]*3)

    # Build a few isolation trees and measure path lengths
    path_lengths_normal = []
    path_lengths_anom   = []

    for seed in range(100):
        tree = IsolationTreeFromScratch(max_depth=15, random_state=seed)
        root = tree.fit(X_all)
        for x in X_norm[:50]:
            path_lengths_normal.append(tree.path_length(root, x))
        for x in X_anom:
            path_lengths_anom.append(tree.path_length(root, x))

    fig, axes = plt.subplots(1, 2, figsize=figsize)

    # Panel 1: Path length distributions
    ax = axes[0]
    ax.hist(path_lengths_normal, bins=25, alpha=0.7, color='steelblue',
             density=True, label=f'Normal (mean={np.mean(path_lengths_normal):.2f})')
    ax.hist(path_lengths_anom, bins=15, alpha=0.7, color='coral',
             density=True, label=f'Anomaly (mean={np.mean(path_lengths_anom):.2f})')
    ax.set_xlabel('Path Length to Isolation', fontsize=11)
    ax.set_ylabel('Density', fontsize=11)
    ax.set_title('Isolation Forest: Path Length Distribution\n'
                 'Anomalies reach isolation in fewer splits',
                 fontsize=11, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.3)

    # Panel 2: Anomaly scores on scatter
    ax = axes[1]
    iso = IsolationForest(n_estimators=100, contamination='auto',
                           random_state=42)
    iso.fit(X_all)
    scores = -iso.score_samples(X_all)  # Higher = more anomalous

    sc = ax.scatter(X_all[:, 0], X_all[:, 1], c=scores,
                     cmap='RdYlGn_r', s=40, edgecolors='white',
                     linewidth=0.3, alpha=0.85)
    plt.colorbar(sc, ax=ax, label='Anomaly Score\n(red = more anomalous)')
    ax.scatter(X_anom[:, 0], X_anom[:, 1], s=200, facecolors='none',
               edgecolors='red', linewidth=2.5, zorder=5,
               label='True anomalies')
    ax.set_title('Isolation Forest: Anomaly Scores\n'
                 'Dense regions = normal (green), sparse = anomalous (red)',
                 fontsize=11, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.2)

    plt.suptitle('How Isolation Forest Works: Short Paths = Anomalies',
                 fontsize=13, fontweight='bold')
    plt.tight_layout()
    plt.savefig('isolation_forest_concept.png', dpi=150)
    plt.show()
    print("Saved: isolation_forest_concept.png")

    print(f"\n  Mean path length — Normal: {np.mean(path_lengths_normal):.2f}")
    print(f"  Mean path length — Anomaly: {np.mean(path_lengths_anom):.2f}")
    print(f"  Ratio: {np.mean(path_lengths_normal)/np.mean(path_lengths_anom):.2f}× longer for normal points")


demonstrate_isolation_concept()

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.datasets import make_blobs


class IsolationTreeFromScratch:
    """
    Single isolation tree — educational implementation.
    In production use sklearn's IsolationForest.
    """

    def __init__(self, max_depth=None, random_state=None):
        self.max_depth    = max_depth
        self.random_state = random_state
        self.root_        = None

    def fit(self, X, depth=0):
        """Build the isolation tree recursively."""
        rng = np.random.RandomState(self.random_state or np.random.randint(10000))
        return self._build(X, depth, rng)

    def _build(self, X, depth, rng):
        n, d = X.shape

        # Stop conditions: isolated or max depth
        if n <= 1 or (self.max_depth is not None and depth >= self.max_depth):
            return {'type': 'leaf', 'size': n, 'depth': depth}

        # Random split: random feature and random threshold
        feat      = rng.randint(d)
        feat_min  = X[:, feat].min()
        feat_max  = X[:, feat].max()

        if feat_min == feat_max:
            return {'type': 'leaf', 'size': n, 'depth': depth}

        threshold = rng.uniform(feat_min, feat_max)

        left_mask  = X[:, feat] <= threshold
        right_mask = ~left_mask

        return {
            'type':      'internal',
            'feature':   feat,
            'threshold': threshold,
            'depth':     depth,
            'left':      self._build(X[left_mask],  depth + 1, rng),
            'right':     self._build(X[right_mask], depth + 1, rng),
        }

    def path_length(self, node, x):
        """Compute path length (depth) to isolate a single point."""
        if node['type'] == 'leaf':
            # Adjustment factor for subtrees that were stopped early
            n = node['size']
            c = 2 * (np.log(n - 1) + 0.5772) - 2*(n-1)/n if n > 1 else 0
            return node['depth'] + c

        if x[node['feature']] <= node['threshold']:
            return self.path_length(node['left'], x)
        else:
            return self.path_length(node['right'], x)


def demonstrate_isolation_concept(figsize=(14, 6)):
    """
    Show why anomalies are isolated faster than normal points.
    """
    np.random.seed(42)
    X_norm  = np.random.randn(200, 2)
    X_anom  = np.array([[5, 5], [-4, 3], [0.5, 5.0]])

    X_all   = np.vstack([X_norm, X_anom])
    y_all   = np.array([0]*200 + [1]*3)

    # Build a few isolation trees and measure path lengths
    path_lengths_normal = []
    path_lengths_anom   = []

    for seed in range(100):
        tree = IsolationTreeFromScratch(max_depth=15, random_state=seed)
        root = tree.fit(X_all)
        for x in X_norm[:50]:
            path_lengths_normal.append(tree.path_length(root, x))
        for x in X_anom:
            path_lengths_anom.append(tree.path_length(root, x))

    fig, axes = plt.subplots(1, 2, figsize=figsize)

    # Panel 1: Path length distributions
    ax = axes[0]
    ax.hist(path_lengths_normal, bins=25, alpha=0.7, color='steelblue',
             density=True, label=f'Normal (mean={np.mean(path_lengths_normal):.2f})')
    ax.hist(path_lengths_anom, bins=15, alpha=0.7, color='coral',
             density=True, label=f'Anomaly (mean={np.mean(path_lengths_anom):.2f})')
    ax.set_xlabel('Path Length to Isolation', fontsize=11)
    ax.set_ylabel('Density', fontsize=11)
    ax.set_title('Isolation Forest: Path Length Distribution\n'
                 'Anomalies reach isolation in fewer splits',
                 fontsize=11, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.3)

    # Panel 2: Anomaly scores on scatter
    ax = axes[1]
    iso = IsolationForest(n_estimators=100, contamination='auto',
                           random_state=42)
    iso.fit(X_all)
    scores = -iso.score_samples(X_all)  # Higher = more anomalous

    sc = ax.scatter(X_all[:, 0], X_all[:, 1], c=scores,
                     cmap='RdYlGn_r', s=40, edgecolors='white',
                     linewidth=0.3, alpha=0.85)
    plt.colorbar(sc, ax=ax, label='Anomaly Score\n(red = more anomalous)')
    ax.scatter(X_anom[:, 0], X_anom[:, 1], s=200, facecolors='none',
               edgecolors='red', linewidth=2.5, zorder=5,
               label='True anomalies')
    ax.set_title('Isolation Forest: Anomaly Scores\n'
                 'Dense regions = normal (green), sparse = anomalous (red)',
                 fontsize=11, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.2)

    plt.suptitle('How Isolation Forest Works: Short Paths = Anomalies',
                 fontsize=13, fontweight='bold')
    plt.tight_layout()
    plt.savefig('isolation_forest_concept.png', dpi=150)
    plt.show()
    print("Saved: isolation_forest_concept.png")

    print(f"\n  Mean path length — Normal: {np.mean(path_lengths_normal):.2f}")
    print(f"  Mean path length — Anomaly: {np.mean(path_lengths_anom):.2f}")
    print(f"  Ratio: {np.mean(path_lengths_normal)/np.mean(path_lengths_anom):.2f}× longer for normal points")


demonstrate_isolation_concept()

Isolation Forest in Practice

Python

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
from sklearn.metrics import (precision_score, recall_score, f1_score,
                              roc_auc_score, average_precision_score,
                              confusion_matrix)
import matplotlib.pyplot as plt


def isolation_forest_complete(X_train, X_test, y_test,
                                contamination=0.05,
                                dataset_name="Dataset"):
    """
    Complete Isolation Forest workflow:
    1. Fit on training data (assumed mostly clean)
    2. Score test data
    3. Evaluate at different thresholds
    4. Choose threshold by maximizing F1 or by contamination rate
    """
    scaler   = StandardScaler()
    X_tr_s   = scaler.fit_transform(X_train)
    X_te_s   = scaler.transform(X_test)

    iso = IsolationForest(
        n_estimators=200,
        contamination=contamination,  # Expected fraction of anomalies
        max_samples='auto',           # Subsample for speed: sqrt(n) default
        max_features=1.0,             # Fraction of features per tree
        bootstrap=False,              # Without replacement is standard
        random_state=42,
        n_jobs=-1,
    )
    iso.fit(X_tr_s)

    # Anomaly scores: negative = more anomalous; flip for intuitive direction
    scores     = -iso.score_samples(X_te_s)
    pred_labels = (iso.predict(X_te_s) == -1).astype(int)  # -1=anomaly

    print(f"=== Isolation Forest: {dataset_name} ===\n")
    print(f"  Training set: {len(X_train)}")
    print(f"  Test set:     {len(X_test)} ({y_test.sum()} true anomalies, "
          f"{y_test.mean()*100:.1f}%)\n")

    print(f"  Default threshold (contamination={contamination}):")
    print(f"    Precision: {precision_score(y_test, pred_labels):.4f}")
    print(f"    Recall:    {recall_score(y_test, pred_labels):.4f}")
    print(f"    F1:        {f1_score(y_test, pred_labels):.4f}")
    print(f"    AUC-ROC:   {roc_auc_score(y_test, scores):.4f}")
    print(f"    Avg Prec:  {average_precision_score(y_test, scores):.4f}")

    # Find the threshold that maximizes F1
    thresholds = np.percentile(scores, np.linspace(80, 99.5, 50))
    best_f1    = 0
    best_thresh = None
    for t in thresholds:
        preds = (scores >= t).astype(int)
        f1    = f1_score(y_test, preds, zero_division=0)
        if f1 > best_f1:
            best_f1    = f1
            best_thresh = t

    if best_thresh is not None:
        best_preds = (scores >= best_thresh).astype(int)
        print(f"\n  Best F1 threshold (scores ≥ {best_thresh:.4f}):")
        print(f"    Precision: {precision_score(y_test, best_preds):.4f}")
        print(f"    Recall:    {recall_score(y_test, best_preds):.4f}")
        print(f"    F1:        {best_f1:.4f}")

    return iso, scaler, scores


# Generate a synthetic fraud-like dataset
np.random.seed(42)
n_normal = 5000
n_anom   = 100

X_normal = np.random.randn(n_normal, 5)
X_anom   = np.random.randn(n_anom, 5) * 0.5 + np.array([3, -2, 3, -2, 3])

X_full  = np.vstack([X_normal, X_anom])
y_full  = np.array([0]*n_normal + [1]*n_anom)

from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
    X_full, y_full, test_size=0.3, random_state=42, stratify=y_full
)

iso_model, iso_scaler, iso_scores = isolation_forest_complete(
    X_tr, X_te, y_te, contamination=0.02, dataset_name="Synthetic Fraud"
)

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
from sklearn.metrics import (precision_score, recall_score, f1_score,
                              roc_auc_score, average_precision_score,
                              confusion_matrix)
import matplotlib.pyplot as plt


def isolation_forest_complete(X_train, X_test, y_test,
                                contamination=0.05,
                                dataset_name="Dataset"):
    """
    Complete Isolation Forest workflow:
    1. Fit on training data (assumed mostly clean)
    2. Score test data
    3. Evaluate at different thresholds
    4. Choose threshold by maximizing F1 or by contamination rate
    """
    scaler   = StandardScaler()
    X_tr_s   = scaler.fit_transform(X_train)
    X_te_s   = scaler.transform(X_test)

    iso = IsolationForest(
        n_estimators=200,
        contamination=contamination,  # Expected fraction of anomalies
        max_samples='auto',           # Subsample for speed: sqrt(n) default
        max_features=1.0,             # Fraction of features per tree
        bootstrap=False,              # Without replacement is standard
        random_state=42,
        n_jobs=-1,
    )
    iso.fit(X_tr_s)

    # Anomaly scores: negative = more anomalous; flip for intuitive direction
    scores     = -iso.score_samples(X_te_s)
    pred_labels = (iso.predict(X_te_s) == -1).astype(int)  # -1=anomaly

    print(f"=== Isolation Forest: {dataset_name} ===\n")
    print(f"  Training set: {len(X_train)}")
    print(f"  Test set:     {len(X_test)} ({y_test.sum()} true anomalies, "
          f"{y_test.mean()*100:.1f}%)\n")

    print(f"  Default threshold (contamination={contamination}):")
    print(f"    Precision: {precision_score(y_test, pred_labels):.4f}")
    print(f"    Recall:    {recall_score(y_test, pred_labels):.4f}")
    print(f"    F1:        {f1_score(y_test, pred_labels):.4f}")
    print(f"    AUC-ROC:   {roc_auc_score(y_test, scores):.4f}")
    print(f"    Avg Prec:  {average_precision_score(y_test, scores):.4f}")

    # Find the threshold that maximizes F1
    thresholds = np.percentile(scores, np.linspace(80, 99.5, 50))
    best_f1    = 0
    best_thresh = None
    for t in thresholds:
        preds = (scores >= t).astype(int)
        f1    = f1_score(y_test, preds, zero_division=0)
        if f1 > best_f1:
            best_f1    = f1
            best_thresh = t

    if best_thresh is not None:
        best_preds = (scores >= best_thresh).astype(int)
        print(f"\n  Best F1 threshold (scores ≥ {best_thresh:.4f}):")
        print(f"    Precision: {precision_score(y_test, best_preds):.4f}")
        print(f"    Recall:    {recall_score(y_test, best_preds):.4f}")
        print(f"    F1:        {best_f1:.4f}")

    return iso, scaler, scores


# Generate a synthetic fraud-like dataset
np.random.seed(42)
n_normal = 5000
n_anom   = 100

X_normal = np.random.randn(n_normal, 5)
X_anom   = np.random.randn(n_anom, 5) * 0.5 + np.array([3, -2, 3, -2, 3])

X_full  = np.vstack([X_normal, X_anom])
y_full  = np.array([0]*n_normal + [1]*n_anom)

from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
    X_full, y_full, test_size=0.3, random_state=42, stratify=y_full
)

iso_model, iso_scaler, iso_scores = isolation_forest_complete(
    X_tr, X_te, y_te, contamination=0.02, dataset_name="Synthetic Fraud"
)

Method 3: Local Outlier Factor (LOF)

Local Outlier Factor compares the local density of a point to its neighbors. Points in regions significantly less dense than their neighbors receive high LOF scores.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.datasets import make_blobs


def local_outlier_factor_demo(figsize=(14, 6)):
    """
    Demonstrate LOF's key advantage: detecting LOCAL outliers.
    A point may be normal globally (within the data range) but anomalous
    locally (in a region where no other similar points exist).

    This is the contextual anomaly problem that global methods like
    z-score and Isolation Forest miss.
    """
    np.random.seed(42)

    # Two dense clusters + points that are normal globally but
    # anomalous locally (between clusters)
    X_c1     = np.random.randn(150, 2) * 0.5 + [0, 0]
    X_c2     = np.random.randn(150, 2) * 0.5 + [5, 5]
    X_bridge = np.array([[2.5, 2.5], [2.8, 2.2], [2.0, 2.8]])  # Between clusters

    X_all = np.vstack([X_c1, X_c2, X_bridge])
    y_all = np.array([0]*300 + [1]*3)

    lof    = LocalOutlierFactor(n_neighbors=20, contamination=0.02,
                                 novelty=False)
    labels = lof.fit_predict(X_all)  # -1 = outlier
    scores = -lof.negative_outlier_factor_  # Higher = more anomalous

    from sklearn.ensemble import IsolationForest
    iso    = IsolationForest(n_estimators=100, contamination=0.02,
                              random_state=42)
    iso_scores = -iso.fit(X_all).score_samples(X_all)

    fig, axes = plt.subplots(1, 2, figsize=figsize)

    for ax, method_scores, title in [
        (axes[0], scores,      'LOF: Local Outlier Factor'),
        (axes[1], iso_scores,  'Isolation Forest'),
    ]:
        sc = ax.scatter(X_all[:, 0], X_all[:, 1],
                         c=method_scores, cmap='RdYlGn_r',
                         s=40, edgecolors='white', linewidth=0.3)
        ax.scatter(X_bridge[:, 0], X_bridge[:, 1], s=200,
                   facecolors='none', edgecolors='black',
                   linewidth=2.5, zorder=5, label='True anomalies')
        plt.colorbar(sc, ax=ax, label='Anomaly score')
        ax.set_title(f'{title}\n(True anomalies are circled)',
                     fontsize=11, fontweight='bold')
        ax.legend(fontsize=9); ax.grid(True, alpha=0.2)

    plt.suptitle('LOF vs Isolation Forest: Local vs Global Anomaly Detection\n'
                 'LOF detects anomalies in low-density regions between clusters',
                 fontsize=12, fontweight='bold')
    plt.tight_layout()
    plt.savefig('lof_vs_isoforest.png', dpi=150)
    plt.show()
    print("Saved: lof_vs_isoforest.png")

    # Evaluate
    from sklearn.metrics import average_precision_score
    ap_lof = average_precision_score(y_all, scores)
    ap_iso = average_precision_score(y_all, iso_scores)
    print(f"\n  Average Precision (higher = better):")
    print(f"    LOF:              {ap_lof:.4f}")
    print(f"    Isolation Forest: {ap_iso:.4f}")
    print(f"\n  LOF detects LOCAL outliers better.")
    print(f"  Between-cluster points look global, so IsoForest misses them.")


local_outlier_factor_demo()

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.datasets import make_blobs


def local_outlier_factor_demo(figsize=(14, 6)):
    """
    Demonstrate LOF's key advantage: detecting LOCAL outliers.
    A point may be normal globally (within the data range) but anomalous
    locally (in a region where no other similar points exist).

    This is the contextual anomaly problem that global methods like
    z-score and Isolation Forest miss.
    """
    np.random.seed(42)

    # Two dense clusters + points that are normal globally but
    # anomalous locally (between clusters)
    X_c1     = np.random.randn(150, 2) * 0.5 + [0, 0]
    X_c2     = np.random.randn(150, 2) * 0.5 + [5, 5]
    X_bridge = np.array([[2.5, 2.5], [2.8, 2.2], [2.0, 2.8]])  # Between clusters

    X_all = np.vstack([X_c1, X_c2, X_bridge])
    y_all = np.array([0]*300 + [1]*3)

    lof    = LocalOutlierFactor(n_neighbors=20, contamination=0.02,
                                 novelty=False)
    labels = lof.fit_predict(X_all)  # -1 = outlier
    scores = -lof.negative_outlier_factor_  # Higher = more anomalous

    from sklearn.ensemble import IsolationForest
    iso    = IsolationForest(n_estimators=100, contamination=0.02,
                              random_state=42)
    iso_scores = -iso.fit(X_all).score_samples(X_all)

    fig, axes = plt.subplots(1, 2, figsize=figsize)

    for ax, method_scores, title in [
        (axes[0], scores,      'LOF: Local Outlier Factor'),
        (axes[1], iso_scores,  'Isolation Forest'),
    ]:
        sc = ax.scatter(X_all[:, 0], X_all[:, 1],
                         c=method_scores, cmap='RdYlGn_r',
                         s=40, edgecolors='white', linewidth=0.3)
        ax.scatter(X_bridge[:, 0], X_bridge[:, 1], s=200,
                   facecolors='none', edgecolors='black',
                   linewidth=2.5, zorder=5, label='True anomalies')
        plt.colorbar(sc, ax=ax, label='Anomaly score')
        ax.set_title(f'{title}\n(True anomalies are circled)',
                     fontsize=11, fontweight='bold')
        ax.legend(fontsize=9); ax.grid(True, alpha=0.2)

    plt.suptitle('LOF vs Isolation Forest: Local vs Global Anomaly Detection\n'
                 'LOF detects anomalies in low-density regions between clusters',
                 fontsize=12, fontweight='bold')
    plt.tight_layout()
    plt.savefig('lof_vs_isoforest.png', dpi=150)
    plt.show()
    print("Saved: lof_vs_isoforest.png")

    # Evaluate
    from sklearn.metrics import average_precision_score
    ap_lof = average_precision_score(y_all, scores)
    ap_iso = average_precision_score(y_all, iso_scores)
    print(f"\n  Average Precision (higher = better):")
    print(f"    LOF:              {ap_lof:.4f}")
    print(f"    Isolation Forest: {ap_iso:.4f}")
    print(f"\n  LOF detects LOCAL outliers better.")
    print(f"  Between-cluster points look global, so IsoForest misses them.")


local_outlier_factor_demo()

Method 4: PCA Reconstruction Error

PCA-based anomaly detection works by compressing data to k components and then reconstructing it. Normal points reconstruct well because their variation is captured by the principal components. Anomalies — points that don’t fit the normal pattern — have high reconstruction error.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score


def pca_reconstruction_anomaly(X_train, X_test, y_test,
                                  n_components=None, variance_threshold=0.95,
                                  dataset_name="Dataset"):
    """
    PCA reconstruction error anomaly detection.

    Algorithm:
    1. Fit PCA on (mostly clean) training data
    2. For each test point: compress to k components, reconstruct
    3. Anomaly score = ||x - PCA_reconstruct(x)||²
    4. High reconstruction error → anomaly (doesn't fit normal space)

    The k choice: keep enough components to explain 95% of normal variance.
    Anomalies tend to have structure in the remaining 5% of dimensions.
    """
    scaler  = StandardScaler()
    X_tr_s  = scaler.fit_transform(X_train)
    X_te_s  = scaler.transform(X_test)

    # Choose k: either explicit or by variance threshold
    if n_components is None:
        pca_full = PCA()
        pca_full.fit(X_tr_s)
        cumvar = np.cumsum(pca_full.explained_variance_ratio_)
        n_components = np.searchsorted(cumvar, variance_threshold) + 1

    pca     = PCA(n_components=n_components)
    pca.fit(X_tr_s)

    # Reconstruction error on test set
    X_te_rec  = pca.inverse_transform(pca.transform(X_te_s))
    rec_errors = np.mean((X_te_s - X_te_rec) ** 2, axis=1)

    auc  = roc_auc_score(y_test, rec_errors)
    ap   = average_precision_score(y_test, rec_errors)

    print(f"=== PCA Reconstruction Error: {dataset_name} ===\n")
    print(f"  n_components: {n_components} "
          f"(explains {pca.explained_variance_ratio_.sum()*100:.1f}% variance)")
    print(f"  AUC-ROC: {auc:.4f}")
    print(f"  Avg Precision: {ap:.4f}")

    # Threshold at 99th percentile of training reconstruction errors
    X_tr_rec_train  = pca.inverse_transform(pca.transform(X_tr_s))
    train_rec_errors = np.mean((X_tr_s - X_tr_rec_train) ** 2, axis=1)
    threshold_99 = np.percentile(train_rec_errors, 99)

    preds_99 = (rec_errors > threshold_99).astype(int)
    from sklearn.metrics import precision_score, recall_score, f1_score
    print(f"\n  At 99th percentile training threshold ({threshold_99:.4f}):")
    print(f"    Precision: {precision_score(y_test, preds_99):.4f}")
    print(f"    Recall:    {recall_score(y_test, preds_99):.4f}")
    print(f"    F1:        {f1_score(y_test, preds_99):.4f}")

    return rec_errors, pca, scaler


# Use the same synthetic fraud dataset
pca_scores, pca_model, pca_scaler = pca_reconstruction_anomaly(
    X_tr, X_te, y_te,
    variance_threshold=0.95,
    dataset_name="Synthetic Fraud"
)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score


def pca_reconstruction_anomaly(X_train, X_test, y_test,
                                  n_components=None, variance_threshold=0.95,
                                  dataset_name="Dataset"):
    """
    PCA reconstruction error anomaly detection.

    Algorithm:
    1. Fit PCA on (mostly clean) training data
    2. For each test point: compress to k components, reconstruct
    3. Anomaly score = ||x - PCA_reconstruct(x)||²
    4. High reconstruction error → anomaly (doesn't fit normal space)

    The k choice: keep enough components to explain 95% of normal variance.
    Anomalies tend to have structure in the remaining 5% of dimensions.
    """
    scaler  = StandardScaler()
    X_tr_s  = scaler.fit_transform(X_train)
    X_te_s  = scaler.transform(X_test)

    # Choose k: either explicit or by variance threshold
    if n_components is None:
        pca_full = PCA()
        pca_full.fit(X_tr_s)
        cumvar = np.cumsum(pca_full.explained_variance_ratio_)
        n_components = np.searchsorted(cumvar, variance_threshold) + 1

    pca     = PCA(n_components=n_components)
    pca.fit(X_tr_s)

    # Reconstruction error on test set
    X_te_rec  = pca.inverse_transform(pca.transform(X_te_s))
    rec_errors = np.mean((X_te_s - X_te_rec) ** 2, axis=1)

    auc  = roc_auc_score(y_test, rec_errors)
    ap   = average_precision_score(y_test, rec_errors)

    print(f"=== PCA Reconstruction Error: {dataset_name} ===\n")
    print(f"  n_components: {n_components} "
          f"(explains {pca.explained_variance_ratio_.sum()*100:.1f}% variance)")
    print(f"  AUC-ROC: {auc:.4f}")
    print(f"  Avg Precision: {ap:.4f}")

    # Threshold at 99th percentile of training reconstruction errors
    X_tr_rec_train  = pca.inverse_transform(pca.transform(X_tr_s))
    train_rec_errors = np.mean((X_tr_s - X_tr_rec_train) ** 2, axis=1)
    threshold_99 = np.percentile(train_rec_errors, 99)

    preds_99 = (rec_errors > threshold_99).astype(int)
    from sklearn.metrics import precision_score, recall_score, f1_score
    print(f"\n  At 99th percentile training threshold ({threshold_99:.4f}):")
    print(f"    Precision: {precision_score(y_test, preds_99):.4f}")
    print(f"    Recall:    {recall_score(y_test, preds_99):.4f}")
    print(f"    F1:        {f1_score(y_test, preds_99):.4f}")

    return rec_errors, pca, scaler


# Use the same synthetic fraud dataset
pca_scores, pca_model, pca_scaler = pca_reconstruction_anomaly(
    X_tr, X_te, y_te,
    variance_threshold=0.95,
    dataset_name="Synthetic Fraud"
)

Comparing All Methods: Benchmark

Python

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.datasets import make_blobs
import time


def benchmark_anomaly_detectors(datasets_info):
    """
    Comprehensive benchmark of anomaly detection methods.

    Methods compared:
    - Isolation Forest (fast, scalable, handles high dimensions)
    - Local Outlier Factor (local density, good for clustering data)
    - One-Class SVM (kernel-based, good small datasets)
    - PCA Reconstruction Error (fast, interpretable in low dimensions)
    - Z-Score (baseline statistical method)
    """
    from scipy import stats as scipy_stats

    methods = {
        'Isolation Forest':    lambda: IsolationForest(n_estimators=100,
                                                         random_state=42, n_jobs=-1),
        'LOF':                 lambda: LocalOutlierFactor(n_neighbors=20,
                                                           contamination='auto',
                                                           novelty=True),
        'One-Class SVM':       lambda: OneClassSVM(kernel='rbf', gamma='scale', nu=0.05),
    }

    print("=== Anomaly Detector Benchmark ===\n")
    print(f"  Metrics: AUC-ROC and Average Precision (AUPRC)")
    print(f"  Higher = better for both metrics\n")

    for ds_name, (X_tr, X_te, y_te) in datasets_info.items():
        scaler  = StandardScaler()
        X_tr_s  = scaler.fit_transform(X_tr)
        X_te_s  = scaler.transform(X_te)
        n_anom  = y_te.sum()
        n_total = len(y_te)

        print(f"  Dataset: {ds_name} "
              f"({n_total} test samples, {n_anom} anomalies = "
              f"{n_anom/n_total*100:.1f}%)")
        print(f"  {'Method':<22} | {'AUC-ROC':>9} | {'AUPRC':>9} | "
              f"{'Time(s)':>8}")
        print(f"  {'-'*56}")

        # PCA reconstruction error
        t0    = time.perf_counter()
        pca_b = PCA(n_components=10 if X_tr.shape[1] > 10 else None)
        pca_b.fit(X_tr_s)
        rec   = np.mean((X_te_s - pca_b.inverse_transform(
            pca_b.transform(X_te_s))) ** 2, axis=1)
        t_pca = time.perf_counter() - t0
        auc_pca = roc_auc_score(y_te, rec)
        ap_pca  = average_precision_score(y_te, rec)
        print(f"  {'PCA Recon Error':<22} | {auc_pca:>9.4f} | "
              f"{ap_pca:>9.4f} | {t_pca:>8.3f}")

        # Z-Score (univariate baseline)
        t0 = time.perf_counter()
        z  = np.abs(scipy_stats.zscore(X_te_s, axis=0)).max(axis=1)
        t_z  = time.perf_counter() - t0
        auc_z = roc_auc_score(y_te, z)
        ap_z  = average_precision_score(y_te, z)
        print(f"  {'Z-Score':<22} | {auc_z:>9.4f} | {ap_z:>9.4f} | {t_z:>8.4f}")

        for name, make_method in methods.items():
            t0 = time.perf_counter()
            try:
                clf = make_method()
                clf.fit(X_tr_s)
                if hasattr(clf, 'score_samples'):
                    scores = -clf.score_samples(X_te_s)
                elif hasattr(clf, 'decision_function'):
                    scores = -clf.decision_function(X_te_s)
                else:
                    scores = (-clf.predict(X_te_s)).astype(float)
                t_method = time.perf_counter() - t0

                auc = roc_auc_score(y_te, scores)
                ap  = average_precision_score(y_te, scores)
                print(f"  {name:<22} | {auc:>9.4f} | {ap:>9.4f} | "
                      f"{t_method:>8.3f}")
            except Exception as e:
                print(f"  {name:<22} | {'ERROR':>9} | {'ERROR':>9} | {str(e)[:20]}")
        print()


# Two datasets for benchmark
datasets_bench = {
    'Synthetic Blobs':
        (X_tr[:, :5], X_te[:, :5], y_te),
}

np.random.seed(123)
X2_normal = np.random.randn(2000, 10)
X2_anom   = np.random.randn(100, 10) * 0.3 + 3
X2        = np.vstack([X2_normal, X2_anom])
y2        = np.array([0]*2000 + [1]*100)
X2_tr, X2_te, y2_tr, y2_te = train_test_split(
    X2, y2, test_size=0.3, random_state=42, stratify=y2
)
datasets_bench['10D Gaussian Shift'] = (X2_tr, X2_te, y2_te)

benchmark_anomaly_detectors(datasets_bench)

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.datasets import make_blobs
import time


def benchmark_anomaly_detectors(datasets_info):
    """
    Comprehensive benchmark of anomaly detection methods.

    Methods compared:
    - Isolation Forest (fast, scalable, handles high dimensions)
    - Local Outlier Factor (local density, good for clustering data)
    - One-Class SVM (kernel-based, good small datasets)
    - PCA Reconstruction Error (fast, interpretable in low dimensions)
    - Z-Score (baseline statistical method)
    """
    from scipy import stats as scipy_stats

    methods = {
        'Isolation Forest':    lambda: IsolationForest(n_estimators=100,
                                                         random_state=42, n_jobs=-1),
        'LOF':                 lambda: LocalOutlierFactor(n_neighbors=20,
                                                           contamination='auto',
                                                           novelty=True),
        'One-Class SVM':       lambda: OneClassSVM(kernel='rbf', gamma='scale', nu=0.05),
    }

    print("=== Anomaly Detector Benchmark ===\n")
    print(f"  Metrics: AUC-ROC and Average Precision (AUPRC)")
    print(f"  Higher = better for both metrics\n")

    for ds_name, (X_tr, X_te, y_te) in datasets_info.items():
        scaler  = StandardScaler()
        X_tr_s  = scaler.fit_transform(X_tr)
        X_te_s  = scaler.transform(X_te)
        n_anom  = y_te.sum()
        n_total = len(y_te)

        print(f"  Dataset: {ds_name} "
              f"({n_total} test samples, {n_anom} anomalies = "
              f"{n_anom/n_total*100:.1f}%)")
        print(f"  {'Method':<22} | {'AUC-ROC':>9} | {'AUPRC':>9} | "
              f"{'Time(s)':>8}")
        print(f"  {'-'*56}")

        # PCA reconstruction error
        t0    = time.perf_counter()
        pca_b = PCA(n_components=10 if X_tr.shape[1] > 10 else None)
        pca_b.fit(X_tr_s)
        rec   = np.mean((X_te_s - pca_b.inverse_transform(
            pca_b.transform(X_te_s))) ** 2, axis=1)
        t_pca = time.perf_counter() - t0
        auc_pca = roc_auc_score(y_te, rec)
        ap_pca  = average_precision_score(y_te, rec)
        print(f"  {'PCA Recon Error':<22} | {auc_pca:>9.4f} | "
              f"{ap_pca:>9.4f} | {t_pca:>8.3f}")

        # Z-Score (univariate baseline)
        t0 = time.perf_counter()
        z  = np.abs(scipy_stats.zscore(X_te_s, axis=0)).max(axis=1)
        t_z  = time.perf_counter() - t0
        auc_z = roc_auc_score(y_te, z)
        ap_z  = average_precision_score(y_te, z)
        print(f"  {'Z-Score':<22} | {auc_z:>9.4f} | {ap_z:>9.4f} | {t_z:>8.4f}")

        for name, make_method in methods.items():
            t0 = time.perf_counter()
            try:
                clf = make_method()
                clf.fit(X_tr_s)
                if hasattr(clf, 'score_samples'):
                    scores = -clf.score_samples(X_te_s)
                elif hasattr(clf, 'decision_function'):
                    scores = -clf.decision_function(X_te_s)
                else:
                    scores = (-clf.predict(X_te_s)).astype(float)
                t_method = time.perf_counter() - t0

                auc = roc_auc_score(y_te, scores)
                ap  = average_precision_score(y_te, scores)
                print(f"  {name:<22} | {auc:>9.4f} | {ap:>9.4f} | "
                      f"{t_method:>8.3f}")
            except Exception as e:
                print(f"  {name:<22} | {'ERROR':>9} | {'ERROR':>9} | {str(e)[:20]}")
        print()


# Two datasets for benchmark
datasets_bench = {
    'Synthetic Blobs':
        (X_tr[:, :5], X_te[:, :5], y_te),
}

np.random.seed(123)
X2_normal = np.random.randn(2000, 10)
X2_anom   = np.random.randn(100, 10) * 0.3 + 3
X2        = np.vstack([X2_normal, X2_anom])
y2        = np.array([0]*2000 + [1]*100)
X2_tr, X2_te, y2_tr, y2_te = train_test_split(
    X2, y2, test_size=0.3, random_state=42, stratify=y2
)
datasets_bench['10D Gaussian Shift'] = (X2_tr, X2_te, y2_te)

benchmark_anomaly_detectors(datasets_bench)

Threshold Selection and Evaluation

Anomaly detection models output scores, not labels. Choosing the threshold that converts scores to predictions is a critical practical step.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (precision_recall_curve, roc_curve,
                              roc_auc_score, average_precision_score,
                              f1_score)


def threshold_selection_analysis(y_true, scores, method_name="Method",
                                   figsize=(14, 5)):
    """
    Comprehensive threshold selection analysis.

    Three threshold strategies:
    1. By contamination rate (if known)
    2. By maximizing <a href="https://techietory.com/ai/f1-score-imbalanced-datasets/">F1 score</a> (if some labeled anomalies are available)
    3. By maximizing precision at fixed recall (when recall matters most)
    """
    auc_roc = roc_auc_score(y_true, scores)
    ap      = average_precision_score(y_true, scores)

    fig, axes = plt.subplots(1, 3, figsize=figsize)

    # ROC curve
    ax = axes[0]
    fpr, tpr, thresholds_roc = roc_curve(y_true, scores)
    ax.plot(fpr, tpr, 'steelblue', lw=2.5, label=f'AUC = {auc_roc:.4f}')
    ax.plot([0, 1], [0, 1], 'k--', lw=1, alpha=0.5)
    ax.set_xlabel('False Positive Rate', fontsize=11)
    ax.set_ylabel('True Positive Rate', fontsize=11)
    ax.set_title(f'ROC Curve: {method_name}', fontsize=10, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.3)

    # Precision-Recall curve
    ax = axes[1]
    prec, rec, thresholds_pr = precision_recall_curve(y_true, scores)
    ax.plot(rec, prec, 'coral', lw=2.5, label=f'AUPRC = {ap:.4f}')
    baseline = y_true.mean()
    ax.axhline(y=baseline, color='gray', linestyle='--', lw=1.5,
               label=f'Baseline = {baseline:.3f}')
    ax.set_xlabel('Recall', fontsize=11)
    ax.set_ylabel('Precision', fontsize=11)
    ax.set_title('Precision-Recall Curve\n(Use AUPRC for imbalanced anomalies)',
                 fontsize=10, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.3)

    # F1 vs threshold
    ax = axes[2]
    thresholds_grid = np.percentile(scores, np.linspace(50, 99.9, 100))
    f1_vals   = []
    prec_vals = []
    rec_vals  = []

    for t in thresholds_grid:
        preds = (scores >= t).astype(int)
        f1_vals.append(f1_score(y_true, preds, zero_division=0))
        from sklearn.metrics import precision_score, recall_score
        prec_vals.append(precision_score(y_true, preds, zero_division=0))
        rec_vals.append(recall_score(y_true, preds, zero_division=0))

    best_f1_idx = np.argmax(f1_vals)
    best_f1_t   = thresholds_grid[best_f1_idx]

    ax.plot(thresholds_grid, f1_vals,   'steelblue', lw=2, label='F1')
    ax.plot(thresholds_grid, prec_vals, 'coral',     lw=2, label='Precision')
    ax.plot(thresholds_grid, rec_vals,  'mediumseagreen', lw=2, label='Recall')
    ax.axvline(x=best_f1_t, color='black', linestyle='--', lw=1.5,
               label=f'Best F1 threshold')
    ax.set_xlabel('Score Threshold', fontsize=11)
    ax.set_ylabel('Metric Value', fontsize=11)
    ax.set_title('Metrics vs Threshold\nChoose threshold by business need',
                 fontsize=10, fontweight='bold')
    ax.legend(fontsize=8); ax.grid(True, alpha=0.3)

    plt.suptitle(f'Threshold Selection: {method_name}', fontsize=12,
                 fontweight='bold')
    plt.tight_layout()
    plt.savefig('anomaly_threshold_selection.png', dpi=150)
    plt.show()
    print("Saved: anomaly_threshold_selection.png")

    print(f"\n  AUC-ROC:        {auc_roc:.4f}")
    print(f"  AUPRC:          {ap:.4f}")
    print(f"  Best F1 thresh: {best_f1_t:.4f} "
          f"(F1={max(f1_vals):.4f})")

    print(f"\n  Threshold selection guidance:")
    print(f"    - If false positives are expensive: set threshold to maximize precision")
    print(f"    - If false negatives are expensive: set threshold to maximize recall")
    print(f"    - If balanced: maximize F1")
    print(f"    - If contamination rate is known: use that percentile directly")


threshold_selection_analysis(y_te, iso_scores, "Isolation Forest")

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (precision_recall_curve, roc_curve,
                              roc_auc_score, average_precision_score,
                              f1_score)


def threshold_selection_analysis(y_true, scores, method_name="Method",
                                   figsize=(14, 5)):
    """
    Comprehensive threshold selection analysis.

    Three threshold strategies:
    1. By contamination rate (if known)
    2. By maximizing F1 score (if some labeled anomalies are available)
    3. By maximizing precision at fixed recall (when recall matters most)
    """
    auc_roc = roc_auc_score(y_true, scores)
    ap      = average_precision_score(y_true, scores)

    fig, axes = plt.subplots(1, 3, figsize=figsize)

    # ROC curve
    ax = axes[0]
    fpr, tpr, thresholds_roc = roc_curve(y_true, scores)
    ax.plot(fpr, tpr, 'steelblue', lw=2.5, label=f'AUC = {auc_roc:.4f}')
    ax.plot([0, 1], [0, 1], 'k--', lw=1, alpha=0.5)
    ax.set_xlabel('False Positive Rate', fontsize=11)
    ax.set_ylabel('True Positive Rate', fontsize=11)
    ax.set_title(f'ROC Curve: {method_name}', fontsize=10, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.3)

    # Precision-Recall curve
    ax = axes[1]
    prec, rec, thresholds_pr = precision_recall_curve(y_true, scores)
    ax.plot(rec, prec, 'coral', lw=2.5, label=f'AUPRC = {ap:.4f}')
    baseline = y_true.mean()
    ax.axhline(y=baseline, color='gray', linestyle='--', lw=1.5,
               label=f'Baseline = {baseline:.3f}')
    ax.set_xlabel('Recall', fontsize=11)
    ax.set_ylabel('Precision', fontsize=11)
    ax.set_title('Precision-Recall Curve\n(Use AUPRC for imbalanced anomalies)',
                 fontsize=10, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.3)

    # F1 vs threshold
    ax = axes[2]
    thresholds_grid = np.percentile(scores, np.linspace(50, 99.9, 100))
    f1_vals   = []
    prec_vals = []
    rec_vals  = []

    for t in thresholds_grid:
        preds = (scores >= t).astype(int)
        f1_vals.append(f1_score(y_true, preds, zero_division=0))
        from sklearn.metrics import precision_score, recall_score
        prec_vals.append(precision_score(y_true, preds, zero_division=0))
        rec_vals.append(recall_score(y_true, preds, zero_division=0))

    best_f1_idx = np.argmax(f1_vals)
    best_f1_t   = thresholds_grid[best_f1_idx]

    ax.plot(thresholds_grid, f1_vals,   'steelblue', lw=2, label='F1')
    ax.plot(thresholds_grid, prec_vals, 'coral',     lw=2, label='Precision')
    ax.plot(thresholds_grid, rec_vals,  'mediumseagreen', lw=2, label='Recall')
    ax.axvline(x=best_f1_t, color='black', linestyle='--', lw=1.5,
               label=f'Best F1 threshold')
    ax.set_xlabel('Score Threshold', fontsize=11)
    ax.set_ylabel('Metric Value', fontsize=11)
    ax.set_title('Metrics vs Threshold\nChoose threshold by business need',
                 fontsize=10, fontweight='bold')
    ax.legend(fontsize=8); ax.grid(True, alpha=0.3)

    plt.suptitle(f'Threshold Selection: {method_name}', fontsize=12,
                 fontweight='bold')
    plt.tight_layout()
    plt.savefig('anomaly_threshold_selection.png', dpi=150)
    plt.show()
    print("Saved: anomaly_threshold_selection.png")

    print(f"\n  AUC-ROC:        {auc_roc:.4f}")
    print(f"  AUPRC:          {ap:.4f}")
    print(f"  Best F1 thresh: {best_f1_t:.4f} "
          f"(F1={max(f1_vals):.4f})")

    print(f"\n  Threshold selection guidance:")
    print(f"    - If false positives are expensive: set threshold to maximize precision")
    print(f"    - If false negatives are expensive: set threshold to maximize recall")
    print(f"    - If balanced: maximize F1")
    print(f"    - If contamination rate is known: use that percentile directly")


threshold_selection_analysis(y_te, iso_scores, "Isolation Forest")

Time-Series Anomaly Detection

Many real anomaly detection problems are temporal: sensor readings, stock prices, server metrics. Time series anomalies require special treatment because context matters — a value normal during the day may be anomalous at 3 AM.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest


def time_series_anomaly_demo(figsize=(14, 8)):
    """
    Demonstrate time-series anomaly detection using rolling statistics
    and Isolation Forest on sliding windows.

    Techniques:
    1. Rolling z-score: flag values more than k stds from rolling mean
    2. Seasonal decomposition: flag residuals after removing trend+seasonality
    3. Sliding-window Isolation Forest: use recent window as features
    """
    np.random.seed(42)
    n = 500
    t = np.arange(n)

    # Simulate a time series: trend + seasonality + noise
    trend    = 0.01 * t
    seasonal = 2 * np.sin(2 * np.pi * t / 50)
    noise    = np.random.randn(n) * 0.5
    series   = trend + seasonal + noise

    # Inject anomalies
    anom_idx = [100, 200, 201, 350, 400]
    for idx in anom_idx:
        series[idx] += np.random.choice([-6, 6])

    y_true_ts = np.zeros(n, dtype=int)
    y_true_ts[anom_idx] = 1

    # Method 1: Rolling z-score
    window = 30
    rolling_mean = np.array([series[max(0,i-window):i+1].mean()
                              for i in range(n)])
    rolling_std  = np.array([series[max(0,i-window):i+1].std()
                              for i in range(n)])
    rolling_std  = np.where(rolling_std < 0.01, 0.01, rolling_std)  # Avoid div/0
    rolling_z    = np.abs((series - rolling_mean) / rolling_std)
    anomaly_rolling = rolling_z > 3

    # Method 2: Sliding window isolation forest
    win_size   = 10
    X_windows  = np.array([series[i:i+win_size]
                            for i in range(n - win_size)])
    iso_ts     = IsolationForest(n_estimators=100, contamination=0.02,
                                  random_state=42)
    iso_ts.fit(X_windows)
    iso_scores_ts = -iso_ts.score_samples(X_windows)
    iso_anom_ts   = iso_ts.predict(X_windows) == -1
    # Pad to original length
    iso_scores_padded = np.zeros(n)
    iso_anom_padded   = np.zeros(n, dtype=bool)
    iso_scores_padded[win_size:] = iso_scores_ts
    iso_anom_padded[win_size:]   = iso_anom_ts

    fig, axes = plt.subplots(3, 1, figsize=figsize, sharex=True)

    # Raw series
    ax = axes[0]
    ax.plot(t, series, 'steelblue', lw=1.5, alpha=0.8)
    ax.scatter(t[y_true_ts == 1], series[y_true_ts == 1],
               c='red', s=100, zorder=5, marker='*', label='True anomaly')
    ax.set_title('Time Series with Injected Anomalies', fontsize=10,
                 fontweight='bold')
    ax.set_ylabel('Value'); ax.legend(fontsize=8); ax.grid(True, alpha=0.2)

    # Rolling z-score
    ax = axes[1]
    ax.plot(t, rolling_z, 'coral', lw=1.5)
    ax.axhline(y=3, color='black', linestyle='--', lw=1.5, alpha=0.7,
               label='Threshold (3σ)')
    ax.scatter(t[anomaly_rolling], rolling_z[anomaly_rolling],
               c='red', s=80, zorder=5)
    for idx in anom_idx:
        ax.axvline(x=idx, color='red', lw=0.8, alpha=0.3)
    ax.set_title('Rolling Z-Score (window=30)', fontsize=10, fontweight='bold')
    ax.set_ylabel('|Z-score|'); ax.legend(fontsize=8); ax.grid(True, alpha=0.2)

    # Isolation Forest sliding window
    ax = axes[2]
    ax.plot(t, iso_scores_padded, 'mediumseagreen', lw=1.5)
    ax.scatter(t[iso_anom_padded], iso_scores_padded[iso_anom_padded],
               c='red', s=80, zorder=5, label='Flagged anomaly')
    for idx in anom_idx:
        ax.axvline(x=idx, color='red', lw=0.8, alpha=0.3)
    ax.set_title('Isolation Forest (sliding window=10)', fontsize=10,
                 fontweight='bold')
    ax.set_xlabel('Time step', fontsize=10)
    ax.set_ylabel('Anomaly score'); ax.legend(fontsize=8); ax.grid(True, alpha=0.2)

    plt.suptitle('Time-Series Anomaly Detection Methods', fontsize=12,
                 fontweight='bold')
    plt.tight_layout()
    plt.savefig('timeseries_anomaly.png', dpi=150)
    plt.show()
    print("Saved: timeseries_anomaly.png")

    # Quick evaluation
    from sklearn.metrics import f1_score
    print(f"\n  Detection performance:")
    print(f"    Rolling Z-Score F1:  {f1_score(y_true_ts, anomaly_rolling.astype(int)):.4f}")
    print(f"    IsoForest F1:        "
          f"{f1_score(y_true_ts[win_size:], iso_anom_ts.astype(int)):.4f}")


time_series_anomaly_demo()

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest


def time_series_anomaly_demo(figsize=(14, 8)):
    """
    Demonstrate time-series anomaly detection using rolling statistics
    and Isolation Forest on sliding windows.

    Techniques:
    1. Rolling z-score: flag values more than k stds from rolling mean
    2. Seasonal decomposition: flag residuals after removing trend+seasonality
    3. Sliding-window Isolation Forest: use recent window as features
    """
    np.random.seed(42)
    n = 500
    t = np.arange(n)

    # Simulate a time series: trend + seasonality + noise
    trend    = 0.01 * t
    seasonal = 2 * np.sin(2 * np.pi * t / 50)
    noise    = np.random.randn(n) * 0.5
    series   = trend + seasonal + noise

    # Inject anomalies
    anom_idx = [100, 200, 201, 350, 400]
    for idx in anom_idx:
        series[idx] += np.random.choice([-6, 6])

    y_true_ts = np.zeros(n, dtype=int)
    y_true_ts[anom_idx] = 1

    # Method 1: Rolling z-score
    window = 30
    rolling_mean = np.array([series[max(0,i-window):i+1].mean()
                              for i in range(n)])
    rolling_std  = np.array([series[max(0,i-window):i+1].std()
                              for i in range(n)])
    rolling_std  = np.where(rolling_std < 0.01, 0.01, rolling_std)  # Avoid div/0
    rolling_z    = np.abs((series - rolling_mean) / rolling_std)
    anomaly_rolling = rolling_z > 3

    # Method 2: Sliding window isolation forest
    win_size   = 10
    X_windows  = np.array([series[i:i+win_size]
                            for i in range(n - win_size)])
    iso_ts     = IsolationForest(n_estimators=100, contamination=0.02,
                                  random_state=42)
    iso_ts.fit(X_windows)
    iso_scores_ts = -iso_ts.score_samples(X_windows)
    iso_anom_ts   = iso_ts.predict(X_windows) == -1
    # Pad to original length
    iso_scores_padded = np.zeros(n)
    iso_anom_padded   = np.zeros(n, dtype=bool)
    iso_scores_padded[win_size:] = iso_scores_ts
    iso_anom_padded[win_size:]   = iso_anom_ts

    fig, axes = plt.subplots(3, 1, figsize=figsize, sharex=True)

    # Raw series
    ax = axes[0]
    ax.plot(t, series, 'steelblue', lw=1.5, alpha=0.8)
    ax.scatter(t[y_true_ts == 1], series[y_true_ts == 1],
               c='red', s=100, zorder=5, marker='*', label='True anomaly')
    ax.set_title('Time Series with Injected Anomalies', fontsize=10,
                 fontweight='bold')
    ax.set_ylabel('Value'); ax.legend(fontsize=8); ax.grid(True, alpha=0.2)

    # Rolling z-score
    ax = axes[1]
    ax.plot(t, rolling_z, 'coral', lw=1.5)
    ax.axhline(y=3, color='black', linestyle='--', lw=1.5, alpha=0.7,
               label='Threshold (3σ)')
    ax.scatter(t[anomaly_rolling], rolling_z[anomaly_rolling],
               c='red', s=80, zorder=5)
    for idx in anom_idx:
        ax.axvline(x=idx, color='red', lw=0.8, alpha=0.3)
    ax.set_title('Rolling Z-Score (window=30)', fontsize=10, fontweight='bold')
    ax.set_ylabel('|Z-score|'); ax.legend(fontsize=8); ax.grid(True, alpha=0.2)

    # Isolation Forest sliding window
    ax = axes[2]
    ax.plot(t, iso_scores_padded, 'mediumseagreen', lw=1.5)
    ax.scatter(t[iso_anom_padded], iso_scores_padded[iso_anom_padded],
               c='red', s=80, zorder=5, label='Flagged anomaly')
    for idx in anom_idx:
        ax.axvline(x=idx, color='red', lw=0.8, alpha=0.3)
    ax.set_title('Isolation Forest (sliding window=10)', fontsize=10,
                 fontweight='bold')
    ax.set_xlabel('Time step', fontsize=10)
    ax.set_ylabel('Anomaly score'); ax.legend(fontsize=8); ax.grid(True, alpha=0.2)

    plt.suptitle('Time-Series Anomaly Detection Methods', fontsize=12,
                 fontweight='bold')
    plt.tight_layout()
    plt.savefig('timeseries_anomaly.png', dpi=150)
    plt.show()
    print("Saved: timeseries_anomaly.png")

    # Quick evaluation
    from sklearn.metrics import f1_score
    print(f"\n  Detection performance:")
    print(f"    Rolling Z-Score F1:  {f1_score(y_true_ts, anomaly_rolling.astype(int)):.4f}")
    print(f"    IsoForest F1:        "
          f"{f1_score(y_true_ts[win_size:], iso_anom_ts.astype(int)):.4f}")


time_series_anomaly_demo()

Production Anomaly Detection Pipeline

A production-ready anomaly detection system requires more than just a fitted model: it needs monitored thresholds, drift detection, alert logging, and regular retraining.

Python

import numpy as np
import joblib
import os
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, average_precision_score


def build_production_anomaly_detector(X_train, feature_names=None,
                                         contamination=0.01,
                                         random_state=42):
    """
    Production anomaly detector with:
    - StandardScaler + IsolationForest in a Pipeline
    - Threshold calibrated on training data
    - Drift detection: flag if new data distribution shifts
    - Serialization for deployment
    """
    if feature_names is None:
        feature_names = [f'feature_{i}' for i in range(X_train.shape[1])]

    # Build pipeline
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('isoforest', IsolationForest(
            n_estimators=200,
            contamination=contamination,
            random_state=random_state,
            n_jobs=-1,
        ))
    ])
    pipe.fit(X_train)

    # Calibrate threshold on training data
    train_scores = -pipe.named_steps['isoforest'].score_samples(
        pipe.named_steps['scaler'].transform(X_train)
    )

    # Thresholds at different percentiles
    thresholds = {
        'conservative':  np.percentile(train_scores, 99.5),  # 0.5% flagged
        'standard':      np.percentile(train_scores, 99.0),  # 1.0% flagged
        'sensitive':     np.percentile(train_scores, 97.0),  # 3.0% flagged
    }

    # Drift detection: save training distribution statistics
    scaler_fit = pipe.named_steps['scaler']
    drift_stats = {
        'train_mean':   scaler_fit.mean_.copy(),
        'train_std':    scaler_fit.scale_.copy(),
        'train_score_mean': train_scores.mean(),
        'train_score_std':  train_scores.std(),
        'feature_names': feature_names,
    }

    model_path = 'anomaly_detector.joblib'
    joblib.dump({
        'pipeline':     pipe,
        'thresholds':   thresholds,
        'drift_stats':  drift_stats,
        'contamination': contamination,
    }, model_path)

    print(f"=== Production Anomaly Detector Built ===\n")
    print(f"  Training samples: {len(X_train)}")
    print(f"  Contamination rate: {contamination*100:.1f}%\n")
    print(f"  Calibrated thresholds:")
    for name, t in thresholds.items():
        pct_flagged = (train_scores >= t).mean() * 100
        print(f"    {name:<14}: {t:.4f} ({pct_flagged:.1f}% of training flagged)")
    print(f"\n  Model saved: {model_path} "
          f"({os.path.getsize(model_path)/1024:.0f} KB)")

    return pipe, thresholds, drift_stats


def score_new_data(model_artifact_path, X_new, threshold_level='standard'):
    """
    Score new data with drift detection.
    Returns anomaly scores, predictions, and a drift warning if applicable.
    """
    artifact = joblib.load(model_artifact_path)
    pipe       = artifact['pipeline']
    thresholds = artifact['thresholds']
    drift_stats = artifact['drift_stats']

    # Compute anomaly scores
    X_s      = pipe.named_steps['scaler'].transform(X_new)
    scores   = -pipe.named_steps['isoforest'].score_samples(X_s)
    threshold = thresholds[threshold_level]
    predictions = (scores >= threshold).astype(int)

    # Drift detection: compare new data mean to training mean
    new_mean = X_new.mean(axis=0)
    mean_shift = np.abs(new_mean - drift_stats['train_mean']) / (
        drift_stats['train_std'] + 1e-8
    )
    max_shift   = mean_shift.max()
    drift_flag  = max_shift > 3.0  # More than 3 std shift

    result = {
        'scores':      scores,
        'predictions': predictions,
        'n_anomalies': predictions.sum(),
        'pct_anomalies': predictions.mean() * 100,
        'drift_detected': drift_flag,
        'max_feature_shift': max_shift,
    }

    if drift_flag:
        shifted_feat = drift_stats['feature_names'][np.argmax(mean_shift)]
        result['drift_feature'] = shifted_feat
        print(f"  ⚠ DRIFT DETECTED: {shifted_feat} shifted by "
              f"{max_shift:.1f}σ — consider retraining")

    return result


# Demonstrate
pipe_prod, thresholds_prod, drift_prod = build_production_anomaly_detector(
    X_tr, feature_names=[f'feature_{i}' for i in range(X_tr.shape[1])]
)

# Score new data
result = score_new_data('anomaly_detector.joblib', X_te)
print(f"\n  Scoring {len(X_te)} new samples:")
print(f"    Anomalies detected: {result['n_anomalies']} ({result['pct_anomalies']:.1f}%)")
print(f"    Drift detected: {result['drift_detected']}")

# Cleanup
os.remove('anomaly_detector.joblib')

import numpy as np
import joblib
import os
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, average_precision_score


def build_production_anomaly_detector(X_train, feature_names=None,
                                         contamination=0.01,
                                         random_state=42):
    """
    Production anomaly detector with:
    - StandardScaler + IsolationForest in a Pipeline
    - Threshold calibrated on training data
    - Drift detection: flag if new data distribution shifts
    - Serialization for deployment
    """
    if feature_names is None:
        feature_names = [f'feature_{i}' for i in range(X_train.shape[1])]

    # Build pipeline
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('isoforest', IsolationForest(
            n_estimators=200,
            contamination=contamination,
            random_state=random_state,
            n_jobs=-1,
        ))
    ])
    pipe.fit(X_train)

    # Calibrate threshold on training data
    train_scores = -pipe.named_steps['isoforest'].score_samples(
        pipe.named_steps['scaler'].transform(X_train)
    )

    # Thresholds at different percentiles
    thresholds = {
        'conservative':  np.percentile(train_scores, 99.5),  # 0.5% flagged
        'standard':      np.percentile(train_scores, 99.0),  # 1.0% flagged
        'sensitive':     np.percentile(train_scores, 97.0),  # 3.0% flagged
    }

    # Drift detection: save training distribution statistics
    scaler_fit = pipe.named_steps['scaler']
    drift_stats = {
        'train_mean':   scaler_fit.mean_.copy(),
        'train_std':    scaler_fit.scale_.copy(),
        'train_score_mean': train_scores.mean(),
        'train_score_std':  train_scores.std(),
        'feature_names': feature_names,
    }

    model_path = 'anomaly_detector.joblib'
    joblib.dump({
        'pipeline':     pipe,
        'thresholds':   thresholds,
        'drift_stats':  drift_stats,
        'contamination': contamination,
    }, model_path)

    print(f"=== Production Anomaly Detector Built ===\n")
    print(f"  Training samples: {len(X_train)}")
    print(f"  Contamination rate: {contamination*100:.1f}%\n")
    print(f"  Calibrated thresholds:")
    for name, t in thresholds.items():
        pct_flagged = (train_scores >= t).mean() * 100
        print(f"    {name:<14}: {t:.4f} ({pct_flagged:.1f}% of training flagged)")
    print(f"\n  Model saved: {model_path} "
          f"({os.path.getsize(model_path)/1024:.0f} KB)")

    return pipe, thresholds, drift_stats


def score_new_data(model_artifact_path, X_new, threshold_level='standard'):
    """
    Score new data with drift detection.
    Returns anomaly scores, predictions, and a drift warning if applicable.
    """
    artifact = joblib.load(model_artifact_path)
    pipe       = artifact['pipeline']
    thresholds = artifact['thresholds']
    drift_stats = artifact['drift_stats']

    # Compute anomaly scores
    X_s      = pipe.named_steps['scaler'].transform(X_new)
    scores   = -pipe.named_steps['isoforest'].score_samples(X_s)
    threshold = thresholds[threshold_level]
    predictions = (scores >= threshold).astype(int)

    # Drift detection: compare new data mean to training mean
    new_mean = X_new.mean(axis=0)
    mean_shift = np.abs(new_mean - drift_stats['train_mean']) / (
        drift_stats['train_std'] + 1e-8
    )
    max_shift   = mean_shift.max()
    drift_flag  = max_shift > 3.0  # More than 3 std shift

    result = {
        'scores':      scores,
        'predictions': predictions,
        'n_anomalies': predictions.sum(),
        'pct_anomalies': predictions.mean() * 100,
        'drift_detected': drift_flag,
        'max_feature_shift': max_shift,
    }

    if drift_flag:
        shifted_feat = drift_stats['feature_names'][np.argmax(mean_shift)]
        result['drift_feature'] = shifted_feat
        print(f"  ⚠ DRIFT DETECTED: {shifted_feat} shifted by "
              f"{max_shift:.1f}σ — consider retraining")

    return result


# Demonstrate
pipe_prod, thresholds_prod, drift_prod = build_production_anomaly_detector(
    X_tr, feature_names=[f'feature_{i}' for i in range(X_tr.shape[1])]
)

# Score new data
result = score_new_data('anomaly_detector.joblib', X_te)
print(f"\n  Scoring {len(X_te)} new samples:")
print(f"    Anomalies detected: {result['n_anomalies']} ({result['pct_anomalies']:.1f}%)")
print(f"    Drift detected: {result['drift_detected']}")

# Cleanup
os.remove('anomaly_detector.joblib')

Summary

Anomaly detection finds the rare, unexpected data points that don’t fit the learned pattern of normal behavior — without requiring labeled examples of anomalies. The choice of method depends on the type of anomaly, the dataset size, and whether anomalies are global (far from center) or local (in low-density regions).

Statistical methods (z-score, IQR) are fast, interpretable, and appropriate for approximately Gaussian, low-dimensional data. They fail on multimodal data or when anomalies are local. Use them as baselines and for simple, well-understood features.

Isolation Forest is the recommended default for tabular data. It is fast, scalable to millions of points, handles high dimensions well, and requires minimal hyperparameter tuning. Its core insight — anomalies are isolated in fewer random splits — is elegant and empirically robust.

Local Outlier Factor detects anomalies that are locally sparse: normal globally but anomalous in their local neighborhood. It is the right choice when data contains multiple clusters of different densities. Its main limitation is that it cannot score new data without refitting (unless novelty=True).

PCA reconstruction error is a natural fit when you have pre-existing PCA infrastructure or when the anomalous dimensions are the ones not captured by the principal components. It is fast, interpretable, and pairs naturally with visualization.

One-Class SVM works well on small datasets with complex normal distributions. It is slower than Isolation Forest and requires kernel and hyperparameter tuning.

Threshold selection is a business decision: set it to maximize precision (minimize false alarms), recall (minimize missed anomalies), or F1 (balance both). The AUPRC (area under the precision-recall curve) is the most informative single metric for anomaly detection because most datasets are highly imbalanced — far more normal points than anomalies.

Evaluation Without Labels: Practical Considerations

Most anomaly detection deployments lack ground truth labels. This creates a circular evaluation problem: you cannot measure performance without labeled anomalies, but you need performance measurements to tune the detector. Several practical strategies address this:

Python

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler


def anomaly_detection_evaluation_strategies(X_clean, X_with_anomalies=None):
    """
    Four strategies for evaluating anomaly detectors without labels.
    """
    print("=== Evaluation Strategies Without Labels ===\n")

    scaler = StandardScaler()
    X_s    = scaler.fit_transform(X_clean)

    iso = IsolationForest(n_estimators=100, contamination=0.02,
                           random_state=42)
    iso.fit(X_s)
    scores = -iso.score_samples(X_s)

    print("  1. Internal consistency: score distribution shape")
    print(f"     Score mean: {scores.mean():.4f}")
    print(f"     Score std:  {scores.std():.4f}")
    print(f"     Skewness:   {float(np.mean((scores - scores.mean())**3) / scores.std()**3):.4f}")
    print(f"     Expect: right-skewed (few very high scores = anomalies)\n")

    print("  2. Contamination sweep: stable results around true contamination")
    for cont in [0.005, 0.01, 0.02, 0.05]:
        iso_c  = IsolationForest(n_estimators=100, contamination=cont, random_state=42)
        iso_c.fit(X_s)
        n_flagged = (iso_c.predict(X_s) == -1).sum()
        print(f"     contamination={cont:.3f}: {n_flagged} flagged ({n_flagged/len(X_s)*100:.1f}%)")

    print("\n  3. Ablation: inject known anomalies and measure detection rate")
    np.random.seed(42)
    X_inject = X_s.copy()
    n_injected = 20
    inject_idx = np.random.choice(len(X_s), n_injected, replace=False)
    X_inject[inject_idx] += np.random.randn(n_injected, X_s.shape[1]) * 5
    scores_inject = -IsolationForest(n_estimators=100, contamination=0.05,
                                      random_state=42).fit(X_inject).score_samples(X_inject)
    threshold = np.percentile(scores_inject, 95)
    detected  = (scores_inject[inject_idx] >= threshold).mean()
    print(f"     Injected {n_injected} anomalies; detector found {detected*100:.0f}% at 5% threshold")

    print("\n  4. Expert review: sample top-k anomalies for manual inspection")
    top_k = 10
    top_idx = np.argsort(scores)[-top_k:][::-1]
    print(f"     Top {top_k} anomaly candidates by score:")
    print(f"     Indices: {top_idx.tolist()}")
    print(f"     Scores:  {scores[top_idx].round(4).tolist()}")
    print(f"     Present these to domain experts for labeling")


np.random.seed(42)
X_eval = np.random.randn(1000, 5)
anomaly_detection_evaluation_strategies(X_eval)

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler


def anomaly_detection_evaluation_strategies(X_clean, X_with_anomalies=None):
    """
    Four strategies for evaluating anomaly detectors without labels.
    """
    print("=== Evaluation Strategies Without Labels ===\n")

    scaler = StandardScaler()
    X_s    = scaler.fit_transform(X_clean)

    iso = IsolationForest(n_estimators=100, contamination=0.02,
                           random_state=42)
    iso.fit(X_s)
    scores = -iso.score_samples(X_s)

    print("  1. Internal consistency: score distribution shape")
    print(f"     Score mean: {scores.mean():.4f}")
    print(f"     Score std:  {scores.std():.4f}")
    print(f"     Skewness:   {float(np.mean((scores - scores.mean())**3) / scores.std()**3):.4f}")
    print(f"     Expect: right-skewed (few very high scores = anomalies)\n")

    print("  2. Contamination sweep: stable results around true contamination")
    for cont in [0.005, 0.01, 0.02, 0.05]:
        iso_c  = IsolationForest(n_estimators=100, contamination=cont, random_state=42)
        iso_c.fit(X_s)
        n_flagged = (iso_c.predict(X_s) == -1).sum()
        print(f"     contamination={cont:.3f}: {n_flagged} flagged ({n_flagged/len(X_s)*100:.1f}%)")

    print("\n  3. Ablation: inject known anomalies and measure detection rate")
    np.random.seed(42)
    X_inject = X_s.copy()
    n_injected = 20
    inject_idx = np.random.choice(len(X_s), n_injected, replace=False)
    X_inject[inject_idx] += np.random.randn(n_injected, X_s.shape[1]) * 5
    scores_inject = -IsolationForest(n_estimators=100, contamination=0.05,
                                      random_state=42).fit(X_inject).score_samples(X_inject)
    threshold = np.percentile(scores_inject, 95)
    detected  = (scores_inject[inject_idx] >= threshold).mean()
    print(f"     Injected {n_injected} anomalies; detector found {detected*100:.0f}% at 5% threshold")

    print("\n  4. Expert review: sample top-k anomalies for manual inspection")
    top_k = 10
    top_idx = np.argsort(scores)[-top_k:][::-1]
    print(f"     Top {top_k} anomaly candidates by score:")
    print(f"     Indices: {top_idx.tolist()}")
    print(f"     Scores:  {scores[top_idx].round(4).tolist()}")
    print(f"     Present these to domain experts for labeling")


np.random.seed(42)
X_eval = np.random.randn(1000, 5)
anomaly_detection_evaluation_strategies(X_eval)

Decision Guide: Which Method to Use

Data size	Cluster structure	Labels available	Recommended method
Any, tabular	No	No	Isolation Forest (default)
Small (<10K)	Multiple clusters	No	LOF (local density)
Any, Gaussian	No	No	Z-Score / IQR (fast baseline)
Any, PCA preprocessed	No	No	PCA reconstruction error
Time series	Seasonal	No	Rolling z-score + IsoForest
Any	Yes	Some	Semi-supervised one-class models

When in doubt: start with Isolation Forest, compute AUPRC, then try LOF if data has multiple density regions. Always compare both against the z-score baseline.

0 Comments

Inline Feedbacks

View all comments

Discover More

Condition Variables: Thread Communication

Click For More

Search Techietory

Anomaly Detection: Finding Outliers in Your Data

Introduction

The Anomaly Detection Problem

Method 1: Statistical Methods

Z-Score Method

Method 2: Isolation Forest

The Core Idea

Isolation Forest in Practice

Method 3: Local Outlier Factor (LOF)

Method 4: PCA Reconstruction Error

Comparing All Methods: Benchmark

Threshold Selection and Evaluation

Time-Series Anomaly Detection

Production Anomaly Detection Pipeline

Summary

Evaluation Without Labels: Practical Considerations

Decision Guide: Which Method to Use

Discover More

Condition Variables: Thread Communication

Setting Up VS Code for Data Science

Robotics Revolution: NVIDIA’s GR00T Brings Human-Like Reasoning to Bots

How Operating Systems Manage Network Connections

Understanding the Difference Between System Software and Application Software

Clustering Techniques: An Introduction to K-Means