Binary classification is a supervised machine learning task where the model predicts one of exactly two possible outcomes — yes or no, spam or not spam, fraud or legitimate, sick or healthy. The model learns a decision boundary that separates the two classes in feature space, then assigns new examples to one class based on which side of the boundary they fall on. Binary classification is the most fundamental classification task, and mastering it — including algorithm selection, evaluation metrics, threshold tuning, and handling class imbalance — provides the conceptual foundation for all classification problems.
Introduction: The World of Yes-or-No Decisions
Every day, machine learning systems make millions of binary decisions: Is this transaction fraudulent? Should this loan be approved? Is this X-ray showing a tumor? Will this customer cancel their subscription? Does this image contain a face? Is this review positive or negative?
These are binary classification problems — situations where the answer is one of exactly two possibilities. The output space is discrete and binary: 0 or 1, False or True, negative or positive, class A or class B. Binary classification is the most common and fundamental classification task in machine learning, and getting it right requires more than just fitting a model.
It requires choosing the right algorithm for your data, selecting evaluation metrics that match your problem’s priorities (accuracy alone is often misleading), tuning the decision threshold to balance different types of errors, handling class imbalance when one outcome is rare, and understanding what makes a binary classifier truly useful in practice.
This comprehensive guide covers the complete landscape of binary classification. You’ll learn the formal problem definition, the taxonomy of binary classification algorithms, evaluation metrics and their tradeoffs, threshold optimization, handling imbalanced classes, probability calibration, and complete Python implementations across multiple real-world scenarios.
What is Binary Classification?
The Formal Definition
Binary classification: A supervised learning task where:
- Input: Feature vector x = [x₁, x₂, …, xₙ]
- Output: Class label y ∈ {0, 1}
- Goal: Learn a function f(x) → {0, 1}
The two classes:
Positive class (y=1): The "event of interest"
Examples: fraud, spam, disease, churn, default
Negative class (y=0): The "baseline" or "absence"
Examples: legitimate, not spam, healthy, retained, repaidConvention: Positive class is typically the rarer or more consequential outcome.
Real-World Binary Classification Problems
Finance:
Credit card fraud detection: Fraud (1) vs. Legitimate (0)
Loan default prediction: Default (1) vs. Repaid (0)
Stock movement prediction: Up (1) vs. Down (0)Healthcare:
Disease diagnosis: Positive (1) vs. Negative (0)
Tumor classification: Malignant (1) vs. Benign (0)
Readmission prediction: Readmitted (1) vs. Not (0)Technology:
Spam detection: Spam (1) vs. Ham (0)
Intrusion detection: Attack (1) vs. Normal (0)
Sentiment analysis: Positive (1) vs. Negative (0)Business:
Customer churn: Churned (1) vs. Retained (0)
Click-through prediction: Clicked (1) vs. Not clicked (0)
Product defect detection: Defective (1) vs. Good (0)How Binary Classifiers Work
The Two-Step Process
Every binary classifier follows the same two-step process:
Step 1: Learn a scoring function
score(x) = some measure of "how much like class 1" is xStep 2: Apply a threshold
If score(x) ≥ threshold → Predict class 1
If score(x) < threshold → Predict class 0The scoring function varies by algorithm:
Logistic Regression: score = σ(wᵀx + b) = probability
Decision Tree: score = fraction of class-1 examples in leaf
Random Forest: score = average probability across trees
SVM: score = distance from hyperplane
Neural Network: score = σ(output neuron)
Naive Bayes: score = P(class 1 | features)The Decision Boundary
The decision boundary is the surface in feature space where the classifier is exactly uncertain — where score = threshold.
1D case (one feature):
score
1.0 │ ╭─────────
│ ╭──╯
0.5 │──────────╯←────────────── threshold
│ ╭──╯
0.0 │──────╯
└────────────────────────── x
↑
x* = decision boundary
Left: predict class 0
Right: predict class 12D case (two features):
Feature 2
│ ● ●
│ ● ● ● ← Class 1
│ ● ● ●
│─────────────── ← Decision boundary
│ ○ ○ ○ ○
│ ○ ○ ○ ← Class 0
│ ○ ○
└─────────────── Feature 1Binary Classification Algorithms
Logistic Regression
Best for: Linearly separable data, when probabilities matter, interpretability needed
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=1.0, max_iter=1000)
lr.fit(X_train, y_train)
probs = lr.predict_proba(X_test)[:, 1] # Probability of class 1Strengths: Fast, interpretable, well-calibrated probabilities, works well with many features Weaknesses: Only linear boundary, needs feature scaling, assumes feature independence
Decision Tree
Best for: Non-linear data, interpretability, mixed feature types
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
dt.fit(X_train, y_train)
probs = dt.predict_proba(X_test)[:, 1]Strengths: Handles non-linearity, no scaling needed, interpretable rules Weaknesses: Prone to overfitting without pruning, unstable (high variance)
Random Forest
Best for: General purpose, when performance matters, handles missing data well
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth=10,
random_state=42)
rf.fit(X_train, y_train)
probs = rf.predict_proba(X_test)[:, 1]Strengths: Excellent performance, handles non-linearity, feature importance, robust Weaknesses: Slower training, less interpretable, memory intensive
Gradient Boosting (XGBoost / LightGBM)
Best for: Structured/tabular data, competitions, maximum performance
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
max_depth=3)
gb.fit(X_train, y_train)
probs = gb.predict_proba(X_test)[:, 1]Strengths: State-of-the-art on tabular data, handles missing values, feature importance Weaknesses: Many hyperparameters, slower training, prone to overfitting
Support Vector Machine (SVM)
Best for: High-dimensional data, small-medium datasets, text classification
from sklearn.svm import SVC
svm = SVC(kernel='rbf', C=1.0, probability=True)
svm.fit(X_train, y_train)
probs = svm.predict_proba(X_test)[:, 1]Strengths: Effective in high dimensions, kernel trick for non-linearity Weaknesses: Slow on large datasets, kernel/parameter selection tricky, memory intensive
K-Nearest Neighbors (KNN)
Best for: Small datasets, simple problems, baseline models
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
probs = knn.predict_proba(X_test)[:, 1]Strengths: Simple, no training needed, naturally multi-class Weaknesses: Slow at inference, needs scaling, poor with high dimensions
Naive Bayes
Best for: Text classification, very large datasets, when data is sparse
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)
probs = nb.predict_proba(X_test)[:, 1]Strengths: Very fast, works well with small data, text problems, interpretable Weaknesses: Independence assumption often violated, poor probability calibration
Evaluation Metrics: Beyond Accuracy
Accuracy alone is insufficient for binary classification — especially with imbalanced classes.
The Confusion Matrix
Every binary prediction falls into one of four categories:
Predicted
0 1
Actual 0 True Neg (TN) False Pos (FP)
1 False Neg (FN) True Pos (TP)
TN: Correctly predicted class 0
TP: Correctly predicted class 1
FP: Incorrectly predicted class 1 (False alarm)
FN: Incorrectly predicted class 0 (Missed detection)Real-world names:
Disease testing:
TN = Healthy person → Negative test ✓
TP = Sick person → Positive test ✓
FP = Healthy person → Positive test (false alarm — unnecessary treatment)
FN = Sick person → Negative test (missed diagnosis — dangerous!)Key Metrics
Accuracy:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Proportion of all predictions that are correct.
Problem: Misleading with imbalanced classes.
99% of transactions are legitimate → predict all negative → 99% accuracy!
But catches 0% of fraud. Useless model.Precision (Positive Predictive Value):
Precision = TP / (TP + FP)
"Of all examples predicted positive, what fraction are actually positive?"
High precision = few false alarms
Important when: False positives are costly
Spam filter: Don't want to mark legitimate email as spam
Legal system: Don't want to convict innocent peopleRecall (Sensitivity / True Positive Rate):
Recall = TP / (TP + FN)
"Of all actual positives, what fraction did we correctly identify?"
High recall = few missed detections
Important when: False negatives are costly
Cancer detection: Don't want to miss a real tumor
Fraud: Don't want to miss actual fraudF1 Score:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
= Harmonic mean of precision and recall
Balances both metrics.
Best single metric when classes are imbalanced.
Range: [0, 1], higher is better.Specificity (True Negative Rate):
Specificity = TN / (TN + FP)
"Of all actual negatives, what fraction did we correctly identify?"
Complement of false positive rate.The Precision-Recall Tradeoff:
Lowering the decision threshold:
→ More positives predicted
→ Higher recall (catch more true positives)
→ Lower precision (more false positives)
Raising the decision threshold:
→ Fewer positives predicted
→ Lower recall (miss more true positives)
→ Higher precision (fewer false positives)
Cannot maximize both simultaneously — must choose based on problem needs.ROC-AUC: Threshold-Independent Performance
ROC Curve: Plots True Positive Rate (Recall) vs. False Positive Rate at every threshold
TPR (Recall)
1.0 │ ╭─────────────
│ ╭──╯
│ ╭─╯ ← Good classifier (area = 0.92)
│╭╯
0.5 │╱
│╱ ← Random classifier (area = 0.5, diagonal)
│
0.0 └────────────────── FPR
0 0.5 1.0AUC (Area Under Curve):
AUC = 1.0: Perfect classifier
AUC = 0.5: Random classifier (diagonal line)
AUC = 0.0: Perfectly wrong classifier
AUC ≈ probability that model ranks a random positive
example higher than a random negative example.
Threshold-independent: measures ranking quality, not just a single threshold.Practical Code: All Metrics at Once
import numpy as np
import pandas as pd
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix,
classification_report, roc_curve,
precision_recall_curve, average_precision_score
)
import matplotlib.pyplot as plt
def evaluate_binary_classifier(y_true, y_pred, y_prob,
model_name="Model"):
"""
Comprehensive evaluation of a binary classifier.
y_true: true labels
y_pred: predicted labels (after threshold)
y_prob: predicted probabilities for class 1
"""
print(f"\n{'='*55}")
print(f" {model_name} — Evaluation Report")
print(f"{'='*55}")
# Core metrics
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
auc = roc_auc_score(y_true, y_prob)
print(f" Accuracy: {acc:.4f}")
print(f" Precision: {prec:.4f}")
print(f" Recall: {rec:.4f}")
print(f" F1 Score: {f1:.4f}")
print(f" ROC-AUC: {auc:.4f}")
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"\n Confusion Matrix:")
print(f" TN={tn:5d} FP={fp:5d}")
print(f" FN={fn:5d} TP={tp:5d}")
# Classification report
print(f"\n{classification_report(y_true, y_pred)}")
return {'accuracy': acc, 'precision': prec,
'recall': rec, 'f1': f1, 'auc': auc}Complete Binary Classification Pipeline
Problem: Credit Card Fraud Detection
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline
# ── 1. Simulate credit card fraud dataset ─────────────────────
np.random.seed(42)
X_fraud, y_fraud = make_classification(
n_samples=10000,
n_features=20,
n_informative=10,
n_redundant=5,
weights=[0.97, 0.03], # 97% legitimate, 3% fraud
flip_y=0.01,
random_state=42
)
print(f"Total examples: {len(y_fraud):,}")
print(f"Legitimate (class 0): {(y_fraud==0).sum():,} ({(y_fraud==0).mean()*100:.1f}%)")
print(f"Fraud (class 1): {(y_fraud==1).sum():,} ({(y_fraud==1).mean()*100:.1f}%)")
print(f"\nClass imbalance: {(y_fraud==0).sum() / (y_fraud==1).sum():.0f}:1")
# ── 2. Split ────────────────────────────────────────────────────
X_tr, X_te, y_tr, y_te = train_test_split(
X_fraud, y_fraud, test_size=0.2,
random_state=42, stratify=y_fraud # Preserve class ratio
)
# ── 3. Build and compare multiple classifiers ──────────────────
classifiers = {
'Logistic Regression': Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(C=1.0, max_iter=1000))
]),
'Random Forest': Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(n_estimators=100,
max_depth=8,
random_state=42))
]),
'Gradient Boosting': Pipeline([
('scaler', StandardScaler()),
('clf', GradientBoostingClassifier(n_estimators=100,
learning_rate=0.1,
max_depth=4,
random_state=42))
])
}
results = {}
for name, pipe in classifiers.items():
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)
y_prob = pipe.predict_proba(X_te)[:, 1]
metrics = evaluate_binary_classifier(y_te, y_pred,
y_prob, name)
results[name] = {'pipe': pipe, 'y_prob': y_prob,
'metrics': metrics}Comparing ROC Curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# ── ROC curves ─────────────────────────────────────────────────
ax = axes[0]
colors = ['steelblue', 'coral', 'seagreen']
for (name, res), color in zip(results.items(), colors):
fpr, tpr, _ = roc_curve(y_te, res['y_prob'])
auc = res['metrics']['auc']
ax.plot(fpr, tpr, color=color, linewidth=2,
label=f"{name} (AUC={auc:.3f})")
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate (Recall)')
ax.set_title('ROC Curves — Fraud Detection')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
# ── Precision-Recall curves (better for imbalanced) ────────────
ax = axes[1]
for (name, res), color in zip(results.items(), colors):
prec_c, rec_c, _ = precision_recall_curve(y_te, res['y_prob'])
ap = average_precision_score(y_te, res['y_prob'])
ax.plot(rec_c, prec_c, color=color, linewidth=2,
label=f"{name} (AP={ap:.3f})")
ax.axhline(y=y_te.mean(), color='black', linestyle='--',
linewidth=1, label=f'Baseline (={y_te.mean():.3f})')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_title('Precision-Recall Curves — Fraud Detection\n'
'(Better metric for imbalanced classes)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Threshold Optimization
The default threshold of 0.5 is rarely optimal. Choose it based on your problem’s costs.
Finding the Optimal Threshold
def find_optimal_threshold(y_true, y_prob, metric='f1'):
"""
Find the decision threshold that maximizes a given metric.
metric: 'f1', 'recall', 'precision', or custom cost function
"""
thresholds = np.linspace(0.01, 0.99, 200)
scores = []
for t in thresholds:
y_pred_t = (y_prob >= t).astype(int)
if metric == 'f1':
score = f1_score(y_true, y_pred_t, zero_division=0)
elif metric == 'recall':
score = recall_score(y_true, y_pred_t, zero_division=0)
elif metric == 'precision':
score = precision_score(y_true, y_pred_t, zero_division=0)
scores.append(score)
best_idx = np.argmax(scores)
best_threshold = thresholds[best_idx]
best_score = scores[best_idx]
return best_threshold, best_score, thresholds, scores
# Apply to best model (Gradient Boosting)
gb_probs = results['Gradient Boosting']['y_prob']
best_t, best_f1, thresholds, f1_scores = find_optimal_threshold(
y_te, gb_probs, metric='f1'
)
print(f"Default threshold (0.5) F1: "
f"{f1_score(y_te, (gb_probs>=0.5).astype(int)):.4f}")
print(f"Optimal threshold ({best_t:.2f}) F1: {best_f1:.4f}")
# Visualise threshold vs metrics
precs, recs, pr_thresholds = precision_recall_curve(y_te, gb_probs)
plt.figure(figsize=(10, 5))
plt.plot(thresholds, f1_scores, 'b-', linewidth=2, label='F1 Score')
# Align precision/recall to same threshold axis
plt.plot(pr_thresholds,
precs[:-1], 'g--', linewidth=1.5, label='Precision')
plt.plot(pr_thresholds,
recs[:-1], 'r--', linewidth=1.5, label='Recall')
plt.axvline(x=best_t, color='purple', linestyle=':',
linewidth=2, label=f'Optimal t={best_t:.2f}')
plt.axvline(x=0.5, color='gray', linestyle=':',
linewidth=1.5, label='Default t=0.5')
plt.xlabel('Decision Threshold')
plt.ylabel('Score')
plt.title('Metrics vs. Decision Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Cost-Based Threshold Selection
def cost_based_threshold(y_true, y_prob,
cost_fp=1, cost_fn=5):
"""
Select threshold minimizing total misclassification cost.
cost_fp: Cost of a false positive (predict fraud, is legitimate)
cost_fn: Cost of a false negative (miss actual fraud)
"""
thresholds = np.linspace(0.01, 0.99, 200)
costs = []
for t in thresholds:
y_pred_t = (y_prob >= t).astype(int)
cm = confusion_matrix(y_true, y_pred_t)
tn, fp, fn, tp = cm.ravel()
total_cost = cost_fp * fp + cost_fn * fn
costs.append(total_cost)
best_t = thresholds[np.argmin(costs)]
print(f"Cost-minimizing threshold: {best_t:.3f}")
print(f" FP cost={cost_fp}, FN cost={cost_fn}")
print(f" (Fraud missed {cost_fn}x more costly than false alarm)")
return best_t
# Fraud: missing fraud (FN) is 10x more costly than false alarm (FP)
cost_threshold = cost_based_threshold(y_te, gb_probs,
cost_fp=1, cost_fn=10)Handling Class Imbalance
Imbalanced datasets — where one class is rare — require special treatment.
Why Imbalance is Problematic
Fraud dataset: 97% legitimate (0), 3% fraud (1)
Naive model: "Always predict 0"
Accuracy: 97% ← Looks great!
Fraud caught: 0 ← Completely useless!
The majority class dominates training.
Model learns to predict majority class for everything.Strategy 1: Adjust Class Weights
from sklearn.linear_model import LogisticRegression
# class_weight='balanced' automatically weights classes
# inversely proportional to their frequency
lr_balanced = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(class_weight='balanced',
max_iter=1000))
])
lr_balanced.fit(X_tr, y_tr)
y_pred_bal = lr_balanced.predict(X_te)
y_prob_bal = lr_balanced.predict_proba(X_te)[:, 1]
print("With class_weight='balanced':")
print(f" Recall (fraud): {recall_score(y_te, y_pred_bal):.4f}")
print(f" Precision (fraud): {precision_score(y_te, y_pred_bal):.4f}")
print(f" F1: {f1_score(y_te, y_pred_bal):.4f}")Strategy 2: Oversampling (SMOTE)
# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
# SMOTE: Synthetic Minority Over-sampling Technique
# Creates synthetic examples of the minority class
smote_pipe = ImbPipeline([
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('clf', LogisticRegression(max_iter=1000))
])
smote_pipe.fit(X_tr, y_tr)
y_pred_smote = smote_pipe.predict(X_te)
y_prob_smote = smote_pipe.predict_proba(X_te)[:, 1]
print("With SMOTE oversampling:")
print(f" Recall: {recall_score(y_te, y_pred_smote):.4f}")
print(f" Precision: {precision_score(y_te, y_pred_smote):.4f}")
print(f" F1: {f1_score(y_te, y_pred_smote):.4f}")Strategy 3: Undersampling
from imblearn.under_sampling import RandomUnderSampler
# Reduce majority class to balance ratio
under_pipe = ImbPipeline([
('scaler', StandardScaler()),
('under', RandomUnderSampler(sampling_strategy=0.5,
random_state=42)),
('clf', LogisticRegression(max_iter=1000))
])
under_pipe.fit(X_tr, y_tr)Strategy 4: Use AUC/F1 Instead of Accuracy
# Always evaluate with imbalance-aware metrics
from sklearn.model_selection import cross_val_score
for name, pipe in classifiers.items():
# F1 score (handles imbalance)
f1_cv = cross_val_score(pipe, X_fraud, y_fraud,
cv=5, scoring='f1').mean()
# ROC-AUC (threshold-independent)
auc_cv = cross_val_score(pipe, X_fraud, y_fraud,
cv=5, scoring='roc_auc').mean()
print(f"{name:25s}: F1={f1_cv:.4f} AUC={auc_cv:.4f}")Strategy 5: Stratified Sampling
# Always use stratify=y when splitting imbalanced data
X_tr, X_te, y_tr, y_te = train_test_split(
X_fraud, y_fraud,
test_size=0.2,
random_state=42,
stratify=y_fraud # Preserves class ratio in both splits
)
print("Training set class ratio:")
print(f" Class 0: {(y_tr==0).mean()*100:.1f}%")
print(f" Class 1: {(y_tr==1).mean()*100:.1f}%")Cross-Validation for Binary Classification
from sklearn.model_selection import StratifiedKFold, cross_validate
# Stratified K-Fold preserves class balance across folds
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Evaluate multiple metrics simultaneously
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'roc_auc': 'roc_auc'
}
best_model = classifiers['Gradient Boosting']
cv_results = cross_validate(best_model, X_fraud, y_fraud,
cv=cv, scoring=scoring)
print("5-Fold Cross-Validation Results (Gradient Boosting):")
for metric, scores in cv_results.items():
if metric.startswith('test_'):
name = metric.replace('test_', '')
print(f" {name:12s}: {scores.mean():.4f} ± {scores.std():.4f}")The Probability-vs-Label Decision
A critical design choice: does your application need probabilities or just labels?
When You Need Probabilities
Use case: Fraud risk scoring
Don't want binary "fraud/not fraud"
Want "risk score: 89% probability of fraud"
→ Allows prioritization (investigate highest risk first)
→ Allows different thresholds for different contexts
Use case: Medical diagnosis
Don't want binary "sick/healthy"
Want "73% probability of disease"
→ Doctor can weigh against clinical judgment
→ Risk communication to patientWhen You Need Labels
Use case: Email spam filter
Just need: "Move to spam? Yes/No"
Binary label sufficient
Use case: Production alert
Just need: "Trigger alert? Yes/No"
Binary action is the end resultProbability Calibration
Some models produce poorly calibrated probabilities — the predicted probability doesn’t match true frequency.
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
# Random Forest: often overconfident (probabilities near 0 or 1)
rf_pipe = results['Random Forest']['pipe']
rf_probs = results['Random Forest']['y_prob']
# Calibrate using Platt scaling (sigmoid) or isotonic regression
rf_calibrated = CalibratedClassifierCV(
rf_pipe.named_steps['clf'],
method='sigmoid',
cv=5
)
# Note: need to transform X first since we're bypassing pipeline
scaler_standalone = StandardScaler().fit(X_tr)
rf_calibrated.fit(scaler_standalone.transform(X_tr), y_tr)
rf_cal_probs = rf_calibrated.predict_proba(
scaler_standalone.transform(X_te)
)[:, 1]
# Compare calibration
fig, ax = plt.subplots(figsize=(7, 5))
for probs, label, color in [
(rf_probs, 'RF (uncalibrated)', 'coral'),
(rf_cal_probs, 'RF (calibrated)', 'steelblue')]:
prob_true, prob_pred = calibration_curve(y_te, probs, n_bins=10)
ax.plot(prob_pred, prob_true, 's-', color=color,
linewidth=2, label=label)
ax.plot([0, 1], [0, 1], 'k--', linewidth=1.5,
label='Perfect calibration')
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Fraction of Positives (True Probability)')
ax.set_title('Calibration Curves')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Algorithm Selection Guide
| Algorithm | Data Size | Linearity | Interpretability | Speed | Best Use Case |
|---|---|---|---|---|---|
| Logistic Reg. | Any | Linear only | High | Very fast | Baseline, interpretable, probabilities needed |
| Decision Tree | Small-Medium | Non-linear | High | Fast | Rules needed, mixed features |
| Random Forest | Medium-Large | Non-linear | Medium | Medium | General purpose, tabular data |
| Gradient Boosting | Medium-Large | Non-linear | Low | Slow | Max performance, tabular data |
| SVM | Small-Medium | Non-linear (kernel) | Low | Slow | High-dim, text, small data |
| KNN | Small | Non-linear | Medium | Very slow | Baseline, simple problems |
| Naive Bayes | Any | Linear (feature indep.) | High | Very fast | Text classification, sparse data |
| Neural Network | Large | Any shape | Very low | Slow | Images, text, complex patterns |
Conclusion: The Building Block of Classification
Binary classification is the foundation of all machine learning classification work. Every concept introduced here — decision boundaries, probability outputs, precision-recall tradeoffs, threshold optimization, class imbalance handling — generalizes directly to multi-class classification, sequence labeling, object detection, and virtually every other classification task.
The most important lessons from this guide:
Accuracy alone is dangerous. With imbalanced classes — which describe most real-world classification problems — accuracy is misleading. A model that predicts the majority class for everything achieves high accuracy while being completely useless. Always evaluate with precision, recall, F1, and ROC-AUC.
Threshold 0.5 is rarely optimal. The default threshold is a starting point, not a decision. Every classification problem has different costs for false positives and false negatives. Tune the threshold based on what matters in your specific application — medical screening demands high recall, fraud alerting may need high precision, and most problems require a deliberate balance.
Class imbalance requires active handling. Don’t let the majority class dominate. Use class weights, SMOTE, stratified sampling, and imbalance-aware metrics to ensure the minority class receives appropriate attention during training.
Probabilities are more valuable than labels. Whenever possible, preserve probability outputs. They enable flexible threshold selection, risk ranking, calibration assessment, and combination with domain knowledge in ways that hard labels never can.
Start simple, then add complexity. Logistic regression is fast, interpretable, and surprisingly powerful. It should be your first classifier on any new problem. Only escalate to Random Forest, Gradient Boosting, or neural networks when logistic regression genuinely falls short.
Binary classification is where machine learning meets real decisions. Master it, and you have the tools to build systems that catch fraud, diagnose diseases, filter spam, and make countless other yes-or-no decisions that create real value in the world.







