Writing Reproducible Data Science Code

Learn how to write reproducible data science code. Master random seeds, environment management, config files, pipelines, experiment tracking, and DVC best practices.

Writing Reproducible Data Science Code

Reproducible data science code is code that produces identical results every time it is run, by anyone, on any compatible machine — now and in the future. Achieving reproducibility requires controlling all sources of randomness through fixed seeds, pinning exact software versions in documented environments, separating configuration from code, version-controlling both code and data, and building deterministic pipelines that can be run end-to-end from raw data to final results with a single command.

Introduction

A researcher publishes a landmark machine learning paper claiming their new architecture achieves state-of-the-art performance on a standard benchmark. Other researchers try to reproduce the result. Months of effort later, the best anyone can get is several percentage points below the claimed number. The original authors can’t reproduce it either — their original training run is gone, the exact environment is unrecoverable, and the random seeds were never recorded.

This scenario, known as the reproducibility crisis, is widespread across data science and machine learning. A 2021 analysis found that the majority of published machine learning results could not be fully reproduced. In industry, the problem is equally costly: models that “worked in the notebook” mysteriously underperform in production, experiments from three months ago can’t be replicated for an audit, and teams spend weeks re-deriving results that should have been a matter of clicking “run.”

Reproducibility is not a luxury or an academic concern — it is a fundamental property of trustworthy, professional data science work. It enables you to verify your own results, share work with confidence, audit models in production, build on previous experiments without starting from scratch, and hand off projects to teammates without the dreaded “it only works on my machine” caveat.

This guide covers every dimension of reproducibility in data science: the sources of non-reproducibility you need to control, the tools and patterns that achieve control, and how to build workflows where reproducibility is the default rather than an afterthought.

Why Data Science Code Is Non-Reproducible by Default

Before learning to achieve reproducibility, understand the forces working against it. Data science code is non-reproducible by default for several interconnected reasons.

Source 1: Uncontrolled Randomness

Machine learning algorithms make random choices at numerous points: weight initialization in neural networks, data shuffling before training, bootstrap sampling in random forests, random feature subsets at each tree split, train/test splitting, dropout during training, k-fold cross-validation splits. If these random choices aren’t fixed, every run produces a different result.

Python
# Non-reproducible: Different split every run
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Reproducible: Same split every run
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Source 2: Evolving Software Dependencies

Python packages release new versions constantly. A change in pandas’ groupby behavior between 1.5 and 2.0, a bug fix in scikit-learn’s feature importance calculation, or a numerical precision change in numpy can alter results — even when your code is identical.

Python
# Code that behaved differently across pandas versions
# pandas < 2.0: Silently filled missing with 0
# pandas >= 2.0: Raises FutureWarning and behaves differently
df.fillna(method='ffill')

Without pinned dependencies, the same code run six months later may produce different results.

Source 3: Hidden Notebook State

Jupyter Notebooks accumulate hidden state through out-of-order cell execution. A variable defined in one session persists to the next. A cell run twice modifies state cumulatively. The notebook appears to work but only because it depends on the specific execution history of the current session — a history that can never be exactly recreated.

Source 4: Floating-Point Non-Determinism

Floating-point arithmetic can produce different results depending on CPU architecture, thread execution order, and hardware-specific optimizations. Parallel computations on GPU are especially susceptible — the order in which threads complete varies, and floating-point addition is not associative at machine precision.

Source 5: Data Drift and Undocumented Data Sources

If your code reads from a live database or file path that changes over time, running the same code later produces different results because the input data has changed. Without versioning the specific data snapshot used for each experiment, results become impossible to reproduce.

Source 6: Undocumented Manual Steps

“I adjusted the learning rate by hand after epoch 5.” “I removed three outliers that I noticed looked wrong.” “I ran the feature engineering twice because the first run seemed off.” These undocumented manual interventions are invisible in the code but materially affect results.

Pillar 1: Controlling Randomness with Seeds

The most accessible reproducibility improvement is setting random seeds everywhere randomness is used.

Setting Seeds Comprehensively

Different libraries have different random number generators, each of which must be seeded independently:

Python
import random
import numpy as np
import os

def set_all_seeds(seed: int = 42) -> None:
    """
    Set all random seeds for reproducible results.
    
    Call this function at the very beginning of every script and 
    notebook before any data loading or processing.
    
    Parameters
    ----------
    seed : int, optional
        The random seed value. 42 is conventional. By default 42.
    """
    # Python's built-in random module
    random.seed(seed)
    
    # NumPy
    np.random.seed(seed)
    
    # Python hash seed — affects dict ordering and set operations
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # TensorFlow (if used)
    try:
        import tensorflow as tf
        tf.random.set_seed(seed)
    except ImportError:
        pass
    
    # PyTorch (if used)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # For full determinism on CUDA (may reduce performance)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except ImportError:
        pass

# Call once at the start of every script/notebook
set_all_seeds(seed=42)

The random_state Parameter Convention

Scikit-learn uses a consistent random_state parameter across all algorithms that involve randomness:

Python
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

RANDOM_STATE = 42  # Define once, use everywhere

# Data splitting
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    stratify=y,              # Maintain class distribution
    random_state=RANDOM_STATE
)

# Cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

# Models
rf = RandomForestClassifier(
    n_estimators=300,
    max_features='sqrt',
    random_state=RANDOM_STATE
)

gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    random_state=RANDOM_STATE
)

lr = LogisticRegression(
    max_iter=1000,
    random_state=RANDOM_STATE
)

# Hyperparameter search
from sklearn.model_selection import RandomizedSearchCV
search = RandomizedSearchCV(
    rf,
    param_distributions={...},
    n_iter=50,
    random_state=RANDOM_STATE  # Controls which hyperparameter combinations are tried
)

Centralizing the seed value in a single constant (RANDOM_STATE = 42) means you can change it in one place and all models/splits update consistently — useful for sensitivity analyses that check whether your conclusions are seed-dependent.

Documenting Seed Choices

YAML
# configs/config.yaml
reproducibility:
  random_seed: 42
  # WHY 42: conventional choice; verified on 2024-09-15 that results are
  # stable (±0.002 AUC-ROC) across seeds 1, 7, 42, 123, 2024
  # See notebooks/exploratory/08_seed_sensitivity.ipynb

Pillar 2: Environment Reproducibility

Identical code plus identical data plus different software versions can still produce different results. Environment reproducibility means capturing the exact software stack that produced a given result.

The requirements.txt Hierarchy

For maximum reproducibility, maintain two requirements files:

YAML
# requirements.txt — exact pinned versions for full reproducibility
# Generated by: pip freeze > requirements.txt
# Last updated: 2024-09-15

certifi==2023.7.22
joblib==1.3.2
matplotlib==3.7.2
numpy==1.25.2
pandas==2.0.3
scikit-learn==1.3.0
scipy==1.11.2
seaborn==0.12.2
xgboost==1.7.6
YAML
# requirements-base.txt — minimum version constraints for flexibility
# Used when installing in environments where exact versions create conflicts

numpy>=1.24,<2.0
pandas>=2.0,<3.0
scikit-learn>=1.3,<2.0
matplotlib>=3.7
seaborn>=0.12
xgboost>=1.7

The pinned requirements.txt guarantees exact reproduction on any machine. The flexible requirements-base.txt allows installation in environments with other constraints (Docker base images, shared clusters).

conda environment.yml for Full Stack Reproducibility

For projects using conda (especially those with GPU dependencies or compiled scientific libraries), environment.yml captures the complete environment including Python version and conda packages:

YAML
# environment.yml
name: churn-model-v2
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.11.4
  - pip=23.2.1
  - numpy=1.25.2
  - pandas=2.0.3
  - scikit-learn=1.3.0
  - matplotlib=3.7.2
  - seaborn=0.12.2
  - scipy=1.11.2
  - jupyterlab=4.0.5
  - pip:
    - xgboost==1.7.6
    - shap==0.42.1
    - mlflow==2.6.0
    - pandera==0.17.0

Docker for Hermetic Environment Reproducibility

For the strongest environment reproducibility guarantee — ensuring not just Python packages but the entire system environment (OS version, system libraries, CUDA version) is controlled — use Docker:

Dockerfile
# Dockerfile

# Pin the exact base image with SHA digest for maximum reproducibility
FROM python:3.11.4-slim-bullseye

# Install system dependencies at pinned versions
RUN apt-get update && apt-get install -y \
    libgomp1=12.2.0-14 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy and install Python dependencies first (Docker layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Set Python hash seed for reproducibility
ENV PYTHONHASHSEED=42

CMD ["python", "src/train_model.py", "--config", "configs/config.yaml"]

With this Dockerfile pinning the base image, system libraries, and Python packages, the exact computational environment is preserved and reproducible years into the future.

Documenting the Environment at Result Time

When publishing a model result or sharing an analysis, record the exact environment state:

Python
import platform
import sys
import pkg_resources

def print_environment_info():
    """Print complete environment information for reproducibility documentation."""
    print(f"Python version: {sys.version}")
    print(f"Platform: {platform.platform()}")
    print(f"Architecture: {platform.machine()}")
    print()
    
    key_packages = ['numpy', 'pandas', 'scikit-learn', 'xgboost', 
                    'matplotlib', 'scipy', 'tensorflow', 'torch']
    
    print("Key package versions:")
    for pkg in key_packages:
        try:
            version = pkg_resources.get_distribution(pkg).version
            print(f"  {pkg}: {version}")
        except pkg_resources.DistributionNotFound:
            print(f"  {pkg}: not installed")

print_environment_info()
YAML
Python version: 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0]
Platform: Linux-5.15.0-1040-aws-x86_64-with-glibc2.31
Architecture: x86_64

Key package versions:
  numpy: 1.25.2
  pandas: 2.0.3
  scikit-learn: 1.3.0
  xgboost: 1.7.6
  matplotlib: 3.7.2
  scipy: 1.11.2

Include this output in your notebook results or model documentation.

Pillar 3: Separating Configuration from Code

Hardcoded values — file paths, column names, hyperparameters, thresholds — are one of the most common reproducibility killers. When they’re embedded in code, changing an experiment requires modifying source files, which creates ambiguity about what changed between runs.

The Configuration File Approach

Centralize all experiment parameters in version-controlled YAML or JSON files:

YAML
# configs/config.yaml

experiment:
  name: "xgboost_v3_tuned"
  description: "XGBoost with learning rate decay and L1 regularization"
  date: "2024-09-15"
  author: "Jane Smith"

reproducibility:
  random_seed: 42

paths:
  raw_data: "data/raw/transactions_2024_q3.csv"
  processed_data: "data/processed/features_v2.parquet"
  model_output: "models/xgboost_v3.pkl"
  metrics_output: "reports/metrics/xgboost_v3_metrics.json"
  figures_dir: "reports/figures/"

data:
  target_column: "churned"
  id_column: "customer_id"
  date_column: "transaction_date"
  categorical_columns: ["channel", "product_category", "region"]
  numerical_columns: ["amount", "frequency_90d", "recency_days"]
  test_size: 0.2
  validation_size: 0.1

model:
  algorithm: "xgboost"
  hyperparameters:
    n_estimators: 500
    learning_rate: 0.05
    max_depth: 6
    min_child_weight: 3
    subsample: 0.8
    colsample_bytree: 0.8
    gamma: 0.1
    reg_alpha: 0.2
    reg_lambda: 1.5
    scale_pos_weight: 3.2

evaluation:
  primary_metric: "roc_auc"
  threshold: 0.42
  metrics: ["roc_auc", "f1", "precision", "recall", "average_precision"]

Load and use consistently:

YAML
import yaml
from pathlib import Path
from typing import Any

def load_config(config_path: str = "configs/config.yaml") -> dict[str, Any]:
    """Load configuration from YAML file."""
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    return config


def train(config_path: str = "configs/config.yaml") -> None:
    config = load_config(config_path)
    
    seed = config['reproducibility']['random_seed']
    set_all_seeds(seed)
    
    # All paths from config — never hardcoded
    data_path = config['paths']['processed_data']
    model_path = config['paths']['model_output']
    
    # All hyperparameters from config
    hyperparams = config['model']['hyperparameters']
    model = XGBClassifier(**hyperparams, random_state=seed)
    
    # ...


if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", default="configs/config.yaml")
    args = parser.parse_args()
    train(config_path=args.config)

Now different experiments use different config files — and the exact config used for each result is recorded alongside the results.

Versioning Config Files with Experiments

Create a config file for each significant experiment variant:

Plaintext
configs/
├── config.yaml             ← Current/default config
├── experiments/
│   ├── exp_01_baseline.yaml
│   ├── exp_02_feature_engineering_v2.yaml
│   ├── exp_03_xgboost_default.yaml
│   ├── exp_04_xgboost_tuned.yaml
│   └── exp_05_xgboost_v2_more_regularization.yaml

The config file becomes the identity of the experiment — commit each new config to Git and the complete experiment is reproducible forever.

Pillar 4: Data Version Control

Code versioning with Git is well understood. Data versioning is equally important but much less commonly practiced.

The Problem

Plaintext
Code version: commit a3f8c2d  (Git commit — easy to track)
Data version: transactions_2024_q3.csv  (???  — which version? updated monthly!)

When a colleague reports a discrepancy in your model’s performance, you need to know not just what code was used, but what data. Without data versioning, this question is unanswerable.

DVC: Data Version Control

DVC extends Git semantics to data and model files. DVC tracks data files by their content hash, stores them in remote storage (S3, GCS, Azure Blob), and records the mapping between Git commits and data versions.

Plaintext
# Set up DVC
pip install dvc dvc-s3
dvc init

# Configure remote storage
dvc remote add -d myremote s3://my-bucket/dvc-store

# Track a data file
dvc add data/raw/transactions_2024_q3.csv
# Creates: data/raw/transactions_2024_q3.csv.dvc (text file, committed to Git)
# The CSV itself: goes into .dvc/cache and pushed to S3

# Commit the .dvc file to Git
git add data/raw/transactions_2024_q3.csv.dvc
git commit -m "Add Q3 2024 transaction data"

# Push data to S3
dvc push

# Six months later, anyone can reproduce exactly:
git checkout a3f8c2d        # Check out the code version
dvc pull                    # Pull the exact data version that code used
python src/train_model.py   # Identical results guaranteed

The .dvc file is tiny (a few lines of JSON), lives in Git, and records a SHA256 hash of the data file. Checking out a commit and running dvc pull retrieves exactly the data that was used when that commit was made.

Tracking Processed Data and Models

DVC can also track intermediate data and trained models:

Plaintext
# Track processed features
dvc add data/processed/features_v2.parquet

# Track trained model
dvc add models/xgboost_v3.pkl

# Now you have full lineage: raw data → processed → model
# All three are tied to specific Git commits and reproducible

Lightweight Alternative: Data Hashing

If DVC is too heavyweight for a project, at minimum record the hash of your data files alongside your results:

Python
import hashlib

def compute_file_hash(filepath: str, algorithm: str = 'sha256') -> str:
    """Compute cryptographic hash of a file for reproducibility documentation."""
    h = hashlib.new(algorithm)
    with open(filepath, 'rb') as f:
        for chunk in iter(lambda: f.read(65536), b''):
            h.update(chunk)
    return h.hexdigest()

# Record with your results
data_hash = compute_file_hash("data/raw/transactions_2024_q3.csv")
print(f"Training data SHA256: {data_hash}")
# Training data SHA256: 3b4f8c2d9e1a6f7b8c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b

Record this hash in your experiment log and model documentation. When someone asks “which data version trained this model?”, you can verify by comparing hashes.

Pillar 5: Building End-to-End Reproducible Pipelines

Reproducibility isn’t just about individual scripts — it’s about the entire pipeline from raw data to final results being executable deterministically.

The End-to-End Pipeline Pattern

Every data science project should have a single command that runs the complete pipeline:

Python
# src/pipeline.py

import argparse
import logging
from pathlib import Path

from src.data.make_dataset import download_raw_data
from src.data.preprocess import preprocess_transactions
from src.features.build_features import build_feature_matrix
from src.models.train_model import train_model
from src.models.evaluate_model import evaluate_model
from src.utils.config import load_config
from src.utils.reproducibility import set_all_seeds, log_environment

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def run_pipeline(config_path: str) -> dict:
    """
    Run the complete ML pipeline end-to-end.
    
    Parameters
    ----------
    config_path : str
        Path to the experiment configuration YAML file.
    
    Returns
    -------
    dict
        Dictionary of evaluation metrics from the trained model.
    """
    # Load configuration
    config = load_config(config_path)
    logger.info(f"Running pipeline: {config['experiment']['name']}")
    
    # Set reproducibility
    seed = config['reproducibility']['random_seed']
    set_all_seeds(seed)
    logger.info(f"Random seed set to {seed}")
    
    # Log environment for documentation
    log_environment()
    
    # Step 1: Data ingestion
    logger.info("Step 1: Data ingestion")
    raw_data = download_raw_data(config['paths']['raw_data'])
    
    # Step 2: Preprocessing
    logger.info("Step 2: Preprocessing")
    clean_data = preprocess_transactions(raw_data, config)
    clean_data.to_parquet(
        config['paths']['interim_data'], 
        index=False
    )
    
    # Step 3: Feature engineering
    logger.info("Step 3: Feature engineering")
    features = build_feature_matrix(clean_data, config)
    features.to_parquet(
        config['paths']['processed_data'],
        index=False
    )
    
    # Step 4: Model training
    logger.info("Step 4: Model training")
    model, metrics = train_model(features, config)
    
    # Step 5: Evaluation
    logger.info("Step 5: Evaluation")
    final_metrics = evaluate_model(model, features, config)
    
    # Save results
    import json
    metrics_path = config['paths']['metrics_output']
    Path(metrics_path).parent.mkdir(parents=True, exist_ok=True)
    
    results = {
        'experiment_name': config['experiment']['name'],
        'config_path': config_path,
        'random_seed': seed,
        'metrics': final_metrics
    }
    
    with open(metrics_path, 'w') as f:
        json.dump(results, f, indent=2)
    
    logger.info(f"Pipeline complete. Metrics: {final_metrics}")
    return final_metrics


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run the ML pipeline")
    parser.add_argument(
        "--config", 
        type=str, 
        default="configs/config.yaml",
        help="Path to configuration file"
    )
    args = parser.parse_args()
    run_pipeline(args.config)

Run any experiment:

Bash
# Run with default config
python src/pipeline.py

# Run a specific experiment
python src/pipeline.py --config configs/experiments/exp_04_xgboost_tuned.yaml

# Reproduce a historical experiment from Git history
git checkout a3f8c2d
python src/pipeline.py --config configs/experiments/exp_03_xgboost_default.yaml

Makefile Automation

Expose pipeline commands through a Makefile for discoverability:

YAML
# Makefile

PYTHON = python
CONFIG = configs/config.yaml

.PHONY: all data features train evaluate test clean reproduce

all: data features train evaluate

data:
	$(PYTHON) src/data/make_dataset.py --config $(CONFIG)

features:
	$(PYTHON) src/features/build_features.py --config $(CONFIG)

train:
	$(PYTHON) src/models/train_model.py --config $(CONFIG)

evaluate:
	$(PYTHON) src/models/evaluate_model.py --config $(CONFIG)

# Run complete pipeline
pipeline:
	$(PYTHON) src/pipeline.py --config $(CONFIG)

# Reproduce a specific experiment
reproduce:
	@echo "Reproducing experiment: $(EXPERIMENT)"
	$(PYTHON) src/pipeline.py --config configs/experiments/$(EXPERIMENT).yaml

test:
	pytest tests/ -v --tb=short

# Full reproduction check: git clone fresh, run pipeline, compare metrics
reproduce-check:
	git stash
	dvc pull
	$(PYTHON) src/pipeline.py --config $(CONFIG)
	$(PYTHON) scripts/compare_metrics.py --expected reports/expected_metrics.json

Usage:

Bash
make pipeline                                    # Run default pipeline
make reproduce EXPERIMENT=exp_04_xgboost_tuned  # Reproduce specific experiment
make reproduce-check                             # Verify full reproducibility

Pillar 6: Experiment Tracking

Manually managing experiment results in files and spreadsheets is error-prone and doesn’t scale. Dedicated experiment tracking tools record parameters, metrics, artifacts, and environment automatically.

MLflow: The Open-Source Standard

MLflow is the most widely adopted open-source experiment tracking tool. It logs parameters, metrics, artifacts, and model files for each experiment run, and provides a web UI for comparing runs.

Python
import mlflow
import mlflow.sklearn

from src.utils.config import load_config
from src.utils.reproducibility import set_all_seeds

def train_with_tracking(config_path: str) -> None:
    config = load_config(config_path)
    seed = config['reproducibility']['random_seed']
    set_all_seeds(seed)
    
    # Start an MLflow run
    with mlflow.start_run(run_name=config['experiment']['name']):
        
        # Log all configuration parameters
        mlflow.log_params({
            'random_seed': seed,
            'algorithm': config['model']['algorithm'],
            **config['model']['hyperparameters']
        })
        
        # Log data information
        mlflow.log_param('training_data', config['paths']['processed_data'])
        mlflow.log_param('test_size', config['data']['test_size'])
        
        # ... load data, train model ...
        
        # Log metrics
        mlflow.log_metrics({
            'roc_auc': metrics['roc_auc'],
            'f1_score': metrics['f1'],
            'precision': metrics['precision'],
            'recall': metrics['recall']
        })
        
        # Log the trained model
        mlflow.sklearn.log_model(model, "model")
        
        # Log the config file as an artifact
        mlflow.log_artifact(config_path, "config")
        
        # Log figures
        mlflow.log_artifact("reports/figures/roc_curve.png", "figures")
        
        print(f"Run ID: {mlflow.active_run().info.run_id}")
        print(f"AUC-ROC: {metrics['roc_auc']:.4f}")

Start the MLflow UI:

Bash
mlflow ui
# Opens at http://localhost:5000

The UI shows all experiments in a table, allows sorting and filtering by metric, and lets you compare the parameters and metrics of any two runs side by side — answering “what changed between the run that got 0.891 and the one that got 0.914?” in seconds.

Reproducing a Specific MLflow Run

Python
import mlflow

# Load a specific run by its ID
run_id = "3f8c2d9e1a6f7b8c3d4e5f6a7b8c9d0e"
run = mlflow.get_run(run_id)

# Get all parameters from that run
params = run.data.params
print(f"Random seed used: {params['random_seed']}")
print(f"Learning rate: {params['learning_rate']}")

# Load the exact model artifact
model = mlflow.sklearn.load_model(f"runs:/{run_id}/model")

Weights & Biases and Neptune

For teams with more complex needs — deep learning training curves, hyperparameter sweep visualization, team collaboration — Weights & Biases (W&B) and Neptune offer richer tracking capabilities with hosted infrastructure, though they require API keys and have usage costs beyond free tiers.

Python
import wandb

wandb.init(
    project="customer-churn",
    config=config,
    name=config['experiment']['name']
)

# Log metrics during training
for epoch in range(n_epochs):
    wandb.log({
        'train_loss': train_loss,
        'val_auc': val_auc,
        'epoch': epoch
    })

wandb.finish()

Pillar 7: Making Notebooks Reproducible

Notebooks require special attention because their stateful execution model works against reproducibility.

The Notebook Reproducibility Checklist

Before sharing or archiving any notebook:

1. Restart and run all cells

  • Kernel → Restart & Run All — if the notebook fails, it is not reproducible
  • This is non-negotiable for report notebooks shared with others

2. Clear and re-run, verify determinism

  • Run it twice. Do the results match exactly?
  • If not, there’s uncontrolled randomness somewhere

3. Set the seed at the very first code cell

Python
# Cell 1 — always the first code cell in every notebook
import random
import numpy as np
import os

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
os.environ['PYTHONHASHSEED'] = str(SEED)

4. Use absolute or project-relative paths

Python
from pathlib import Path

# Don't:
df = pd.read_csv("/Users/jane/Desktop/project/data/data.csv")  # Absolute

# Do:
PROJECT_ROOT = Path(__file__).parent.parent  # Or Path("../..") in a notebook
df = pd.read_csv(PROJECT_ROOT / "data" / "raw" / "data.csv")

5. Print versions and environment info

Python
# First markdown cell or early code cell
import sys, pandas, numpy, sklearn
print(f"Python: {sys.version}")
print(f"pandas: {pandas.__version__}")
print(f"numpy: {numpy.__version__}")
print(f"scikit-learn: {sklearn.__version__}")

Papermill: Parameterized Notebook Execution

Papermill executes notebooks programmatically with injected parameters, enabling notebooks to be part of automated, reproducible pipelines:

Python
# In the notebook: mark a cell as "parameters" tag in Jupyter
# This cell's values can be overridden by Papermill

# parameters
DATA_PATH = "data/processed/features_v2.parquet"
MODEL_CONFIG = "configs/config.yaml"
OUTPUT_PATH = "reports/run_20240915/"
SEED = 42
Plaintext
# Execute the notebook with different parameters
papermill notebooks/reports/model_evaluation.ipynb \
    reports/run_20240916/evaluation.ipynb \
    -p DATA_PATH "data/processed/features_v3.parquet" \
    -p MODEL_CONFIG "configs/experiments/exp_05.yaml" \
    -p SEED 42

# The output notebook contains all results with the injected parameters

This transforms notebooks from interactive documents into reproducible, parameterizable pipeline components.

Pillar 8: Testing for Reproducibility

Automated tests can verify reproducibility as part of your CI/CD pipeline — catching accidental non-reproducibility before it reaches production.

Reproducibility Tests

Python
# tests/test_reproducibility.py

import pytest
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score

from src.utils.reproducibility import set_all_seeds
from src.features.build_features import build_feature_matrix
from src.models.train_model import train_model
from src.utils.config import load_config


class TestReproducibility:
    
    def test_feature_engineering_is_deterministic(self, sample_transactions):
        """Feature engineering must produce identical output across calls."""
        config = load_config("configs/config.yaml")
        
        set_all_seeds(42)
        features_run1 = build_feature_matrix(sample_transactions, config)
        
        set_all_seeds(42)
        features_run2 = build_feature_matrix(sample_transactions, config)
        
        pd.testing.assert_frame_equal(features_run1, features_run2)
    
    def test_model_training_is_reproducible(self, sample_features):
        """Two training runs with the same seed must produce identical predictions."""
        config = load_config("configs/config.yaml")
        
        set_all_seeds(42)
        model1, _ = train_model(sample_features, config)
        predictions1 = model1.predict_proba(sample_features.drop('churned', axis=1))[:, 1]
        
        set_all_seeds(42)
        model2, _ = train_model(sample_features, config)
        predictions2 = model2.predict_proba(sample_features.drop('churned', axis=1))[:, 1]
        
        np.testing.assert_array_equal(predictions1, predictions2,
            err_msg="Model predictions differ across runs with same seed")
    
    def test_different_seeds_produce_different_results(self, sample_features):
        """Verify that seed actually affects training (not silently ignored)."""
        config = load_config("configs/config.yaml")
        
        set_all_seeds(42)
        model1, _ = train_model(sample_features, config)
        preds1 = model1.predict_proba(sample_features.drop('churned', axis=1))[:, 1]
        
        set_all_seeds(999)
        model2, _ = train_model(sample_features, config)
        preds2 = model2.predict_proba(sample_features.drop('churned', axis=1))[:, 1]
        
        # Different seeds should produce measurably different models
        assert not np.array_equal(preds1, preds2), \
            "Different seeds produced identical results — seed may not be working"
    
    def test_pipeline_metrics_match_expected(self):
        """Golden test: full pipeline must reproduce known-good metrics."""
        EXPECTED_AUC = 0.9142  # Recorded from verified run on 2024-09-15
        TOLERANCE = 0.001
        
        config = load_config("configs/config.yaml")
        metrics = run_pipeline("configs/config.yaml")
        
        assert abs(metrics['roc_auc'] - EXPECTED_AUC) < TOLERANCE, \
            f"AUC-ROC {metrics['roc_auc']:.4f} differs from expected {EXPECTED_AUC:.4f}"

The last test — a golden test — is particularly powerful: it records the exact metric achieved on a specific date with a specific environment, and fails the build if future changes cause any unexplained deviation. This catches dependency upgrades, code refactors, and other changes that silently affect model performance.

Reproducibility Documentation: The Experiment Report

Every significant experiment should be accompanied by a structured report that contains everything needed to reproduce it:

Plaintext
# Experiment Report: XGBoost v3 — Tuned Hyperparameters

## Experiment Summary
- **Date**: 2024-09-15
- **Author**: Jane Smith
- **Goal**: Improve AUC-ROC over XGBoost v2 baseline (0.908) using 
  Bayesian hyperparameter optimization

## How to Reproduce

    git checkout a3f8c2d
    dvc pull
    python src/pipeline.py --config configs/experiments/exp_04_xgboost_tuned.yaml

Expected result: AUC-ROC = 0.9142 ± 0.0005

## Environment
- Python: 3.11.4
- Key packages: numpy==1.25.2, pandas==2.0.3, scikit-learn==1.3.0, 
  xgboost==1.7.6
- Platform: Linux x86_64 (AWS t3.xlarge)

## Data
- Training: data/raw/transactions_2024_q1_q2.csv  
  (SHA256: 3b4f8c2d9e1a6f7b8c3d4e5f6a7b8c9d...)
- Test: data/raw/transactions_2024_q3.csv  
  (SHA256: 7f2a1b9c4e5d6a3b8c2d1e4f5a6b7c8d...)

## Configuration
Full config: configs/experiments/exp_04_xgboost_tuned.yaml  
Random seed: 42 (stable across seeds 1, 7, 42, 123 — sensitivity verified)

## Results

| Metric | Value |
|--------|-------|
| AUC-ROC | 0.9142 |
| F1 Score | 0.847 |
| Precision | 0.856 |
| Recall | 0.839 |

## What Changed from v2
- Reduced learning_rate from 0.1 to 0.05 (slower learning, better generalization)
- Added L1 regularization (reg_alpha=0.2)  
- Increased n_estimators from 300 to 500 (more trees to compensate for lower lr)

## MLflow Run
- Run ID: 3f8c2d9e1a6f7b8c3d4e5f6a7b8c9d0e
- Experiment: customer-churn-model

Common Reproducibility Pitfalls and Their Fixes

PitfallSymptomFix
Missing random seedResults differ between runsSet seed in set_all_seeds() at script start
Unpinned dependenciesResults differ after pip install --upgradeUse pip freeze > requirements.txt
Fitting on test dataOptimistic validation metricsFit all preprocessors only on training data
Absolute file pathsCode fails on other machinesUse relative paths + pathlib.Path
Manual steps not in code“It only works when I run it manually”Automate all steps in pipeline scripts
Data changes over timeCan’t reproduce old resultsVersion data with DVC; record data hashes
Notebook hidden stateResults depend on execution orderRestart & Run All before sharing
No experiment records“What params produced that result?”Use MLflow or structured experiment log
PyTorch non-determinismGPU results differ each runSet torch.backends.cudnn.deterministic = True
Parallel processing orderResults differ with multiprocessingUse deterministic algorithms; sort before parallel ops

Summary

Reproducibility is not a single technique but a discipline — a set of interconnected practices that together guarantee your results can be trusted, verified, and recreated. The eight pillars covered in this guide — controlling randomness through seeds, pinning environments, separating configuration from code, versioning data with DVC, building end-to-end pipelines, tracking experiments with MLflow, making notebooks reproducible, and testing for reproducibility — address the full range of sources through which non-reproducibility creeps into data science work.

The payoff for this investment is substantial and compounds over time. Reproducible projects can be audited, extended, and handed off without loss of knowledge. Reproducible experiments enable genuine scientific comparison of approaches. Reproducible models can be debugged in production when they fail. And reproducible workflows build the professional trust that distinguishes mature data science practice from ad-hoc analysis.

The practical path forward is incremental: start with seeds and pinned dependencies (the highest-impact, lowest-effort changes), then add config files and a pipeline script, then DVC for data, then MLflow for experiment tracking. Each step builds on the previous, and the cumulative effect is a fully reproducible, professionally credible data science practice.

Key Takeaways

  • Reproducibility has eight interconnected sources of failure — randomness, software versions, notebook state, floating-point precision, data drift, and undocumented manual steps — and each must be addressed for full reproducibility
  • set_all_seeds() must seed Python’s random module, NumPy, PyTorch, TensorFlow, and PYTHONHASHSEED — seeding only one or two is insufficient for full determinism
  • pip freeze > requirements.txt captures the exact software environment; environment.yml does the same for conda — always pin exact versions for reproducible experiments
  • Config YAML files separate all experiment parameters from code — enabling different experiments to run by swapping config files, with Git tracking what changed between them
  • DVC extends Git semantics to data files, enabling git checkout + dvc pull to retrieve the exact code and data used for any historical result
  • MLflow (or equivalent) automatically records parameters, metrics, artifacts, and environment for every training run, answering “what parameters produced that result?” without manual documentation
  • Kernel → Restart & Run All before sharing any notebook is non-negotiable — it’s the only way to verify a notebook is actually reproducible rather than depending on hidden session state
  • Golden tests — automated tests that verify the pipeline produces metrics matching a known-good result from a verified run — catch reproducibility regressions before they reach production
Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Understanding System Architecture: The Blueprint of Your Operating System

Learn about operating system architecture including monolithic kernels, microkernels, hybrid kernels, layered architecture, and how…

Introduction to JavaScript – Basics and Fundamentals

Learn the basics of JavaScript, including syntax, events, loops, and closures, to build dynamic and…

The History of Robotics: From Ancient Automata to Modern Machines

Explore the fascinating evolution of robotics from ancient mechanical devices to today’s AI-powered machines. Discover…

Understanding Force and Torque in Robot Design

Master force and torque concepts essential for robot design. Learn to calculate requirements, select motors,…

The Role of Inductors: Understanding Magnetic Energy Storage

Learn what inductors do in circuits, how they store energy in magnetic fields, and why…

Interactive Data Visualization: Adding Filters and Interactivity

Learn how to enhance data visualizations with filters, real-time integration and interactivity. Explore tools, best…

Click For More
0
Would love your thoughts, please comment.x
()
x