Documentation Best Practices for Data Science Code

Master documentation best practices for data science code. Learn docstrings, README writing, inline comments, data dictionaries, and tools like Sphinx and MkDocs.

Documentation Best Practices for Data Science Code

Documentation in data science code encompasses everything that explains what your code does, why decisions were made, and how to use or reproduce your work — including inline comments, function docstrings, README files, data dictionaries, experiment logs, and architectural overviews. Good documentation is not an afterthought but an integral part of professional data science practice that dramatically improves code maintainability, reproducibility, collaboration, and the long-term value of your work.

Introduction

There is a joke in software development: “Code never lies, but comments sometimes do.” In data science, this joke has a darker dimension. Data science code without documentation doesn’t just confuse people — it destroys value. A machine learning model trained over weeks of computational resources, producing state-of-the-art results, is nearly worthless if no one can understand what data it was trained on, what preprocessing was applied, what the output means, or how to use it in production.

Documentation is the bridge between what your code does and what any person — including future you — understands about what your code does. That gap can be enormous. The algorithmic genius who wrote a complex feature engineering pipeline on a Tuesday afternoon knows exactly what every transformation means. Three months later, after dozens of other projects, that same person stares at their own code with genuine confusion.

Despite its importance, documentation is the most consistently neglected practice in data science. The reasons are understandable: documentation feels like slowing down, results feel more important than process, and the exploratory nature of data science work makes it feel premature to document something before you know if it will be kept. These are real tensions, not just excuses.

This guide navigates those tensions honestly. It provides concrete, practical documentation practices scaled appropriately to the stage and stakes of your work — from lightweight inline comments you should write while coding, to structured docstrings for production functions, to comprehensive project documentation that makes your work truly reproducible and transferable.

The Documentation Spectrum: From Code to Project Level

Documentation in data science exists at multiple levels of granularity, each serving different purposes and audiences.

Plaintext
Granularity Level    │ Type                  │ Audience
─────────────────────┼───────────────────────┼────────────────────────────
Most Granular        │ Inline comments       │ Yourself, code reviewers
                     │ Function docstrings   │ API users, your future self
                     │ Module docstrings     │ Developers of the module
                     │ Notebook markdown     │ Analysts, technical stakeholders
                     │ README files          │ Everyone starting with the project
                     │ Data dictionaries     │ Anyone working with the data
                     │ Architecture docs     │ Technical leads, new team members
Most Broad           │ Project documentation │ All stakeholders, public users

Good documentation practice means attending to all these levels — not just writing docstrings and calling it done. Let’s explore each level in depth.

Level 1: Inline Comments

Inline comments are the most immediate form of documentation — notes written directly alongside code that explain specific decisions, caveats, or non-obvious logic.

When to Write Inline Comments

The most common mistake with inline comments is writing too many of them — explaining what the code does rather than why it does it. Well-written code is largely self-explanatory in terms of what it does; comments add value by explaining why.

Don’t explain the obvious:

Python
# BAD: This comment adds nothing
# Create a copy of the DataFrame
df_clean = df.copy()

# BAD: Just restating what the code clearly does
# Sort by date in descending order
df = df.sort_values('date', ascending=False)

Do explain the non-obvious:

Python
# GOOD: Explains WHY, not just what
# Create a copy to avoid mutating the original DataFrame — downstream 
# functions assume the input is unchanged
df_clean = df.copy()

# GOOD: Explains a business-domain reason that isn't obvious from code
# Sort descending because our feature engineering uses the most recent 
# transaction first (recency bias in RFM model)
df = df.sort_values('date', ascending=False)

# GOOD: Documents a known issue or constraint
# Using clip instead of filtering to avoid losing rows with occasional
# negative amounts — finance team confirmed these are valid refund entries
df['amount'] = df['amount'].clip(lower=0)

# GOOD: Explains a performance trade-off
# Converting to category dtype before merge reduces memory from 1.2GB to 
# 340MB — critical for this dataset size; tested on 2024-08-12
df['category'] = df['category'].astype('category')

Comments for Data Science–Specific Situations

Data science code has unique patterns that benefit from specific commenting styles.

Documenting magic numbers:

Python
# Threshold determined by precision-recall analysis on validation set
# See notebooks/exploratory/04_threshold_selection.ipynb, cell 12
CHURN_PROBABILITY_THRESHOLD = 0.42

# Minimum transactions needed for reliable RFM scores — based on domain 
# expert input from customer success team (2024-09 review)
MIN_TRANSACTION_COUNT = 3

# 99th percentile of amount distribution in training data
# Values above this are likely data entry errors
AMOUNT_OUTLIER_CEILING = 9850.0

Documenting data assumptions:

Python
# Assumes input df has already passed schema validation
# (customer_id: str, transaction_date: datetime, amount: float)
def compute_rfm_scores(df: pd.DataFrame) -> pd.DataFrame:
    ...

# NOTE: Missing values in 'channel' are treated as 'unknown' by design —
# verified with data engineering team that NaN means no channel tracking,
# not a data quality issue
df['channel'] = df['channel'].fillna('unknown')

Documenting workarounds:

Python
# WORKAROUND: scikit-learn 1.3 changed the default max_features behavior 
# for RandomForest — explicitly setting to 'sqrt' for backward compatibility 
# with older model configs. Remove when all configs are updated to 1.3+ syntax.
model = RandomForestClassifier(n_estimators=300, max_features='sqrt')

# HACK: Upstream data sometimes sends dates in MM/DD/YYYY instead of 
# ISO format. This dual-parse handles both until data source is fixed.
# Tracked in JIRA ticket DS-247.
try:
    df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
except ValueError:
    df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')

TODO and FIXME Comments

Use structured tags that your IDE and team can search for systematically:

Plaintext
# TODO: Add input validation for negative amounts after DS-301 is resolved
# TODO: Consider caching this computation — called 3x in the pipeline

# FIXME: This function silently drops rows with NaN in product_id
# Expected behavior: raise ValueError or impute. See DS-298.

# HACK: Temporary fix for production outage 2024-11-14 — replace with
# proper solution after incident review

# NOTE: This is intentionally slow — optimizing it would break the 
# incremental update logic in downstream_processor.py

Level 2: Function and Method Docstrings

A docstring is a string literal that appears as the first statement of a module, function, class, or method. Python’s help() function and IDE tooltips display this string, making it the primary API documentation for your code.

Why Docstrings Matter More in Data Science

In general software development, function names and type signatures often tell you most of what you need to know about a function. In data science, they frequently don’t:

Python
# What does this return? A DataFrame? A numpy array? What shape?
# What do the parameters mean? What does 'mode' accept?
def compute_features(df, customer_id, mode):
    ...

Data science functions deal with DataFrames that have specific column requirements, arrays of specific shapes, models that must have been fitted before use, and domain-specific concepts. Docstrings are essential for making these implicit requirements explicit.

Docstring Formats

There are three widely used docstring formats in the Python ecosystem:

NumPy/SciPy Format (most popular in data science):

Python
def calculate_customer_metrics(
    df: pd.DataFrame,
    customer_col: str = 'customer_id',
    amount_col: str = 'amount',
    date_col: str = 'transaction_date',
    reference_date: Optional[str] = None
) -> pd.DataFrame:
    """
    Calculate RFM (Recency, Frequency, Monetary) metrics for each customer.
    
    Computes the three classic customer segmentation metrics from a 
    transaction DataFrame. Reference date defaults to today if not provided,
    making the function suitable for both historical analysis and live scoring.

    Parameters
    ----------
    df : pd.DataFrame
        Transaction data. Must contain columns for customer ID, amount, 
        and transaction date. Rows with null values in any of these columns 
        are silently dropped before computation.
    customer_col : str, optional
        Name of the customer identifier column, by default 'customer_id'.
    amount_col : str, optional
        Name of the transaction amount column (must be numeric), 
        by default 'amount'.
    date_col : str, optional
        Name of the transaction date column (must be datetime or 
        parseable string), by default 'transaction_date'.
    reference_date : str or None, optional
        ISO-format date string (e.g., '2024-09-30') used as "today" for 
        recency calculation. If None, uses the current date. Useful for 
        reproducible historical analysis, by default None.

    Returns
    -------
    pd.DataFrame
        One row per customer with columns:
        - customer_id: Original customer identifier
        - recency_days: Days since most recent transaction
        - frequency: Total number of transactions
        - monetary_total: Sum of all transaction amounts
        - monetary_avg: Mean transaction amount

    Raises
    ------
    ValueError
        If df is empty after dropping null values in required columns.
    KeyError
        If any of customer_col, amount_col, or date_col are not in df.columns.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({
    ...     'customer_id': ['A', 'A', 'B'],
    ...     'amount': [100.0, 250.0, 75.0],
    ...     'transaction_date': ['2024-08-01', '2024-09-15', '2024-09-01']
    ... })
    >>> result = calculate_customer_metrics(df, reference_date='2024-09-30')
    >>> result[['customer_id', 'recency_days', 'frequency']].to_string()
       customer_id  recency_days  frequency
    0           A            15          2
    1           B            29          1

    Notes
    -----
    RFM scoring thresholds are not applied here — this function returns 
    raw metric values. Apply score_rfm_metrics() to convert to 1-5 scores.
    
    Reference: Hughes, A.M. (1994). Strategic Database Marketing.
    """

Google Format (concise, readable in source):

Python
def split_features_target(
    df: pd.DataFrame,
    target_col: str,
    drop_cols: Optional[List[str]] = None
) -> Tuple[pd.DataFrame, pd.Series]:
    """Split a DataFrame into feature matrix X and target vector y.

    Args:
        df: Input DataFrame containing both features and target.
        target_col: Name of the column to use as the prediction target.
        drop_cols: Additional columns to exclude from X (e.g., ID columns
            that shouldn't be features). Defaults to None.

    Returns:
        A tuple (X, y) where X is the feature DataFrame and y is the 
        target Series. Both share the same index as the input DataFrame.

    Raises:
        KeyError: If target_col or any column in drop_cols is not present 
            in df.

    Example:
        >>> X, y = split_features_target(df, 'churned', drop_cols=['customer_id'])
        >>> print(f"Features: {X.shape}, Target: {y.shape}")
        Features: (10000, 47), Target: (10000,)
    """

reStructuredText (reST) Format (used by Sphinx):

Python
def normalize_features(
    X: np.ndarray,
    method: str = 'minmax'
) -> np.ndarray:
    """
    Normalize a feature matrix using the specified method.

    :param X: Feature matrix of shape (n_samples, n_features).
    :type X: numpy.ndarray
    :param method: Normalization method. One of 'minmax' (scales to [0,1]),
        'zscore' (standardizes to mean=0, std=1), or 'robust' (scales using
        median and IQR, resistant to outliers). Defaults to 'minmax'.
    :type method: str
    :raises ValueError: If method is not one of the accepted values.
    :returns: Normalized feature matrix with same shape as X.
    :rtype: numpy.ndarray
    """

Which format to choose? NumPy format is the standard in data science (used by numpy, scipy, pandas, scikit-learn themselves). Google format is more concise and suitable for shorter functions. Pick one and be consistent across your project — document the choice in a CONTRIBUTING.md file.

Docstring Quality Checklist

For every function docstring, verify it answers these questions:

QuestionDocstring Element
What does this function do?First-line summary (imperative mood: “Calculate…” not “Calculates…”)
What are the inputs?Parameters section with type, description, defaults, valid values
What does it return?Returns section with type and description
What can go wrong?Raises section with exception types and conditions
How do I use it?Examples section with runnable code
Are there edge cases?Notes section for caveats, assumptions, references
Has anything changed?(Optional) Deprecated section if behavior has changed

Class Docstrings

Classes in data science code — custom transformers, model wrappers, data loaders — need class-level docstrings that describe the class’s purpose and usage pattern:

Python
class ChurnFeatureTransformer(BaseEstimator, TransformerMixin):
    """
    Build customer churn prediction features from transaction history.
    
    This transformer computes RFM metrics, behavioral ratios, and time-based 
    features from a transaction DataFrame and returns a feature matrix 
    suitable for machine learning models.
    
    Designed to work in scikit-learn Pipelines — implements fit() and 
    transform() following the sklearn transformer API.
    
    Parameters
    ----------
    reference_date : str or None, optional
        ISO date for recency calculation. If None, uses current date.
        Set explicitly for reproducible offline analysis.
    include_channel_features : bool, optional
        Whether to include channel-based behavioral features (requires
        a 'channel' column in input data). By default True.
    n_transaction_bins : int, optional
        Number of frequency quantile bins for discretizing transaction 
        counts. By default 5.
    
    Attributes
    ----------
    feature_names_ : list of str
        Names of generated features, set after fit().
    n_features_out_ : int
        Number of output features, set after fit().
    
    Examples
    --------
    >>> transformer = ChurnFeatureTransformer(reference_date='2024-09-30')
    >>> X_features = transformer.fit_transform(transactions_df)
    >>> print(transformer.feature_names_[:5])
    ['recency_days', 'frequency', 'monetary_total', 'monetary_avg', 
     'days_since_first_transaction']
    
    Notes
    -----
    The transformer requires the input DataFrame to contain at minimum:
    'customer_id', 'transaction_date', and 'amount' columns.
    See build_features.py for column name configuration.
    """
    
    def __init__(
        self,
        reference_date: Optional[str] = None,
        include_channel_features: bool = True,
        n_transaction_bins: int = 5
    ):
        self.reference_date = reference_date
        self.include_channel_features = include_channel_features
        self.n_transaction_bins = n_transaction_bins

Level 3: Module-Level Documentation

Every Python file in your src/ directory should begin with a module-level docstring explaining the file’s purpose, contents, and relationships to other modules:

Python
"""
src/features/build_features.py
===============================

Feature engineering pipeline for the customer churn prediction model.

This module transforms cleaned transaction data (output of 
src/data/preprocess.py) into a feature matrix ready for model training.

Main entry point:
    build_feature_matrix(transactions_df) → feature_df

Key transformations:
    - RFM metrics (recency, frequency, monetary value)
    - Channel behavior ratios (mobile/web/store split)
    - Time-based features (days since first purchase, purchase velocity)
    - Product diversity features (unique categories, avg category concentration)

Feature output schema:
    See configs/feature_schema.yaml for complete feature definitions 
    and expected value ranges.

Dependencies:
    pandas >= 2.0
    numpy >= 1.24
    scikit-learn >= 1.3 (for ChurnFeatureTransformer base classes)

Author: Jane Smith <jane.smith@company.com>
Last updated: 2024-09-15
"""

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from typing import Optional, List, Tuple

This module docstring acts as a map: someone opening the file for the first time immediately understands what it contains, how it fits into the larger project, and what the main function to call is.

Level 4: Notebook Documentation with Markdown

Jupyter Notebooks are often the primary way data science work is communicated to technical and semi-technical audiences. Treating them as executable documents — not just runnable code — requires strategic use of Markdown cells.

The Anatomy of a Well-Documented Notebook

A professional data science notebook follows this structure:

Plaintext
Cell 1 (Markdown): Title and Purpose
  - What is this notebook about?
  - What question does it answer?
  - Who is the intended audience?
  - When was it created / last updated?

Cell 2 (Markdown): Setup and Imports Context
  - Brief note on environment and data sources

Cell 3 (Code): Imports

Cell 4 (Code): Configuration and constants

Cell 5 (Markdown): Section 1 — Data Loading
  - What data are we loading?
  - Where does it come from?
  - What should it look like?

Cell 6 (Code): Data loading code
Cell 7 (Code): Shape/head/info inspection

Cell 8 (Markdown): Section 2 — Exploratory Analysis
  - What are we looking for?
  - What hypotheses are we testing?

... and so on, alternating Markdown explanation with Code execution

Example: Well-Documented Notebook Opening

Plaintext
# Customer Churn — Feature Engineering Exploration

**Purpose**: Explore and validate candidate features for the churn 
prediction model. Identify which transaction-based behavioral signals 
correlate most strongly with 30-day churn.

**Inputs**: 
- `data/processed/transactions_clean.parquet` — cleaned transaction data
- `data/processed/churn_labels.parquet` — 30-day churn labels

**Outputs**: 
- Feature correlation analysis (saved to `reports/figures/`)
- Feature shortlist for `src/features/build_features.py`

**Author**: Jane Smith  
**Created**: 2024-09-10  
**Last updated**: 2024-09-15  
**Status**: Complete — findings incorporated into build_features.py
Plaintext
## 1. Data Overview

Loading the Q3 2024 transaction data (after preprocessing in 
`notebooks/exploratory/01_data_cleaning.ipynb`). 

Expect ~500,000 rows, one row per transaction, covering 85,000 unique 
customers. The churn labels are binary (1 = churned within 30 days of 
observation date, 0 = retained).
Python
# Code cell: data loading
df = pd.read_parquet("data/processed/transactions_clean.parquet")
labels = pd.read_parquet("data/processed/churn_labels.parquet")

print(f"Transactions: {df.shape}")
print(f"Customers with labels: {labels.shape}")
print(f"Churn rate: {labels['churned'].mean():.1%}")
Plaintext
## 2. RFM Feature Analysis

RFM (Recency, Frequency, Monetary) metrics are the classic foundation of 
customer segmentation. We expect all three to be predictive of churn:

- **Recency**: Customers who haven't purchased recently are more likely 
  to have churned
- **Frequency**: Habitual purchasers are more loyal
- **Monetary**: High-value customers may have different churn patterns

We'll first compute raw RFM metrics, then examine their distributions 
and correlations with the churn label.

This alternation between Markdown explanation and code execution creates a document that reads like a coherent analytical narrative while remaining fully executable.

Key Markdown Documentation Practices in Notebooks

Always explain what a visualization shows, not just display it:

Plaintext
### Observation

The distribution is right-skewed with a long tail — most customers have 
recency under 30 days, but a significant segment hasn't purchased in 90+ 
days. These high-recency customers correlate strongly with churn 
(see churn rate by recency bucket below).

This suggests recency should be log-transformed or bucketed for the model.

Document surprising or counter-intuitive findings:

Plaintext
**Unexpected finding**: High monetary value customers show *higher* churn 
rates in the 60-120 day recency bucket. This may indicate bulk purchasers 
who make large one-time transactions rather than loyal repeat customers.
Flagged for product team discussion — may warrant a separate customer 
segment model.

Record decisions made during analysis:

Plaintext
**Decision**: Using the median rather than mean for monetary features 
due to extreme outliers (max transaction = $48,000 — likely B2B account). 
Mean would be heavily skewed by these outliers. See outlier analysis in 
cell 14.

Level 5: README Files

The README is the most important documentation file in any project — it’s the front door. Writing a great README is a distinct skill.

The Complete Data Science README Template

Plaintext
# [Project Name]

[1-2 sentence description of what this project does and the business 
problem it solves]

## Problem Statement

[3-5 sentences explaining the business context: What decision does this 
model support? What was the previous approach? What improvement does this 
deliver?]

## Approach

[Brief description of your methodology: data sources, modeling approach, 
evaluation strategy]

## Results

| Model | AUC-ROC | F1 Score | Precision@30% | 
|-------|---------|----------|---------------|
| Baseline (LR) | 0.823 | 0.764 | 0.612 |
| Random Forest | 0.891 | 0.819 | 0.714 |
| **XGBoost (final)** | **0.914** | **0.847** | **0.761** |

[1-2 sentences interpreting the results and their business implications]

## Project Structure

    ├── data/
    │   ├── raw/             ← Source data (DVC-tracked, not in Git)
    │   └── processed/       ← Feature-engineered data
    ├── notebooks/
    │   ├── exploratory/     ← EDA and experiments
    │   └── reports/         ← Polished analysis reports  
    ├── src/                 ← Python modules
    ├── tests/               ← Unit tests
    ├── configs/             ← Configuration YAML files
    └── models/              ← Trained model files

## Setup

**Prerequisites**: Python 3.11, Git, DVC

    git clone git@github.com:org/project-name.git
    cd project-name
    python -m venv venv
    source venv/bin/activate    # Windows: venv\Scripts\activate
    pip install -r requirements-dev.txt
    dvc pull                    # Download data (requires AWS credentials)

## Running the Pipeline

    make pipeline       # Full data → features → train → evaluate
    make train          # Train model only (uses existing processed data)
    make evaluate       # Evaluate the latest trained model
    make test           # Run unit tests

## Key Files

| File | Description |
|------|-------------|
| `configs/config.yaml` | All file paths, column names, training parameters |
| `src/features/build_features.py` | Feature engineering logic |
| `src/models/train_model.py` | Model training script |
| `notebooks/reports/06_final_evaluation.ipynb` | Final model analysis |

## Data

- **Source**: Internal CRM database export (Q3 2024)
- **Size**: ~500K transactions, 85K customers
- **Access**: DVC remote at `s3://company-data/churn-model/`
- **Schema**: See `data/raw/README.md`
- **Refresh cadence**: Monthly export

## Model Details

- **Algorithm**: XGBoost classifier
- **Features**: 47 RFM + behavioral features (see `configs/feature_schema.yaml`)
- **Target**: Binary churn within 30 days
- **Training data**: Q1-Q2 2024 transactions
- **Validation**: Q3 2024 holdout set
- **Retraining trigger**: When monthly AUC-ROC drops below 0.88

## Contributing

1. Create a feature branch: `git switch -c feature/your-feature`
2. Make your changes, add tests
3. Run `make lint test` to verify quality checks pass
4. Open a pull request against `main`

## Authors

- Jane Smith (jane@company.com) — modeling, feature engineering
- Bob Jones (bob@company.com) — data pipeline, infrastructure

## License

Internal use only — contact the Data Science team for access.

Level 6: Data Dictionaries

A data dictionary is documentation that describes every variable in your dataset — its name, type, description, valid values, units, and source. It is arguably the most valuable documentation artifact in a data science project and the most frequently skipped.

What a Data Dictionary Contains

Plaintext
# Data Dictionary: Customer Transactions

**Dataset**: transactions_clean.parquet
**Last updated**: 2024-09-15
**Row count**: ~500,000
**Grain**: One row per transaction

## Column Definitions

| Column | Type | Description | Example | Notes |
|--------|------|-------------|---------|-------|
| `transaction_id` | str | Unique identifier for each transaction | "TXN_20240901_001234" | UUID format, never null |
| `customer_id` | str | Unique customer identifier | "CUST_98765" | Joins to customers table |
| `transaction_date` | datetime | Date and time of purchase (UTC) | 2024-09-01 14:23:11 | Null for ~0.1% of records (data entry gap) |
| `amount` | float | Transaction amount in USD | 149.99 | Always positive; negative = data error |
| `product_id` | str | Product purchased | "PROD_electronics_001" | Joins to products table |
| `product_category` | str | Top-level product category | "electronics" | One of 12 categories (see Categories table) |
| `channel` | str | Purchase channel | "mobile_app" | One of: web, mobile_app, store, phone; null = unknown |
| `is_returned` | bool | Whether the item was subsequently returned | False | Set at time of return processing |
| `promotion_applied` | bool | Whether a promotional discount was used | True | |
| `discount_amount` | float | Discount applied in USD | 15.00 | 0.0 if no promotion |

## Categories Reference

| product_category | Description | Example Products |
|-----------------|-------------|-----------------|
| electronics | Consumer electronics | Phones, laptops, tablets |
| apparel | Clothing and accessories | Shirts, shoes, bags |
| home | Home goods and furniture | Furniture, decor, kitchen |
| ... | | |

## Known Data Quality Issues

1. **Missing transaction_date (~0.1%)**: Records from legacy POS system 
   (2019-2020) occasionally have null timestamps. These are excluded from 
   recency features but counted in frequency metrics.
   
2. **Channel null (~2%)**: Pre-2021 web transactions don't have channel 
   attribution. Treated as 'unknown' in channel features.
   
3. **Amount outliers**: Transactions above $5,000 are predominantly B2B 
   accounts. Consider filtering or separate treatment in consumer-focused models.

## Source System

CRM database: `crm_production.transactions` table  
Extraction query: `scripts/data_extraction.sql`  
Refresh: Monthly on the 1st

Store the data dictionary in a docs/ directory or as a README within the data/processed/ directory. It is one of the first documents anyone should read when joining a project.

Level 7: API Documentation with Sphinx and MkDocs

For data science projects that expose Python modules as libraries, or for teams that want browsable documentation websites, automatic documentation generation tools are invaluable.

Sphinx: The Python Documentation Standard

Sphinx is the standard documentation tool for Python projects — used by Python itself, NumPy, pandas, scikit-learn, and thousands of libraries.

Bash
pip install sphinx sphinx-rtd-theme
cd docs/
sphinx-quickstart

Sphinx reads your docstrings and generates HTML documentation automatically:

Bash
# Generate HTML docs from your docstrings
make html

# View the generated documentation
open _build/html/index.html

The autodoc extension extracts docstrings from your modules:

Bash
.. automodule:: src.features.build_features
   :members:
   :undoc-members:
   :show-inheritance:

For data science projects, the NumPy docstring format is specifically designed to render beautifully with Sphinx + the numpydoc extension.

MkDocs: Modern, Simpler Documentation

MkDocs with the Material theme is increasingly popular for data science projects — it’s simpler than Sphinx and produces beautiful, modern documentation websites.

Bash
pip install mkdocs mkdocs-material mkdocstrings[python]
mkdocs new .

Configure mkdocs.yml:

YAML
site_name: Customer Churn Model
theme:
  name: material
  palette:
    primary: indigo

plugins:
  - search
  - mkdocstrings:
      handlers:
        python:
          options:
            docstring_style: numpy

nav:
  - Home: index.md
  - Getting Started: getting_started.md
  - Data Dictionary: data_dictionary.md
  - API Reference:
    - Feature Engineering: api/features.md
    - Model Training: api/models.md
  - Results: results.md

Serve locally: mkdocs serve → opens at http://127.0.0.1:8000

Deploy to GitHub Pages: mkdocs gh-deploy

Documenting Experiments and Model Decisions

One of the most valuable but most neglected documentation practices in data science is recording the why behind modeling decisions — what was tried, what didn’t work, and why the final approach was chosen.

The Experiment Log

Maintain a running EXPERIMENTS.md or use a dedicated experiment tracking tool (MLflow, Weights & Biases, Neptune) to record:

Plaintext
# Experiment Log: Customer Churn Model

## 2024-08-15: Baseline Experiments
**Goal**: Establish baseline performance with simple models  
**Data**: Q1-Q2 2024 transactions (70/30 train/val split)  
**Results**:
- Logistic Regression: AUC-ROC 0.823, F1 0.764
- Decision Tree: AUC-ROC 0.801, F1 0.741

**Finding**: LR surprisingly competitive — good linear signal in raw features  
**Next step**: Try ensemble methods

---

## 2024-08-22: Feature Engineering Round 1
**Goal**: Test whether RFM features improve over raw transaction counts  
**Changes**: Added recency_days, frequency_90d, monetary_avg features  
**Results**:
- Random Forest: AUC-ROC 0.891 (↑ from 0.823 baseline)

**Finding**: RFM features provide large improvement  
**Next step**: Add channel and time-based features

---

## 2024-09-05: XGBoost Tuning
**Goal**: Optimize XGBoost with full feature set  
**Method**: 5-fold CV grid search over learning_rate, max_depth, n_estimators  
**Best params**: lr=0.05, max_depth=6, n_estimators=500  
**Results**: AUC-ROC 0.914 on Q3 holdout  

**Why XGBoost over Neural Network**: Tested a simple MLP (3 layers, 128/64/32 
units) — AUC-ROC 0.907, lower than XGBoost, much slower to train, much 
harder to explain to stakeholders. XGBoost selected as final model.

**Why not SHAP + threshold tuning first**: Tried threshold=0.35 based on 
precision-recall tradeoff — see notebook 05_threshold_analysis.ipynb.
Settled on 0.42 based on business requirement (flag top 20% of customers).

This log answers the question every new team member and every future model auditor will ask: “Why is the model built this way?”

Documenting Model Cards

For models deployed to production, a model card is a standardized summary document (introduced by Google) that provides structured information about a model’s capabilities, limitations, and appropriate use cases:

Plaintext
# Model Card: Customer Churn Predictor v2.3

## Model Details
- **Type**: XGBoost binary classifier
- **Version**: 2.3.0
- **Trained**: 2024-09-12
- **Author**: Jane Smith, Data Science Team

## Intended Use
- **Primary use**: Identify customers at risk of churning within 30 days
- **Intended users**: Customer success team, for proactive outreach campaigns
- **Out-of-scope**: Not suitable for B2B accounts (>$5K transaction values)

## Training Data
- **Source**: Q1-Q2 2024 transactions (Jan 1 – Jun 30, 2024)
- **Size**: 68,000 customers, 380,000 transactions
- **Label**: Churned within 30 days of observation date

## Evaluation Data
- **Source**: Q3 2024 holdout (Jul 1 – Sep 30, 2024)
- **Size**: 17,000 customers

## Performance Metrics
| Metric | Value |
|--------|-------|
| AUC-ROC | 0.914 |
| F1 Score (threshold=0.42) | 0.847 |
| Precision | 0.856 |
| Recall | 0.839 |

## Limitations and Biases
- Performance may degrade for new customers (<90 days, <3 transactions)
- Seasonal patterns (holiday shopping) not fully captured — monitor Nov/Dec
- B2B accounts systematically over-predicted as churners — filter by account type

## Monitoring
- **Drift alert threshold**: Monthly AUC-ROC < 0.88
- **Data freshness**: Must retrain if >60 days since training data cutoff
- **Responsible team**: Data Science (slack: #ds-churn-model)

Documentation Anti-Patterns to Avoid

Anti-Pattern 1: Stale Comments

Comments that were once accurate but no longer reflect the code are actively harmful — they mislead readers into thinking code does something it doesn’t:

Python
# BAD: Comment says one thing, code does another
# Filter to customers with more than 5 transactions
df = df[df['transaction_count'] > 3]  # ← Code uses 3, comment says 5

Stale comments are often worse than no comments. Keep comments in sync with code, or use descriptive variable names that make comments unnecessary.

Anti-Pattern 2: Over-Documentation of Obvious Code

Python
# BAD: Wastes space, adds noise
# Import pandas
import pandas as pd

# Set x to 10
x = 10

# Iterate over the list
for item in items:
    # Process each item
    process(item)

Over-commenting makes code harder to read, not easier — the signal-to-noise ratio drops.

Anti-Pattern 3: Lying Docstrings

A docstring that describes what a function was intended to do rather than what it actually does is dangerous. Keep docstrings synchronized with behavior as code evolves.

Anti-Pattern 4: Missing the Why

The most common documentation gap: explaining what code does but never why:

Python
# BAD: Explains what but not why
# Calculate log transform
amount_log = np.log1p(df['amount'])

# GOOD: Explains both what and why
# Log-transform amount to reduce right skew — histogram in EDA showed 
# distribution spans 3 orders of magnitude; log transform makes it 
# approximately normal, improving tree model splits
amount_log = np.log1p(df['amount'])

Building a Documentation Culture

Individual documentation practices matter, but the most effective documentation happens when it’s a team norm, not an individual heroic effort.

Make documentation part of the definition of done: A feature or analysis is not finished until the key functions have docstrings, the README is updated, and any new data is added to the data dictionary. This is a team agreement, not an individual preference.

Review documentation in code review: When reviewing a pull request, explicitly check: Are new functions documented? Is the README updated? Are any non-obvious decisions commented?

Write documentation before or during coding: The act of writing a docstring before writing code (similar to test-driven development) forces clarity about what a function should do. If you can’t explain it clearly in a docstring, you may not fully understand the problem yet.

Use templates to lower the friction: Provide docstring templates, README templates, and notebook header templates. Lower the effort required to document well, and more documentation gets written.

Summary

Documentation is not separate from data science work — it is an integral dimension of quality data science work. The question is never “should I document?” but “what level of documentation is appropriate for this stage and audience?”

The spectrum runs from quick inline comments explaining non-obvious decisions, through carefully structured function docstrings that make your API self-explaining, through notebook Markdown cells that transform executable code into readable analysis, through comprehensive READMEs and data dictionaries that make projects understandable to anyone, to model cards that communicate capabilities and limitations to production users.

The investment in documentation pays compound returns. A function documented today costs 10 minutes. Understanding an undocumented function six months from now — or explaining it to a new team member, or reproducing its behavior in a different context — costs orders of magnitude more. Documentation is not overhead. It is the mechanism through which the value of your analytical work persists and compounds over time.

Key Takeaways

  • Documentation exists at multiple levels — inline comments, docstrings, module headers, notebook Markdown, READMEs, data dictionaries, and model cards — each serving different audiences and purposes
  • Inline comments should explain why decisions were made, not what the code does — well-written code shows the what; comments reveal the non-obvious reasoning
  • NumPy-style docstrings are the data science standard, covering parameters (with types and defaults), return values, exceptions raised, and runnable examples
  • Data dictionaries — describing every column’s name, type, meaning, valid values, and known issues — are among the most valuable and most neglected documentation artifacts in data science
  • Notebook documentation alternates Markdown explanation cells with code cells to create executable analytical narratives, not just collections of code
  • Experiment logs record what was tried, what didn’t work, and why the final approach was chosen — answering the “why is it built this way?” questions that every future team member will ask
  • Model cards provide standardized production model documentation covering intended use, training data, performance metrics, and known limitations
  • Documentation culture matters as much as individual practice — making documentation part of the definition of done and reviewing it in code review makes good documentation a team norm
Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Understanding System Architecture: The Blueprint of Your Operating System

Learn about operating system architecture including monolithic kernels, microkernels, hybrid kernels, layered architecture, and how…

Introduction to JavaScript – Basics and Fundamentals

Learn the basics of JavaScript, including syntax, events, loops, and closures, to build dynamic and…

The History of Robotics: From Ancient Automata to Modern Machines

Explore the fascinating evolution of robotics from ancient mechanical devices to today’s AI-powered machines. Discover…

Understanding Force and Torque in Robot Design

Master force and torque concepts essential for robot design. Learn to calculate requirements, select motors,…

The Role of Inductors: Understanding Magnetic Energy Storage

Learn what inductors do in circuits, how they store energy in magnetic fields, and why…

Interactive Data Visualization: Adding Filters and Interactivity

Learn how to enhance data visualizations with filters, real-time integration and interactivity. Explore tools, best…

Click For More
0
Would love your thoughts, please comment.x
()
x