Documentation in data science code encompasses everything that explains what your code does, why decisions were made, and how to use or reproduce your work — including inline comments, function docstrings, README files, data dictionaries, experiment logs, and architectural overviews. Good documentation is not an afterthought but an integral part of professional data science practice that dramatically improves code maintainability, reproducibility, collaboration, and the long-term value of your work.
Introduction
There is a joke in software development: “Code never lies, but comments sometimes do.” In data science, this joke has a darker dimension. Data science code without documentation doesn’t just confuse people — it destroys value. A machine learning model trained over weeks of computational resources, producing state-of-the-art results, is nearly worthless if no one can understand what data it was trained on, what preprocessing was applied, what the output means, or how to use it in production.
Documentation is the bridge between what your code does and what any person — including future you — understands about what your code does. That gap can be enormous. The algorithmic genius who wrote a complex feature engineering pipeline on a Tuesday afternoon knows exactly what every transformation means. Three months later, after dozens of other projects, that same person stares at their own code with genuine confusion.
Despite its importance, documentation is the most consistently neglected practice in data science. The reasons are understandable: documentation feels like slowing down, results feel more important than process, and the exploratory nature of data science work makes it feel premature to document something before you know if it will be kept. These are real tensions, not just excuses.
This guide navigates those tensions honestly. It provides concrete, practical documentation practices scaled appropriately to the stage and stakes of your work — from lightweight inline comments you should write while coding, to structured docstrings for production functions, to comprehensive project documentation that makes your work truly reproducible and transferable.
The Documentation Spectrum: From Code to Project Level
Documentation in data science exists at multiple levels of granularity, each serving different purposes and audiences.
Granularity Level │ Type │ Audience
─────────────────────┼───────────────────────┼────────────────────────────
Most Granular │ Inline comments │ Yourself, code reviewers
│ Function docstrings │ API users, your future self
│ Module docstrings │ Developers of the module
│ Notebook markdown │ Analysts, technical stakeholders
│ README files │ Everyone starting with the project
│ Data dictionaries │ Anyone working with the data
│ Architecture docs │ Technical leads, new team members
Most Broad │ Project documentation │ All stakeholders, public usersGood documentation practice means attending to all these levels — not just writing docstrings and calling it done. Let’s explore each level in depth.
Level 1: Inline Comments
Inline comments are the most immediate form of documentation — notes written directly alongside code that explain specific decisions, caveats, or non-obvious logic.
When to Write Inline Comments
The most common mistake with inline comments is writing too many of them — explaining what the code does rather than why it does it. Well-written code is largely self-explanatory in terms of what it does; comments add value by explaining why.
Don’t explain the obvious:
# BAD: This comment adds nothing
# Create a copy of the DataFrame
df_clean = df.copy()
# BAD: Just restating what the code clearly does
# Sort by date in descending order
df = df.sort_values('date', ascending=False)Do explain the non-obvious:
# GOOD: Explains WHY, not just what
# Create a copy to avoid mutating the original DataFrame — downstream
# functions assume the input is unchanged
df_clean = df.copy()
# GOOD: Explains a business-domain reason that isn't obvious from code
# Sort descending because our feature engineering uses the most recent
# transaction first (recency bias in RFM model)
df = df.sort_values('date', ascending=False)
# GOOD: Documents a known issue or constraint
# Using clip instead of filtering to avoid losing rows with occasional
# negative amounts — finance team confirmed these are valid refund entries
df['amount'] = df['amount'].clip(lower=0)
# GOOD: Explains a performance trade-off
# Converting to category dtype before merge reduces memory from 1.2GB to
# 340MB — critical for this dataset size; tested on 2024-08-12
df['category'] = df['category'].astype('category')Comments for Data Science–Specific Situations
Data science code has unique patterns that benefit from specific commenting styles.
Documenting magic numbers:
# Threshold determined by precision-recall analysis on validation set
# See notebooks/exploratory/04_threshold_selection.ipynb, cell 12
CHURN_PROBABILITY_THRESHOLD = 0.42
# Minimum transactions needed for reliable RFM scores — based on domain
# expert input from customer success team (2024-09 review)
MIN_TRANSACTION_COUNT = 3
# 99th percentile of amount distribution in training data
# Values above this are likely data entry errors
AMOUNT_OUTLIER_CEILING = 9850.0Documenting data assumptions:
# Assumes input df has already passed schema validation
# (customer_id: str, transaction_date: datetime, amount: float)
def compute_rfm_scores(df: pd.DataFrame) -> pd.DataFrame:
...
# NOTE: Missing values in 'channel' are treated as 'unknown' by design —
# verified with data engineering team that NaN means no channel tracking,
# not a data quality issue
df['channel'] = df['channel'].fillna('unknown')Documenting workarounds:
# WORKAROUND: scikit-learn 1.3 changed the default max_features behavior
# for RandomForest — explicitly setting to 'sqrt' for backward compatibility
# with older model configs. Remove when all configs are updated to 1.3+ syntax.
model = RandomForestClassifier(n_estimators=300, max_features='sqrt')
# HACK: Upstream data sometimes sends dates in MM/DD/YYYY instead of
# ISO format. This dual-parse handles both until data source is fixed.
# Tracked in JIRA ticket DS-247.
try:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
except ValueError:
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')TODO and FIXME Comments
Use structured tags that your IDE and team can search for systematically:
# TODO: Add input validation for negative amounts after DS-301 is resolved
# TODO: Consider caching this computation — called 3x in the pipeline
# FIXME: This function silently drops rows with NaN in product_id
# Expected behavior: raise ValueError or impute. See DS-298.
# HACK: Temporary fix for production outage 2024-11-14 — replace with
# proper solution after incident review
# NOTE: This is intentionally slow — optimizing it would break the
# incremental update logic in downstream_processor.pyLevel 2: Function and Method Docstrings
A docstring is a string literal that appears as the first statement of a module, function, class, or method. Python’s help() function and IDE tooltips display this string, making it the primary API documentation for your code.
Why Docstrings Matter More in Data Science
In general software development, function names and type signatures often tell you most of what you need to know about a function. In data science, they frequently don’t:
# What does this return? A DataFrame? A numpy array? What shape?
# What do the parameters mean? What does 'mode' accept?
def compute_features(df, customer_id, mode):
...Data science functions deal with DataFrames that have specific column requirements, arrays of specific shapes, models that must have been fitted before use, and domain-specific concepts. Docstrings are essential for making these implicit requirements explicit.
Docstring Formats
There are three widely used docstring formats in the Python ecosystem:
NumPy/SciPy Format (most popular in data science):
def calculate_customer_metrics(
df: pd.DataFrame,
customer_col: str = 'customer_id',
amount_col: str = 'amount',
date_col: str = 'transaction_date',
reference_date: Optional[str] = None
) -> pd.DataFrame:
"""
Calculate RFM (Recency, Frequency, Monetary) metrics for each customer.
Computes the three classic customer segmentation metrics from a
transaction DataFrame. Reference date defaults to today if not provided,
making the function suitable for both historical analysis and live scoring.
Parameters
----------
df : pd.DataFrame
Transaction data. Must contain columns for customer ID, amount,
and transaction date. Rows with null values in any of these columns
are silently dropped before computation.
customer_col : str, optional
Name of the customer identifier column, by default 'customer_id'.
amount_col : str, optional
Name of the transaction amount column (must be numeric),
by default 'amount'.
date_col : str, optional
Name of the transaction date column (must be datetime or
parseable string), by default 'transaction_date'.
reference_date : str or None, optional
ISO-format date string (e.g., '2024-09-30') used as "today" for
recency calculation. If None, uses the current date. Useful for
reproducible historical analysis, by default None.
Returns
-------
pd.DataFrame
One row per customer with columns:
- customer_id: Original customer identifier
- recency_days: Days since most recent transaction
- frequency: Total number of transactions
- monetary_total: Sum of all transaction amounts
- monetary_avg: Mean transaction amount
Raises
------
ValueError
If df is empty after dropping null values in required columns.
KeyError
If any of customer_col, amount_col, or date_col are not in df.columns.
Examples
--------
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'customer_id': ['A', 'A', 'B'],
... 'amount': [100.0, 250.0, 75.0],
... 'transaction_date': ['2024-08-01', '2024-09-15', '2024-09-01']
... })
>>> result = calculate_customer_metrics(df, reference_date='2024-09-30')
>>> result[['customer_id', 'recency_days', 'frequency']].to_string()
customer_id recency_days frequency
0 A 15 2
1 B 29 1
Notes
-----
RFM scoring thresholds are not applied here — this function returns
raw metric values. Apply score_rfm_metrics() to convert to 1-5 scores.
Reference: Hughes, A.M. (1994). Strategic Database Marketing.
"""Google Format (concise, readable in source):
def split_features_target(
df: pd.DataFrame,
target_col: str,
drop_cols: Optional[List[str]] = None
) -> Tuple[pd.DataFrame, pd.Series]:
"""Split a DataFrame into feature matrix X and target vector y.
Args:
df: Input DataFrame containing both features and target.
target_col: Name of the column to use as the prediction target.
drop_cols: Additional columns to exclude from X (e.g., ID columns
that shouldn't be features). Defaults to None.
Returns:
A tuple (X, y) where X is the feature DataFrame and y is the
target Series. Both share the same index as the input DataFrame.
Raises:
KeyError: If target_col or any column in drop_cols is not present
in df.
Example:
>>> X, y = split_features_target(df, 'churned', drop_cols=['customer_id'])
>>> print(f"Features: {X.shape}, Target: {y.shape}")
Features: (10000, 47), Target: (10000,)
"""reStructuredText (reST) Format (used by Sphinx):
def normalize_features(
X: np.ndarray,
method: str = 'minmax'
) -> np.ndarray:
"""
Normalize a feature matrix using the specified method.
:param X: Feature matrix of shape (n_samples, n_features).
:type X: numpy.ndarray
:param method: Normalization method. One of 'minmax' (scales to [0,1]),
'zscore' (standardizes to mean=0, std=1), or 'robust' (scales using
median and IQR, resistant to outliers). Defaults to 'minmax'.
:type method: str
:raises ValueError: If method is not one of the accepted values.
:returns: Normalized feature matrix with same shape as X.
:rtype: numpy.ndarray
"""Which format to choose? NumPy format is the standard in data science (used by numpy, scipy, pandas, scikit-learn themselves). Google format is more concise and suitable for shorter functions. Pick one and be consistent across your project — document the choice in a CONTRIBUTING.md file.
Docstring Quality Checklist
For every function docstring, verify it answers these questions:
| Question | Docstring Element |
|---|---|
| What does this function do? | First-line summary (imperative mood: “Calculate…” not “Calculates…”) |
| What are the inputs? | Parameters section with type, description, defaults, valid values |
| What does it return? | Returns section with type and description |
| What can go wrong? | Raises section with exception types and conditions |
| How do I use it? | Examples section with runnable code |
| Are there edge cases? | Notes section for caveats, assumptions, references |
| Has anything changed? | (Optional) Deprecated section if behavior has changed |
Class Docstrings
Classes in data science code — custom transformers, model wrappers, data loaders — need class-level docstrings that describe the class’s purpose and usage pattern:
class ChurnFeatureTransformer(BaseEstimator, TransformerMixin):
"""
Build customer churn prediction features from transaction history.
This transformer computes RFM metrics, behavioral ratios, and time-based
features from a transaction DataFrame and returns a feature matrix
suitable for machine learning models.
Designed to work in scikit-learn Pipelines — implements fit() and
transform() following the sklearn transformer API.
Parameters
----------
reference_date : str or None, optional
ISO date for recency calculation. If None, uses current date.
Set explicitly for reproducible offline analysis.
include_channel_features : bool, optional
Whether to include channel-based behavioral features (requires
a 'channel' column in input data). By default True.
n_transaction_bins : int, optional
Number of frequency quantile bins for discretizing transaction
counts. By default 5.
Attributes
----------
feature_names_ : list of str
Names of generated features, set after fit().
n_features_out_ : int
Number of output features, set after fit().
Examples
--------
>>> transformer = ChurnFeatureTransformer(reference_date='2024-09-30')
>>> X_features = transformer.fit_transform(transactions_df)
>>> print(transformer.feature_names_[:5])
['recency_days', 'frequency', 'monetary_total', 'monetary_avg',
'days_since_first_transaction']
Notes
-----
The transformer requires the input DataFrame to contain at minimum:
'customer_id', 'transaction_date', and 'amount' columns.
See build_features.py for column name configuration.
"""
def __init__(
self,
reference_date: Optional[str] = None,
include_channel_features: bool = True,
n_transaction_bins: int = 5
):
self.reference_date = reference_date
self.include_channel_features = include_channel_features
self.n_transaction_bins = n_transaction_binsLevel 3: Module-Level Documentation
Every Python file in your src/ directory should begin with a module-level docstring explaining the file’s purpose, contents, and relationships to other modules:
"""
src/features/build_features.py
===============================
Feature engineering pipeline for the customer churn prediction model.
This module transforms cleaned transaction data (output of
src/data/preprocess.py) into a feature matrix ready for model training.
Main entry point:
build_feature_matrix(transactions_df) → feature_df
Key transformations:
- RFM metrics (recency, frequency, monetary value)
- Channel behavior ratios (mobile/web/store split)
- Time-based features (days since first purchase, purchase velocity)
- Product diversity features (unique categories, avg category concentration)
Feature output schema:
See configs/feature_schema.yaml for complete feature definitions
and expected value ranges.
Dependencies:
pandas >= 2.0
numpy >= 1.24
scikit-learn >= 1.3 (for ChurnFeatureTransformer base classes)
Author: Jane Smith <jane.smith@company.com>
Last updated: 2024-09-15
"""
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from typing import Optional, List, TupleThis module docstring acts as a map: someone opening the file for the first time immediately understands what it contains, how it fits into the larger project, and what the main function to call is.
Level 4: Notebook Documentation with Markdown
Jupyter Notebooks are often the primary way data science work is communicated to technical and semi-technical audiences. Treating them as executable documents — not just runnable code — requires strategic use of Markdown cells.
The Anatomy of a Well-Documented Notebook
A professional data science notebook follows this structure:
Cell 1 (Markdown): Title and Purpose
- What is this notebook about?
- What question does it answer?
- Who is the intended audience?
- When was it created / last updated?
Cell 2 (Markdown): Setup and Imports Context
- Brief note on environment and data sources
Cell 3 (Code): Imports
Cell 4 (Code): Configuration and constants
Cell 5 (Markdown): Section 1 — Data Loading
- What data are we loading?
- Where does it come from?
- What should it look like?
Cell 6 (Code): Data loading code
Cell 7 (Code): Shape/head/info inspection
Cell 8 (Markdown): Section 2 — Exploratory Analysis
- What are we looking for?
- What hypotheses are we testing?
... and so on, alternating Markdown explanation with Code executionExample: Well-Documented Notebook Opening
# Customer Churn — Feature Engineering Exploration
**Purpose**: Explore and validate candidate features for the churn
prediction model. Identify which transaction-based behavioral signals
correlate most strongly with 30-day churn.
**Inputs**:
- `data/processed/transactions_clean.parquet` — cleaned transaction data
- `data/processed/churn_labels.parquet` — 30-day churn labels
**Outputs**:
- Feature correlation analysis (saved to `reports/figures/`)
- Feature shortlist for `src/features/build_features.py`
**Author**: Jane Smith
**Created**: 2024-09-10
**Last updated**: 2024-09-15
**Status**: Complete — findings incorporated into build_features.py## 1. Data Overview
Loading the Q3 2024 transaction data (after preprocessing in
`notebooks/exploratory/01_data_cleaning.ipynb`).
Expect ~500,000 rows, one row per transaction, covering 85,000 unique
customers. The churn labels are binary (1 = churned within 30 days of
observation date, 0 = retained).# Code cell: data loading
df = pd.read_parquet("data/processed/transactions_clean.parquet")
labels = pd.read_parquet("data/processed/churn_labels.parquet")
print(f"Transactions: {df.shape}")
print(f"Customers with labels: {labels.shape}")
print(f"Churn rate: {labels['churned'].mean():.1%}")
## 2. RFM Feature Analysis
RFM (Recency, Frequency, Monetary) metrics are the classic foundation of
customer segmentation. We expect all three to be predictive of churn:
- **Recency**: Customers who haven't purchased recently are more likely
to have churned
- **Frequency**: Habitual purchasers are more loyal
- **Monetary**: High-value customers may have different churn patterns
We'll first compute raw RFM metrics, then examine their distributions
and correlations with the churn label.This alternation between Markdown explanation and code execution creates a document that reads like a coherent analytical narrative while remaining fully executable.
Key Markdown Documentation Practices in Notebooks
Always explain what a visualization shows, not just display it:
### Observation
The distribution is right-skewed with a long tail — most customers have
recency under 30 days, but a significant segment hasn't purchased in 90+
days. These high-recency customers correlate strongly with churn
(see churn rate by recency bucket below).
This suggests recency should be log-transformed or bucketed for the model.Document surprising or counter-intuitive findings:
**Unexpected finding**: High monetary value customers show *higher* churn
rates in the 60-120 day recency bucket. This may indicate bulk purchasers
who make large one-time transactions rather than loyal repeat customers.
Flagged for product team discussion — may warrant a separate customer
segment model.Record decisions made during analysis:
**Decision**: Using the median rather than mean for monetary features
due to extreme outliers (max transaction = $48,000 — likely B2B account).
Mean would be heavily skewed by these outliers. See outlier analysis in
cell 14.Level 5: README Files
The README is the most important documentation file in any project — it’s the front door. Writing a great README is a distinct skill.
The Complete Data Science README Template
# [Project Name]
[1-2 sentence description of what this project does and the business
problem it solves]
## Problem Statement
[3-5 sentences explaining the business context: What decision does this
model support? What was the previous approach? What improvement does this
deliver?]
## Approach
[Brief description of your methodology: data sources, modeling approach,
evaluation strategy]
## Results
| Model | AUC-ROC | F1 Score | Precision@30% |
|-------|---------|----------|---------------|
| Baseline (LR) | 0.823 | 0.764 | 0.612 |
| Random Forest | 0.891 | 0.819 | 0.714 |
| **XGBoost (final)** | **0.914** | **0.847** | **0.761** |
[1-2 sentences interpreting the results and their business implications]
## Project Structure
├── data/
│ ├── raw/ ← Source data (DVC-tracked, not in Git)
│ └── processed/ ← Feature-engineered data
├── notebooks/
│ ├── exploratory/ ← EDA and experiments
│ └── reports/ ← Polished analysis reports
├── src/ ← Python modules
├── tests/ ← Unit tests
├── configs/ ← Configuration YAML files
└── models/ ← Trained model files
## Setup
**Prerequisites**: Python 3.11, Git, DVC
git clone git@github.com:org/project-name.git
cd project-name
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements-dev.txt
dvc pull # Download data (requires AWS credentials)
## Running the Pipeline
make pipeline # Full data → features → train → evaluate
make train # Train model only (uses existing processed data)
make evaluate # Evaluate the latest trained model
make test # Run unit tests
## Key Files
| File | Description |
|------|-------------|
| `configs/config.yaml` | All file paths, column names, training parameters |
| `src/features/build_features.py` | Feature engineering logic |
| `src/models/train_model.py` | Model training script |
| `notebooks/reports/06_final_evaluation.ipynb` | Final model analysis |
## Data
- **Source**: Internal CRM database export (Q3 2024)
- **Size**: ~500K transactions, 85K customers
- **Access**: DVC remote at `s3://company-data/churn-model/`
- **Schema**: See `data/raw/README.md`
- **Refresh cadence**: Monthly export
## Model Details
- **Algorithm**: XGBoost classifier
- **Features**: 47 RFM + behavioral features (see `configs/feature_schema.yaml`)
- **Target**: Binary churn within 30 days
- **Training data**: Q1-Q2 2024 transactions
- **Validation**: Q3 2024 holdout set
- **Retraining trigger**: When monthly AUC-ROC drops below 0.88
## Contributing
1. Create a feature branch: `git switch -c feature/your-feature`
2. Make your changes, add tests
3. Run `make lint test` to verify quality checks pass
4. Open a pull request against `main`
## Authors
- Jane Smith (jane@company.com) — modeling, feature engineering
- Bob Jones (bob@company.com) — data pipeline, infrastructure
## License
Internal use only — contact the Data Science team for access.Level 6: Data Dictionaries
A data dictionary is documentation that describes every variable in your dataset — its name, type, description, valid values, units, and source. It is arguably the most valuable documentation artifact in a data science project and the most frequently skipped.
What a Data Dictionary Contains
# Data Dictionary: Customer Transactions
**Dataset**: transactions_clean.parquet
**Last updated**: 2024-09-15
**Row count**: ~500,000
**Grain**: One row per transaction
## Column Definitions
| Column | Type | Description | Example | Notes |
|--------|------|-------------|---------|-------|
| `transaction_id` | str | Unique identifier for each transaction | "TXN_20240901_001234" | UUID format, never null |
| `customer_id` | str | Unique customer identifier | "CUST_98765" | Joins to customers table |
| `transaction_date` | datetime | Date and time of purchase (UTC) | 2024-09-01 14:23:11 | Null for ~0.1% of records (data entry gap) |
| `amount` | float | Transaction amount in USD | 149.99 | Always positive; negative = data error |
| `product_id` | str | Product purchased | "PROD_electronics_001" | Joins to products table |
| `product_category` | str | Top-level product category | "electronics" | One of 12 categories (see Categories table) |
| `channel` | str | Purchase channel | "mobile_app" | One of: web, mobile_app, store, phone; null = unknown |
| `is_returned` | bool | Whether the item was subsequently returned | False | Set at time of return processing |
| `promotion_applied` | bool | Whether a promotional discount was used | True | |
| `discount_amount` | float | Discount applied in USD | 15.00 | 0.0 if no promotion |
## Categories Reference
| product_category | Description | Example Products |
|-----------------|-------------|-----------------|
| electronics | Consumer electronics | Phones, laptops, tablets |
| apparel | Clothing and accessories | Shirts, shoes, bags |
| home | Home goods and furniture | Furniture, decor, kitchen |
| ... | | |
## Known Data Quality Issues
1. **Missing transaction_date (~0.1%)**: Records from legacy POS system
(2019-2020) occasionally have null timestamps. These are excluded from
recency features but counted in frequency metrics.
2. **Channel null (~2%)**: Pre-2021 web transactions don't have channel
attribution. Treated as 'unknown' in channel features.
3. **Amount outliers**: Transactions above $5,000 are predominantly B2B
accounts. Consider filtering or separate treatment in consumer-focused models.
## Source System
CRM database: `crm_production.transactions` table
Extraction query: `scripts/data_extraction.sql`
Refresh: Monthly on the 1stStore the data dictionary in a docs/ directory or as a README within the data/processed/ directory. It is one of the first documents anyone should read when joining a project.
Level 7: API Documentation with Sphinx and MkDocs
For data science projects that expose Python modules as libraries, or for teams that want browsable documentation websites, automatic documentation generation tools are invaluable.
Sphinx: The Python Documentation Standard
Sphinx is the standard documentation tool for Python projects — used by Python itself, NumPy, pandas, scikit-learn, and thousands of libraries.
pip install sphinx sphinx-rtd-theme
cd docs/
sphinx-quickstartSphinx reads your docstrings and generates HTML documentation automatically:
# Generate HTML docs from your docstrings
make html
# View the generated documentation
open _build/html/index.htmlThe autodoc extension extracts docstrings from your modules:
.. automodule:: src.features.build_features
:members:
:undoc-members:
:show-inheritance:For data science projects, the NumPy docstring format is specifically designed to render beautifully with Sphinx + the numpydoc extension.
MkDocs: Modern, Simpler Documentation
MkDocs with the Material theme is increasingly popular for data science projects — it’s simpler than Sphinx and produces beautiful, modern documentation websites.
pip install mkdocs mkdocs-material mkdocstrings[python]
mkdocs new .Configure mkdocs.yml:
site_name: Customer Churn Model
theme:
name: material
palette:
primary: indigo
plugins:
- search
- mkdocstrings:
handlers:
python:
options:
docstring_style: numpy
nav:
- Home: index.md
- Getting Started: getting_started.md
- Data Dictionary: data_dictionary.md
- API Reference:
- Feature Engineering: api/features.md
- Model Training: api/models.md
- Results: results.mdServe locally: mkdocs serve → opens at http://127.0.0.1:8000
Deploy to GitHub Pages: mkdocs gh-deploy
Documenting Experiments and Model Decisions
One of the most valuable but most neglected documentation practices in data science is recording the why behind modeling decisions — what was tried, what didn’t work, and why the final approach was chosen.
The Experiment Log
Maintain a running EXPERIMENTS.md or use a dedicated experiment tracking tool (MLflow, Weights & Biases, Neptune) to record:
# Experiment Log: Customer Churn Model
## 2024-08-15: Baseline Experiments
**Goal**: Establish baseline performance with simple models
**Data**: Q1-Q2 2024 transactions (70/30 train/val split)
**Results**:
- Logistic Regression: AUC-ROC 0.823, F1 0.764
- Decision Tree: AUC-ROC 0.801, F1 0.741
**Finding**: LR surprisingly competitive — good linear signal in raw features
**Next step**: Try ensemble methods
---
## 2024-08-22: Feature Engineering Round 1
**Goal**: Test whether RFM features improve over raw transaction counts
**Changes**: Added recency_days, frequency_90d, monetary_avg features
**Results**:
- Random Forest: AUC-ROC 0.891 (↑ from 0.823 baseline)
**Finding**: RFM features provide large improvement
**Next step**: Add channel and time-based features
---
## 2024-09-05: XGBoost Tuning
**Goal**: Optimize XGBoost with full feature set
**Method**: 5-fold CV grid search over learning_rate, max_depth, n_estimators
**Best params**: lr=0.05, max_depth=6, n_estimators=500
**Results**: AUC-ROC 0.914 on Q3 holdout
**Why XGBoost over Neural Network**: Tested a simple MLP (3 layers, 128/64/32
units) — AUC-ROC 0.907, lower than XGBoost, much slower to train, much
harder to explain to stakeholders. XGBoost selected as final model.
**Why not SHAP + threshold tuning first**: Tried threshold=0.35 based on
precision-recall tradeoff — see notebook 05_threshold_analysis.ipynb.
Settled on 0.42 based on business requirement (flag top 20% of customers).This log answers the question every new team member and every future model auditor will ask: “Why is the model built this way?”
Documenting Model Cards
For models deployed to production, a model card is a standardized summary document (introduced by Google) that provides structured information about a model’s capabilities, limitations, and appropriate use cases:
# Model Card: Customer Churn Predictor v2.3
## Model Details
- **Type**: XGBoost binary classifier
- **Version**: 2.3.0
- **Trained**: 2024-09-12
- **Author**: Jane Smith, Data Science Team
## Intended Use
- **Primary use**: Identify customers at risk of churning within 30 days
- **Intended users**: Customer success team, for proactive outreach campaigns
- **Out-of-scope**: Not suitable for B2B accounts (>$5K transaction values)
## Training Data
- **Source**: Q1-Q2 2024 transactions (Jan 1 – Jun 30, 2024)
- **Size**: 68,000 customers, 380,000 transactions
- **Label**: Churned within 30 days of observation date
## Evaluation Data
- **Source**: Q3 2024 holdout (Jul 1 – Sep 30, 2024)
- **Size**: 17,000 customers
## Performance Metrics
| Metric | Value |
|--------|-------|
| AUC-ROC | 0.914 |
| F1 Score (threshold=0.42) | 0.847 |
| Precision | 0.856 |
| Recall | 0.839 |
## Limitations and Biases
- Performance may degrade for new customers (<90 days, <3 transactions)
- Seasonal patterns (holiday shopping) not fully captured — monitor Nov/Dec
- B2B accounts systematically over-predicted as churners — filter by account type
## Monitoring
- **Drift alert threshold**: Monthly AUC-ROC < 0.88
- **Data freshness**: Must retrain if >60 days since training data cutoff
- **Responsible team**: Data Science (slack: #ds-churn-model)Documentation Anti-Patterns to Avoid
Anti-Pattern 1: Stale Comments
Comments that were once accurate but no longer reflect the code are actively harmful — they mislead readers into thinking code does something it doesn’t:
# BAD: Comment says one thing, code does another
# Filter to customers with more than 5 transactions
df = df[df['transaction_count'] > 3] # ← Code uses 3, comment says 5Stale comments are often worse than no comments. Keep comments in sync with code, or use descriptive variable names that make comments unnecessary.
Anti-Pattern 2: Over-Documentation of Obvious Code
# BAD: Wastes space, adds noise
# Import pandas
import pandas as pd
# Set x to 10
x = 10
# Iterate over the list
for item in items:
# Process each item
process(item)Over-commenting makes code harder to read, not easier — the signal-to-noise ratio drops.
Anti-Pattern 3: Lying Docstrings
A docstring that describes what a function was intended to do rather than what it actually does is dangerous. Keep docstrings synchronized with behavior as code evolves.
Anti-Pattern 4: Missing the Why
The most common documentation gap: explaining what code does but never why:
# BAD: Explains what but not why
# Calculate log transform
amount_log = np.log1p(df['amount'])
# GOOD: Explains both what and why
# Log-transform amount to reduce right skew — histogram in EDA showed
# distribution spans 3 orders of magnitude; log transform makes it
# approximately normal, improving tree model splits
amount_log = np.log1p(df['amount'])Building a Documentation Culture
Individual documentation practices matter, but the most effective documentation happens when it’s a team norm, not an individual heroic effort.
Make documentation part of the definition of done: A feature or analysis is not finished until the key functions have docstrings, the README is updated, and any new data is added to the data dictionary. This is a team agreement, not an individual preference.
Review documentation in code review: When reviewing a pull request, explicitly check: Are new functions documented? Is the README updated? Are any non-obvious decisions commented?
Write documentation before or during coding: The act of writing a docstring before writing code (similar to test-driven development) forces clarity about what a function should do. If you can’t explain it clearly in a docstring, you may not fully understand the problem yet.
Use templates to lower the friction: Provide docstring templates, README templates, and notebook header templates. Lower the effort required to document well, and more documentation gets written.
Summary
Documentation is not separate from data science work — it is an integral dimension of quality data science work. The question is never “should I document?” but “what level of documentation is appropriate for this stage and audience?”
The spectrum runs from quick inline comments explaining non-obvious decisions, through carefully structured function docstrings that make your API self-explaining, through notebook Markdown cells that transform executable code into readable analysis, through comprehensive READMEs and data dictionaries that make projects understandable to anyone, to model cards that communicate capabilities and limitations to production users.
The investment in documentation pays compound returns. A function documented today costs 10 minutes. Understanding an undocumented function six months from now — or explaining it to a new team member, or reproducing its behavior in a different context — costs orders of magnitude more. Documentation is not overhead. It is the mechanism through which the value of your analytical work persists and compounds over time.
Key Takeaways
- Documentation exists at multiple levels — inline comments, docstrings, module headers, notebook Markdown, READMEs, data dictionaries, and model cards — each serving different audiences and purposes
- Inline comments should explain why decisions were made, not what the code does — well-written code shows the what; comments reveal the non-obvious reasoning
- NumPy-style docstrings are the data science standard, covering parameters (with types and defaults), return values, exceptions raised, and runnable examples
- Data dictionaries — describing every column’s name, type, meaning, valid values, and known issues — are among the most valuable and most neglected documentation artifacts in data science
- Notebook documentation alternates Markdown explanation cells with code cells to create executable analytical narratives, not just collections of code
- Experiment logs record what was tried, what didn’t work, and why the final approach was chosen — answering the “why is it built this way?” questions that every future team member will ask
- Model cards provide standardized production model documentation covering intended use, training data, performance metrics, and known limitations
- Documentation culture matters as much as individual practice — making documentation part of the definition of done and reviewing it in code review makes good documentation a team norm








