Organizing Your Data Science Project Files

Learn how to organize data science project files professionally. Explore proven folder structures, naming conventions, cookiecutter templates, and collaboration best practices.

By Techietory on February 28, 2026

Organizing Your Data Science Project Files

A well-organized data science project uses a consistent, logical folder structure that separates raw data from processed data, exploratory notebooks from production code, and configuration from source logic — making the project easy to navigate, reproduce, and hand off to teammates. The most widely adopted standard is the Cookiecutter Data Science template, which provides a battle-tested directory structure used by data science teams at organizations worldwide.

Introduction

Imagine opening a data science project you worked on eight months ago. You need to reproduce a model result for a stakeholder presentation. Where is the data? Which notebook has the final version of the feature engineering? Is model_v3_final_USE_THIS.pkl actually the best model, or was there a model_v4 somewhere? Why are there three files called preprocessing.py in different directories?

This scenario, embarrassingly common in data science, is entirely preventable. The difference between a project you can confidently revisit after months away and a project that’s effectively lost to time is almost entirely a matter of organization.

Unlike software engineering, where organizational patterns like MVC (Model-View-Controller) have been standardized for decades, data science is a younger discipline that evolved partly from ad-hoc research workflows. Many data scientists learned their craft through Kaggle notebooks, academic papers, and self-directed tutorials — all of which prioritize getting answers fast over building maintainable project structures.

The good news is that the data science community has developed strong organizational standards over the past decade. These standards, refined through hard experience at companies ranging from small startups to major tech firms, strike a careful balance: structured enough to be navigable and reproducible, flexible enough to accommodate the messiness of real data science work.

This guide covers everything you need to organize your data science projects like a professional: from the fundamental principles behind good organization, through specific directory structures and naming conventions, to templates and tools that make the right structure the default rather than an extra effort.

Why Project Organization Matters

Before diving into how to organize, it’s worth establishing why organization matters so deeply in data science specifically — because the stakes are higher here than in many other domains.

Reproducibility

Data science is fundamentally about producing results that can be trusted and verified. A model’s performance claim means nothing if no one can reproduce it. Good project organization means that every result is traceable: which data was used, how it was preprocessed, which code version trained the model, and what parameters were set. Without this traceability, you’re producing art, not science.

The Handoff Problem

Projects change hands constantly in professional settings. You might leave a team, bring in a new data scientist to scale a project, or need a colleague to cover for you during an absence. An organized project can be understood by someone new in hours; a disorganized one requires weeks of archaeology and still leaves uncertainty.

The Return Problem

You will return to your own projects after months away. Memory is unreliable. The mental model you had while building the project will have faded. Organized projects with clear structure, naming conventions, and documentation are self-explanatory — they don’t depend on you remembering things you’ve forgotten.

Collaboration at Scale

As data science teams grow, uncoordinated individual styles create friction. When five team members have five different folder structures, every new project requires everyone to relearn navigation. A shared standard means anyone can immediately find what they need in any team member’s project.

Debugging and Iteration

Data science involves a lot of iteration — trying different models, different features, different preprocessing approaches. Without structure, iteration creates an accumulating pile of half-finished experiments with unclear relationships. With structure, each experiment is clearly contextualized: what was tried, when, with what result.

Core Principles of Good Data Science Project Organization

Before specific structures and templates, some universal principles that should guide all organizational decisions.

Principle 1: Separate Data from Code

Data and code have fundamentally different natures and should never be mixed:

Data is often large, binary, and should not be version-controlled in Git
Data changes on its own (upstream sources update, new data arrives)
Data has different access controls than code (often more sensitive)
Code generates processed data from raw data — they have a directional relationship

Keep all data in a dedicated data/ directory, with clear subdirectories for raw, processed, and intermediate states. Keep code completely separate.

Principle 2: Raw Data Is Sacred

Raw data — the original, unmodified data you received from a source — must be treated as read-only and never modified. All transformations produce new files in separate directories. This ensures you can always go back to the original source and reproduce your work from scratch.

Think of raw data like a master tape recording: you make copies to work with, but you never edit the original.

Principle 3: Separate Exploration from Production

Exploratory notebooks are messy by nature — they contain dead ends, experimental code, half-written analyses, and commented-out approaches you might return to. Production code must be clean, tested, and reliable. These two categories need separate homes and different standards of care.

Principle 4: Make the Project Self-Documenting

The directory names, file names, and project structure itself should communicate what the project contains and how it flows. Someone who has never seen the project should be able to understand its general shape by browsing the file tree for two minutes.

Principle 5: Automate the Boring Parts

Configuration, paths, environment setup, and pipeline execution should all be codified so they require no manual intervention or memorization. A Makefile, a configuration file, and a well-written README eliminate the hidden tribal knowledge that makes projects fragile.

The Standard Data Science Project Structure

This is the directory structure used — with variations — by professional data science teams worldwide. It’s informed by the Cookiecutter Data Science template (drivendata.github.io/cookiecutter-data-science), which is itself the distillation of industry best practices.

Plaintext

project-name/
│
├── data/
│   ├── raw/                    ← Original, immutable source data
│   ├── external/               ← Data from external sources (third-party)
│   ├── interim/                ← Partially processed, intermediate data
│   └── processed/              ← Final, clean data ready for modeling
│
├── notebooks/
│   ├── exploratory/            ← Messy EDA, experiments, dead ends — anything goes
│   │   ├── 01_eda_initial.ipynb
│   │   ├── 02_feature_ideas.ipynb
│   │   └── 03_model_experiments.ipynb
│   └── reports/                ← Clean, polished notebooks for sharing results
│       └── 01_final_analysis.ipynb
│
├── src/                        ← Source code (Python modules)
│   ├── __init__.py
│   ├── data/
│   │   ├── __init__.py
│   │   ├── make_dataset.py     ← Scripts to download or generate data
│   │   └── preprocess.py       ← Data cleaning and transformation
│   ├── features/
│   │   ├── __init__.py
│   │   └── build_features.py   ← Feature engineering
│   ├── models/
│   │   ├── __init__.py
│   │   ├── train_model.py      ← Model training
│   │   ├── predict_model.py    ← Model inference
│   │   └── evaluate_model.py   ← Model evaluation
│   └── visualization/
│       ├── __init__.py
│       └── visualize.py        ← Reusable visualization functions
│
├── models/                     ← Trained model files (often gitignored)
│   ├── baseline_lr.pkl
│   └── best_model_xgb.pkl
│
├── reports/                    ← Generated analysis and output documents
│   ├── figures/                ← Saved plots and visualizations
│   │   ├── feature_importance.png
│   │   └── confusion_matrix.png
│   └── final_report.pdf
│
├── tests/                      ← Unit and integration tests
│   ├── __init__.py
│   ├── test_preprocess.py
│   ├── test_features.py
│   └── test_models.py
│
├── configs/                    ← Configuration files
│   ├── config.yaml             ← Main project configuration
│   └── model_config.yaml       ← Model hyperparameters
│
├── .github/
│   └── workflows/
│       └── tests.yml           ← CI/CD pipeline
│
├── .gitignore                  ← Files excluded from version control
├── .env.example                ← Template for environment variables (no secrets)
├── Makefile                    ← Automation commands
├── README.md                   ← Project documentation
├── requirements.txt            ← Production dependencies
├── requirements-dev.txt        ← Development + testing dependencies
└── pyproject.toml              ← Tool configuration (black, pytest, etc.)

project-name/
│
├── data/
│   ├── raw/                    ← Original, immutable source data
│   ├── external/               ← Data from external sources (third-party)
│   ├── interim/                ← Partially processed, intermediate data
│   └── processed/              ← Final, clean data ready for modeling
│
├── notebooks/
│   ├── exploratory/            ← Messy EDA, experiments, dead ends — anything goes
│   │   ├── 01_eda_initial.ipynb
│   │   ├── 02_feature_ideas.ipynb
│   │   └── 03_model_experiments.ipynb
│   └── reports/                ← Clean, polished notebooks for sharing results
│       └── 01_final_analysis.ipynb
│
├── src/                        ← Source code (Python modules)
│   ├── __init__.py
│   ├── data/
│   │   ├── __init__.py
│   │   ├── make_dataset.py     ← Scripts to download or generate data
│   │   └── preprocess.py       ← Data cleaning and transformation
│   ├── features/
│   │   ├── __init__.py
│   │   └── build_features.py   ← Feature engineering
│   ├── models/
│   │   ├── __init__.py
│   │   ├── train_model.py      ← Model training
│   │   ├── predict_model.py    ← Model inference
│   │   └── evaluate_model.py   ← Model evaluation
│   └── visualization/
│       ├── __init__.py
│       └── visualize.py        ← Reusable visualization functions
│
├── models/                     ← Trained model files (often gitignored)
│   ├── baseline_lr.pkl
│   └── best_model_xgb.pkl
│
├── reports/                    ← Generated analysis and output documents
│   ├── figures/                ← Saved plots and visualizations
│   │   ├── feature_importance.png
│   │   └── confusion_matrix.png
│   └── final_report.pdf
│
├── tests/                      ← Unit and integration tests
│   ├── __init__.py
│   ├── test_preprocess.py
│   ├── test_features.py
│   └── test_models.py
│
├── configs/                    ← Configuration files
│   ├── config.yaml             ← Main project configuration
│   └── model_config.yaml       ← Model hyperparameters
│
├── .github/
│   └── workflows/
│       └── tests.yml           ← CI/CD pipeline
│
├── .gitignore                  ← Files excluded from version control
├── .env.example                ← Template for environment variables (no secrets)
├── Makefile                    ← Automation commands
├── README.md                   ← Project documentation
├── requirements.txt            ← Production dependencies
├── requirements-dev.txt        ← Development + testing dependencies
└── pyproject.toml              ← Tool configuration (black, pytest, etc.)

Let’s go through each directory in detail.

Directory Deep Dive

The `data/` Directory

This is one of the most important directories to get right.

Plaintext

data/
├── raw/          ← Never modified after receipt. Gitignored.
├── external/     ← Third-party data: Census data, market data, etc.
├── interim/      ← Output of partial processing (e.g., after deduplication)
└── processed/    ← Final, model-ready data. Gitignored.

data/
├── raw/          ← Never modified after receipt. Gitignored.
├── external/     ← Third-party data: Census data, market data, etc.
├── interim/      ← Output of partial processing (e.g., after deduplication)
└── processed/    ← Final, model-ready data. Gitignored.

Key rules:

data/raw/ is read-only. No script ever writes to it. It represents the ground truth of what you received.
Every subdirectory is typically gitignored (data files are too large and too sensitive for Git). Use cloud storage (S3, GCS, Azure Blob) or DVC for data version control.
Add a data/raw/README.md describing where the data came from, when it was collected, what format it’s in, and any known issues. This readme is committed to Git even though the data files aren’t.

Plaintext

# data/raw/README.md

## Customer Transaction Data

**Source**: Internal CRM database export
**Date collected**: 2024-09-15
**Format**: CSV
**File**: transactions_2024_q3.csv (412 MB, gitignored)
**Description**: All customer transactions for Q3 2024. 
**Schema**: customer_id (str), transaction_date (datetime), amount (float), 
            product_id (str), channel (str)
**Known issues**: ~2% of records have null values in 'channel' column.
**Access**: Available in S3 bucket s3://company-data/raw/transactions/

# data/raw/README.md

## Customer Transaction Data

**Source**: Internal CRM database export
**Date collected**: 2024-09-15
**Format**: CSV
**File**: transactions_2024_q3.csv (412 MB, gitignored)
**Description**: All customer transactions for Q3 2024. 
**Schema**: customer_id (str), transaction_date (datetime), amount (float), 
            product_id (str), channel (str)
**Known issues**: ~2% of records have null values in 'channel' column.
**Access**: Available in S3 bucket s3://company-data/raw/transactions/

This README costs 10 minutes to write and saves hours of confusion later.

The `notebooks/` Directory

Plaintext

notebooks/
├── exploratory/    ← Personal lab notebooks, messy is fine
└── reports/        ← Polished notebooks for sharing

notebooks/
├── exploratory/    ← Personal lab notebooks, messy is fine
└── reports/        ← Polished notebooks for sharing

Notebook naming convention: Number notebooks with a prefix to enforce chronological ordering and include a brief description:

Plaintext

01_initial_data_exploration.ipynb
02_missing_value_analysis.ipynb
03_feature_correlation_study.ipynb
04_baseline_model_experiments.ipynb
05_xgboost_hyperparameter_tuning.ipynb

01_initial_data_exploration.ipynb
02_missing_value_analysis.ipynb
03_feature_correlation_study.ipynb
04_baseline_model_experiments.ipynb
05_xgboost_hyperparameter_tuning.ipynb

This naming gives you a natural narrative: someone new to the project can read notebooks in order and follow the analytical journey. The number prefix also keeps them sorted correctly in file explorers.

The split between exploratory/ and reports/ is important. Exploratory notebooks are your personal lab: messy, iterative, containing dead ends, debug output, and commented-out experiments. They don’t need to run cleanly from top to bottom — they’re your thinking made visible.

Report notebooks are cleaned-up presentations of findings. They should run cleanly from Kernel → Restart and Run All, contain Markdown explanations of each section, and be suitable for sharing with stakeholders or committing as project documentation.

A practical rule: exploratory notebooks are for you; report notebooks are for everyone else.

The `src/` Directory

This is where your reusable, production-quality Python code lives. As you work in notebooks and discover patterns and functions worth keeping, you extract them here.

Plaintext

src/
├── __init__.py
├── data/
│   ├── make_dataset.py     ← Download, generate, or receive raw data
│   └── preprocess.py       ← Clean and transform raw data
├── features/
│   └── build_features.py   ← Create model features from processed data
├── models/
│   ├── train_model.py       ← Training logic
│   ├── predict_model.py     ← Inference logic
│   └── evaluate_model.py    ← Metrics and evaluation
└── visualization/
    └── visualize.py         ← Reusable plot functions

src/
├── __init__.py
├── data/
│   ├── make_dataset.py     ← Download, generate, or receive raw data
│   └── preprocess.py       ← Clean and transform raw data
├── features/
│   └── build_features.py   ← Create model features from processed data
├── models/
│   ├── train_model.py       ← Training logic
│   ├── predict_model.py     ← Inference logic
│   └── evaluate_model.py    ← Metrics and evaluation
└── visualization/
    └── visualize.py         ← Reusable plot functions

The __init__.py files make each directory a Python package, enabling clean imports:

Python

# In a notebook or script:
from src.data.preprocess import clean_transactions
from src.features.build_features import add_customer_lifetime_value
from src.models.train_model import train_xgboost

# In a notebook or script:
from src.data.preprocess import clean_transactions
from src.features.build_features import add_customer_lifetime_value
from src.models.train_model import train_xgboost

This import pattern means your notebooks stay clean: they orchestrate functions rather than containing all the logic inline.

A well-developed src/ directory is the hallmark of a mature data science project. It represents the crystallization of exploratory work into a stable, tested codebase.

The `configs/` Directory

Hardcoding values like file paths, model hyperparameters, and column names throughout your code is a maintenance nightmare. Config files centralize all these values:

YAML

# configs/config.yaml

paths:
  raw_data: "data/raw/transactions_2024_q3.csv"
  processed_data: "data/processed/transactions_clean.csv"
  model_output: "models/best_model.pkl"

data:
  target_column: "churned"
  id_column: "customer_id"
  date_column: "transaction_date"
  categorical_columns:
    - "channel"
    - "product_category"
    - "region"

training:
  test_size: 0.2
  random_state: 42
  cv_folds: 5

logging:
  level: "INFO"
  file: "logs/training.log"

# configs/config.yaml

paths:
  raw_data: "data/raw/transactions_2024_q3.csv"
  processed_data: "data/processed/transactions_clean.csv"
  model_output: "models/best_model.pkl"

data:
  target_column: "churned"
  id_column: "customer_id"
  date_column: "transaction_date"
  categorical_columns:
    - "channel"
    - "product_category"
    - "region"

training:
  test_size: 0.2
  random_state: 42
  cv_folds: 5

logging:
  level: "INFO"
  file: "logs/training.log"

YAML

# configs/model_config.yaml

xgboost:
  n_estimators: 500
  learning_rate: 0.05
  max_depth: 6
  min_child_weight: 3
  subsample: 0.8
  colsample_bytree: 0.8
  gamma: 0.1
  reg_alpha: 0.1
  reg_lambda: 1.0
  random_state: 42

random_forest:
  n_estimators: 300
  max_depth: 10
  min_samples_split: 5
  min_samples_leaf: 2
  random_state: 42

# configs/model_config.yaml

xgboost:
  n_estimators: 500
  learning_rate: 0.05
  max_depth: 6
  min_child_weight: 3
  subsample: 0.8
  colsample_bytree: 0.8
  gamma: 0.1
  reg_alpha: 0.1
  reg_lambda: 1.0
  random_state: 42

random_forest:
  n_estimators: 300
  max_depth: 10
  min_samples_split: 5
  min_samples_leaf: 2
  random_state: 42

Load configs in your Python code with PyYAML:

Python

import yaml
from pathlib import Path

def load_config(config_path: str = "configs/config.yaml") -> dict:
    with open(config_path, 'r') as f:
        return yaml.safe_load(f)

config = load_config()
raw_data_path = config['paths']['raw_data']
target_col = config['data']['target_column']

import yaml
from pathlib import Path

def load_config(config_path: str = "configs/config.yaml") -> dict:
    with open(config_path, 'r') as f:
        return yaml.safe_load(f)

config = load_config()
raw_data_path = config['paths']['raw_data']
target_col = config['data']['target_column']

Now changing a file path or hyperparameter requires editing one YAML file, not hunting through Python code across multiple modules.

The `tests/` Directory

Tests are frequently skipped in data science projects. This is a mistake that compounds over time — untested code is fragile code, and fragile code becomes impossible to refactor.

Plaintext

tests/
├── __init__.py
├── test_preprocess.py
├── test_features.py
└── test_models.py

tests/
├── __init__.py
├── test_preprocess.py
├── test_features.py
└── test_models.py

Tests don’t need to be exhaustive to be valuable. Even a handful of tests for your most critical functions provides enormous value:

Python

# tests/test_preprocess.py
import pandas as pd
import pytest
from src.data.preprocess import clean_transactions, handle_missing_values

class TestCleanTransactions:
    
    def test_removes_duplicate_rows(self):
        df = pd.DataFrame({
            'customer_id': ['A', 'A', 'B'],
            'amount': [100.0, 100.0, 200.0]
        })
        result = clean_transactions(df)
        assert len(result) == 2
    
    def test_preserves_all_columns(self):
        df = pd.DataFrame({
            'customer_id': ['A', 'B'],
            'amount': [100.0, 200.0],
            'channel': ['web', 'mobile']
        })
        result = clean_transactions(df)
        assert set(result.columns) == set(df.columns)
    
    def test_handles_empty_dataframe(self):
        df = pd.DataFrame()
        result = clean_transactions(df)
        assert len(result) == 0

class TestHandleMissingValues:
    
    def test_fills_numeric_with_median(self):
        df = pd.DataFrame({'amount': [100.0, None, 300.0]})
        result = handle_missing_values(df)
        assert result['amount'].isnull().sum() == 0
        assert result['amount'].iloc[1] == 200.0  # Median of 100, 300

# tests/test_preprocess.py
import pandas as pd
import pytest
from src.data.preprocess import clean_transactions, handle_missing_values

class TestCleanTransactions:
    
    def test_removes_duplicate_rows(self):
        df = pd.DataFrame({
            'customer_id': ['A', 'A', 'B'],
            'amount': [100.0, 100.0, 200.0]
        })
        result = clean_transactions(df)
        assert len(result) == 2
    
    def test_preserves_all_columns(self):
        df = pd.DataFrame({
            'customer_id': ['A', 'B'],
            'amount': [100.0, 200.0],
            'channel': ['web', 'mobile']
        })
        result = clean_transactions(df)
        assert set(result.columns) == set(df.columns)
    
    def test_handles_empty_dataframe(self):
        df = pd.DataFrame()
        result = clean_transactions(df)
        assert len(result) == 0

class TestHandleMissingValues:
    
    def test_fills_numeric_with_median(self):
        df = pd.DataFrame({'amount': [100.0, None, 300.0]})
        result = handle_missing_values(df)
        assert result['amount'].isnull().sum() == 0
        assert result['amount'].iloc[1] == 200.0  # Median of 100, 300

Run with: pytest tests/ -v

The `Makefile`

A Makefile codifies common project commands into named, executable targets — eliminating “how do I run this again?” questions:

YAML

# Makefile

.PHONY: env data features train evaluate test lint clean

# Set up the development environment
env:
	python -m venv venv
	source venv/bin/activate && pip install -r requirements-dev.txt
	@echo "Environment ready. Run 'source venv/bin/activate' to activate."

# Download and prepare raw data
data:
	python src/data/make_dataset.py --config configs/config.yaml

# Build features from processed data
features:
	python src/features/build_features.py --config configs/config.yaml

# Train the model
train:
	python src/models/train_model.py --config configs/config.yaml

# Evaluate the model
evaluate:
	python src/models/evaluate_model.py --config configs/config.yaml

# Run the full pipeline
pipeline: data features train evaluate
	@echo "Full pipeline complete."

# Run tests
test:
	pytest tests/ -v --tb=short

# Run linting and formatting checks
lint:
	black --check src/ tests/
	flake8 src/ tests/

# Format code
format:
	black src/ tests/
	isort src/ tests/

# Clean generated files
clean:
	find . -type f -name "*.pyc" -delete
	find . -type d -name "__pycache__" -exec rm -rf {} +
	find . -name ".ipynb_checkpoints" -exec rm -rf {} +

# Makefile

.PHONY: env data features train evaluate test lint clean

# Set up the development environment
env:
	python -m venv venv
	source venv/bin/activate && pip install -r requirements-dev.txt
	@echo "Environment ready. Run 'source venv/bin/activate' to activate."

# Download and prepare raw data
data:
	python src/data/make_dataset.py --config configs/config.yaml

# Build features from processed data
features:
	python src/features/build_features.py --config configs/config.yaml

# Train the model
train:
	python src/models/train_model.py --config configs/config.yaml

# Evaluate the model
evaluate:
	python src/models/evaluate_model.py --config configs/config.yaml

# Run the full pipeline
pipeline: data features train evaluate
	@echo "Full pipeline complete."

# Run tests
test:
	pytest tests/ -v --tb=short

# Run linting and formatting checks
lint:
	black --check src/ tests/
	flake8 src/ tests/

# Format code
format:
	black src/ tests/
	isort src/ tests/

# Clean generated files
clean:
	find . -type f -name "*.pyc" -delete
	find . -type d -name "__pycache__" -exec rm -rf {} +
	find . -name ".ipynb_checkpoints" -exec rm -rf {} +

With this Makefile, a new team member can run make env to set up their environment, make pipeline to run the entire data pipeline, and make test to run tests — without reading any documentation about which scripts to run in which order.

File Naming Conventions

Consistent naming conventions make file navigation fast and intuitive. Here are the standards used by professional data science teams.

General Rules

Use lowercase with underscores for all filenames: feature_engineering.py not FeatureEngineering.py or feature-engineering.py
Be descriptive but concise: train_xgboost_classifier.py not train.py or this_is_the_training_script_for_the_xgboost_model_v2.py
Avoid version numbers in filenames: use Git for versioning, not model_v3_final_FINAL.pkl
Use ISO date format when dates are needed: data_20240915.csv not data_Sept_15.csv

Notebook Naming

Plaintext

# Pattern: NN_brief_description.ipynb
# NN = two-digit number for ordering

01_data_exploration.ipynb
02_missing_value_analysis.ipynb
03_feature_engineering_experiments.ipynb
04_baseline_model_comparison.ipynb
05_xgboost_optimization.ipynb
06_final_model_evaluation.ipynb

# Pattern: NN_brief_description.ipynb
# NN = two-digit number for ordering

01_data_exploration.ipynb
02_missing_value_analysis.ipynb
03_feature_engineering_experiments.ipynb
04_baseline_model_comparison.ipynb
05_xgboost_optimization.ipynb
06_final_model_evaluation.ipynb

Script Naming

Use the verb-noun pattern for scripts that perform actions:

Plaintext

# Good — verb describes what the script does
make_dataset.py
build_features.py
train_model.py
evaluate_model.py
generate_report.py
clean_transactions.py

# Avoid — too generic
main.py (unless it truly is a main entry point)
utils.py (what utilities? be specific)
helpers.py (helpers for what?)
script.py

# Good — verb describes what the script does
make_dataset.py
build_features.py
train_model.py
evaluate_model.py
generate_report.py
clean_transactions.py

# Avoid — too generic
main.py (unless it truly is a main entry point)
utils.py (what utilities? be specific)
helpers.py (helpers for what?)
script.py

Data File Naming

Plaintext

# Pattern: description_YYYYMMDD.extension or description_version.extension

transactions_raw_20240915.csv       ← Raw data with acquisition date
customers_processed_v2.parquet      ← Processed data with version
features_train_set.parquet          ← Role-specific (train/test)
features_test_set.parquet

# Pattern: description_YYYYMMDD.extension or description_version.extension

transactions_raw_20240915.csv       ← Raw data with acquisition date
customers_processed_v2.parquet      ← Processed data with version
features_train_set.parquet          ← Role-specific (train/test)
features_test_set.parquet

Model File Naming

Plaintext

# Pattern: algorithm_experiment_description.extension

baseline_logistic_regression.pkl
xgboost_v1_default_params.pkl
xgboost_v2_tuned_lr0.05_depth6.pkl
best_model_xgboost.pkl              ← Clearly marked best performer

# Pattern: algorithm_experiment_description.extension

baseline_logistic_regression.pkl
xgboost_v1_default_params.pkl
xgboost_v2_tuned_lr0.05_depth6.pkl
best_model_xgboost.pkl              ← Clearly marked best performer

Adapting the Structure for Different Project Types

The standard structure works well for most projects, but different problem types call for slight adaptations.

Small Exploratory Projects

For quick, one-person analyses that won’t be deployed or shared widely, a lighter structure is appropriate:

Plaintext

quick-analysis/
├── data/
│   └── raw/
├── notebooks/
│   └── 01_analysis.ipynb
├── .gitignore
├── requirements.txt
└── README.md

quick-analysis/
├── data/
│   └── raw/
├── notebooks/
│   └── 01_analysis.ipynb
├── .gitignore
├── requirements.txt
└── README.md

Don’t over-engineer short-lived work. The full structure shines for projects that will grow, be shared, or need to be maintained.

Machine Learning Pipeline Projects

For projects that deploy models into production:

Plaintext

ml-pipeline/
├── data/                   ← (standard)
├── src/
│   ├── data/               ← (standard)
│   ├── features/           ← (standard)
│   ├── models/             ← (standard)
│   └── api/                ← NEW: model serving code
│       ├── app.py          ← Flask/FastAPI application
│       ├── schemas.py      ← Request/response schemas
│       └── predict.py      ← Prediction endpoint logic
├── notebooks/              ← (standard)
├── tests/                  ← (standard)
├── configs/                ← (standard)
├── docker/                 ← NEW: containerization
│   ├── Dockerfile
│   └── docker-compose.yml
├── deployment/             ← NEW: infrastructure as code
│   ├── kubernetes/
│   └── terraform/
└── monitoring/             ← NEW: model monitoring
    └── drift_detection.py

ml-pipeline/
├── data/                   ← (standard)
├── src/
│   ├── data/               ← (standard)
│   ├── features/           ← (standard)
│   ├── models/             ← (standard)
│   └── api/                ← NEW: model serving code
│       ├── app.py          ← Flask/FastAPI application
│       ├── schemas.py      ← Request/response schemas
│       └── predict.py      ← Prediction endpoint logic
├── notebooks/              ← (standard)
├── tests/                  ← (standard)
├── configs/                ← (standard)
├── docker/                 ← NEW: containerization
│   ├── Dockerfile
│   └── docker-compose.yml
├── deployment/             ← NEW: infrastructure as code
│   ├── kubernetes/
│   └── terraform/
└── monitoring/             ← NEW: model monitoring
    └── drift_detection.py

NLP Projects

Natural language processing projects add specialized data directories:

Plaintext

nlp-project/
├── data/
│   ├── raw/                ← Original documents, corpora
│   ├── interim/
│   │   └── tokenized/      ← Tokenized text
│   └── processed/
│       └── embeddings/     ← Vector representations
├── src/
│   ├── data/
│   │   ├── scraper.py      ← Data collection scripts
│   │   └── tokenizer.py    ← Text tokenization
│   ├── features/
│   │   └── text_features.py ← TF-IDF, embeddings, etc.
│   └── models/
│       ├── fine_tune.py    ← Model fine-tuning
│       └── evaluate.py
├── models/
│   └── fine_tuned/         ← Fine-tuned model checkpoints
└── notebooks/

nlp-project/
├── data/
│   ├── raw/                ← Original documents, corpora
│   ├── interim/
│   │   └── tokenized/      ← Tokenized text
│   └── processed/
│       └── embeddings/     ← Vector representations
├── src/
│   ├── data/
│   │   ├── scraper.py      ← Data collection scripts
│   │   └── tokenizer.py    ← Text tokenization
│   ├── features/
│   │   └── text_features.py ← TF-IDF, embeddings, etc.
│   └── models/
│       ├── fine_tune.py    ← Model fine-tuning
│       └── evaluate.py
├── models/
│   └── fine_tuned/         ← Fine-tuned model checkpoints
└── notebooks/

Computer Vision Projects

Plaintext

cv-project/
├── data/
│   ├── raw/
│   │   ├── images/         ← Original images
│   │   └── annotations/    ← Labels, masks, bounding boxes
│   └── processed/
│       └── augmented/      ← Augmented training images
├── src/
│   ├── data/
│   │   ├── dataset.py      ← PyTorch/TF Dataset class
│   │   └── augmentation.py ← Data augmentation pipeline
│   └── models/
│       ├── architecture.py ← Neural network architecture
│       └── train.py
├── models/
│   └── checkpoints/        ← Training checkpoints
└── notebooks/

cv-project/
├── data/
│   ├── raw/
│   │   ├── images/         ← Original images
│   │   └── annotations/    ← Labels, masks, bounding boxes
│   └── processed/
│       └── augmented/      ← Augmented training images
├── src/
│   ├── data/
│   │   ├── dataset.py      ← PyTorch/TF Dataset class
│   │   └── augmentation.py ← Data augmentation pipeline
│   └── models/
│       ├── architecture.py ← Neural network architecture
│       └── train.py
├── models/
│   └── checkpoints/        ← Training checkpoints
└── notebooks/

The Cookiecutter Data Science Template

Rather than setting up this structure from scratch each time, use the Cookiecutter Data Science template to generate it automatically.

Using Cookiecutter

Bash

# Install cookiecutter
pip install cookiecutter

# Generate a new project from the template
cookiecutter https://github.com/drivendata/cookiecutter-data-science

# Install cookiecutter
pip install cookiecutter

# Generate a new project from the template
cookiecutter https://github.com/drivendata/cookiecutter-data-science

Cookiecutter asks a series of questions and generates the complete directory structure:

Plaintext

project_name [project_name]: customer-churn-model
repo_name [customer-churn-model]: customer-churn-model
author_name [Your name]: Jane Smith
description [A short description of the project.]: Predict customer churn using transaction history
open_source_license [MIT]: MIT
python_interpreter [python3]: python3

project_name [project_name]: customer-churn-model
repo_name [customer-churn-model]: customer-churn-model
author_name [Your name]: Jane Smith
description [A short description of the project.]: Predict customer churn using transaction history
open_source_license [MIT]: MIT
python_interpreter [python3]: python3

The generated structure is ready to use immediately, with placeholder README files in each directory explaining its purpose.

Customizing the Template

If your team has specific needs, you can fork the Cookiecutter Data Science template and add your organization’s standard files: company-specific .gitignore additions, your team’s Makefile targets, standard CI/CD configurations, and custom documentation templates. Every new project then starts from your organization’s opinionated standard.

Managing the Data Directory with DVC

Since data files are typically gitignored, you need a separate system for data version control. DVC (Data Version Control) extends Git-like semantics to data and model files.

DVC Basics

Plaintext

# Install DVC
pip install dvc

# Initialize DVC in your Git repository
dvc init

# Add a data file to DVC tracking
dvc add data/raw/transactions_2024_q3.csv
# Creates: data/raw/transactions_2024_q3.csv.dvc (small metadata file)
# The .dvc file IS committed to Git
# The actual CSV is NOT committed (DVC manages it separately)

# Configure remote storage (e.g., S3)
dvc remote add -d myremote s3://my-bucket/dvc-store

# Push data to remote storage
dvc push

# Teammates pull data
dvc pull

# Install DVC
pip install dvc

# Initialize DVC in your Git repository
dvc init

# Add a data file to DVC tracking
dvc add data/raw/transactions_2024_q3.csv
# Creates: data/raw/transactions_2024_q3.csv.dvc (small metadata file)
# The .dvc file IS committed to Git
# The actual CSV is NOT committed (DVC manages it separately)

# Configure remote storage (e.g., S3)
dvc remote add -d myremote s3://my-bucket/dvc-store

# Push data to remote storage
dvc push

# Teammates pull data
dvc pull

With DVC, your data/raw/transactions_2024_q3.csv.dvc file is committed to Git and records the exact version of the data (via a hash). Teammates run dvc pull and get the exact same data file. The data lives in S3 (or GCS, Azure, or even a shared network drive), not in Git.

This gives you data version control that parallels code version control — you can dvc checkout to get the data that corresponds to any Git commit, making complete experiment reproducibility possible.

Documentation Within the Project

Good project organization isn’t just about directory structure — it’s about the documentation embedded throughout the project.

The README.md

The project-level README should cover:

Plaintext

# Customer Churn Prediction

Predict which customers are likely to churn in the next 30 days 
using transaction history and behavioral features.

## Project Organization

    ├── data/
    │   ├── raw/           ← Original transaction data (DVC-managed)
    │   └── processed/     ← Cleaned, feature-engineered data
    ├── notebooks/         ← Exploratory and report notebooks
    ├── src/               ← Python source modules
    ├── tests/             ← Unit tests
    ├── configs/           ← Configuration files
    └── models/            ← Trained model artifacts

## Setup

Prerequisites: Python 3.11, DVC

    git clone git@github.com:team/customer-churn.git
    cd customer-churn
    make env
    source venv/bin/activate
    dvc pull   # Download data from remote storage

## Running the Pipeline

    make pipeline  # Run complete data → features → train → evaluate pipeline

## Results

| Model         | AUC-ROC | F1 Score | Precision | Recall |
|---------------|---------|----------|-----------|--------|
| Baseline (LR) | 0.823   | 0.764    | 0.771     | 0.757  |
| Random Forest | 0.891   | 0.819    | 0.834     | 0.804  |
| **XGBoost**   | **0.914** | **0.847** | **0.856** | **0.839** |

Best model: XGBoost (see `notebooks/reports/06_final_model_evaluation.ipynb`)

# Customer Churn Prediction

Predict which customers are likely to churn in the next 30 days 
using transaction history and behavioral features.

## Project Organization

    ├── data/
    │   ├── raw/           ← Original transaction data (DVC-managed)
    │   └── processed/     ← Cleaned, feature-engineered data
    ├── notebooks/         ← Exploratory and report notebooks
    ├── src/               ← Python source modules
    ├── tests/             ← Unit tests
    ├── configs/           ← Configuration files
    └── models/            ← Trained model artifacts

## Setup

Prerequisites: Python 3.11, DVC

    git clone git@github.com:team/customer-churn.git
    cd customer-churn
    make env
    source venv/bin/activate
    dvc pull   # Download data from remote storage

## Running the Pipeline

    make pipeline  # Run complete data → features → train → evaluate pipeline

## Results

| Model         | AUC-ROC | F1 Score | Precision | Recall |
|---------------|---------|----------|-----------|--------|
| Baseline (LR) | 0.823   | 0.764    | 0.771     | 0.757  |
| Random Forest | 0.891   | 0.819    | 0.834     | 0.804  |
| **XGBoost**   | **0.914** | **0.847** | **0.856** | **0.839** |

Best model: XGBoost (see `notebooks/reports/06_final_model_evaluation.ipynb`)

README Files in Subdirectories

Each major subdirectory should have its own README explaining its contents:

Plaintext

# data/raw/README.md
Source: CRM database export, Q3 2024.
Files: transactions_2024_q3.csv (DVC-tracked). See .dvc file.
Schema: customer_id, transaction_date, amount, product_id, channel.
Access: Request access via s3://company-data/raw/ (ask DevOps team).

# data/raw/README.md
Source: CRM database export, Q3 2024.
Files: transactions_2024_q3.csv (DVC-tracked). See .dvc file.
Schema: customer_id, transaction_date, amount, product_id, channel.
Access: Request access via s3://company-data/raw/ (ask DevOps team).

Plaintext

# models/README.md
Models are saved as pickle files with scikit-learn/XGBoost.
Naming: algorithm_experiment_description.pkl
Best production model: best_model_xgboost.pkl (AUC-ROC: 0.914)
Training details: notebooks/reports/06_final_model_evaluation.ipynb

# models/README.md
Models are saved as pickle files with scikit-learn/XGBoost.
Naming: algorithm_experiment_description.pkl
Best production model: best_model_xgboost.pkl (AUC-ROC: 0.914)
Training details: notebooks/reports/06_final_model_evaluation.ipynb

These in-directory READMEs are small time investments that make the project self-navigating.

Common Anti-Patterns to Avoid

Recognizing bad patterns is as important as knowing good ones.

Anti-Pattern 1: Everything in One Directory

Plaintext

project/  ← All files dumped here
├── analysis.ipynb
├── analysis_v2.ipynb
├── analysis_final.ipynb
├── data.csv
├── clean_data.csv
├── model.pkl
├── model_final.pkl
├── preprocessing.py
├── train.py
└── utils.py

project/  ← All files dumped here
├── analysis.ipynb
├── analysis_v2.ipynb
├── analysis_final.ipynb
├── data.csv
├── clean_data.csv
├── model.pkl
├── model_final.pkl
├── preprocessing.py
├── train.py
└── utils.py

No separation of concerns, no navigability, no reproducibility. A project archaeologist’s nightmare.

Anti-Pattern 2: Version Numbers in Filenames

Plaintext

model_v1.pkl
model_v2.pkl
model_v2_improved.pkl
model_v3_FINAL.pkl
model_v3_FINAL_fixed.pkl
model_v3_FINAL_fixed_USE_THIS.pkl

model_v1.pkl
model_v2.pkl
model_v2_improved.pkl
model_v3_FINAL.pkl
model_v3_FINAL_fixed.pkl
model_v3_FINAL_fixed_USE_THIS.pkl

This is what Git is for. Use one best_model.pkl and let commit history tell the story of how you got there.

Anti-Pattern 3: Hardcoded Paths

Python

# Fragile — breaks on anyone else's machine
df = pd.read_csv("/Users/jane/Desktop/project/data/my_data.csv")

# Better — relative path using config
df = pd.read_csv(config['paths']['raw_data'])

# Also better — using pathlib
from pathlib import Path
DATA_DIR = Path(__file__).parent.parent / "data"
df = pd.read_csv(DATA_DIR / "raw" / "transactions.csv")

# Fragile — breaks on anyone else's machine
df = pd.read_csv("/Users/jane/Desktop/project/data/my_data.csv")

# Better — relative path using config
df = pd.read_csv(config['paths']['raw_data'])

# Also better — using pathlib
from pathlib import Path
DATA_DIR = Path(__file__).parent.parent / "data"
df = pd.read_csv(DATA_DIR / "raw" / "transactions.csv")

Hardcoded absolute paths guarantee that your code runs only on your machine.

Anti-Pattern 4: Committing Data to Git

Large data files in Git repositories slow down every clone, push, and pull for everyone on the team and fill up the repository. Use DVC, Git LFS, or simply gitignore data directories and document access via README.

Anti-Pattern 5: Notebooks as Production Code

Notebooks are excellent for exploration and communication, but they shouldn’t be the operational code that runs your pipeline. Complex, multi-notebook pipelines that must be run in a specific order, manually, are brittle and non-automatable. Extract the logic into src/ modules and orchestrate them with a Makefile or pipeline tool.

Evolving Your Structure Over Time

Projects grow. A structure appropriate for day one may feel constraining on day ninety. The key is to evolve deliberately rather than letting entropy accumulate.

Signs your structure needs refactoring:

You’re spending more than a minute finding files you know exist
New team members ask where to put things more than once per day
Your notebooks directory has more than 20 files with no subdirectory organization
The same logic appears in three different places with slight variations
Nobody knows which model file is the current production model

When to refactor:

At natural project milestones (completing EDA, completing initial model, preparing for deployment)
When onboarding new team members (their confusion is useful signal)
Before sharing the project publicly or presenting it to stakeholders

Refactoring directory structure is like refactoring code — it should be done regularly in small increments, not deferred until the project is completely unmaintainable.

Summary

Organizing data science project files is not bureaucratic overhead — it is the infrastructure of reproducible, collaborative, maintainable work. The investment in a clean structure pays dividends every time you return to a project, onboard a colleague, reproduce a result, or hand off your work to a team.

The standard structure — with clearly separated data/, notebooks/, src/, tests/, configs/, and models/ directories — reflects hard-won experience about how data science projects actually work: data flows from raw to processed, exploration yields code that gets extracted into modules, modules get tested, and configuration drives behavior without hardcoding.

The practical tools that support this structure — Cookiecutter for generating it, DVC for data versioning, Makefiles for automation, config YAML files for parameterization — transform good intentions into consistent execution. Use them on every project, from the very first commit, and the organizational habits they instill will make you a significantly more effective data scientist.

Key Takeaways

Good project organization enables reproducibility, collaboration, and the ability to return to projects months later without confusion — it is not optional overhead but a fundamental professional practice
Raw data is sacred and read-only; all processing produces new files in separate directories, ensuring you can always reproduce work from the original source
The standard structure separates data from code, exploration from production, and configuration from logic — each separation solves a distinct category of organizational problem
Notebook naming with numeric prefixes (01_eda.ipynb, 02_features.ipynb) creates a navigable narrative of the analytical process
Config YAML files eliminate hardcoded values and make experiments parameterizable — one file to change file paths and hyperparameters across the entire project
A Makefile codifies common commands so anyone can run make train or make test without reading documentation about which scripts to run in which order
DVC (Data Version Control) extends Git semantics to large data and model files, making complete experiment reproducibility possible without bloating the Git repository
Version numbers in filenames are an anti-pattern — use Git for version control and keep one canonical filename for each artifact

0 Comments

Inline Feedbacks

View all comments

Discover More

Click For More

Organizing Your Data Science Project Files

Introduction

Why Project Organization Matters

Reproducibility

The Handoff Problem

The Return Problem

Collaboration at Scale

Debugging and Iteration

Core Principles of Good Data Science Project Organization

Principle 1: Separate Data from Code

Principle 2: Raw Data Is Sacred

Principle 3: Separate Exploration from Production

Principle 4: Make the Project Self-Documenting

Principle 5: Automate the Boring Parts

The Standard Data Science Project Structure

Directory Deep Dive

The data/ Directory

The notebooks/ Directory

The src/ Directory

The configs/ Directory

The tests/ Directory

The Makefile

File Naming Conventions

General Rules

Notebook Naming

Script Naming

Data File Naming

Model File Naming

Adapting the Structure for Different Project Types

Small Exploratory Projects

Machine Learning Pipeline Projects

NLP Projects

Computer Vision Projects

The Cookiecutter Data Science Template

Using Cookiecutter

Customizing the Template

Managing the Data Directory with DVC

DVC Basics

Documentation Within the Project

The README.md

README Files in Subdirectories

Common Anti-Patterns to Avoid

Anti-Pattern 1: Everything in One Directory

Anti-Pattern 2: Version Numbers in Filenames

Anti-Pattern 3: Hardcoded Paths

Anti-Pattern 4: Committing Data to Git

Anti-Pattern 5: Notebooks as Production Code

Evolving Your Structure Over Time

Summary

Key Takeaways

Discover More

Understanding System Architecture: The Blueprint of Your Operating System

Introduction to JavaScript – Basics and Fundamentals

The History of Robotics: From Ancient Automata to Modern Machines

Understanding Force and Torque in Robot Design

The Role of Inductors: Understanding Magnetic Energy Storage

Interactive Data Visualization: Adding Filters and Interactivity

The `data/` Directory

The `notebooks/` Directory

The `src/` Directory

The `configs/` Directory

The `tests/` Directory

The `Makefile`