Version Control for Data Scientists: Git Basics

Learn Git basics for data science. Master version control with commits, branches, merges, and best practices to manage your data science projects professionally.

By Techietory on February 28, 2026

Version Control for Data Scientists: Git Basics

Git is a distributed version control system that tracks changes to files over time, allowing you to revisit any previous state of your project, collaborate with teammates without overwriting each other’s work, and maintain a complete history of every modification ever made. For data scientists, Git is an essential tool that transforms chaotic, file-based workflows (“model_v2_FINAL_fixed.py”) into structured, professional project management.

Introduction

Picture this scenario: you’ve spent three weeks building a machine learning model. It’s working reasonably well, and you decide to try a new feature engineering approach. Two days of work later, the model is performing worse than before — and you can’t quite remember what the original code looked like. You search through files named model_v1.py, model_v2_new.py, model_v2_new_FIXED.py, and model_FINAL_use_this_one.py, trying to piece together what you had before.

Every data scientist has lived some version of this story. It’s painful, time-consuming, and completely avoidable.

Git is the solution. It’s the industry-standard version control system used by virtually every professional software developer and an increasing number of data scientists. With Git, every change you make is recorded, every version of your project is accessible, and collaboration with teammates becomes structured and conflict-free.

This guide will take you from zero Git knowledge to confidently using version control in your daily data science work. We’ll cover what Git is, why it matters specifically for data scientists, how to set it up, and how to use the core commands that you’ll reach for every single day.

What Is Version Control?

Before diving into Git specifically, let’s understand what version control means and why it was invented.

Version control is a system that records changes to a set of files over time so that you can recall specific versions later. Think of it like an infinitely detailed “undo history” for your entire project — not just the last 20 changes, but every change ever made, by anyone, with notes explaining why each change was made.

The Problem Version Control Solves

Without version control, developers and data scientists resort to manual versioning — duplicating files, adding dates or version numbers to filenames, keeping “backup” folders. This approach:

Creates clutter and confusion about which version is current
Makes it impossible to understand why changes were made
Makes collaboration a nightmare (who changed what? did their changes overwrite mine?)
Provides no way to easily roll back a bad change
Offers no history of the project’s evolution

With version control, you have a single source of truth for your project, a complete annotated history of every change, and the ability to branch off in experimental directions without risk.

Types of Version Control Systems

There are two main models of version control:

Centralized Version Control (CVS, Subversion/SVN): All version history is stored on a single central server. Team members check out files from this central server, make changes, and check them back in. This creates a single point of failure and requires constant network access.

Distributed Version Control (Git, Mercurial): Every team member has a complete copy of the repository and its entire history on their local machine. Changes can be made, committed, and reviewed offline. Synchronization with remote servers (like GitHub) happens explicitly when you choose.

Git is by far the most dominant distributed version control system in use today, powering the workflows of millions of developers and data scientists worldwide.

Why Git Matters Specifically for Data Scientists

Data scientists have unique version control needs that differ slightly from traditional software developers. Here’s why Git is especially valuable in data science work:

Tracking Experiments and Model Iterations

Machine learning involves constant experimentation. You try different algorithms, hyperparameters, preprocessing steps, and feature combinations. Without version control, it’s nearly impossible to remember what combination of choices produced your best result three weeks ago.

With Git, you commit a working version of your code, create a branch to try a new approach, and if the new approach doesn’t pan out, you simply switch back to the previous commit. Your best model is always recoverable.

Collaboration Without Conflict

Data science projects increasingly involve teams — data engineers building pipelines, data scientists building models, ML engineers deploying them. Without version control, multiple people working on the same codebase leads to files being overwritten and work being lost. Git provides structured tools for merging contributions from multiple people.

Reproducibility

Reproducibility is a core principle of good science. If someone asks you to reproduce results from a model you trained six months ago, you need to know exactly what code was used. Git lets you tag specific commits as “the code used for the Q3 2024 model” and check out that exact state at any time.

Code Quality and Review

Teams using Git can implement code review workflows where changes are reviewed by peers before being merged into the main codebase. This catches bugs, improves code quality, and spreads knowledge across the team.

Installing and Configuring Git

Installation

On macOS: Git is often pre-installed. Check by running:

Bash

git --version

git --version

If not installed, the easiest approach is via Homebrew:

Bash

brew install git

brew install git

Or install Xcode Command Line Tools:

Bash

xcode-select --install

xcode-select --install

On Linux (Ubuntu/Debian):

Bash

sudo apt update
sudo apt install git

sudo apt update
sudo apt install git

On Windows: Download Git for Windows from git-scm.com. The installer includes Git Bash — a terminal emulator that lets you use Git with Unix-style commands on Windows. Most data scientists on Windows use Git Bash or WSL (Windows Subsystem for Linux).

Initial Configuration

After installing Git, the first thing to do is configure your identity. Git attaches your name and email to every commit you make:

Bash

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

The --global flag means these settings apply to all Git repositories on your machine. You can also set a default text editor (used for writing commit messages):

Bash

# Set VS Code as the default editor
git config --global core.editor "code --wait"

# Or set nano (simpler for beginners)
git config --global core.editor "nano"

# Set VS Code as the default editor
git config --global core.editor "code --wait"

# Or set nano (simpler for beginners)
git config --global core.editor "nano"

Verify your configuration:

Bash

git config --list

git config --list

Core Git Concepts

Before learning commands, you need to understand Git’s mental model. These are the foundational concepts everything else builds on.

The Repository

A repository (or “repo”) is a directory that Git is tracking. It contains your project files plus a hidden .git folder where Git stores all the version history and metadata. Everything Git needs to know about your project’s history is inside that .git folder.

There are two types of repositories:

Local repository: The copy on your computer
Remote repository: A copy hosted on a server, typically GitHub, GitLab, or Bitbucket

The Three Areas of Git

This is one of the most important concepts to understand. Git manages your files across three distinct areas:

1. Working Directory (Working Tree) This is your actual project folder — the files you see and edit. When you create, modify, or delete files, those changes exist in the working directory but are not yet tracked by Git.

2. Staging Area (Index) Think of the staging area as a “preparation zone” where you assemble changes you intend to commit. You explicitly add files to the staging area to say “these are the changes I want to include in my next commit.”

3. Repository (Commit History) Once you commit the staged changes, they are permanently recorded in Git’s history. Each commit is a snapshot of your project at a specific point in time.

The workflow is: Work → Stage → Commit

Plaintext

Working Directory  →  Staging Area  →  Repository
  (git add)           (git commit)

Working Directory  →  Staging Area  →  Repository
  (git add)           (git commit)

Commits

A commit is a snapshot of your project at a specific moment in time. Each commit contains:

A unique identifier (SHA hash like a3f8c2d)
The author’s name and email
A timestamp
A commit message describing what changed and why
A reference to the previous commit (parent)

Commits are the core unit of Git. All of Git’s power — branching, merging, reverting, time-traveling — operates on commits.

Branches

A branch is a lightweight pointer to a specific commit. The default branch is typically called main (or master in older repositories). When you create a new branch, you create a parallel line of development where you can make changes independently from the main codebase.

Branches are how you work on new features or experiments without affecting your stable, working code.

The HEAD

HEAD is a special pointer that indicates where you currently are in the repository — which commit (or branch) is your “current position.” When you’re on the main branch and make a new commit, HEAD advances to point to the new commit.

Essential Git Commands

Now let’s get into the commands you’ll use every day. We’ll build up from the simplest workflows to more advanced operations.

Starting a Repository

Initialize a new repository:

Bash

cd my-data-science-project
git init

cd my-data-science-project
git init

This creates the .git folder in your current directory, turning it into a Git repository. Your files are still untracked — you haven’t told Git to watch anything yet.

Clone an existing repository:

Bash

git clone https://github.com/username/repository-name.git

git clone https://github.com/username/repository-name.git

This downloads a complete copy of an existing repository (including all history) to your local machine. You’ll use this when joining a project that already exists on GitHub.

Checking Status and History

Check the current state of your repository:

Bash

git status

git status

This is the command you’ll run most frequently. It tells you:

Which files have been modified
Which files are staged (ready to commit)
Which files are untracked (new files Git doesn’t know about)

Example output:

Plaintext

On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
        modified:   model_training.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        new_features.py

On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
        modified:   model_training.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        new_features.py

View commit history:

Bash

git log

git log

Shows a list of all commits in reverse chronological order, with hash, author, date, and message.

For a more compact view:

Bash

git log --oneline

git log --oneline

Output:

Plaintext

a3f8c2d Add random forest model comparison
b91e4af Fix missing value handling in preprocessing
c72d1e3 Initial data exploration notebook

a3f8c2d Add random forest model comparison
b91e4af Fix missing value handling in preprocessing
c72d1e3 Initial data exploration notebook

See what changed in a file:

Bash

git diff model_training.py

git diff model_training.py

Shows the specific lines added (in green, prefixed with +) and removed (in red, prefixed with -) since your last commit.

Staging and Committing

Stage a specific file:

Bash

git add model_training.py

git add model_training.py

Stage multiple files:

Bash

git add model_training.py feature_engineering.py

git add model_training.py feature_engineering.py

Stage all changed files:

Bash

git add .

git add .

The . means “everything in the current directory and subdirectories.” Use this carefully — make sure you know what you’re staging.

Unstage a file (if you staged it by mistake):

Bash

git restore --staged model_training.py

git restore --staged model_training.py

Commit staged changes:

Bash

git commit -m "Add gradient boosting model with tuned hyperparameters"

git commit -m "Add gradient boosting model with tuned hyperparameters"

The -m flag lets you write the commit message inline. If you omit -m, Git opens your configured text editor for you to write the message.

Stage and commit in one step (for tracked files only):

Bash

git commit -am "Fix bug in data preprocessing pipeline"

git commit -am "Fix bug in data preprocessing pipeline"

The -a flag automatically stages all modifications to already-tracked files. Note: this doesn’t include new (untracked) files.

Writing Good Commit Messages

This deserves special attention. A commit message is a note to your future self and your teammates explaining why a change was made. Vague messages like “fix stuff” or “update” are nearly useless. Good commit messages have lasting value.

The conventional structure:

Plaintext

Short summary (50 characters or less)

More detailed explanation if necessary. Wrap at 72 characters.
Explain what and why, not how (the code shows how).

- Bullet points are fine for multiple changes
- Reference issue numbers if relevant: Fixes #42

Short summary (50 characters or less)

More detailed explanation if necessary. Wrap at 72 characters.
Explain what and why, not how (the code shows how).

- Bullet points are fine for multiple changes
- Reference issue numbers if relevant: Fixes #42

Examples of poor vs. good commit messages:

Poor Message	Good Message
`fix`	`Fix KeyError in feature engineering when column is missing`
`update model`	`Switch from RandomForest to XGBoost, improves AUC by 3%`
`changes`	`Refactor preprocessing pipeline for readability`
`wip`	`Add initial implementation of SMOTE for class imbalance`
`stuff`	`Remove hardcoded file paths, use config.yaml instead`

A good commit message answers: “If applied, this commit will ___________.” For example: “If applied, this commit will fix the KeyError in feature engineering when a column is missing.”

Branching

Create a new branch:

Bash

git branch experiment-xgboost

git branch experiment-xgboost

Switch to a branch:

Bash

git checkout experiment-xgboost
# Or using the newer syntax:
git switch experiment-xgboost

git checkout experiment-xgboost
# Or using the newer syntax:
git switch experiment-xgboost

Create and switch to a new branch in one command:

Bash

git checkout -b experiment-xgboost
# Or:
git switch -c experiment-xgboost

git checkout -b experiment-xgboost
# Or:
git switch -c experiment-xgboost

List all branches:

Bash

git branch

git branch

The current branch is marked with an asterisk (*).

Delete a branch (after merging):

Bash

git branch -d experiment-xgboost

git branch -d experiment-xgboost

A Branching Workflow Example

Here’s how a data scientist might use branches to safely experiment:

Bash

# Start on main branch with your stable model
git status
# On branch main, nothing to commit

# Create a branch to try XGBoost
git switch -c experiment-xgboost

# Make your experimental changes
# Edit model_training.py to use XGBoost...

# Commit your experiment
git add model_training.py
git commit -m "Replace RandomForest with XGBoost, initial attempt"

# More experiments...
git add .
git commit -m "Tune XGBoost learning rate and max depth"

# If the experiment succeeds, merge it into main
git switch main
git merge experiment-xgboost

# If it fails, just delete the branch - main is untouched
git switch main
git branch -d experiment-xgboost

# Start on main branch with your stable model
git status
# On branch main, nothing to commit

# Create a branch to try XGBoost
git switch -c experiment-xgboost

# Make your experimental changes
# Edit model_training.py to use XGBoost...

# Commit your experiment
git add model_training.py
git commit -m "Replace RandomForest with XGBoost, initial attempt"

# More experiments...
git add .
git commit -m "Tune XGBoost learning rate and max depth"

# If the experiment succeeds, merge it into main
git switch main
git merge experiment-xgboost

# If it fails, just delete the branch - main is untouched
git switch main
git branch -d experiment-xgboost

Your main branch always contains working, stable code. Experiments live in their own branches until they’re proven.

Merging

Merge a branch into your current branch:

Bash

git switch main
git merge experiment-xgboost

git switch main
git merge experiment-xgboost

If there are no conflicting changes, Git performs a clean merge automatically. If two branches modified the same lines of the same file differently, Git creates a merge conflict that you must resolve manually.

Resolving Merge Conflicts

A merge conflict looks like this in your file:

Bash

<<<<<<< HEAD
model = RandomForestClassifier(n_estimators=100)
=======
model = XGBClassifier(learning_rate=0.1, max_depth=5)
>>>>>>> experiment-xgboost

<<<<<<< HEAD
model = RandomForestClassifier(n_estimators=100)
=======
model = XGBClassifier(learning_rate=0.1, max_depth=5)
>>>>>>> experiment-xgboost

Git is showing you both versions. You need to edit the file to keep what you want, removing the conflict markers:

Bash

# Keep XGBoost version
model = XGBClassifier(learning_rate=0.1, max_depth=5)

# Keep XGBoost version
model = XGBClassifier(learning_rate=0.1, max_depth=5)

After editing, stage and commit to complete the merge:

Bash

git add model_training.py
git commit -m "Merge XGBoost experiment into main"

git add model_training.py
git commit -m "Merge XGBoost experiment into main"

Undoing Changes

One of Git’s most powerful features is the ability to undo mistakes. There are several ways to do this depending on your situation.

Discard changes in a file (before staging):

Bash

git restore model_training.py

git restore model_training.py

This reverts the file to the state of your last commit. Warning: this is irreversible for uncommitted changes.

Undo the last commit (keep changes in working directory):

Bash

git reset HEAD~1

git reset HEAD~1

This “uncommits” your last commit, putting the changes back into your working directory unstaged. Useful when you committed too early.

Create a new commit that reverses a previous commit:

Bash

git revert a3f8c2d

git revert a3f8c2d

This is the safe way to undo a commit that’s already been shared with others. It creates a new commit that applies the inverse of the specified commit, preserving history.

View a previous version of a file without changing your current state:

Bash

git show a3f8c2d:model_training.py

git show a3f8c2d:model_training.py

Working with Remote Repositories

So far, everything we’ve discussed happens on your local machine. Remote repositories (typically on GitHub) enable collaboration and provide a backup of your work.

Connecting to a Remote

When you clone a repository, the remote connection is set up automatically. When you initialize a new local repository, you need to connect it to a remote:

Bash

git remote add origin https://github.com/yourusername/your-repo.git

git remote add origin https://github.com/yourusername/your-repo.git

origin is the conventional name for your primary remote. You can have multiple remotes with different names.

List remote connections:

Bash

git remote -v

git remote -v

Pushing and Pulling

Push your commits to the remote:

Bash

git push origin main

git push origin main

This sends your local main branch’s commits to the remote repository.

Push a new branch to the remote:

Bash

git push origin experiment-xgboost

git push origin experiment-xgboost

Pull changes from the remote (fetch + merge):

Bash

git pull origin main

git pull origin main

This downloads new commits from the remote main branch and merges them into your local main branch. Use this to get your teammates’ latest changes.

Fetch without merging:

Bash

git fetch origin

git fetch origin

Downloads remote changes but doesn’t merge them into your current branch. Useful for reviewing what teammates have pushed before integrating their changes.

The .gitignore File

Not everything in your project directory should be version-controlled. The .gitignore file tells Git which files and directories to ignore completely.

What to Ignore in Data Science Projects

For data science projects, you typically want to ignore:

Large data files: Raw datasets (CSV, parquet, HDF5 files) are often too large for Git and belong in cloud storage (S3, GCS) or data versioning tools (DVC)
Model binary files: Trained model files (.pkl, .joblib, .h5) can be large and change frequently
Virtual environment folders: venv/, env/, .conda/
Python cache: __pycache__/, *.pyc, *.pyo
Jupyter checkpoint folders: .ipynb_checkpoints/
Environment/secret files: .env files containing API keys and passwords
OS-generated files: .DS_Store (macOS), Thumbs.db (Windows)
IDE configuration folders: .vscode/ (sometimes), .idea/

A Data Science .gitignore Template

Create a .gitignore file in your project root:

Plaintext

# Data files (store in S3, GCS, or DVC instead)
data/raw/
data/processed/
*.csv
*.parquet
*.h5
*.hdf5

# Model files
models/
*.pkl
*.joblib
*.h5

# Python
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.egg-info/
dist/
build/

# Virtual environments
venv/
env/
.conda/
.env

# Jupyter
.ipynb_checkpoints/

# OS files
.DS_Store
Thumbs.db

# Secrets and environment variables
.env
*.secret
config/secrets.yaml

# IDE
.vscode/settings.json
.idea/

# Data files (store in S3, GCS, or DVC instead)
data/raw/
data/processed/
*.csv
*.parquet
*.h5
*.hdf5

# Model files
models/
*.pkl
*.joblib
*.h5

# Python
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.egg-info/
dist/
build/

# Virtual environments
venv/
env/
.conda/
.env

# Jupyter
.ipynb_checkpoints/

# OS files
.DS_Store
Thumbs.db

# Secrets and environment variables
.env
*.secret
config/secrets.yaml

# IDE
.vscode/settings.json
.idea/

Important note on data files: The .gitignore approach ignores data files from Git entirely. For actual data version control (tracking which version of your dataset was used to train each model), look into DVC (Data Version Control) — a Git-like tool specifically designed for data and model files.

A Complete Data Science Git Workflow

Let’s walk through a realistic, end-to-end Git workflow for a data science project, from initialization to collaboration.

Setting Up a New Project

Bash

# Create project directory
mkdir customer-churn-prediction
cd customer-churn-prediction

# Initialize Git
git init

# Create project structure
mkdir data notebooks src tests
touch README.md .gitignore requirements.txt

# Add your .gitignore content
nano .gitignore
# (add data science gitignore content from above)

# Initial commit
git add README.md .gitignore requirements.txt
git commit -m "Initialize project structure"

# Connect to GitHub (after creating the repo on GitHub)
git remote add origin https://github.com/yourname/customer-churn-prediction.git
git push -u origin main

# Create project directory
mkdir customer-churn-prediction
cd customer-churn-prediction

# Initialize Git
git init

# Create project structure
mkdir data notebooks src tests
touch README.md .gitignore requirements.txt

# Add your .gitignore content
nano .gitignore
# (add data science gitignore content from above)

# Initial commit
git add README.md .gitignore requirements.txt
git commit -m "Initialize project structure"

# Connect to GitHub (after creating the repo on GitHub)
git remote add origin https://github.com/yourname/customer-churn-prediction.git
git push -u origin main

The Daily Development Cycle

Bash

# Start your day by pulling latest changes from teammates
git pull origin main

# Check status before starting work
git status

# Create a branch for your new work
git switch -c feature/add-feature-importance-plot

# Do your data science work...
# Edit notebooks/02_model_training.ipynb
# Edit src/visualization.py

# Review what changed
git diff src/visualization.py

# Stage your changes
git add src/visualization.py
git add notebooks/02_model_training.ipynb

# Commit with a descriptive message
git commit -m "Add feature importance bar chart to model evaluation"

# Push your branch to GitHub
git push origin feature/add-feature-importance-plot

# Open a Pull Request on GitHub for code review
# (done through the GitHub web interface)

# After approval and merge, clean up
git switch main
git pull origin main
git branch -d feature/add-feature-importance-plot

# Start your day by pulling latest changes from teammates
git pull origin main

# Check status before starting work
git status

# Create a branch for your new work
git switch -c feature/add-feature-importance-plot

# Do your data science work...
# Edit notebooks/02_model_training.ipynb
# Edit src/visualization.py

# Review what changed
git diff src/visualization.py

# Stage your changes
git add src/visualization.py
git add notebooks/02_model_training.ipynb

# Commit with a descriptive message
git commit -m "Add feature importance bar chart to model evaluation"

# Push your branch to GitHub
git push origin feature/add-feature-importance-plot

# Open a Pull Request on GitHub for code review
# (done through the GitHub web interface)

# After approval and merge, clean up
git switch main
git pull origin main
git branch -d feature/add-feature-importance-plot

Git for Jupyter Notebooks: Special Considerations

As mentioned in the previous article, Jupyter Notebook files (.ipynb) present specific challenges for Git because they store output cells in JSON format. Here’s how to handle this professionally.

Option 1: Strip Outputs Before Committing

Manually clear all outputs before committing a notebook: in Jupyter, go to Cell > All Output > Clear. Then commit the clean notebook.

Option 2: Use nbstripout

nbstripout is a tool that automatically strips notebook outputs before they’re committed to Git, acting as a Git filter:

Bash

pip install nbstripout
# Install as a Git filter for the current repository
nbstripout --install

pip install nbstripout
# Install as a Git filter for the current repository
nbstripout --install

After installation, whenever you git add a notebook, outputs are automatically stripped. Teammates running the notebook will re-generate outputs locally — but the notebook code stays clean and version-controllable.

Option 3: Use Jupytext

Jupytext synchronizes notebooks with plain .py scripts:

Bash

pip install jupytext

pip install jupytext

Configure Jupytext to auto-generate a .py version of every notebook. Commit the .py file (which diffs beautifully) and optionally ignore the .ipynb in .gitignore.

Git Commands Quick Reference

Here’s a comprehensive reference table for the commands covered in this guide:

Command	Description
`git init`	Initialize a new Git repository
`git clone <url>`	Clone a remote repository locally
`git status`	Show working directory and staging area status
`git log --oneline`	View compact commit history
`git diff <file>`	Show unstaged changes in a file
`git add <file>`	Stage a file for commit
`git add .`	Stage all changed files
`git commit -m "message"`	Commit staged changes with a message
`git push origin <branch>`	Push branch to remote
`git pull origin <branch>`	Pull and merge remote changes
`git fetch origin`	Download remote changes without merging
`git branch`	List branches
`git switch -c <branch>`	Create and switch to a new branch
`git switch <branch>`	Switch to an existing branch
`git merge <branch>`	Merge a branch into the current branch
`git branch -d <branch>`	Delete a merged branch
`git restore <file>`	Discard unstaged changes in a file
`git reset HEAD~1`	Undo last commit, keep changes staged
`git revert <hash>`	Create a new commit reversing a past commit
`git stash`	Temporarily save uncommitted changes
`git stash pop`	Restore stashed changes
`git remote -v`	List remote connections
`git remote add origin <url>`	Add a remote connection
`git config --list`	Show Git configuration

Common Git Mistakes and How to Fix Them

“I committed to the wrong branch”

Bash

# Move the last commit to a new branch
git switch -c correct-branch
git switch wrong-branch
git reset HEAD~1

# Move the last commit to a new branch
git switch -c correct-branch
git switch wrong-branch
git reset HEAD~1

“I accidentally committed a large data file”

Bash

# Remove the file from tracking (add to .gitignore first!)
git rm --cached large_dataset.csv
echo "large_dataset.csv" >> .gitignore
git add .gitignore
git commit -m "Remove large data file from tracking"

# Remove the file from tracking (add to .gitignore first!)
git rm --cached large_dataset.csv
echo "large_dataset.csv" >> .gitignore
git add .gitignore
git commit -m "Remove large data file from tracking"

Note: If you’ve already pushed this to a remote repository, the file is in the history and needs more advanced techniques (like git filter-branch or BFG Repo-Cleaner) to remove completely.

“I accidentally committed my API key”

This is serious. Treat the key as compromised — rotate it immediately through the service provider. Then remove it from Git history using git filter-branch or BFG Repo-Cleaner. Going forward, store secrets in .env files and add .env to .gitignore.

“My merge created a mess and I want to start over”

Bash

# Abort the merge and return to pre-merge state
git merge --abort

# Abort the merge and return to pre-merge state
git merge --abort

“I want to see what the file looked like 3 commits ago”

Bash

git show HEAD~3:model_training.py

git show HEAD~3:model_training.py

Or check it out temporarily:

Bash

git checkout HEAD~3 -- model_training.py
# Review the file, then restore the current version:
git checkout HEAD -- model_training.py

git checkout HEAD~3 -- model_training.py
# Review the file, then restore the current version:
git checkout HEAD -- model_training.py

Git Stash: Saving Work in Progress

Sometimes you’re in the middle of working on something when you need to quickly switch to another branch (maybe a bug was found in production). You’re not ready to commit your half-finished work, but you also don’t want to lose it.

git stash is the solution:

Bash

# Save your uncommitted changes temporarily
git stash

# Your working directory is now clean — switch branches freely
git switch main
# Fix the bug...
git commit -m "Fix critical bug in production pipeline"

# Switch back to your feature branch
git switch feature/my-work

# Restore your stashed changes
git stash pop

# Save your uncommitted changes temporarily
git stash

# Your working directory is now clean — switch branches freely
git switch main
# Fix the bug...
git commit -m "Fix critical bug in production pipeline"

# Switch back to your feature branch
git switch feature/my-work

# Restore your stashed changes
git stash pop

You can have multiple stashes and list them with git stash list.

Best Practices for Data Scientists Using Git

Commit Early and Often

Small, focused commits are much easier to understand, review, and revert than large, sweeping changes. Commit whenever you complete a logical unit of work, even if it’s just a small improvement.

One Logical Change Per Commit

Each commit should represent a single logical change. Don’t bundle “Fix missing value handling, add new features, and update README” into one commit. Three separate commits with clear messages are far more valuable.

Never Commit Directly to Main

Treat your main branch as sacred — it should always contain working, tested code. Do all your work in feature branches and merge to main only when the work is ready.

Write Meaningful Commit Messages

Invest 30 seconds in writing a clear commit message. Your future self will thank you six months from now when you’re trying to understand why you made a change.

Use a Consistent Branch Naming Convention

Clear branch names make project navigation much easier:

feature/add-shap-explanations
experiment/lstm-time-series
fix/handle-null-dates
refactor/simplify-preprocessing

Keep Data Out of Git

Data files belong in dedicated storage (S3, Google Cloud Storage, local data directories) or data version control tools like DVC. Git is for code, not data.

Review `git status` and `git diff` Before Every Commit

Before running git commit, always run git status to confirm what you’re committing and git diff --staged to review the actual line-level changes. This prevents accidental commits of debug code, secrets, or unintended files.

Beyond the Basics: What to Learn Next

Once you’re comfortable with the concepts in this article, several advanced Git topics will further improve your workflow:

Git rebase offers an alternative to merging that rewrites commit history to create a cleaner, linear project history. It’s controversial but widely used.

Interactive rebase (git rebase -i) lets you rewrite, squash, and reorder commits before sharing them — useful for cleaning up messy experimental work into clean, logical commits.

Git tags let you mark specific commits as significant milestones, like v1.0-baseline-model or 2024-q3-production-release.

GitHub Pull Requests (covered in the next article) provide the formal workflow for proposing, reviewing, and merging changes in team environments.

Git hooks are scripts that run automatically at specific points in the Git workflow — like running your test suite before every commit to prevent broken code from being committed.

DVC (Data Version Control) extends Git-like versioning to data files and ML models, enabling true reproducibility of ML experiments including the exact dataset and model weights used.

Summary

Git is an indispensable tool for data scientists, transforming chaotic, file-duplication-based workflows into professional, structured project management. The core workflow you’ll use every day is simple: make changes in your working directory, stage the changes you want to record with git add, and commit them with a clear message using git commit.

Branches let you experiment safely without affecting your stable main codebase. Remote repositories on GitHub provide collaboration infrastructure and off-site backup. The .gitignore file keeps data files, secrets, and generated files out of your repository.

The best way to learn Git is to start using it on every project, even personal ones with no collaborators. The habits you build — small, frequent commits; descriptive messages; feature branches; reviewing diffs before committing — become second nature quickly, and the protection and clarity they provide make you a significantly more effective and professional data scientist.

Key Takeaways

Git tracks file changes over time, enabling you to recover any previous version of your project and collaborate without overwriting teammates’ work
The three areas of Git — working directory, staging area, and repository — define a clear workflow: edit, stage, commit
Branches let you experiment safely in parallel without affecting your stable main branch
Good commit messages (descriptive, explaining the why) are an investment in your project’s long-term maintainability
The .gitignore file is essential for keeping data files, secrets, and generated files out of version control
Jupyter Notebooks require special handling in Git due to their JSON format and stored outputs — use nbstripout for clean diffs
The daily Git rhythm is: git pull → git switch -c (new branch) → do work → git add → git commit → git push
Start using Git on all your projects immediately — the habits it builds are foundational to professional data science work

0 Comments

Inline Feedbacks

View all comments

Discover More

Click For More

Version Control for Data Scientists: Git Basics

Introduction

What Is Version Control?

The Problem Version Control Solves

Types of Version Control Systems

Why Git Matters Specifically for Data Scientists

Tracking Experiments and Model Iterations

Collaboration Without Conflict

Reproducibility

Code Quality and Review

Installing and Configuring Git

Installation

Initial Configuration

Core Git Concepts

The Repository

The Three Areas of Git

Commits

Branches

The HEAD

Essential Git Commands

Starting a Repository

Checking Status and History

Staging and Committing

Writing Good Commit Messages

Branching

A Branching Workflow Example

Merging

Resolving Merge Conflicts

Undoing Changes

Working with Remote Repositories

Connecting to a Remote

Pushing and Pulling

The .gitignore File

What to Ignore in Data Science Projects

A Data Science .gitignore Template

A Complete Data Science Git Workflow

Setting Up a New Project

The Daily Development Cycle

Git for Jupyter Notebooks: Special Considerations

Option 1: Strip Outputs Before Committing

Option 2: Use nbstripout

Option 3: Use Jupytext

Git Commands Quick Reference

Common Git Mistakes and How to Fix Them

“I committed to the wrong branch”

“I accidentally committed a large data file”

“I accidentally committed my API key”

“My merge created a mess and I want to start over”

“I want to see what the file looked like 3 commits ago”

Git Stash: Saving Work in Progress

Best Practices for Data Scientists Using Git

Commit Early and Often

One Logical Change Per Commit

Never Commit Directly to Main

Write Meaningful Commit Messages

Use a Consistent Branch Naming Convention

Keep Data Out of Git

Review git status and git diff Before Every Commit

Beyond the Basics: What to Learn Next

Summary

Key Takeaways

Discover More

Understanding System Architecture: The Blueprint of Your Operating System

Introduction to JavaScript – Basics and Fundamentals

The History of Robotics: From Ancient Automata to Modern Machines

Understanding Force and Torque in Robot Design

The Role of Inductors: Understanding Magnetic Energy Storage

Interactive Data Visualization: Adding Filters and Interactivity

Review `git status` and `git diff` Before Every Commit