Debugging in data science is the systematic process of identifying and fixing errors in Python code, data transformations, and model pipelines — ranging from straightforward syntax errors that prevent code from running, to subtle logic errors that silently corrupt results and produce convincing but incorrect output. Effective data science debugging requires a combination of Python debugging tools (print statements, the pdb debugger, IDE breakpoints), data inspection techniques (shape checks, value audits, distribution comparisons), and disciplined mental models for tracing problems through multi-step pipelines.
Introduction
Every data scientist, at every experience level, spends a significant portion of their working time debugging. Studies of software developers suggest that debugging accounts for roughly 50% of total development time — and in data science, where pipelines are longer, data is messier, and failures can be silent, that proportion may be even higher.
What separates fast debuggers from slow ones is not intelligence — it’s methodology. Inexperienced debuggers attack bugs randomly: change something, run again, change something else, run again, growing increasingly frustrated with each iteration. Experienced debuggers approach bugs like scientists: form a hypothesis about the cause, design a test that would confirm or refute it, execute the test, update the hypothesis, and repeat. This systematic approach finds bugs faster, with less wasted effort, and produces better understanding of the code in the process.
Data science debugging has unique characteristics that distinguish it from general Python debugging. Data scientists deal with bugs that don’t raise errors at all — a wrong merge key that silently duplicates rows, a feature scaling applied to test data using test statistics instead of training statistics, a label encoding that maps categories differently on different subsets. These “silent failures” can propagate through an entire pipeline, producing models that train and evaluate without errors but perform poorly or incorrectly in production.
This guide covers the full spectrum of debugging in data science: the types of bugs you’ll encounter, the tools available for finding them, systematic strategies for isolating problems, and patterns for the most common data science–specific bugs. By the end, you’ll have a methodological toolkit that makes debugging faster, less frustrating, and even intellectually satisfying.
Understanding the Types of Bugs in Data Science
Different bugs require different debugging approaches. Knowing what kind of bug you’re dealing with is the first step toward fixing it efficiently.
Type 1: Syntax Errors
Syntax errors are the simplest category — Python’s parser can’t understand the code and refuses to run it at all. These are usually trivially identified because Python tells you exactly where the problem is.
# SyntaxError: Missing closing parenthesis
df = pd.read_csv("data.csv"
# SyntaxError: Invalid indentation
def clean_data(df):
df = df.dropna() # ← Should be indented
return df
# SyntaxError: Using = instead of == in a condition
if df.shape[0] = 0: # ← Should be ==
raise ValueError("Empty DataFrame")Debugging approach: Read the error message carefully — Python tells you the file and line number. Modern IDEs (VS Code, PyCharm) underline syntax errors in red before you even run the code.
Type 2: Runtime Errors (Exceptions)
Runtime errors occur during execution — the code is syntactically valid but fails when run because of an unexpected condition.
Common data science runtime errors:
# KeyError: Column doesn't exist in DataFrame
df['revenue_per_customer'] # ← Column named 'revenue' not 'revenue_per_customer'
# TypeError: Wrong type passed to a function
np.log(-5) # ← Returns nan, not an error, but np.sqrt(-5) raises RuntimeWarning
# ValueError: Shape mismatch
scaler.transform(X_test) # ← X_test has different number of features than X_train
# AttributeError: Method doesn't exist
df.drop_duplicte() # ← Typo: should be drop_duplicates()
# IndexError: Out of bounds
features_list[50] # ← List only has 47 elements
# FileNotFoundError: Wrong path
pd.read_csv("data/transactions.csv") # ← File is in data/raw/transactions.csvDebugging approach: Python’s traceback tells you exactly what happened and where. Read it from bottom to top — the bottom is the immediate error, tracing upward shows how execution arrived there.
Type 3: Logic Errors
Logic errors are the most dangerous category. The code runs without raising any exception, but it produces wrong results. In data science, these are extraordinarily common and often go undetected for a long time.
# Logic error: Wrong merge key — creates data leakage
# Intended: join transactions to customers
merged = pd.merge(transactions, customers, on='id')
# Problem: 'id' is transaction_id in transactions but customer_id in customers
# Result: exploded DataFrame with many-to-many joins, silent data corruption
# Logic error: Fitting scaler on test data — data leakage
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Correct
X_test_scaled = scaler.fit_transform(X_test) # Wrong! Should be .transform()
# Problem: Test set statistics contaminate training evaluation
# Logic error: Off-by-one in rolling window
df['rolling_avg_7d'] = df['amount'].rolling(window=7).mean()
# But the index wasn't sorted by date first — rolling on unsorted data is meaningless
# Logic error: Wrong comparison operator
high_value = df[df['amount'] > 100] # Should be >= 100
# Excludes exactly 100, possibly a significant categoryDebugging approach: Logic errors require data inspection and reasoning. You can’t rely on error messages — you need to verify assumptions at each step of the pipeline. This is where most debugging time is actually spent.
Type 4: Performance Bugs
These aren’t errors in the traditional sense — the code produces correct output, but far too slowly or using far too much memory.
# Performance bug: Iterating row-by-row instead of using vectorized operations
for i, row in df.iterrows():
df.at[i, 'score'] = row['amount'] * 0.1 + row['frequency'] * 5
# 1000x slower than: df['score'] = df['amount'] * 0.1 + df['frequency'] * 5
# Memory bug: Loading entire dataset when only a sample is needed
df = pd.read_csv("massive_file.csv") # 8GB file, crashes on 16GB machine
# Better: df = pd.read_csv("massive_file.csv", nrows=10000) for development
# Performance bug: Unnecessary DataFrame copies in a loop
results = []
for chunk in data_chunks:
results.append(process(chunk))
final = pd.concat(results) # This is actually fine
# vs:
final = pd.DataFrame()
for chunk in data_chunks:
final = pd.concat([final, process(chunk)]) # O(n²) — terrible for large nDebugging approach: Profiling tools (cProfile, line_profiler, memory_profiler) identify where time and memory are actually being spent.
Tool 1: Print Debugging — The Humble Workhorse
Despite more sophisticated tools existing, print-based debugging remains the most-used debugging technique in data science, for good reason: it’s fast, requires no setup, and works everywhere — scripts, notebooks, remote servers, and CI/CD pipelines.
Effective Print Debugging
The key to effective print debugging is being systematic rather than scattering prints randomly. Place prints at boundaries — before and after major transformations — and make them informative:
def preprocess_pipeline(df):
print(f"[START] Input shape: {df.shape}")
print(f"[START] Null counts:\n{df.isnull().sum()}")
# Step 1: Remove duplicates
df = df.drop_duplicates()
print(f"[AFTER drop_duplicates] Shape: {df.shape}")
# Step 2: Handle missing values
df = df.dropna(subset=['customer_id', 'amount'])
print(f"[AFTER dropna] Shape: {df.shape}, "
f"Rows removed: {_prev_shape - df.shape[0]}")
# Step 3: Type conversion
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
print(f"[AFTER type conversion] "
f"date dtype: {df['transaction_date'].dtype}")
# Step 4: Filter invalid dates
df = df[df['transaction_date'] >= '2020-01-01']
print(f"[AFTER date filter] Shape: {df.shape}")
# Step 5: Compute derived columns
df['amount_log'] = np.log1p(df['amount'])
print(f"[AFTER log transform] "
f"amount_log stats:\n{df['amount_log'].describe()}")
print(f"[END] Final shape: {df.shape}")
return dfThis style of instrumentation — sometimes called “logging checkpoints” — makes it immediately obvious at which step the data changes unexpectedly.
Data-Specific Print Diagnostics
For data science debugging, the most useful things to print aren’t just shapes — they’re data-level summaries:
def debug_dataframe(df, label=""):
"""Print comprehensive diagnostic information about a DataFrame."""
header = f"=== DataFrame Debug: {label} ===" if label else "=== DataFrame Debug ==="
print(header)
print(f"Shape: {df.shape}")
print(f"Dtypes:\n{df.dtypes}")
print(f"Null counts:\n{df.isnull().sum()[df.isnull().sum() > 0]}")
print(f"Duplicate rows: {df.duplicated().sum()}")
for col in df.select_dtypes(include='number').columns[:5]:
print(f"{col}: min={df[col].min():.3f}, max={df[col].max():.3f}, "
f"mean={df[col].mean():.3f}, nulls={df[col].isnull().sum()}")
print()
# Use it throughout your pipeline
raw_df = pd.read_csv("data.csv")
debug_dataframe(raw_df, "Raw data")
clean_df = clean_data(raw_df)
debug_dataframe(clean_df, "After cleaning")
feature_df = build_features(clean_df)
debug_dataframe(feature_df, "After feature engineering")Removing Print Statements: The Logging Module
Production code shouldn’t have print() statements scattered throughout it. Use Python’s logging module instead — it supports log levels, can be turned on/off without changing code, and writes to files as well as the console:
import logging
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
def preprocess_pipeline(df):
logger.info(f"Starting preprocessing. Input shape: {df.shape}")
df = df.drop_duplicates()
logger.debug(f"After drop_duplicates: {df.shape}")
df = df.dropna(subset=['customer_id'])
logger.info(f"After dropna: {df.shape}")
if df.shape[0] == 0:
logger.error("DataFrame is empty after preprocessing!")
raise ValueError("Empty DataFrame after preprocessing")
logger.info("Preprocessing complete.")
return dfIn production, set level=logging.INFO (suppress DEBUG messages). During debugging, set level=logging.DEBUG to see all messages. No code changes needed — just a configuration change.
Tool 2: Python’s Built-in Debugger — pdb
pdb (Python Debugger) is Python’s built-in interactive debugger. It lets you pause execution at any point, inspect variables, execute arbitrary expressions, and step through code line by line — all without leaving the terminal.
Starting pdb
Method 1: Insert a breakpoint in code
def build_features(df):
rfm = compute_rfm(df)
# Pause execution here and open interactive debugger
breakpoint() # Python 3.7+ (replaces the older: import pdb; pdb.set_trace())
features = merge_feature_sets(rfm, behavioral_features)
return featuresWhen execution hits breakpoint(), Python pauses and opens the pdb interactive prompt:
> /path/to/your/file.py(8)build_features()
-> features = merge_feature_sets(rfm, behavioral_features)
(Pdb) Method 2: Launch a script directly in pdb
python -m pdb src/train_model.py --config configs/config.yamlMethod 3: Post-mortem debugging — investigate after a crash
import pdb
try:
result = run_pipeline(df)
except Exception as e:
print(f"Error: {e}")
pdb.post_mortem() # Opens debugger at the point of the crashEssential pdb Commands
| Command | Shortcut | Description |
|---|---|---|
help | h | Show all commands |
list | l | Show current code context (11 lines around current position) |
next | n | Execute next line (step over function calls) |
step | s | Step into the next function call |
return | r | Continue until current function returns |
continue | c | Resume execution until next breakpoint |
quit | q | Exit the debugger |
print(expr) | p expr | Evaluate and print an expression |
pp expr | pp | Pretty-print an expression (better for dicts/lists) |
where | w | Show the call stack (where am I in the program?) |
up | u | Move up one level in the call stack |
down | d | Move down one level in the call stack |
break n | b n | Set a breakpoint at line n |
!statement | ! | Execute arbitrary Python in the current context |
pdb in Action: A Data Science Example
# Debugging a mysterious row count drop in preprocessing
(Pdb) l
5 df = df.drop_duplicates()
6 print(f"After dedup: {df.shape}")
7
8 -> df = df.merge(customers, on='customer_id', how='inner')
9 print(f"After merge: {df.shape}")
10
(Pdb) p df.shape
(45231, 8)
(Pdb) p customers.shape
(38940, 5)
(Pdb) p df['customer_id'].nunique()
45231
(Pdb) p customers['customer_id'].nunique()
38940
(Pdb) p len(set(df['customer_id']) - set(customers['customer_id']))
6291
# Found it! 6,291 transaction customer_ids don't exist in the customers table
# The inner merge will drop these rows — that's the missing rows we're seeing
# Fix: investigate why customers table is missing these IDs
(Pdb) c # continueThis interaction found the root cause in under 2 minutes.
ipdb: The Enhanced Debugger
ipdb is a drop-in replacement for pdb that adds IPython-style tab completion, syntax highlighting, and better display of objects:
pip install ipdbimport ipdb
ipdb.set_trace() # Or just: breakpoint() (if PYTHONBREAKPOINT=ipdb)Set ipdb as the default debugger:
export PYTHONBREAKPOINT=ipdb.set_traceNow breakpoint() everywhere opens ipdb automatically.
Tool 3: IDE Debuggers — PyCharm and VS Code
For the most powerful debugging experience, use your IDE’s graphical debugger. IDE debuggers provide everything pdb does but with a visual interface that shows variables, call stacks, and code simultaneously — dramatically reducing cognitive load.
Setting Breakpoints in VS Code
Click in the margin to the left of any line number — a red dot appears. Run the file with F5 (Debug mode). Execution pauses at the breakpoint.
The Debug panel shows:
- Variables: Every variable in the current scope with current values. DataFrames show their shape and dtypes. Click the arrow to expand objects.
- Watch: Type any expression (e.g.,
df.shape,df['amount'].isnull().sum()) and VS Code evaluates it continuously as you step through code - Call Stack: The chain of function calls that led here
- Debug Console: A REPL where you can type arbitrary Python in the current execution context
Debug Console: The Hidden Gem
The Debug Console (or “Evaluate Expression” in PyCharm) is the most useful debugging tool most data scientists underuse. While paused at a breakpoint, the debug console lets you run arbitrary code in the current scope:
> df.shape
(45231, 8)
> df['customer_id'].duplicated().sum()
0
> df.dtypes
customer_id object
transaction_date object ← Should be datetime!
amount float64
> pd.to_datetime(df['transaction_date']).head()
0 2024-01-15
1 2024-01-16
...
> df['transaction_date'].head()
0 "2024-01-15" ← It's a string, not datetime — found the bug!The debug console lets you explore the data interactively without modifying your code — test fixes before implementing them, verify assumptions, understand data properties.
Conditional Breakpoints
For data science pipelines processing thousands of rows, you don’t want to pause at a breakpoint for every row — you want to pause when something specific goes wrong. Both VS Code and PyCharm support conditional breakpoints:
# Right-click a breakpoint → Add Condition
# Pause when the DataFrame drops below expected size
df.shape[0] < 40000
# Pause when a suspicious value appears
df['amount'].max() > 50000
# Pause on a specific iteration
customer_id == 'CUST_98765'
# Pause when an unexpected NaN appears
df['customer_id'].isnull().any()Debugging Data Science–Specific Bugs
Beyond general Python debugging, data science has its own category of bugs that require specific diagnostic approaches.
Bug Category 1: Silent Row Loss
One of the most common data science bugs is rows disappearing from your DataFrame without errors or warnings. Your pipeline starts with 50,000 rows and ends with 43,000, and you don’t know where the 7,000 went.
Systematic approach — count rows at every step:
def preprocessing_with_audit(df, verbose=True):
audit = {'start': len(df)}
df = df.drop_duplicates()
audit['after_dedup'] = len(df)
df = df.dropna(subset=['customer_id', 'amount'])
audit['after_dropna_required'] = len(df)
df = df[df['amount'] > 0]
audit['after_positive_amount'] = len(df)
df = df[df['transaction_date'] >= '2020-01-01']
audit['after_date_filter'] = len(df)
df = df.merge(customers, on='customer_id', how='inner')
audit['after_merge'] = len(df)
if verbose:
print("\n=== Row Count Audit ===")
prev = audit['start']
for step, count in audit.items():
diff = count - prev if step != 'start' else 0
flag = " ← LARGE DROP" if abs(diff) > 1000 else ""
print(f"{step:35s}: {count:7,d} ({diff:+,d}){flag}")
prev = count
return dfOutput:
=== Row Count Audit ===
start : 50,000 (+0)
after_dedup : 49,847 (-153)
after_dropna_required : 49,821 (-26)
after_positive_amount : 49,798 (-23)
after_date_filter : 49,798 (+0)
after_merge : 43,507 (-6,291) ← LARGE DROPImmediately clear: the merge is responsible for the large drop.
Bug Category 2: Shape and Type Mismatches
Shape and type errors are common when code written for one dataset is applied to another:
def diagnose_ml_data(X_train, X_test, y_train, y_test, feature_names=None):
"""Print comprehensive diagnostics before model training."""
print("=== ML Data Diagnostics ===")
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}")
# Check for shape compatibility
if X_train.shape[1] != X_test.shape[1]:
print(f"ERROR: Feature count mismatch! "
f"Train: {X_train.shape[1]}, Test: {X_test.shape[1]}")
# Check target distribution
print(f"\ny_train class distribution:")
print(pd.Series(y_train).value_counts(normalize=True).round(3))
print(f"\ny_test class distribution:")
print(pd.Series(y_test).value_counts(normalize=True).round(3))
# Check for NaN/Inf values
if isinstance(X_train, np.ndarray):
n_nan = np.isnan(X_train).sum()
n_inf = np.isinf(X_train).sum()
else: # DataFrame
n_nan = X_train.isnull().sum().sum()
n_inf = np.isinf(X_train.select_dtypes(include='number')).sum().sum()
if n_nan > 0:
print(f"\nWARNING: {n_nan} NaN values in X_train!")
if n_inf > 0:
print(f"\nWARNING: {n_inf} infinite values in X_train!")
# Feature value ranges
if feature_names and isinstance(X_train, np.ndarray):
print("\nFeature value ranges (first 10):")
for i, name in enumerate(feature_names[:10]):
col = X_train[:, i]
print(f" {name:30s}: [{col.min():.3f}, {col.max():.3f}], "
f"mean={col.mean():.3f}")Bug Category 3: Data Leakage
Data leakage is when information from the test set “leaks” into the training process, producing optimistic evaluation metrics that don’t reflect real-world performance. It’s one of the most insidious bugs in machine learning because the model trains successfully and evaluates well — but fails in production.
Common leakage patterns and how to detect them:
# LEAKAGE BUG 1: Fitting preprocessors on entire dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fits on ALL data including test
X_train_scaled = X_scaled[:n_train]
X_test_scaled = X_scaled[n_train:]
# Problem: Test set statistics influenced the scaler
# CORRECT: Fit only on training data
X_train, X_test = X[:n_train], X[n_train:]
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit only on train
X_test_scaled = scaler.transform(X_test) # Transform (no fit) on test
# LEAKAGE BUG 2: Including future information as a feature
df['next_month_revenue'] = df.groupby('customer_id')['revenue'].shift(-1)
# This feature requires knowing the future — data leakage!
# LEAKAGE BUG 3: Target encoding computed on full dataset
df['category_avg_revenue'] = df.groupby('category')['revenue'].transform('mean')
# Computed using test set revenue values — leakage!
# Fix: compute target encoding only on training fold in cross-validation
# DIAGNOSTIC: Check for suspiciously high feature-target correlations
feature_target_corr = pd.DataFrame({
'feature': X_train.columns,
'corr_with_target': [X_train[col].corr(y_train) for col in X_train.columns]
}).sort_values('corr_with_target', ascending=False)
print("Top correlated features (investigate if > 0.9):")
print(feature_target_corr.head(10))Bug Category 4: Merge Bugs
Incorrect DataFrame merges are among the most common silent corruption sources:
def safe_merge(left, right, on, how='inner', validate=None, verbose=True):
"""
Merge with comprehensive diagnostic output to catch common merge bugs.
"""
left_rows = len(left)
right_rows = len(right)
# Check for duplicate keys before merging
left_dupes = left[on].duplicated().sum() if isinstance(on, str) else \
left[on].duplicated().sum()
right_dupes = right[on].duplicated().sum() if isinstance(on, str) else \
right[on].duplicated().sum()
if verbose:
print(f"Merging: {left_rows:,} rows × {right_rows:,} rows on '{on}' ({how})")
if left_dupes > 0:
print(f" WARNING: {left_dupes:,} duplicate keys in left DataFrame")
if right_dupes > 0:
print(f" WARNING: {right_dupes:,} duplicate keys in right DataFrame")
# Perform the merge
merged = pd.merge(left, right, on=on, how=how, validate=validate)
if verbose:
expected_min = min(left_rows, right_rows) if how == 'inner' else max(left_rows, right_rows)
actual = len(merged)
print(f" Result: {actual:,} rows")
if how == 'inner' and actual < left_rows * 0.9:
print(f" WARNING: Lost {left_rows - actual:,} rows ({(1 - actual/left_rows):.1%})")
if actual > max(left_rows, right_rows):
print(f" WARNING: Row explosion! Result larger than either input")
print(f" Likely cause: Many-to-many join (duplicate keys on both sides)")
return merged
# Usage
transactions_enriched = safe_merge(
transactions, customers,
on='customer_id',
how='inner',
validate='many_to_one' # Each transaction should match one customer
)Bug Category 5: Label Encoding Inconsistencies
When categorical features are encoded differently between training and inference:
# Common bug: Using different encoding in training vs. inference
# Training:
train_df['channel_encoded'] = train_df['channel'].map(
{'web': 0, 'mobile': 1, 'store': 2}
)
# Inference — a new category appeared in production data
inference_df['channel_encoded'] = inference_df['channel'].map(
{'web': 0, 'mobile': 1, 'store': 2}
)
# If 'phone' appears: maps to NaN silently!
# Better: Use sklearn's LabelEncoder or OrdinalEncoder with handle_unknown
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
encoder.fit(train_df[['channel']])
# Training and inference use same fitted encoder
train_encoded = encoder.transform(train_df[['channel']])
inference_encoded = encoder.transform(inference_df[['channel']])
# Unknown categories get -1 instead of NaN — detectable, not silent
# Diagnostic: check for NaN after encoding
def check_encoding_coverage(df, encoded_col, original_col):
n_nan = df[encoded_col].isnull().sum()
if n_nan > 0:
unseen = df[df[encoded_col].isnull()][original_col].value_counts()
print(f"WARNING: {n_nan} NaN after encoding '{original_col}'")
print(f"Unseen categories:\n{unseen}")Debugging Strategies and Mental Models
Having tools is necessary but not sufficient — you also need systematic strategies for using them.
Strategy 1: Binary Search Debugging
When a bug occurs somewhere in a long pipeline, binary search isolates it efficiently. Instead of checking every step sequentially, check the midpoint. If the midpoint is correct, the bug is in the second half; if not, it’s in the first half. Repeat until isolated.
# Pipeline: Steps 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8 → 9 → 10
# Bug: final output is wrong
# Round 1: Check step 5 (midpoint)
intermediate = run_pipeline_to_step(df, stop_after=5)
is_correct(intermediate) # True → bug is in steps 6-10
# Round 2: Check step 7-8 area
intermediate = run_pipeline_to_step(df, stop_after=8)
is_correct(intermediate) # False → bug is in steps 6-8
# Round 3: Check step 6 or 7
intermediate = run_pipeline_to_step(df, stop_after=6)
is_correct(intermediate) # False → bug is in step 6
# Found! 3 checks instead of up to 10Strategy 2: Simplify and Isolate
When a bug is hard to reproduce or understand, simplify the input until you have a minimal reproducible example:
# Original failing code: complex pipeline with 500K rows
result = run_full_pipeline(large_df) # Fails mysteriously
# Step 1: Try with a tiny sample
small_df = large_df.head(100)
result = run_full_pipeline(small_df) # Still fails → not a scale issue
# Step 2: Construct a minimal DataFrame that reproduces the bug
minimal_df = pd.DataFrame({
'customer_id': ['A', 'A', 'B'],
'amount': [100.0, 200.0, None], # ← Null amount is the issue
'date': ['2024-01-01', '2024-01-02', '2024-01-03']
})
result = run_full_pipeline(minimal_df) # Fails → found the minimal reproduction
# Now it's clear: the bug is triggered by null values in 'amount'Minimal reproducible examples have a second benefit: they’re easy to share with colleagues or post in StackOverflow when you need help.
Strategy 3: Verify Assumptions Explicitly
The most common root cause of data science bugs is an assumption about the data that turns out to be false. The fix: stop assuming, start verifying.
def verify_assumptions(df):
"""Explicitly check all assumptions before processing."""
errors = []
# Assumption: customer_id is never null
null_ids = df['customer_id'].isnull().sum()
if null_ids > 0:
errors.append(f"customer_id has {null_ids} nulls")
# Assumption: customer_id is unique (one row per customer)
dupes = df['customer_id'].duplicated().sum()
if dupes > 0:
errors.append(f"customer_id has {dupes} duplicates (not unique)")
# Assumption: amount is always positive
negative = (df['amount'] <= 0).sum()
if negative > 0:
errors.append(f"{negative} rows have non-positive amount")
# Assumption: date column is datetime type
if df['transaction_date'].dtype != 'datetime64[ns]':
errors.append(f"transaction_date dtype is {df['transaction_date'].dtype}, expected datetime")
# Assumption: all required columns are present
required_cols = ['customer_id', 'transaction_date', 'amount', 'channel']
missing = [col for col in required_cols if col not in df.columns]
if missing:
errors.append(f"Missing required columns: {missing}")
if errors:
for error in errors:
print(f"ASSUMPTION VIOLATED: {error}")
raise AssertionError(f"{len(errors)} assumption violations found")
else:
print("All assumptions verified.")
return dfUsing assert statements for lightweight assumption checking:
def compute_rfm_scores(df):
assert 'customer_id' in df.columns, "Missing customer_id column"
assert df['customer_id'].notna().all(), "customer_id contains nulls"
assert (df['amount'] > 0).all(), "amount must be positive for monetary score"
assert df['transaction_date'].dtype == 'datetime64[ns]', \
f"Expected datetime, got {df['transaction_date'].dtype}"
# ... rest of functionStrategy 4: Rubber Duck Debugging
Explain your code aloud — to a colleague, or even to an inanimate object (the “rubber duck”). The act of articulating what you expect the code to do, step by step, frequently reveals the discrepancy between what you think the code does and what it actually does. The moment you find yourself saying “…and then this should return X… wait, actually it returns Y because…” — that’s the bug.
Strategy 5: Read the Traceback Carefully
Python’s error tracebacks contain exactly the information needed to find most bugs, but they require careful reading. Many beginners read only the last line (the exception message) and miss the crucial context in the lines above it.
Traceback (most recent call last):
File "src/train_model.py", line 87, in main
X_train_scaled = preprocess_features(X_train, scaler) ← Entered here
File "src/preprocessing.py", line 34, in preprocess_features
X_scaled = scaler.transform(X) ← Called this
File "sklearn/preprocessing/_data.py", line 970, in transform
check_is_fitted(self) ← Failed here
File "sklearn/utils/validation.py", line 1463, in check_is_fitted
raise NotFittedError(...)
sklearn.exceptions.NotFittedError: This StandardScaler instance is not
fitted yet. Call 'fit' with appropriate arguments before using this estimator.Reading from bottom to top: the scaler isn’t fitted. Why? Look at line 34: scaler.transform(X) — the scaler was created but fit() was never called. Look at line 87: preprocess_features(X_train, scaler) — the scaler was passed in as a parameter. The bug: the caller is responsible for fitting the scaler before passing it.
Debugging in Jupyter Notebooks
Notebooks have unique debugging characteristics because of their stateful, non-linear execution model.
The Hidden State Problem
The most common notebook bug is the hidden state problem: you run cells out of order, redefine a variable, and now the notebook’s behavior depends on which cells you ran, in what order — not just on the code in the cells.
# Cell 1:
x = 10
# Cell 2:
x = x * 2
print(x) # Prints 20 if Cell 1 ran first, 40 if Cell 2 ran twice, etc.
# Cell 3:
x = 100
# The value of x is now unknowable without knowing execution orderDiagnosis and fix: Run Kernel → Restart & Run All regularly. This is the only way to verify that your notebook actually works from top to bottom. Do this before sharing any notebook or before drawing conclusions from results.
%debug Magic Command
After a notebook cell raises an exception, run %debug in the next cell to open an interactive pdb session at the point of the crash:
# Cell raises an exception:
result = process_data(df)
# ValueError: cannot convert float NaN to integer
# In the very next cell:
%debug
# Opens pdb at the point of the error — inspect df, intermediate values, etc.The %pdb Magic
Enable automatic post-mortem debugging whenever a cell raises an exception:
%pdb on
# Now every exception automatically opens pdb — no need to manually add %debugDebugging Kernel Issues
If your notebook kernel dies (memory error, crash) or produces results that seem inconsistent with the code:
- Kernel → Restart & Clear Output: Starts fresh but keeps your code
- Kernel → Restart & Run All: Verifies clean, sequential execution
- Check memory usage — large DataFrames in multiple variables can exhaust RAM
- Use
del variable_nameandgc.collect()to free memory from variables you no longer need
Building Defensive Data Science Code
The best debugging strategy is preventing bugs from hiding. Defensive coding practices make bugs loud and early rather than silent and late.
Validate Data Schemas
Use libraries like pandera or great_expectations to define and enforce data schemas:
import pandera as pa
transactions_schema = pa.DataFrameSchema({
"customer_id": pa.Column(str, nullable=False),
"transaction_date": pa.Column(pa.DateTime, nullable=False),
"amount": pa.Column(float, pa.Check.greater_than(0), nullable=False),
"channel": pa.Column(str, pa.Check.isin(
['web', 'mobile_app', 'store', 'phone', 'unknown']
), nullable=True),
}, checks=[
pa.Check(lambda df: df['amount'].max() < 100_000,
error="Extreme amount values — likely data error")
])
@pa.check_input(transactions_schema)
def compute_rfm_scores(df):
# Function guaranteed to receive valid data
...If the DataFrame violates the schema, the decorator raises a clear, descriptive error — before the bug can propagate deep into the pipeline.
Use Python Type Hints
Type hints make function contracts explicit and enable static analysis tools (mypy, Pylance) to catch type errors before runtime:
from typing import Tuple, Optional
import pandas as pd
import numpy as np
def split_features_target(
df: pd.DataFrame,
target_col: str,
drop_cols: Optional[list[str]] = None
) -> Tuple[pd.DataFrame, pd.Series]:
"""Split DataFrame into features and target."""
cols_to_drop = [target_col] + (drop_cols or [])
X = df.drop(columns=cols_to_drop)
y = df[target_col]
return X, yA Debugging Checklist for Data Science Projects
When you encounter a bug, work through this checklist systematically:
1. Read the error message and traceback completely (bottom to top)
2. Identify which type of bug this is (syntax, runtime, logic, performance)
3. Locate the boundary (which step first produces wrong output?)
4. Check data shape and types at the identified boundary
5. Check for null values in unexpected columns
6. Check for duplicate rows or unexpected row counts
7. Verify the merge keys if a merge is involved
8. Check whether preprocessors were fit on training data only (leakage check)
9. Simplify to a minimal reproduction (small DataFrame, simple inputs)
10. Verify your assumptions explicitly with assert statements
11. Restart and rerun the notebook/script from scratch
12. Ask a rubber duck — explain the code step by step to force clarity
Summary
Debugging in data science requires a richer toolkit than debugging in general software development because the failure modes are richer. Code can run without errors while silently producing incorrect results — through wrong merge keys, data leakage, label encoding inconsistencies, or corrupted transformations that leave no trace in exception logs.
The systematic approach beats the random approach every time. Whether you’re using print() statements to audit row counts at pipeline boundaries, pdb to interactively explore state at a specific line, an IDE debugger to visually inspect DataFrames mid-execution, or defensive schema validation to make violations explicit at the point of entry — the goal is the same: form a hypothesis, design a test, execute it, and update your understanding.
Data science bugs have their own taxonomy — silent row loss, shape mismatches, data leakage, merge explosions, encoding inconsistencies — and each category has proven diagnostic patterns. Learning to recognize these patterns transforms debugging from an anxious, open-ended struggle into a structured, focused investigation that typically resolves in minutes rather than hours.
Key Takeaways
- Data science has four main bug categories: syntax errors (code won’t run), runtime errors (exceptions), logic errors (wrong results, no error), and performance bugs (correct but too slow/memory-intensive) — each requiring different debugging approaches
- Logic errors are the most dangerous: they produce convincing-looking results without raising any exceptions, making them the hardest to detect and the most likely to cause real-world harm
- Print debugging with data-specific diagnostics (shape, dtypes, null counts, value ranges) at every pipeline boundary is the fastest way to isolate where data goes wrong
- Python’s
pdbdebugger (and its enhanced variantipdb) provides an interactive REPL at any point in execution — use it for complex bugs that require exploring the program’s state - IDE debuggers (VS Code, PyCharm) add visual variable inspection, conditional breakpoints, and a debug console — dramatically reducing cognitive load for complex debugging sessions
- The
Kernel → Restart & Run Allcommand in Jupyter Notebooks is the only reliable way to confirm your notebook works correctly from top to bottom — run it before sharing any results - Data leakage (fitting preprocessors on test data, using future information as features) is among the most insidious data science bugs because it produces optimistic evaluation metrics that fail in production
- Defensive programming — schema validation with pandera, explicit assertion of assumptions, comprehensive merge diagnostics, type hints — prevents bugs from hiding by making violations visible immediately at the point of entry








