Debugging Python Code in Data Science Projects

Master debugging Python code in data science. Learn print debugging, pdb, IDE debuggers, common data science bugs, and systematic strategies to fix errors fast.

Debugging Python Code in Data Science Projects

Debugging in data science is the systematic process of identifying and fixing errors in Python code, data transformations, and model pipelines — ranging from straightforward syntax errors that prevent code from running, to subtle logic errors that silently corrupt results and produce convincing but incorrect output. Effective data science debugging requires a combination of Python debugging tools (print statements, the pdb debugger, IDE breakpoints), data inspection techniques (shape checks, value audits, distribution comparisons), and disciplined mental models for tracing problems through multi-step pipelines.

Introduction

Every data scientist, at every experience level, spends a significant portion of their working time debugging. Studies of software developers suggest that debugging accounts for roughly 50% of total development time — and in data science, where pipelines are longer, data is messier, and failures can be silent, that proportion may be even higher.

What separates fast debuggers from slow ones is not intelligence — it’s methodology. Inexperienced debuggers attack bugs randomly: change something, run again, change something else, run again, growing increasingly frustrated with each iteration. Experienced debuggers approach bugs like scientists: form a hypothesis about the cause, design a test that would confirm or refute it, execute the test, update the hypothesis, and repeat. This systematic approach finds bugs faster, with less wasted effort, and produces better understanding of the code in the process.

Data science debugging has unique characteristics that distinguish it from general Python debugging. Data scientists deal with bugs that don’t raise errors at all — a wrong merge key that silently duplicates rows, a feature scaling applied to test data using test statistics instead of training statistics, a label encoding that maps categories differently on different subsets. These “silent failures” can propagate through an entire pipeline, producing models that train and evaluate without errors but perform poorly or incorrectly in production.

This guide covers the full spectrum of debugging in data science: the types of bugs you’ll encounter, the tools available for finding them, systematic strategies for isolating problems, and patterns for the most common data science–specific bugs. By the end, you’ll have a methodological toolkit that makes debugging faster, less frustrating, and even intellectually satisfying.

Understanding the Types of Bugs in Data Science

Different bugs require different debugging approaches. Knowing what kind of bug you’re dealing with is the first step toward fixing it efficiently.

Type 1: Syntax Errors

Syntax errors are the simplest category — Python’s parser can’t understand the code and refuses to run it at all. These are usually trivially identified because Python tells you exactly where the problem is.

Python
# SyntaxError: Missing closing parenthesis
df = pd.read_csv("data.csv"

# SyntaxError: Invalid indentation
def clean_data(df):
df = df.dropna()   # ← Should be indented
    return df

# SyntaxError: Using = instead of == in a condition
if df.shape[0] = 0:   # ← Should be ==
    raise ValueError("Empty DataFrame")

Debugging approach: Read the error message carefully — Python tells you the file and line number. Modern IDEs (VS Code, PyCharm) underline syntax errors in red before you even run the code.

Type 2: Runtime Errors (Exceptions)

Runtime errors occur during execution — the code is syntactically valid but fails when run because of an unexpected condition.

Common data science runtime errors:

Python
# KeyError: Column doesn't exist in DataFrame
df['revenue_per_customer']   # ← Column named 'revenue' not 'revenue_per_customer'

# TypeError: Wrong type passed to a function
np.log(-5)   # ← Returns nan, not an error, but np.sqrt(-5) raises RuntimeWarning

# ValueError: Shape mismatch
scaler.transform(X_test)   # ← X_test has different number of features than X_train

# AttributeError: Method doesn't exist
df.drop_duplicte()   # ← Typo: should be drop_duplicates()

# IndexError: Out of bounds
features_list[50]   # ← List only has 47 elements

# FileNotFoundError: Wrong path
pd.read_csv("data/transactions.csv")   # ← File is in data/raw/transactions.csv

Debugging approach: Python’s traceback tells you exactly what happened and where. Read it from bottom to top — the bottom is the immediate error, tracing upward shows how execution arrived there.

Type 3: Logic Errors

Logic errors are the most dangerous category. The code runs without raising any exception, but it produces wrong results. In data science, these are extraordinarily common and often go undetected for a long time.

Python
# Logic error: Wrong merge key — creates data leakage
# Intended: join transactions to customers
merged = pd.merge(transactions, customers, on='id')  
# Problem: 'id' is transaction_id in transactions but customer_id in customers
# Result: exploded DataFrame with many-to-many joins, silent data corruption

# Logic error: Fitting scaler on test data — data leakage
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Correct
X_test_scaled = scaler.fit_transform(X_test)    # Wrong! Should be .transform()
# Problem: Test set statistics contaminate training evaluation

# Logic error: Off-by-one in rolling window
df['rolling_avg_7d'] = df['amount'].rolling(window=7).mean()
# But the index wasn't sorted by date first — rolling on unsorted data is meaningless

# Logic error: Wrong comparison operator
high_value = df[df['amount'] > 100]   # Should be >= 100
# Excludes exactly 100, possibly a significant category

Debugging approach: Logic errors require data inspection and reasoning. You can’t rely on error messages — you need to verify assumptions at each step of the pipeline. This is where most debugging time is actually spent.

Type 4: Performance Bugs

These aren’t errors in the traditional sense — the code produces correct output, but far too slowly or using far too much memory.

Python
# Performance bug: Iterating row-by-row instead of using vectorized operations
for i, row in df.iterrows():
    df.at[i, 'score'] = row['amount'] * 0.1 + row['frequency'] * 5
# 1000x slower than: df['score'] = df['amount'] * 0.1 + df['frequency'] * 5

# Memory bug: Loading entire dataset when only a sample is needed
df = pd.read_csv("massive_file.csv")  # 8GB file, crashes on 16GB machine
# Better: df = pd.read_csv("massive_file.csv", nrows=10000) for development

# Performance bug: Unnecessary DataFrame copies in a loop
results = []
for chunk in data_chunks:
    results.append(process(chunk))
final = pd.concat(results)  # This is actually fine
# vs:
final = pd.DataFrame()
for chunk in data_chunks:
    final = pd.concat([final, process(chunk)])  # O(n²) — terrible for large n

Debugging approach: Profiling tools (cProfile, line_profiler, memory_profiler) identify where time and memory are actually being spent.

Tool 1: Print Debugging — The Humble Workhorse

Despite more sophisticated tools existing, print-based debugging remains the most-used debugging technique in data science, for good reason: it’s fast, requires no setup, and works everywhere — scripts, notebooks, remote servers, and CI/CD pipelines.

Effective Print Debugging

The key to effective print debugging is being systematic rather than scattering prints randomly. Place prints at boundaries — before and after major transformations — and make them informative:

Python
def preprocess_pipeline(df):
    print(f"[START] Input shape: {df.shape}")
    print(f"[START] Null counts:\n{df.isnull().sum()}")
    
    # Step 1: Remove duplicates
    df = df.drop_duplicates()
    print(f"[AFTER drop_duplicates] Shape: {df.shape}")
    
    # Step 2: Handle missing values
    df = df.dropna(subset=['customer_id', 'amount'])
    print(f"[AFTER dropna] Shape: {df.shape}, "
          f"Rows removed: {_prev_shape - df.shape[0]}")
    
    # Step 3: Type conversion
    df['transaction_date'] = pd.to_datetime(df['transaction_date'])
    print(f"[AFTER type conversion] "
          f"date dtype: {df['transaction_date'].dtype}")
    
    # Step 4: Filter invalid dates
    df = df[df['transaction_date'] >= '2020-01-01']
    print(f"[AFTER date filter] Shape: {df.shape}")
    
    # Step 5: Compute derived columns
    df['amount_log'] = np.log1p(df['amount'])
    print(f"[AFTER log transform] "
          f"amount_log stats:\n{df['amount_log'].describe()}")
    
    print(f"[END] Final shape: {df.shape}")
    return df

This style of instrumentation — sometimes called “logging checkpoints” — makes it immediately obvious at which step the data changes unexpectedly.

Data-Specific Print Diagnostics

For data science debugging, the most useful things to print aren’t just shapes — they’re data-level summaries:

Python
def debug_dataframe(df, label=""):
    """Print comprehensive diagnostic information about a DataFrame."""
    header = f"=== DataFrame Debug: {label} ===" if label else "=== DataFrame Debug ==="
    print(header)
    print(f"Shape: {df.shape}")
    print(f"Dtypes:\n{df.dtypes}")
    print(f"Null counts:\n{df.isnull().sum()[df.isnull().sum() > 0]}")
    print(f"Duplicate rows: {df.duplicated().sum()}")
    for col in df.select_dtypes(include='number').columns[:5]:
        print(f"{col}: min={df[col].min():.3f}, max={df[col].max():.3f}, "
              f"mean={df[col].mean():.3f}, nulls={df[col].isnull().sum()}")
    print()

# Use it throughout your pipeline
raw_df = pd.read_csv("data.csv")
debug_dataframe(raw_df, "Raw data")

clean_df = clean_data(raw_df)
debug_dataframe(clean_df, "After cleaning")

feature_df = build_features(clean_df)
debug_dataframe(feature_df, "After feature engineering")

Removing Print Statements: The Logging Module

Production code shouldn’t have print() statements scattered throughout it. Use Python’s logging module instead — it supports log levels, can be turned on/off without changing code, and writes to files as well as the console:

Python
import logging

logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

def preprocess_pipeline(df):
    logger.info(f"Starting preprocessing. Input shape: {df.shape}")
    
    df = df.drop_duplicates()
    logger.debug(f"After drop_duplicates: {df.shape}")
    
    df = df.dropna(subset=['customer_id'])
    logger.info(f"After dropna: {df.shape}")
    
    if df.shape[0] == 0:
        logger.error("DataFrame is empty after preprocessing!")
        raise ValueError("Empty DataFrame after preprocessing")
    
    logger.info("Preprocessing complete.")
    return df

In production, set level=logging.INFO (suppress DEBUG messages). During debugging, set level=logging.DEBUG to see all messages. No code changes needed — just a configuration change.

Tool 2: Python’s Built-in Debugger — pdb

pdb (Python Debugger) is Python’s built-in interactive debugger. It lets you pause execution at any point, inspect variables, execute arbitrary expressions, and step through code line by line — all without leaving the terminal.

Starting pdb

Method 1: Insert a breakpoint in code

Python
def build_features(df):
    rfm = compute_rfm(df)
    
    # Pause execution here and open interactive debugger
    breakpoint()  # Python 3.7+ (replaces the older: import pdb; pdb.set_trace())
    
    features = merge_feature_sets(rfm, behavioral_features)
    return features

When execution hits breakpoint(), Python pauses and opens the pdb interactive prompt:

Plaintext
> /path/to/your/file.py(8)build_features()
-> features = merge_feature_sets(rfm, behavioral_features)
(Pdb) 

Method 2: Launch a script directly in pdb

Bash
python -m pdb src/train_model.py --config configs/config.yaml

Method 3: Post-mortem debugging — investigate after a crash

Python
import pdb

try:
    result = run_pipeline(df)
except Exception as e:
    print(f"Error: {e}")
    pdb.post_mortem()  # Opens debugger at the point of the crash

Essential pdb Commands

CommandShortcutDescription
helphShow all commands
listlShow current code context (11 lines around current position)
nextnExecute next line (step over function calls)
stepsStep into the next function call
returnrContinue until current function returns
continuecResume execution until next breakpoint
quitqExit the debugger
print(expr)p exprEvaluate and print an expression
pp exprppPretty-print an expression (better for dicts/lists)
wherewShow the call stack (where am I in the program?)
upuMove up one level in the call stack
downdMove down one level in the call stack
break nb nSet a breakpoint at line n
!statement!Execute arbitrary Python in the current context

pdb in Action: A Data Science Example

Python
# Debugging a mysterious row count drop in preprocessing
(Pdb) l
  5     df = df.drop_duplicates()
  6     print(f"After dedup: {df.shape}")
  7     
  8  -> df = df.merge(customers, on='customer_id', how='inner')
  9     print(f"After merge: {df.shape}")
 10

(Pdb) p df.shape
(45231, 8)

(Pdb) p customers.shape
(38940, 5)

(Pdb) p df['customer_id'].nunique()
45231

(Pdb) p customers['customer_id'].nunique()
38940

(Pdb) p len(set(df['customer_id']) - set(customers['customer_id']))
6291

# Found it! 6,291 transaction customer_ids don't exist in the customers table
# The inner merge will drop these rows — that's the missing rows we're seeing
# Fix: investigate why customers table is missing these IDs

(Pdb) c  # continue

This interaction found the root cause in under 2 minutes.

ipdb: The Enhanced Debugger

ipdb is a drop-in replacement for pdb that adds IPython-style tab completion, syntax highlighting, and better display of objects:

Bash
pip install ipdb
Python
import ipdb
ipdb.set_trace()  # Or just: breakpoint()  (if PYTHONBREAKPOINT=ipdb)

Set ipdb as the default debugger:

Python
export PYTHONBREAKPOINT=ipdb.set_trace

Now breakpoint() everywhere opens ipdb automatically.

Tool 3: IDE Debuggers — PyCharm and VS Code

For the most powerful debugging experience, use your IDE’s graphical debugger. IDE debuggers provide everything pdb does but with a visual interface that shows variables, call stacks, and code simultaneously — dramatically reducing cognitive load.

Setting Breakpoints in VS Code

Click in the margin to the left of any line number — a red dot appears. Run the file with F5 (Debug mode). Execution pauses at the breakpoint.

The Debug panel shows:

  • Variables: Every variable in the current scope with current values. DataFrames show their shape and dtypes. Click the arrow to expand objects.
  • Watch: Type any expression (e.g., df.shape, df['amount'].isnull().sum()) and VS Code evaluates it continuously as you step through code
  • Call Stack: The chain of function calls that led here
  • Debug Console: A REPL where you can type arbitrary Python in the current execution context

Debug Console: The Hidden Gem

The Debug Console (or “Evaluate Expression” in PyCharm) is the most useful debugging tool most data scientists underuse. While paused at a breakpoint, the debug console lets you run arbitrary code in the current scope:

Python
> df.shape
(45231, 8)

> df['customer_id'].duplicated().sum()
0

> df.dtypes
customer_id        object
transaction_date   object   ← Should be datetime!
amount             float64

> pd.to_datetime(df['transaction_date']).head()
0   2024-01-15
1   2024-01-16
...

> df['transaction_date'].head()
0    "2024-01-15"   ← It's a string, not datetime — found the bug!

The debug console lets you explore the data interactively without modifying your code — test fixes before implementing them, verify assumptions, understand data properties.

Conditional Breakpoints

For data science pipelines processing thousands of rows, you don’t want to pause at a breakpoint for every row — you want to pause when something specific goes wrong. Both VS Code and PyCharm support conditional breakpoints:

Python
# Right-click a breakpoint → Add Condition

# Pause when the DataFrame drops below expected size
df.shape[0] < 40000

# Pause when a suspicious value appears
df['amount'].max() > 50000

# Pause on a specific iteration
customer_id == 'CUST_98765'

# Pause when an unexpected NaN appears
df['customer_id'].isnull().any()

Debugging Data Science–Specific Bugs

Beyond general Python debugging, data science has its own category of bugs that require specific diagnostic approaches.

Bug Category 1: Silent Row Loss

One of the most common data science bugs is rows disappearing from your DataFrame without errors or warnings. Your pipeline starts with 50,000 rows and ends with 43,000, and you don’t know where the 7,000 went.

Systematic approach — count rows at every step:

Python
def preprocessing_with_audit(df, verbose=True):
    audit = {'start': len(df)}
    
    df = df.drop_duplicates()
    audit['after_dedup'] = len(df)
    
    df = df.dropna(subset=['customer_id', 'amount'])
    audit['after_dropna_required'] = len(df)
    
    df = df[df['amount'] > 0]
    audit['after_positive_amount'] = len(df)
    
    df = df[df['transaction_date'] >= '2020-01-01']
    audit['after_date_filter'] = len(df)
    
    df = df.merge(customers, on='customer_id', how='inner')
    audit['after_merge'] = len(df)
    
    if verbose:
        print("\n=== Row Count Audit ===")
        prev = audit['start']
        for step, count in audit.items():
            diff = count - prev if step != 'start' else 0
            flag = " ← LARGE DROP" if abs(diff) > 1000 else ""
            print(f"{step:35s}: {count:7,d}  ({diff:+,d}){flag}")
            prev = count
    
    return df

Output:

Plaintext
=== Row Count Audit ===
start                              :  50,000  (+0)
after_dedup                        :  49,847  (-153)
after_dropna_required              :  49,821  (-26)
after_positive_amount              :  49,798  (-23)
after_date_filter                  :  49,798  (+0)
after_merge                        :  43,507  (-6,291)  ← LARGE DROP

Immediately clear: the merge is responsible for the large drop.

Bug Category 2: Shape and Type Mismatches

Shape and type errors are common when code written for one dataset is applied to another:

Python
def diagnose_ml_data(X_train, X_test, y_train, y_test, feature_names=None):
    """Print comprehensive diagnostics before model training."""
    print("=== ML Data Diagnostics ===")
    print(f"X_train: {X_train.shape}, X_test: {X_test.shape}")
    
    # Check for shape compatibility
    if X_train.shape[1] != X_test.shape[1]:
        print(f"ERROR: Feature count mismatch! "
              f"Train: {X_train.shape[1]}, Test: {X_test.shape[1]}")
    
    # Check target distribution
    print(f"\ny_train class distribution:")
    print(pd.Series(y_train).value_counts(normalize=True).round(3))
    print(f"\ny_test class distribution:")
    print(pd.Series(y_test).value_counts(normalize=True).round(3))
    
    # Check for NaN/Inf values
    if isinstance(X_train, np.ndarray):
        n_nan = np.isnan(X_train).sum()
        n_inf = np.isinf(X_train).sum()
    else:  # DataFrame
        n_nan = X_train.isnull().sum().sum()
        n_inf = np.isinf(X_train.select_dtypes(include='number')).sum().sum()
    
    if n_nan > 0:
        print(f"\nWARNING: {n_nan} NaN values in X_train!")
    if n_inf > 0:
        print(f"\nWARNING: {n_inf} infinite values in X_train!")
    
    # Feature value ranges
    if feature_names and isinstance(X_train, np.ndarray):
        print("\nFeature value ranges (first 10):")
        for i, name in enumerate(feature_names[:10]):
            col = X_train[:, i]
            print(f"  {name:30s}: [{col.min():.3f}, {col.max():.3f}], "
                  f"mean={col.mean():.3f}")

Bug Category 3: Data Leakage

Data leakage is when information from the test set “leaks” into the training process, producing optimistic evaluation metrics that don’t reflect real-world performance. It’s one of the most insidious bugs in machine learning because the model trains successfully and evaluates well — but fails in production.

Common leakage patterns and how to detect them:

Python
# LEAKAGE BUG 1: Fitting preprocessors on entire dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)        # Fits on ALL data including test
X_train_scaled = X_scaled[:n_train]
X_test_scaled = X_scaled[n_train:]
# Problem: Test set statistics influenced the scaler

# CORRECT: Fit only on training data
X_train, X_test = X[:n_train], X[n_train:]
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # Fit only on train
X_test_scaled = scaler.transform(X_test)         # Transform (no fit) on test

# LEAKAGE BUG 2: Including future information as a feature
df['next_month_revenue'] = df.groupby('customer_id')['revenue'].shift(-1)
# This feature requires knowing the future — data leakage!

# LEAKAGE BUG 3: Target encoding computed on full dataset
df['category_avg_revenue'] = df.groupby('category')['revenue'].transform('mean')
# Computed using test set revenue values — leakage!
# Fix: compute target encoding only on training fold in cross-validation

# DIAGNOSTIC: Check for suspiciously high feature-target correlations
feature_target_corr = pd.DataFrame({
    'feature': X_train.columns,
    'corr_with_target': [X_train[col].corr(y_train) for col in X_train.columns]
}).sort_values('corr_with_target', ascending=False)

print("Top correlated features (investigate if > 0.9):")
print(feature_target_corr.head(10))

Bug Category 4: Merge Bugs

Incorrect DataFrame merges are among the most common silent corruption sources:

Python
def safe_merge(left, right, on, how='inner', validate=None, verbose=True):
    """
    Merge with comprehensive diagnostic output to catch common merge bugs.
    """
    left_rows = len(left)
    right_rows = len(right)
    
    # Check for duplicate keys before merging
    left_dupes = left[on].duplicated().sum() if isinstance(on, str) else \
                 left[on].duplicated().sum()
    right_dupes = right[on].duplicated().sum() if isinstance(on, str) else \
                  right[on].duplicated().sum()
    
    if verbose:
        print(f"Merging: {left_rows:,} rows × {right_rows:,} rows on '{on}' ({how})")
        if left_dupes > 0:
            print(f"  WARNING: {left_dupes:,} duplicate keys in left DataFrame")
        if right_dupes > 0:
            print(f"  WARNING: {right_dupes:,} duplicate keys in right DataFrame")
    
    # Perform the merge
    merged = pd.merge(left, right, on=on, how=how, validate=validate)
    
    if verbose:
        expected_min = min(left_rows, right_rows) if how == 'inner' else max(left_rows, right_rows)
        actual = len(merged)
        print(f"  Result: {actual:,} rows")
        
        if how == 'inner' and actual < left_rows * 0.9:
            print(f"  WARNING: Lost {left_rows - actual:,} rows ({(1 - actual/left_rows):.1%})")
        
        if actual > max(left_rows, right_rows):
            print(f"  WARNING: Row explosion! Result larger than either input")
            print(f"  Likely cause: Many-to-many join (duplicate keys on both sides)")
    
    return merged

# Usage
transactions_enriched = safe_merge(
    transactions, customers, 
    on='customer_id', 
    how='inner',
    validate='many_to_one'  # Each transaction should match one customer
)

Bug Category 5: Label Encoding Inconsistencies

When categorical features are encoded differently between training and inference:

Python
# Common bug: Using different encoding in training vs. inference
# Training:
train_df['channel_encoded'] = train_df['channel'].map(
    {'web': 0, 'mobile': 1, 'store': 2}
)

# Inference — a new category appeared in production data
inference_df['channel_encoded'] = inference_df['channel'].map(
    {'web': 0, 'mobile': 1, 'store': 2}
)
# If 'phone' appears: maps to NaN silently!

# Better: Use sklearn's LabelEncoder or OrdinalEncoder with handle_unknown
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
encoder.fit(train_df[['channel']])

# Training and inference use same fitted encoder
train_encoded = encoder.transform(train_df[['channel']])
inference_encoded = encoder.transform(inference_df[['channel']])
# Unknown categories get -1 instead of NaN — detectable, not silent

# Diagnostic: check for NaN after encoding
def check_encoding_coverage(df, encoded_col, original_col):
    n_nan = df[encoded_col].isnull().sum()
    if n_nan > 0:
        unseen = df[df[encoded_col].isnull()][original_col].value_counts()
        print(f"WARNING: {n_nan} NaN after encoding '{original_col}'")
        print(f"Unseen categories:\n{unseen}")

Debugging Strategies and Mental Models

Having tools is necessary but not sufficient — you also need systematic strategies for using them.

Strategy 1: Binary Search Debugging

When a bug occurs somewhere in a long pipeline, binary search isolates it efficiently. Instead of checking every step sequentially, check the midpoint. If the midpoint is correct, the bug is in the second half; if not, it’s in the first half. Repeat until isolated.

Python
# Pipeline: Steps 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8 → 9 → 10
# Bug: final output is wrong

# Round 1: Check step 5 (midpoint)
intermediate = run_pipeline_to_step(df, stop_after=5)
is_correct(intermediate)   # True → bug is in steps 6-10

# Round 2: Check step 7-8 area
intermediate = run_pipeline_to_step(df, stop_after=8)
is_correct(intermediate)   # False → bug is in steps 6-8

# Round 3: Check step 6 or 7
intermediate = run_pipeline_to_step(df, stop_after=6)
is_correct(intermediate)   # False → bug is in step 6

# Found! 3 checks instead of up to 10

Strategy 2: Simplify and Isolate

When a bug is hard to reproduce or understand, simplify the input until you have a minimal reproducible example:

Python
# Original failing code: complex pipeline with 500K rows
result = run_full_pipeline(large_df)  # Fails mysteriously

# Step 1: Try with a tiny sample
small_df = large_df.head(100)
result = run_full_pipeline(small_df)  # Still fails → not a scale issue

# Step 2: Construct a minimal DataFrame that reproduces the bug
minimal_df = pd.DataFrame({
    'customer_id': ['A', 'A', 'B'],
    'amount': [100.0, 200.0, None],  # ← Null amount is the issue
    'date': ['2024-01-01', '2024-01-02', '2024-01-03']
})
result = run_full_pipeline(minimal_df)  # Fails → found the minimal reproduction

# Now it's clear: the bug is triggered by null values in 'amount'

Minimal reproducible examples have a second benefit: they’re easy to share with colleagues or post in StackOverflow when you need help.

Strategy 3: Verify Assumptions Explicitly

The most common root cause of data science bugs is an assumption about the data that turns out to be false. The fix: stop assuming, start verifying.

Python
def verify_assumptions(df):
    """Explicitly check all assumptions before processing."""
    errors = []
    
    # Assumption: customer_id is never null
    null_ids = df['customer_id'].isnull().sum()
    if null_ids > 0:
        errors.append(f"customer_id has {null_ids} nulls")
    
    # Assumption: customer_id is unique (one row per customer)
    dupes = df['customer_id'].duplicated().sum()
    if dupes > 0:
        errors.append(f"customer_id has {dupes} duplicates (not unique)")
    
    # Assumption: amount is always positive
    negative = (df['amount'] <= 0).sum()
    if negative > 0:
        errors.append(f"{negative} rows have non-positive amount")
    
    # Assumption: date column is datetime type
    if df['transaction_date'].dtype != 'datetime64[ns]':
        errors.append(f"transaction_date dtype is {df['transaction_date'].dtype}, expected datetime")
    
    # Assumption: all required columns are present
    required_cols = ['customer_id', 'transaction_date', 'amount', 'channel']
    missing = [col for col in required_cols if col not in df.columns]
    if missing:
        errors.append(f"Missing required columns: {missing}")
    
    if errors:
        for error in errors:
            print(f"ASSUMPTION VIOLATED: {error}")
        raise AssertionError(f"{len(errors)} assumption violations found")
    else:
        print("All assumptions verified.")
    
    return df

Using assert statements for lightweight assumption checking:

Python
def compute_rfm_scores(df):
    assert 'customer_id' in df.columns, "Missing customer_id column"
    assert df['customer_id'].notna().all(), "customer_id contains nulls"
    assert (df['amount'] > 0).all(), "amount must be positive for monetary score"
    assert df['transaction_date'].dtype == 'datetime64[ns]', \
        f"Expected datetime, got {df['transaction_date'].dtype}"
    
    # ... rest of function

Strategy 4: Rubber Duck Debugging

Explain your code aloud — to a colleague, or even to an inanimate object (the “rubber duck”). The act of articulating what you expect the code to do, step by step, frequently reveals the discrepancy between what you think the code does and what it actually does. The moment you find yourself saying “…and then this should return X… wait, actually it returns Y because…” — that’s the bug.

Strategy 5: Read the Traceback Carefully

Python’s error tracebacks contain exactly the information needed to find most bugs, but they require careful reading. Many beginners read only the last line (the exception message) and miss the crucial context in the lines above it.

Python
Traceback (most recent call last):
  File "src/train_model.py", line 87, in main
    X_train_scaled = preprocess_features(X_train, scaler)  ← Entered here
  File "src/preprocessing.py", line 34, in preprocess_features
    X_scaled = scaler.transform(X)                          ← Called this
  File "sklearn/preprocessing/_data.py", line 970, in transform
    check_is_fitted(self)                                    ← Failed here
  File "sklearn/utils/validation.py", line 1463, in check_is_fitted
    raise NotFittedError(...)                               
sklearn.exceptions.NotFittedError: This StandardScaler instance is not
fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Reading from bottom to top: the scaler isn’t fitted. Why? Look at line 34: scaler.transform(X) — the scaler was created but fit() was never called. Look at line 87: preprocess_features(X_train, scaler) — the scaler was passed in as a parameter. The bug: the caller is responsible for fitting the scaler before passing it.

Debugging in Jupyter Notebooks

Notebooks have unique debugging characteristics because of their stateful, non-linear execution model.

The Hidden State Problem

The most common notebook bug is the hidden state problem: you run cells out of order, redefine a variable, and now the notebook’s behavior depends on which cells you ran, in what order — not just on the code in the cells.

Python
# Cell 1:
x = 10

# Cell 2:
x = x * 2
print(x)  # Prints 20 if Cell 1 ran first, 40 if Cell 2 ran twice, etc.

# Cell 3:
x = 100

# The value of x is now unknowable without knowing execution order

Diagnosis and fix: Run Kernel → Restart & Run All regularly. This is the only way to verify that your notebook actually works from top to bottom. Do this before sharing any notebook or before drawing conclusions from results.

%debug Magic Command

After a notebook cell raises an exception, run %debug in the next cell to open an interactive pdb session at the point of the crash:

Python
# Cell raises an exception:
result = process_data(df)
# ValueError: cannot convert float NaN to integer

# In the very next cell:
%debug
# Opens pdb at the point of the error — inspect df, intermediate values, etc.

The %pdb Magic

Enable automatic post-mortem debugging whenever a cell raises an exception:

Python
%pdb on
# Now every exception automatically opens pdb — no need to manually add %debug

Debugging Kernel Issues

If your notebook kernel dies (memory error, crash) or produces results that seem inconsistent with the code:

  1. Kernel → Restart & Clear Output: Starts fresh but keeps your code
  2. Kernel → Restart & Run All: Verifies clean, sequential execution
  3. Check memory usage — large DataFrames in multiple variables can exhaust RAM
  4. Use del variable_name and gc.collect() to free memory from variables you no longer need

Building Defensive Data Science Code

The best debugging strategy is preventing bugs from hiding. Defensive coding practices make bugs loud and early rather than silent and late.

Validate Data Schemas

Use libraries like pandera or great_expectations to define and enforce data schemas:

Python
import pandera as pa

transactions_schema = pa.DataFrameSchema({
    "customer_id": pa.Column(str, nullable=False),
    "transaction_date": pa.Column(pa.DateTime, nullable=False),
    "amount": pa.Column(float, pa.Check.greater_than(0), nullable=False),
    "channel": pa.Column(str, pa.Check.isin(
        ['web', 'mobile_app', 'store', 'phone', 'unknown']
    ), nullable=True),
}, checks=[
    pa.Check(lambda df: df['amount'].max() < 100_000, 
             error="Extreme amount values — likely data error")
])

@pa.check_input(transactions_schema)
def compute_rfm_scores(df):
    # Function guaranteed to receive valid data
    ...

If the DataFrame violates the schema, the decorator raises a clear, descriptive error — before the bug can propagate deep into the pipeline.

Use Python Type Hints

Type hints make function contracts explicit and enable static analysis tools (mypy, Pylance) to catch type errors before runtime:

Python
from typing import Tuple, Optional
import pandas as pd
import numpy as np

def split_features_target(
    df: pd.DataFrame,
    target_col: str,
    drop_cols: Optional[list[str]] = None
) -> Tuple[pd.DataFrame, pd.Series]:
    """Split DataFrame into features and target."""
    cols_to_drop = [target_col] + (drop_cols or [])
    X = df.drop(columns=cols_to_drop)
    y = df[target_col]
    return X, y

A Debugging Checklist for Data Science Projects

When you encounter a bug, work through this checklist systematically:

1. Read the error message and traceback completely (bottom to top)
2. Identify which type of bug this is (syntax, runtime, logic, performance)
3. Locate the boundary (which step first produces wrong output?)
4. Check data shape and types at the identified boundary
5. Check for null values in unexpected columns
6. Check for duplicate rows or unexpected row counts
7. Verify the merge keys if a merge is involved
8. Check whether preprocessors were fit on training data only (leakage check)
9. Simplify to a minimal reproduction (small DataFrame, simple inputs)
10. Verify your assumptions explicitly with assert statements
11. Restart and rerun the notebook/script from scratch
12. Ask a rubber duck — explain the code step by step to force clarity

Summary

Debugging in data science requires a richer toolkit than debugging in general software development because the failure modes are richer. Code can run without errors while silently producing incorrect results — through wrong merge keys, data leakage, label encoding inconsistencies, or corrupted transformations that leave no trace in exception logs.

The systematic approach beats the random approach every time. Whether you’re using print() statements to audit row counts at pipeline boundaries, pdb to interactively explore state at a specific line, an IDE debugger to visually inspect DataFrames mid-execution, or defensive schema validation to make violations explicit at the point of entry — the goal is the same: form a hypothesis, design a test, execute it, and update your understanding.

Data science bugs have their own taxonomy — silent row loss, shape mismatches, data leakage, merge explosions, encoding inconsistencies — and each category has proven diagnostic patterns. Learning to recognize these patterns transforms debugging from an anxious, open-ended struggle into a structured, focused investigation that typically resolves in minutes rather than hours.

Key Takeaways

  • Data science has four main bug categories: syntax errors (code won’t run), runtime errors (exceptions), logic errors (wrong results, no error), and performance bugs (correct but too slow/memory-intensive) — each requiring different debugging approaches
  • Logic errors are the most dangerous: they produce convincing-looking results without raising any exceptions, making them the hardest to detect and the most likely to cause real-world harm
  • Print debugging with data-specific diagnostics (shape, dtypes, null counts, value ranges) at every pipeline boundary is the fastest way to isolate where data goes wrong
  • Python’s pdb debugger (and its enhanced variant ipdb) provides an interactive REPL at any point in execution — use it for complex bugs that require exploring the program’s state
  • IDE debuggers (VS Code, PyCharm) add visual variable inspection, conditional breakpoints, and a debug console — dramatically reducing cognitive load for complex debugging sessions
  • The Kernel → Restart & Run All command in Jupyter Notebooks is the only reliable way to confirm your notebook works correctly from top to bottom — run it before sharing any results
  • Data leakage (fitting preprocessors on test data, using future information as features) is among the most insidious data science bugs because it produces optimistic evaluation metrics that fail in production
  • Defensive programming — schema validation with pandera, explicit assertion of assumptions, comprehensive merge diagnostics, type hints — prevents bugs from hiding by making violations visible immediately at the point of entry
Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Understanding System Architecture: The Blueprint of Your Operating System

Learn about operating system architecture including monolithic kernels, microkernels, hybrid kernels, layered architecture, and how…

Introduction to JavaScript – Basics and Fundamentals

Learn the basics of JavaScript, including syntax, events, loops, and closures, to build dynamic and…

The History of Robotics: From Ancient Automata to Modern Machines

Explore the fascinating evolution of robotics from ancient mechanical devices to today’s AI-powered machines. Discover…

Understanding Force and Torque in Robot Design

Master force and torque concepts essential for robot design. Learn to calculate requirements, select motors,…

The Role of Inductors: Understanding Magnetic Energy Storage

Learn what inductors do in circuits, how they store energy in magnetic fields, and why…

Interactive Data Visualization: Adding Filters and Interactivity

Learn how to enhance data visualizations with filters, real-time integration and interactivity. Explore tools, best…

Click For More
0
Would love your thoughts, please comment.x
()
x