Writing Your First Python Script for Data Analysis

Learn to write Python scripts for data analysis from scratch. Master script structure, data loading, analysis, visualization, and best practices with complete working examples.

By Techietory on February 1, 2026

Introduction: From Interactive Code to Reusable Scripts

When learning Python for data analysis, most people start with interactive environments like Jupyter notebooks or Python shells executing commands one at a time and seeing immediate results. This interactive approach is perfect for exploration and learning, but professional data analysis requires scripts—complete programs saved in files that can run repeatedly, automatically, and reliably. Scripts transform one-off analyses into repeatable workflows, enable automation, make analyses shareable with teammates, and integrate into production systems.

The transition from interactive coding to script writing involves several new considerations. Scripts must be self-contained, loading all necessary libraries and data. They need clear structure with logical flow from data loading through analysis to results. Error handling becomes critical because scripts often run without human supervision. Documentation matters more when others will read and use your code. Output must be saved rather than just displayed. These differences might seem daunting initially, but they represent professional practices that make your work more valuable.

This article will guide you through writing your first complete Python script for data analysis. We won’t just show you code—we’ll explain why scripts are structured certain ways, what decisions you need to make at each step, and what patterns professionals follow. We’ll build a complete analysis script from scratch, examining each component’s purpose and implementation. By the end, you’ll understand not just how to write a script, but how to think about structuring analyses as programs.

We’ll start by understanding what scripts are and how they differ from interactive code. We’ll explore proper script structure and organization that makes code maintainable. We’ll walk through writing a complete analysis script step by step, from imports through data loading, analysis, visualization, and saving results. We’ll cover best practices like error handling, documentation, and code organization. We’ll examine how to make scripts configurable and reusable. Throughout, we’ll focus on practical patterns you’ll use in real data analysis projects, ensuring you develop skills that transfer directly to professional work.

Understanding Scripts: Programs vs. Interactive Code

Before writing scripts, you need to understand what distinguishes them from the interactive code you’ve likely been writing. This understanding helps you structure scripts appropriately and appreciate why certain practices matter.

What is a Python Script?

A Python script is a text file containing Python code, saved with a .py extension. When you run the script, Python executes the code from top to bottom, just as if you’d typed each line interactively. The difference is that scripts are complete, self-contained programs designed to accomplish specific tasks without human interaction during execution.

Scripts typically have three stages: setup (importing libraries, loading data, configuring parameters), execution (performing analysis, transformations, or computations), and output (displaying results, saving files, or generating reports). This structure contrasts with interactive code where you might import libraries as needed, examine intermediate results frequently, and manually save interesting outputs.

Interactive Code vs. Scripts: Key Differences

Interactive code is exploratory. You run small pieces, examine results, adjust based on what you see, and iterate rapidly. Variables accumulate in memory. You can reference variables created earlier even if the defining code is no longer visible. This flexibility enables discovery but creates dependencies on execution history that make reproducing results difficult.

Scripts are deterministic. They must run successfully from start to finish without human input. Every variable must be defined before use. Execution order is strictly top-to-bottom. This constraint might seem limiting, but it provides crucial guarantees: scripts run the same way every time, anyone can reproduce your results by running your script, and you can automate scripts to run on schedules or triggers.

Why Write Scripts Instead of Using Notebooks?

Jupyter notebooks are excellent for exploration, learning, and presenting analyses with inline visualizations. However, scripts offer advantages for certain tasks:

Repeatability: Scripts run identically each time. Notebooks can produce different results if cells run out of order or some cells don’t re-run.

Version control: Scripts are plain text files that version control systems like Git handle well. Notebook JSON format is harder to diff and merge.

Automation: Scripts integrate easily into automated workflows, scheduled jobs, and production pipelines. Notebooks are designed for interactive use.

Testing: Scripts are easier to test because they have clear inputs and outputs. Notebooks mix code, output, and narrative.

Modularity: Scripts can import each other, enabling code reuse. Notebook code sharing is less straightforward.

The choice depends on your goal. Use notebooks for exploration, experimentation, and presentation. Use scripts for production analyses, automated workflows, and reusable tools.

Example: Same Analysis Interactive vs. Script

Let’s see how the same analysis differs between interactive and script approaches:

Interactive notebook approach:

Python

# Cell 1
import pandas as pd

# Cell 2
df = pd.read_csv('data.csv')

# Cell 3
df.head()  # Examine data

# Cell 4
df.describe()  # Look at statistics

# Cell 5 (maybe run multiple times with different columns)
df['age'].mean()

# Cell 6 (after realizing we need to clean data)
df = df.dropna()

# Cell 7
result = df.groupby('category')['value'].mean()

# Cell 8
result

# Cell 1
import pandas as pd

# Cell 2
df = pd.read_csv('data.csv')

# Cell 3
df.head()  # Examine data

# Cell 4
df.describe()  # Look at statistics

# Cell 5 (maybe run multiple times with different columns)
df['age'].mean()

# Cell 6 (after realizing we need to clean data)
df = df.dropna()

# Cell 7
result = df.groupby('category')['value'].mean()

# Cell 8
result

This works great for exploration. You can run cells multiple times, examine intermediate results, and adjust as you discover issues. However, it’s unclear what order produces the final result, and re-running requires knowing which cells to execute.

Script approach:

Python

# analysis.py
"""
Data analysis script for customer dataset.
Calculates average values by category.
"""

import pandas as pd

def main():
    """Main analysis function."""
    # Load data
    print("Loading data...")
    df = pd.read_csv('data.csv')
    print(f"Loaded {len(df)} records")
    
    # Clean data
    print("Cleaning data...")
    df_clean = df.dropna()
    print(f"Removed {len(df) - len(df_clean)} records with missing values")
    
    # Perform analysis
    print("Analyzing data...")
    result = df_clean.groupby('category')['value'].mean()
    
    # Display results
    print("\nResults:")
    print(result)
    
    # Save results
    result.to_csv('results.csv')
    print("\nResults saved to results.csv")

if __name__ == "__main__":
    main()

# analysis.py
"""
Data analysis script for customer dataset.
Calculates average values by category.
"""

import pandas as pd

def main():
    """Main analysis function."""
    # Load data
    print("Loading data...")
    df = pd.read_csv('data.csv')
    print(f"Loaded {len(df)} records")
    
    # Clean data
    print("Cleaning data...")
    df_clean = df.dropna()
    print(f"Removed {len(df) - len(df_clean)} records with missing values")
    
    # Perform analysis
    print("Analyzing data...")
    result = df_clean.groupby('category')['value'].mean()
    
    # Display results
    print("\nResults:")
    print(result)
    
    # Save results
    result.to_csv('results.csv')
    print("\nResults saved to results.csv")

if __name__ == "__main__":
    main()

This script is self-contained and clearly structured. Anyone can run it and get the same results. It includes helpful print statements for progress tracking. It saves results automatically. However, it requires more upfront planning about structure and flow.

What this comparison demonstrates: Interactive code is exploratory and flexible. Scripts are structured and reproducible. Both have their place. Often, you’ll explore data interactively in a notebook, then convert your analysis into a script once you know what you want to do.

Script Structure: Organizing Code Professionally

Well-structured scripts follow conventions that make them readable, maintainable, and professional. Understanding standard structure helps you write clear code others can understand and modify.

The Standard Script Template

Professional Python scripts follow a consistent organization:

Python

#!/usr/bin/env python3
"""
Script description: Brief explanation of what this script does.

This docstring explains the script's purpose, inputs, outputs,
and any important usage information.
"""

# ============================================================================
# IMPORTS
# ============================================================================
# Standard library imports
import os
import sys
from datetime import datetime

# Third-party imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Local imports (if you have other modules)
# from my_module import my_function

# ============================================================================
# CONSTANTS AND CONFIGURATION
# ============================================================================
DATA_PATH = "data/customer_data.csv"
OUTPUT_DIR = "results"
RANDOM_SEED = 42

# ============================================================================
# FUNCTION DEFINITIONS
# ============================================================================
def load_data(filepath):
    """
    Load data from CSV file.
    
    Parameters:
    -----------
    filepath : str
        Path to the CSV file
        
    Returns:
    --------
    pd.DataFrame
        Loaded data
    """
    return pd.read_csv(filepath)

def analyze_data(df):
    """
    Perform analysis on the data.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input data
        
    Returns:
    --------
    dict
        Analysis results
    """
    results = {
        'mean': df['value'].mean(),
        'median': df['value'].median(),
        'std': df['value'].std()
    }
    return results

# ============================================================================
# MAIN FUNCTION
# ============================================================================
def main():
    """
    Main execution function.
    
    Orchestrates the entire analysis workflow.
    """
    print("Starting analysis...")
    print(f"Timestamp: {datetime.now()}")
    
    # Load data
    print(f"\nLoading data from {DATA_PATH}...")
    df = load_data(DATA_PATH)
    print(f"Loaded {len(df)} records")
    
    # Analyze
    print("\nPerforming analysis...")
    results = analyze_data(df)
    
    # Display results
    print("\nResults:")
    for key, value in results.items():
        print(f"  {key}: {value:.2f}")
    
    print("\nAnalysis complete!")

# ============================================================================
# SCRIPT ENTRY POINT
# ============================================================================
if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
Script description: Brief explanation of what this script does.

This docstring explains the script's purpose, inputs, outputs,
and any important usage information.
"""

# ============================================================================
# IMPORTS
# ============================================================================
# Standard library imports
import os
import sys
from datetime import datetime

# Third-party imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Local imports (if you have other modules)
# from my_module import my_function

# ============================================================================
# CONSTANTS AND CONFIGURATION
# ============================================================================
DATA_PATH = "data/customer_data.csv"
OUTPUT_DIR = "results"
RANDOM_SEED = 42

# ============================================================================
# FUNCTION DEFINITIONS
# ============================================================================
def load_data(filepath):
    """
    Load data from CSV file.
    
    Parameters:
    -----------
    filepath : str
        Path to the CSV file
        
    Returns:
    --------
    pd.DataFrame
        Loaded data
    """
    return pd.read_csv(filepath)

def analyze_data(df):
    """
    Perform analysis on the data.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input data
        
    Returns:
    --------
    dict
        Analysis results
    """
    results = {
        'mean': df['value'].mean(),
        'median': df['value'].median(),
        'std': df['value'].std()
    }
    return results

# ============================================================================
# MAIN FUNCTION
# ============================================================================
def main():
    """
    Main execution function.
    
    Orchestrates the entire analysis workflow.
    """
    print("Starting analysis...")
    print(f"Timestamp: {datetime.now()}")
    
    # Load data
    print(f"\nLoading data from {DATA_PATH}...")
    df = load_data(DATA_PATH)
    print(f"Loaded {len(df)} records")
    
    # Analyze
    print("\nPerforming analysis...")
    results = analyze_data(df)
    
    # Display results
    print("\nResults:")
    for key, value in results.items():
        print(f"  {key}: {value:.2f}")
    
    print("\nAnalysis complete!")

# ============================================================================
# SCRIPT ENTRY POINT
# ============================================================================
if __name__ == "__main__":
    main()

What this structure demonstrates:

The shebang line (#!/usr/bin/env python3) tells Unix systems this is a Python script, enabling direct execution.

The module docstring explains what the script does. This appears when someone runs help(script_name) or reads the file.

Imports are grouped into standard library (built into Python), third-party (installed separately), and local (your own modules). This organization makes dependencies clear.

Constants are capitalized and defined near the top. Changing configuration doesn’t require hunting through code.

Functions are defined before use. Each has a docstring explaining parameters and return values. This makes code modular and testable.

The main function orchestrates everything. It provides a high-level view of the workflow.

The if __name__ == "__main__": guard only runs main() when the script is executed directly, not when imported as a module. This makes scripts reusable.

This template might seem like overkill for simple scripts, but using it consistently creates good habits and makes all your scripts uniform and professional.

Writing a Complete Data Analysis Script: Step by Step

Let’s write a complete data analysis script from scratch, explaining each piece. We’ll build a script that analyzes customer purchase data, following all best practices.

Step 1: Setting Up the Script File

First, create a new file called customer_analysis.py. Start with the module docstring and imports:

Python

#!/usr/bin/env python3
"""
Customer Purchase Analysis Script

This script analyzes customer purchase data to identify patterns,
calculate statistics, and generate visualizations.

Usage:
    python customer_analysis.py

Output:
    - Summary statistics printed to console
    - Visualizations saved to 'output/' directory
    - Analysis results saved to 'customer_analysis_results.csv'

Author: Your Name
Date: 2024-02-04
"""

import sys
import os
from datetime import datetime

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Configuration
DATA_FILE = "customer_data.csv"
OUTPUT_DIR = "output"
RESULTS_FILE = "customer_analysis_results.csv"

print("Customer Purchase Analysis Script")
print("=" * 60)
print(f"Started at: {datetime.now()}")
print()

#!/usr/bin/env python3
"""
Customer Purchase Analysis Script

This script analyzes customer purchase data to identify patterns,
calculate statistics, and generate visualizations.

Usage:
    python customer_analysis.py

Output:
    - Summary statistics printed to console
    - Visualizations saved to 'output/' directory
    - Analysis results saved to 'customer_analysis_results.csv'

Author: Your Name
Date: 2024-02-04
"""

import sys
import os
from datetime import datetime

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Configuration
DATA_FILE = "customer_data.csv"
OUTPUT_DIR = "output"
RESULTS_FILE = "customer_analysis_results.csv"

print("Customer Purchase Analysis Script")
print("=" * 60)
print(f"Started at: {datetime.now()}")
print()

Why this matters: The docstring serves as documentation visible when someone opens the file. Clear imports show dependencies. Configuration constants at the top make the script easy to adapt. Initial print statements confirm the script is running and provide context.

Step 2: Creating Helper Functions

Next, define functions for each major task:

Python

def load_and_validate_data(filepath):
    """
    Load data from CSV and perform basic validation.
    
    Parameters:
    -----------
    filepath : str
        Path to the data file
        
    Returns:
    --------
    pd.DataFrame
        Loaded and validated data
        
    Raises:
    -------
    FileNotFoundError
        If the data file doesn't exist
    ValueError
        If the data has unexpected format
    """
    # Check if file exists
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Data file not found: {filepath}")
    
    # Load data
    print(f"Loading data from {filepath}...")
    df = pd.read_csv(filepath)
    print(f"Loaded {len(df)} records with {len(df.columns)} columns")
    
    # Validate expected columns
    required_columns = ['customer_id', 'purchase_amount', 'purchase_date', 'category']
    missing_columns = set(required_columns) - set(df.columns)
    
    if missing_columns:
        raise ValueError(f"Missing required columns: {missing_columns}")
    
    print("Data validation passed")
    return df

def clean_data(df):
    """
    Clean the data by handling missing values and removing outliers.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Raw data
        
    Returns:
    --------
    pd.DataFrame
        Cleaned data
    """
    print("\nCleaning data...")
    initial_count = len(df)
    
    # Remove rows with missing values
    df_clean = df.dropna()
    print(f"Removed {initial_count - len(df_clean)} rows with missing values")
    
    # Remove outliers (purchases > 3 standard deviations from mean)
    mean = df_clean['purchase_amount'].mean()
    std = df_clean['purchase_amount'].std()
    df_clean = df_clean[
        (df_clean['purchase_amount'] >= mean - 3*std) &
        (df_clean['purchase_amount'] <= mean + 3*std)
    ]
    
    print(f"Removed {len(df) - len(df_clean)} outliers")
    print(f"Final dataset: {len(df_clean)} records")
    
    return df_clean

def calculate_statistics(df):
    """
    Calculate summary statistics.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Clean data
        
    Returns:
    --------
    dict
        Dictionary of statistics
    """
    print("\nCalculating statistics...")
    
    stats = {
        'total_customers': df['customer_id'].nunique(),
        'total_purchases': len(df),
        'total_revenue': df['purchase_amount'].sum(),
        'average_purchase': df['purchase_amount'].mean(),
        'median_purchase': df['purchase_amount'].median(),
        'std_purchase': df['purchase_amount'].std(),
        'min_purchase': df['purchase_amount'].min(),
        'max_purchase': df['purchase_amount'].max()
    }
    
    return stats

def analyze_by_category(df):
    """
    Analyze purchases by category.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Clean data
        
    Returns:
    --------
    pd.DataFrame
        Category analysis results
    """
    print("\nAnalyzing by category...")
    
    category_stats = df.groupby('category').agg({
        'purchase_amount': ['count', 'sum', 'mean', 'median'],
        'customer_id': 'nunique'
    }).round(2)
    
    category_stats.columns = ['_'.join(col).strip() for col in category_stats.columns]
    category_stats = category_stats.rename(columns={
        'purchase_amount_count': 'num_purchases',
        'purchase_amount_sum': 'total_revenue',
        'purchase_amount_mean': 'avg_purchase',
        'purchase_amount_median': 'median_purchase',
        'customer_id_nunique': 'num_customers'
    })
    
    return category_stats

def create_visualizations(df, output_dir):
    """
    Create and save visualizations.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Clean data
    output_dir : str
        Directory to save visualizations
    """
    print(f"\nCreating visualizations in {output_dir}/...")
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Figure 1: Purchase amount distribution
    plt.figure(figsize=(10, 6))
    plt.hist(df['purchase_amount'], bins=50, edgecolor='black', alpha=0.7)
    plt.xlabel('Purchase Amount ($)')
    plt.ylabel('Frequency')
    plt.title('Distribution of Purchase Amounts')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(f'{output_dir}/purchase_distribution.png', dpi=300)
    plt.close()
    print("  Saved: purchase_distribution.png")
    
    # Figure 2: Purchases by category
    category_counts = df['category'].value_counts()
    
    plt.figure(figsize=(10, 6))
    category_counts.plot(kind='bar', color='skyblue', edgecolor='black')
    plt.xlabel('Category')
    plt.ylabel('Number of Purchases')
    plt.title('Purchases by Category')
    plt.xticks(rotation=45, ha='right')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.savefig(f'{output_dir}/purchases_by_category.png', dpi=300)
    plt.close()
    print("  Saved: purchases_by_category.png")
    
    # Figure 3: Revenue by category
    category_revenue = df.groupby('category')['purchase_amount'].sum().sort_values(ascending=False)
    
    plt.figure(figsize=(10, 6))
    category_revenue.plot(kind='bar', color='lightcoral', edgecolor='black')
    plt.xlabel('Category')
    plt.ylabel('Total Revenue ($)')
    plt.title('Revenue by Category')
    plt.xticks(rotation=45, ha='right')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.savefig(f'{output_dir}/revenue_by_category.png', dpi=300)
    plt.close()
    print("  Saved: revenue_by_category.png")

def save_results(stats, category_analysis, output_file):
    """
    Save analysis results to CSV.
    
    Parameters:
    -----------
    stats : dict
        Overall statistics
    category_analysis : pd.DataFrame
        Category-level analysis
    output_file : str
        Output file path
    """
    print(f"\nSaving results to {output_file}...")
    
    # Create a summary <a href="https://techietory.com/data-science/what-is-a-dataframe-the-foundation-of-data-science/">DataFrame</a>
    summary_df = pd.DataFrame([stats])
    
    # Save to CSV
    with open(output_file, 'w') as f:
        f.write("OVERALL STATISTICS\n")
        summary_df.to_csv(f, index=False)
        f.write("\n\nCATEGORY ANALYSIS\n")
        category_analysis.to_csv(f)
    
    print("Results saved successfully")

def load_and_validate_data(filepath):
    """
    Load data from CSV and perform basic validation.
    
    Parameters:
    -----------
    filepath : str
        Path to the data file
        
    Returns:
    --------
    pd.DataFrame
        Loaded and validated data
        
    Raises:
    -------
    FileNotFoundError
        If the data file doesn't exist
    ValueError
        If the data has unexpected format
    """
    # Check if file exists
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Data file not found: {filepath}")
    
    # Load data
    print(f"Loading data from {filepath}...")
    df = pd.read_csv(filepath)
    print(f"Loaded {len(df)} records with {len(df.columns)} columns")
    
    # Validate expected columns
    required_columns = ['customer_id', 'purchase_amount', 'purchase_date', 'category']
    missing_columns = set(required_columns) - set(df.columns)
    
    if missing_columns:
        raise ValueError(f"Missing required columns: {missing_columns}")
    
    print("Data validation passed")
    return df

def clean_data(df):
    """
    Clean the data by handling missing values and removing outliers.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Raw data
        
    Returns:
    --------
    pd.DataFrame
        Cleaned data
    """
    print("\nCleaning data...")
    initial_count = len(df)
    
    # Remove rows with missing values
    df_clean = df.dropna()
    print(f"Removed {initial_count - len(df_clean)} rows with missing values")
    
    # Remove outliers (purchases > 3 standard deviations from mean)
    mean = df_clean['purchase_amount'].mean()
    std = df_clean['purchase_amount'].std()
    df_clean = df_clean[
        (df_clean['purchase_amount'] >= mean - 3*std) &
        (df_clean['purchase_amount'] <= mean + 3*std)
    ]
    
    print(f"Removed {len(df) - len(df_clean)} outliers")
    print(f"Final dataset: {len(df_clean)} records")
    
    return df_clean

def calculate_statistics(df):
    """
    Calculate summary statistics.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Clean data
        
    Returns:
    --------
    dict
        Dictionary of statistics
    """
    print("\nCalculating statistics...")
    
    stats = {
        'total_customers': df['customer_id'].nunique(),
        'total_purchases': len(df),
        'total_revenue': df['purchase_amount'].sum(),
        'average_purchase': df['purchase_amount'].mean(),
        'median_purchase': df['purchase_amount'].median(),
        'std_purchase': df['purchase_amount'].std(),
        'min_purchase': df['purchase_amount'].min(),
        'max_purchase': df['purchase_amount'].max()
    }
    
    return stats

def analyze_by_category(df):
    """
    Analyze purchases by category.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Clean data
        
    Returns:
    --------
    pd.DataFrame
        Category analysis results
    """
    print("\nAnalyzing by category...")
    
    category_stats = df.groupby('category').agg({
        'purchase_amount': ['count', 'sum', 'mean', 'median'],
        'customer_id': 'nunique'
    }).round(2)
    
    category_stats.columns = ['_'.join(col).strip() for col in category_stats.columns]
    category_stats = category_stats.rename(columns={
        'purchase_amount_count': 'num_purchases',
        'purchase_amount_sum': 'total_revenue',
        'purchase_amount_mean': 'avg_purchase',
        'purchase_amount_median': 'median_purchase',
        'customer_id_nunique': 'num_customers'
    })
    
    return category_stats

def create_visualizations(df, output_dir):
    """
    Create and save visualizations.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Clean data
    output_dir : str
        Directory to save visualizations
    """
    print(f"\nCreating visualizations in {output_dir}/...")
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Figure 1: Purchase amount distribution
    plt.figure(figsize=(10, 6))
    plt.hist(df['purchase_amount'], bins=50, edgecolor='black', alpha=0.7)
    plt.xlabel('Purchase Amount ($)')
    plt.ylabel('Frequency')
    plt.title('Distribution of Purchase Amounts')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(f'{output_dir}/purchase_distribution.png', dpi=300)
    plt.close()
    print("  Saved: purchase_distribution.png")
    
    # Figure 2: Purchases by category
    category_counts = df['category'].value_counts()
    
    plt.figure(figsize=(10, 6))
    category_counts.plot(kind='bar', color='skyblue', edgecolor='black')
    plt.xlabel('Category')
    plt.ylabel('Number of Purchases')
    plt.title('Purchases by Category')
    plt.xticks(rotation=45, ha='right')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.savefig(f'{output_dir}/purchases_by_category.png', dpi=300)
    plt.close()
    print("  Saved: purchases_by_category.png")
    
    # Figure 3: Revenue by category
    category_revenue = df.groupby('category')['purchase_amount'].sum().sort_values(ascending=False)
    
    plt.figure(figsize=(10, 6))
    category_revenue.plot(kind='bar', color='lightcoral', edgecolor='black')
    plt.xlabel('Category')
    plt.ylabel('Total Revenue ($)')
    plt.title('Revenue by Category')
    plt.xticks(rotation=45, ha='right')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.savefig(f'{output_dir}/revenue_by_category.png', dpi=300)
    plt.close()
    print("  Saved: revenue_by_category.png")

def save_results(stats, category_analysis, output_file):
    """
    Save analysis results to CSV.
    
    Parameters:
    -----------
    stats : dict
        Overall statistics
    category_analysis : pd.DataFrame
        Category-level analysis
    output_file : str
        Output file path
    """
    print(f"\nSaving results to {output_file}...")
    
    # Create a summary DataFrame
    summary_df = pd.DataFrame([stats])
    
    # Save to CSV
    with open(output_file, 'w') as f:
        f.write("OVERALL STATISTICS\n")
        summary_df.to_csv(f, index=False)
        f.write("\n\nCATEGORY ANALYSIS\n")
        category_analysis.to_csv(f)
    
    print("Results saved successfully")

What these functions demonstrate: Each function has a single, clear responsibility. Docstrings explain purpose, parameters, and return values. Functions include validation and error handling. Print statements track progress. The modular structure makes each piece testable and reusable. This organization is professional and maintainable.

Step 3: Writing the Main Function

The main function orchestrates the workflow:

Python

def main():
    """
    Main execution function.
    
    Orchestrates the complete analysis workflow:
    1. Load and validate data
    2. Clean data
    3. Calculate statistics
    4. Analyze by category
    5. Create visualizations
    6. Save results
    """
    try:
        # Load data
        df = load_and_validate_data(DATA_FILE)
        
        # Clean data
        df_clean = clean_data(df)
        
        # Calculate overall statistics
        stats = calculate_statistics(df_clean)
        
        # Display overall statistics
        print("\n" + "=" * 60)
        print("OVERALL STATISTICS")
        print("=" * 60)
        for key, value in stats.items():
            if 'total' in key or 'revenue' in key:
                print(f"{key:25s}: ${value:,.2f}")
            elif 'customers' in key or 'purchases' in key:
                print(f"{key:25s}: {value:,}")
            else:
                print(f"{key:25s}: {value:.2f}")
        
        # Analyze by category
        category_analysis = analyze_by_category(df_clean)
        
        # Display category analysis
        print("\n" + "=" * 60)
        print("ANALYSIS BY CATEGORY")
        print("=" * 60)
        print(category_analysis.to_string())
        
        # Create visualizations
        create_visualizations(df_clean, OUTPUT_DIR)
        
        # Save results
        save_results(stats, category_analysis, RESULTS_FILE)
        
        # Success message
        print("\n" + "=" * 60)
        print("ANALYSIS COMPLETED SUCCESSFULLY")
        print("=" * 60)
        print(f"Finished at: {datetime.now()}")
        
        return 0  # Success exit code
        
    except FileNotFoundError as e:
        print(f"\nERROR: {e}", file=sys.stderr)
        print("Please ensure the data file exists and the path is correct.")
        return 1  # Error exit code
        
    except ValueError as e:
        print(f"\nERROR: {e}", file=sys.stderr)
        print("Please check that the data file has the correct format.")
        return 1
        
    except Exception as e:
        print(f"\nUNEXPECTED ERROR: {e}", file=sys.stderr)
        print("An unexpected error occurred. Please check the script and data.")
        return 1

if __name__ == "__main__":
    exit_code = main()
    sys.exit(exit_code)

def main():
    """
    Main execution function.
    
    Orchestrates the complete analysis workflow:
    1. Load and validate data
    2. Clean data
    3. Calculate statistics
    4. Analyze by category
    5. Create visualizations
    6. Save results
    """
    try:
        # Load data
        df = load_and_validate_data(DATA_FILE)
        
        # Clean data
        df_clean = clean_data(df)
        
        # Calculate overall statistics
        stats = calculate_statistics(df_clean)
        
        # Display overall statistics
        print("\n" + "=" * 60)
        print("OVERALL STATISTICS")
        print("=" * 60)
        for key, value in stats.items():
            if 'total' in key or 'revenue' in key:
                print(f"{key:25s}: ${value:,.2f}")
            elif 'customers' in key or 'purchases' in key:
                print(f"{key:25s}: {value:,}")
            else:
                print(f"{key:25s}: {value:.2f}")
        
        # Analyze by category
        category_analysis = analyze_by_category(df_clean)
        
        # Display category analysis
        print("\n" + "=" * 60)
        print("ANALYSIS BY CATEGORY")
        print("=" * 60)
        print(category_analysis.to_string())
        
        # Create visualizations
        create_visualizations(df_clean, OUTPUT_DIR)
        
        # Save results
        save_results(stats, category_analysis, RESULTS_FILE)
        
        # Success message
        print("\n" + "=" * 60)
        print("ANALYSIS COMPLETED SUCCESSFULLY")
        print("=" * 60)
        print(f"Finished at: {datetime.now()}")
        
        return 0  # Success exit code
        
    except FileNotFoundError as e:
        print(f"\nERROR: {e}", file=sys.stderr)
        print("Please ensure the data file exists and the path is correct.")
        return 1  # Error exit code
        
    except ValueError as e:
        print(f"\nERROR: {e}", file=sys.stderr)
        print("Please check that the data file has the correct format.")
        return 1
        
    except Exception as e:
        print(f"\nUNEXPECTED ERROR: {e}", file=sys.stderr)
        print("An unexpected error occurred. Please check the script and data.")
        return 1

if __name__ == "__main__":
    exit_code = main()
    sys.exit(exit_code)

What the main function demonstrates: The main function provides a high-level overview of the entire workflow. Each step is clearly labeled and calls an appropriate function. Try-except blocks handle errors gracefully, providing useful messages instead of cryptic stack traces. Return codes indicate success (0) or failure (1), useful when scripts run automatically. The structure makes the analysis logic clear and easy to follow.

Step 4: Testing the Script

Before running on real data, test with sample data:

Python

# test_data_generator.py
"""Generate test data for customer analysis script."""

import pandas as pd
import numpy as np

np.random.seed(42)

# Generate sample customer data
n_records = 1000

data = {
    'customer_id': np.random.randint(1, 201, n_records),
    'purchase_amount': np.random.gamma(2, 50, n_records),  # Gamma distribution for realistic prices
    'purchase_date': pd.date_range('2024-01-01', periods=n_records, freq='H'),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books', 'Home'], n_records)
}

# Add some missing values
missing_indices = np.random.choice(n_records, 20, replace=False)
data['purchase_amount'][missing_indices[:10]] = np.nan
data['category'][missing_indices[10:]] = np.nan

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('customer_data.csv', index=False)

print(f"Generated {n_records} sample records")
print("Saved to customer_data.csv")

# test_data_generator.py
"""Generate test data for customer analysis script."""

import pandas as pd
import numpy as np

np.random.seed(42)

# Generate sample customer data
n_records = 1000

data = {
    'customer_id': np.random.randint(1, 201, n_records),
    'purchase_amount': np.random.gamma(2, 50, n_records),  # Gamma distribution for realistic prices
    'purchase_date': pd.date_range('2024-01-01', periods=n_records, freq='H'),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books', 'Home'], n_records)
}

# Add some missing values
missing_indices = np.random.choice(n_records, 20, replace=False)
data['purchase_amount'][missing_indices[:10]] = np.nan
data['category'][missing_indices[10:]] = np.nan

# Create DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('customer_data.csv', index=False)

print(f"Generated {n_records} sample records")
print("Saved to customer_data.csv")

Run the test data generator, then run the analysis script:

Python

# Generate test data
python test_data_generator.py

# Run analysis
python customer_analysis.py

# Generate test data
python test_data_generator.py

# Run analysis
python customer_analysis.py

What this testing approach demonstrates: Generating test data ensures the script works before using real data. Test data includes deliberate issues (missing values) to verify cleaning logic. This workflow—generate test data, run script, verify output—is standard for script development.

Best Practices for Data Analysis Scripts

Professional scripts follow established practices that make code reliable, maintainable, and reusable.

Practice 1: Configuration at the Top

Put all configuration in constants at the script’s top:

Python

# GOOD: Configuration at top
DATA_FILE = "customer_data.csv"
OUTPUT_DIR = "output"
OUTLIER_THRESHOLD = 3  # Standard deviations
MIN_PURCHASE_AMOUNT = 0.01
MAX_PURCHASE_AMOUNT = 10000

# BAD: Magic numbers scattered through code
df = df[df['purchase_amount'] > 0.01]  # What is 0.01?
df = df[df['purchase_amount'] < 10000]  # What is 10000?

# GOOD: Configuration at top
DATA_FILE = "customer_data.csv"
OUTPUT_DIR = "output"
OUTLIER_THRESHOLD = 3  # Standard deviations
MIN_PURCHASE_AMOUNT = 0.01
MAX_PURCHASE_AMOUNT = 10000

# BAD: Magic numbers scattered through code
df = df[df['purchase_amount'] > 0.01]  # What is 0.01?
df = df[df['purchase_amount'] < 10000]  # What is 10000?

Practice 2: Error Handling

Anticipate and handle errors gracefully:

Python

# GOOD: Explicit error handling
try:
    df = pd.read_csv(filepath)
except FileNotFoundError:
    print(f"Error: Data file '{filepath}' not found")
    print("Please check the file path and try again")
    sys.exit(1)
except pd.errors.EmptyDataError:
    print(f"Error: Data file '{filepath}' is empty")
    sys.exit(1)
except Exception as e:
    print(f"Unexpected error loading data: {e}")
    sys.exit(1)

# BAD: No error handling
df = pd.read_csv(filepath)  # Will crash with unclear error if file missing

# GOOD: Explicit error handling
try:
    df = pd.read_csv(filepath)
except FileNotFoundError:
    print(f"Error: Data file '{filepath}' not found")
    print("Please check the file path and try again")
    sys.exit(1)
except pd.errors.EmptyDataError:
    print(f"Error: Data file '{filepath}' is empty")
    sys.exit(1)
except Exception as e:
    print(f"Unexpected error loading data: {e}")
    sys.exit(1)

# BAD: No error handling
df = pd.read_csv(filepath)  # Will crash with unclear error if file missing

Practice 3: Progress Indicators

Print messages so users know what’s happening:

Python

# GOOD: Informative progress messages
print("Loading data...")
df = load_data(filepath)
print(f"  Loaded {len(df)} records")

print("\nCleaning data...")
df_clean = clean_data(df)
print(f"  Removed {len(df) - len(df_clean)} invalid records")

print("\nPerforming analysis...")
results = analyze(df_clean)
print("  Analysis complete")

# BAD: Silent execution
df = load_data(filepath)
df_clean = clean_data(df)
results = analyze(df_clean)
# User has no idea what's happening or if it's working

# GOOD: Informative progress messages
print("Loading data...")
df = load_data(filepath)
print(f"  Loaded {len(df)} records")

print("\nCleaning data...")
df_clean = clean_data(df)
print(f"  Removed {len(df) - len(df_clean)} invalid records")

print("\nPerforming analysis...")
results = analyze(df_clean)
print("  Analysis complete")

# BAD: Silent execution
df = load_data(filepath)
df_clean = clean_data(df)
results = analyze(df_clean)
# User has no idea what's happening or if it's working

Practice 4: Validation

Validate assumptions about your data:

Python

def validate_data(df):
    """Validate data meets expectations."""
    # Check required columns exist
    required_cols = ['customer_id', 'amount', 'date']
    missing = set(required_cols) - set(df.columns)
    if missing:
        raise ValueError(f"Missing required columns: {missing}")
    
    # Check for empty DataFrame
    if len(df) == 0:
        raise ValueError("DataFrame is empty")
    
    # Check data types
    if df['amount'].dtype not in ['int64', 'float64']:
        raise ValueError("Amount column must be numeric")
    
    # Check for negative amounts
    if (df['amount'] < 0).any():
        raise ValueError("Found negative amounts (not allowed)")
    
    print("Data validation passed")
    return True

def validate_data(df):
    """Validate data meets expectations."""
    # Check required columns exist
    required_cols = ['customer_id', 'amount', 'date']
    missing = set(required_cols) - set(df.columns)
    if missing:
        raise ValueError(f"Missing required columns: {missing}")
    
    # Check for empty DataFrame
    if len(df) == 0:
        raise ValueError("DataFrame is empty")
    
    # Check data types
    if df['amount'].dtype not in ['int64', 'float64']:
        raise ValueError("Amount column must be numeric")
    
    # Check for negative amounts
    if (df['amount'] < 0).any():
        raise ValueError("Found negative amounts (not allowed)")
    
    print("Data validation passed")
    return True

Practice 5: Documentation

Document not just what code does, but why:

Python

def remove_outliers(df, column, n_std=3):
    """
    Remove outliers using standard deviation method.
    
    We remove outliers to prevent extreme values from skewing our
    analysis. We chose the standard deviation method because our
    data approximately follows a normal distribution.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input data
    column : str
        Column to check for outliers
    n_std : float
        Number of standard deviations for threshold
        
    Returns:
    --------
    pd.DataFrame
        Data with outliers removed
        
    Notes:
    ------
    Values beyond mean ± n_std*std are considered outliers.
    Adjust n_std if removing too many/too few outliers.
    """
    mean = df[column].mean()
    std = df[column].std()
    lower_bound = mean - n_std * std
    upper_bound = mean + n_std * std
    
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

def remove_outliers(df, column, n_std=3):
    """
    Remove outliers using standard deviation method.
    
    We remove outliers to prevent extreme values from skewing our
    analysis. We chose the standard deviation method because our
    data approximately follows a normal distribution.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Input data
    column : str
        Column to check for outliers
    n_std : float
        Number of standard deviations for threshold
        
    Returns:
    --------
    pd.DataFrame
        Data with outliers removed
        
    Notes:
    ------
    Values beyond mean ± n_std*std are considered outliers.
    Adjust n_std if removing too many/too few outliers.
    """
    mean = df[column].mean()
    std = df[column].std()
    lower_bound = mean - n_std * std
    upper_bound = mean + n_std * std
    
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

What these practices demonstrate: Configuration at the top makes scripts adaptable. Error handling prevents cryptic failures. Progress messages provide feedback. Validation catches problems early. Documentation explains reasoning, not just mechanics. These practices distinguish professional scripts from quick hacks.

Conclusion: From Scripts to Automated Workflows

Writing scripts transforms you from someone who can analyze data interactively into someone who can create reusable, shareable, automatable analyses. Scripts embody best practices—clear structure, error handling, documentation, and validation—that make your work professional and reliable. The investment in learning script structure and organization pays continuous dividends as your analyses become tools others can use and systems can run automatically.

The skills covered in this guide—understanding script structure, organizing code into logical functions, handling errors gracefully, providing informative output, and following professional practices—form the foundation of data science engineering. These aren’t just conventions; they’re practices that make your work reproducible, maintainable, and valuable beyond one-time analyses.

As you write more scripts, patterns will emerge. You’ll develop template structures you reuse. You’ll build libraries of common functions. You’ll create reusable analysis workflows that adapt to different datasets. This evolution from writing scripts to building tools represents the transition from data analyst to data engineer.

Start simple. Write scripts for analyses you do repeatedly. Convert notebook explorations into scripts. Build up complexity gradually. With each script, you’ll become more comfortable with structure, more adept at error handling, more thoughtful about documentation. Eventually, writing well-structured scripts becomes automatic, and you’ll appreciate how they make your work more impactful and professional.

The journey from interactive code to production-ready scripts parallels the journey from beginner to professional. Embrace the transition, practice the patterns, and build the discipline of writing clear, structured, documented code. Your future self—and everyone who uses your scripts—will thank you.

0 Comments

Inline Feedbacks

View all comments

Discover More

Click For More

Search Techietory

Writing Your First Python Script for Data Analysis

Introduction: From Interactive Code to Reusable Scripts

Understanding Scripts: Programs vs. Interactive Code

What is a Python Script?

Interactive Code vs. Scripts: Key Differences

Why Write Scripts Instead of Using Notebooks?

Example: Same Analysis Interactive vs. Script

Script Structure: Organizing Code Professionally

The Standard Script Template

Writing a Complete Data Analysis Script: Step by Step

Step 1: Setting Up the Script File

Step 2: Creating Helper Functions

Step 3: Writing the Main Function

Step 4: Testing the Script

Best Practices for Data Analysis Scripts

Practice 1: Configuration at the Top

Practice 2: Error Handling

Practice 3: Progress Indicators

Practice 4: Validation

Practice 5: Documentation

Conclusion: From Scripts to Automated Workflows

Discover More

File Systems 101: How Your Operating System Organizes Data

Understanding Scope in C++: Local vs Global Variables

Working with NumPy: Mathematical Operations in Python

Moving into Data Science from a Business Background

MrBeast Acquires Youth Finance App Step to Build Teen Banking Empire

Measuring Resistance: Understanding What Your Multimeter is Telling You