Introduction to Data Warehousing Concepts

Learn data warehousing fundamentals: OLTP vs OLAP, star schema, dimension and fact tables, slowly changing dimensions, data lakes, lakehouses, and modern cloud warehouses explained clearly.

Introduction to Data Warehousing Concepts

A data warehouse is a centralized analytical database that consolidates data from multiple operational systems — sales platforms, CRM systems, product databases, financial tools — into a single, integrated store optimized for reporting and analysis rather than transaction processing. Data warehouses organize data into fact tables (which record events and measurements like sales transactions) and dimension tables (which provide context like customer, product, and date information), connected in a star schema pattern. Modern cloud data warehouses like BigQuery, Snowflake, and Redshift can store petabytes of data and execute analytical queries across billions of rows in seconds, making them the primary data source for data scientists in enterprise environments.

Introduction

Walk into any large technology company, financial institution, retailer, or healthcare organization and ask where their analytical data lives. The answer, almost universally, involves a data warehouse. It might be called different things — the analytics layer, the reporting database, the data platform, the gold layer — but the underlying concept is the same: a centralized repository where cleaned, integrated, historical data lives and where analysts, data scientists, and business intelligence tools come to ask questions.

For data scientists, understanding data warehousing concepts is not optional background knowledge — it is a prerequisite for effective work in almost any organizational setting. You need to know why the data is structured the way it is, what a fact table versus a dimension table is, why your query is joining to a “dim_” prefixed table, what the “SCD” columns mean, why some historical records have date ranges instead of single dates, and why aggregating across certain combinations of dimensions produces unexpected results.

This article provides the conceptual foundation: what data warehouses are and why they exist, the architectural patterns that organize warehouse data (star schema, snowflake schema), the building blocks of dimensional modeling (facts, dimensions, slowly changing dimensions), the modern evolution toward data lakes and lakehouses, and the practical skills for working effectively with warehouse data as a data scientist.

The Problem Data Warehouses Solve

To understand data warehouses, start with the problem they solve.

Operational Systems Are Not Designed for Analysis

Every organization runs operational systems — also called OLTP systems (Online Transaction Processing) — that handle day-to-day business activities:

  • An e-commerce platform records purchases, updates inventory, and processes payments
  • A CRM system tracks customer interactions and support tickets
  • A finance system processes invoices and payroll
  • A product database manages catalog information and pricing

These systems are optimized for a specific workload: many small, fast read-write transactions. When a customer places an order, the system must look up their address, check inventory, decrement stock, record the transaction, and send a confirmation — all in milliseconds.

OLTP databases are designed for this: normalized schemas (data split into many small tables to minimize redundancy), row-oriented storage (optimized for reading and writing complete records one at a time), and indices on primary keys for fast single-row lookups.

This design is terrible for analytical queries. Consider: “What was the month-over-month revenue growth by product category for each geographic region over the past three years, and how does it correlate with our marketing spend?” This query needs to:

  • Scan millions of transaction records
  • Join across orders, products, customers, and marketing tables
  • Aggregate across time, geography, and category dimensions
  • Compare across multiple time periods

Running this against a production OLTP database would lock tables, slow down real customers, and probably time out anyway. The query patterns are fundamentally different.

The OLTP vs. OLAP Distinction

CharacteristicOLTPOLAP (Data Warehouse)
Primary purposeRun the businessAnalyze the business
Query patternMany small reads/writesFew large reads, no writes
Data volumeCurrent stateHistorical accumulation
Schema designNormalized (many small tables)Denormalized (fewer wide tables)
OptimizationFast single-row operationsFast multi-row aggregations
UsersApplications, softwareAnalysts, data scientists, BI tools
Update frequencyConstant (every transaction)Batch (nightly or hourly)
Time horizonNow (current state)History (trends over time)
Example systemsPostgreSQL, MySQL, OracleBigQuery, Snowflake, Redshift

OLAP stands for Online Analytical Processing — queries that scan large volumes of data to compute aggregations, identify trends, and answer business questions.

Data warehouses are OLAP systems: they store historical data, are designed for complex analytical queries, and are separated from production OLTP systems to prevent analytical workloads from impacting operational performance.

The Architecture of a Data Warehouse

Data warehouses don’t just receive raw operational data — they transform and organize it. The classic architecture flows through several layers:

The Three-Layer Architecture

Plaintext
Source Systems                          Landing / Staging
(OLTP databases,            ──ETL──►    (Raw data, as-is from
 APIs, flat files,                       source systems, minimal
 event streams)                          transformation)


                                        Core / Integration
                                        (Cleansed, standardized,
                                         integrated data; business
                                         key mapping)


                                        Presentation / Data Mart
                                        (Star schemas, subject-area
                                         marts for analysts and BI)


                                        Analysts / Data Scientists
                                        BI Tools / Dashboards / ML Models

Staging layer: Raw data from source systems, loaded as-is with minimal transformation. Serves as an audit trail — you can always trace back to what was received from the source.

Core layer (also called the integration layer or ODS — Operational Data Store): Cleaned, standardized data. Business keys mapped to warehouse surrogate keys. Data quality rules applied. All sources integrated into a consistent model.

Presentation layer (also called the data mart layer): Dimensional models (star schemas) organized by subject area — a sales mart, a customer mart, a finance mart. Optimized for query performance and ease of use by analysts.

Modern cloud warehouses and dbt-based pipelines often use a similar three-layer pattern called bronze (raw) → silver (cleaned) → gold (aggregated/marts), which we’ll revisit in the data lakehouse section.

Dimensional Modeling: The Language of Data Warehouses

Dimensional modeling is the methodology for organizing data warehouse tables. Developed by Ralph Kimball and published in “The Data Warehouse Toolkit” (1996), it remains the dominant framework for analytical data modeling today.

The core insight: separate the measurements of business events (what happened, how much, how many) from the context of those events (who, what, where, when, why).

Measurements live in fact tables. Context lives in dimension tables.

Fact Tables: What Happened

A fact table records business events or measurements. Each row represents a specific event at the declared grain of the table.

Python
import pandas as pd

# Sales fact table (order-item grain — one row per line item per order)
sales_fact = pd.DataFrame({
    # Surrogate foreign keys — these are integers that join to dimension tables
    "order_item_key":  [1, 2, 3, 4, 5],
    "order_date_key":  [20240901, 20240901, 20240903, 20240905, 20240905],
    "customer_key":    [1001, 1001, 1002, 1003, 1003],
    "product_key":     [501, 502, 501, 503, 504],
    "store_key":       [201, 201, 201, 202, 202],
    "promotion_key":   [0, 0, 301, 0, 301],  # 0 = no promotion

    # Measures — the numeric "facts" being recorded
    "quantity_sold":   [2, 1, 3, 1, 2],
    "unit_price":      [149.99, 49.99, 149.99, 89.99, 29.99],
    "discount_amount": [0.00, 0.00, 15.00, 4.50, 0.00],
    "line_revenue":    [299.98, 49.99, 434.97, 85.49, 59.98],
    "cost_of_goods":   [90.00, 20.00, 90.00, 45.00, 12.00],
    "gross_profit":    [209.98, 29.99, 344.97, 40.49, 47.98]
})

print("Sales fact table:")
print(sales_fact)
print(f"\nGrain: one row per order line item (order_item_key is unique)")

Key properties of fact tables:

  • Rows represent events at a specific grain (defined by the combination of dimension keys)
  • Contain mostly foreign keys (linking to dimensions) and measures (numeric facts)
  • Can be very large — billions of rows for high-volume business processes
  • Typically narrow in the dimension key columns, with many measure columns
  • Additive measures (revenue, quantity) can be summed across all dimensions; semi-additive measures (inventory balance) only across some; non-additive measures (unit price, ratios) cannot be summed

Types of fact tables:

Plaintext
Transaction fact (finest grain): One row per event
├── Sales order line items
├── Web clickstream events
└── Financial transactions

Periodic snapshot: One row per entity per period
├── Account balances (one row per account per month)
├── Inventory levels (one row per product per day)
└── Customer health scores (one row per customer per week)

Accumulating snapshot: One row per process instance, updated as it progresses
├── Order fulfillment (one row per order, updated from placed → shipped → delivered)
├── Loan origination (one row per application, updated through stages)
└── Customer onboarding (one row per customer, tracks completion of steps)

Dimension Tables: The Context

Dimension tables provide the descriptive context that makes measurements meaningful. They answer the who, what, where, and when of each fact.

Python
# Customer dimension table
customer_dim = pd.DataFrame({
    # Surrogate key (warehouse-generated, never changes)
    "customer_key": [1001, 1002, 1003, 1004],

    # Natural/business key (from the source system)
    "customer_id":  ["CUST_001", "CUST_002", "CUST_003", "CUST_004"],

    # Descriptive attributes — lots of them
    "first_name":   ["Jane",   "Bob",   "Alice",  "Carlos"],
    "last_name":    ["Smith",  "Johnson","Williams","Garcia"],
    "email":        ["jane@email.com", "bob@email.com",
                     "alice@email.com", "carlos@email.com"],
    "city":         ["Austin", "Seattle", "Chicago", "Miami"],
    "state":        ["TX",     "WA",      "IL",      "FL"],
    "country":      ["USA",    "USA",     "USA",     "USA"],
    "region":       ["South",  "West",    "Midwest", "South"],
    "age_band":     ["35-44",  "25-34",   "45-54",   "25-34"],
    "income_band":  ["75-100K","50-75K",  "100K+",   "50-75K"],
    "customer_segment": ["Premium", "Standard", "Premium", "Standard"],
    "acquisition_channel": ["Organic", "Paid Search", "Referral", "Email"],

    # Record metadata (for slowly changing dimensions — discussed later)
    "effective_date": pd.to_datetime(["2022-03-14", "2021-11-28",
                                       "2023-01-05", "2022-07-19"]),
    "expiry_date":    pd.to_datetime(["9999-12-31", "9999-12-31",
                                       "9999-12-31", "9999-12-31"]),
    "is_current":     [True, True, True, True]
})

# Date dimension table — one row per calendar date, pre-populated years in advance
date_dim = pd.DataFrame({
    "date_key":    [20240901, 20240902, 20240903, 20240904, 20240905],
    "full_date":   pd.to_datetime(["2024-09-01", "2024-09-02", "2024-09-03",
                                    "2024-09-04", "2024-09-05"]),
    "day_of_week": ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday"],
    "is_weekend":  [True, False, False, False, False],
    "week_number": [35, 36, 36, 36, 36],
    "month_number":[9, 9, 9, 9, 9],
    "month_name":  ["September"] * 5,
    "quarter":     [3, 3, 3, 3, 3],
    "year":        [2024] * 5,
    "is_holiday":  [False, False, False, False, False],
    "fiscal_year": [2025] * 5,    # Company fiscal year may differ from calendar
    "fiscal_quarter": ["FQ1"] * 5
})

# Product dimension table
product_dim = pd.DataFrame({
    "product_key":      [501, 502, 503, 504],
    "product_id":       ["PROD_001", "PROD_002", "PROD_003", "PROD_004"],
    "product_name":     ["Wireless Headphones", "USB-C Hub", "Running Shoes", "Yoga Mat"],
    "brand":            ["SoundMax", "TechGear", "FitStride", "ZenFlex"],
    "subcategory":      ["Audio", "Accessories", "Athletic Footwear", "Fitness Equipment"],
    "category":         ["Electronics", "Electronics", "Apparel", "Sports"],
    "department":       ["Technology", "Technology", "Fashion", "Health"],
    "unit_cost":        [45.00, 20.00, 45.00, 12.00],
    "list_price":       [149.99, 49.99, 89.99, 29.99],
    "is_active":        [True, True, True, True]
})

print("Product dimension table:")
print(product_dim[["product_key", "product_name", "category", "list_price"]])

Key properties of dimension tables:

  • Relatively small compared to fact tables (thousands to millions of rows, not billions)
  • Wide — many descriptive columns (50-100+ columns is common in large dimensions)
  • Contain the surrogate key (warehouse-generated) and the natural/business key (from the source)
  • Include all the attributes users want to group by, filter on, or display in reports
  • The date dimension is special — pre-populated with every calendar date for decades, with all possible temporal attributes pre-computed

The Star Schema

The star schema is the most common dimensional model pattern. It gets its name from its visual structure: a fact table at the center with dimension tables radiating outward like the points of a star.

Plaintext
[date_dim]
                            |
                            | date_key
                            |
[store_dim] ──store_key── [sales_fact] ──product_key── [product_dim]
                            |
                            | customer_key
                            |
                        [customer_dim]
                            |
                            | promotion_key
                            |
                        [promotion_dim]
Python
import pandas as pd

# Star schema query: revenue by category and region for Q3 2024
def star_schema_query(sales_fact, date_dim, customer_dim, product_dim):
    """
    Demonstrate a star schema join — the fundamental analytical query pattern.
    """
    # Step 1: Join fact to date dimension to filter to Q3 2024
    with_date = sales_fact.merge(
        date_dim[["date_key", "quarter", "year", "month_name"]],
        on="date_key",
        how="inner"
    )
    q3_2024 = with_date[(with_date["year"] == 2024) &
                          (with_date["quarter"] == 3)]

    # Step 2: Join to customer dimension for geographic context
    with_customer = q3_2024.merge(
        customer_dim[["customer_key", "region", "customer_segment"]],
        on="customer_key",
        how="inner"
    )

    # Step 3: Join to product dimension for category context
    with_product = with_customer.merge(
        product_dim[["product_key", "category", "department"]],
        on="product_key",
        how="inner"
    )

    # Step 4: Aggregate — this is what the analyst actually wants
    result = (
        with_product
        .groupby(["region", "category"])
        .agg(
            total_revenue   = ("line_revenue",  "sum"),
            total_units     = ("quantity_sold",  "sum"),
            gross_profit    = ("gross_profit",   "sum"),
            n_transactions  = ("order_item_key", "count")
        )
        .reset_index()
        .sort_values("total_revenue", ascending=False)
    )

    result["margin_pct"] = (result["gross_profit"] / result["total_revenue"] * 100).round(1)

    return result

revenue_by_region_category = star_schema_query(
    sales_fact, date_dim, customer_dim, product_dim
)
print(revenue_by_region_category)

Why the Star Schema Works for Analysis

The star schema trades storage efficiency (denormalization means some redundancy) for query simplicity and performance:

Simple queries: Every analytical question follows the same pattern — start at the fact table, join to the relevant dimensions, filter and aggregate. Analysts learn one pattern that works everywhere.

Fast queries: Modern columnar warehouses execute star schema joins extremely efficiently. The fact table is columnar (fast scans), dimension tables fit in memory or are cached (fast lookups), and the query optimizer knows how to handle this well-understood pattern.

Self-documenting: The naming conventions (dim_customer, fact_sales) make the model’s intent obvious. Foreign keys in the fact table clearly point to their dimension tables.

The Snowflake Schema

A snowflake schema normalizes the dimension tables further — subcategory is a separate table joined to category, city is a separate table joined to state, which is joined to country. This reduces storage redundancy but requires more joins in queries and is generally harder to use.

Plaintext
                   [dim_product]
                         |
                product_category_key
                         |
               [dim_product_category]
                         |
                   department_key
                         |
                  [dim_department]

Most practitioners prefer star schemas (denormalized dimensions) over snowflake schemas because:

  • Storage is cheap; query complexity is expensive
  • Fewer joins mean simpler queries and better performance
  • Dimension tables are small — the redundancy cost is minimal

Surrogate Keys vs. Natural Keys

One of the most important (and initially confusing) design patterns in dimensional modeling is the use of surrogate keys.

Natural key: The business identifier from the source system. customer_id = "CUST_001", product_sku = "HDPH-PRO-WH". These are meaningful but can change, can be recycled, and may not be unique across multiple source systems.

Surrogate key: A warehouse-generated integer that uniquely identifies each dimension record. customer_key = 1001. Has no business meaning — just a join key.

Python
import pandas as pd

# Why surrogate keys matter: the same natural key can mean different things over time

# Original customer record
customer_v1 = {
    "customer_key": 1001,         # Surrogate: never changes
    "customer_id":  "CUST_001",   # Natural key: stays the same
    "name":         "Jane Smith",
    "city":         "Austin",
    "segment":      "Standard",
    "effective_date": "2022-03-14",
    "expiry_date":    "2024-06-30",
    "is_current":     False
}

# Customer moved and was promoted — NEW surrogate key for the new version
customer_v2 = {
    "customer_key": 5823,         # New surrogate key for new record
    "customer_id":  "CUST_001",   # Same natural key (same person)
    "name":         "Jane Smith-Torres",
    "city":         "San Francisco",
    "segment":      "Premium",
    "effective_date": "2024-07-01",
    "expiry_date":    "9999-12-31",
    "is_current":     True
}

# Historical sales facts JOIN to customer_key=1001 → "Jane was Standard in Austin"
# Recent sales facts JOIN to customer_key=5823 → "Jane is Premium in San Francisco"
# Natural key alone can't distinguish these two states

Slowly Changing Dimensions (SCD)

Dimension attributes change over time — customers move, products are recategorized, employees change departments. Slowly Changing Dimensions (SCDs) define how to handle these changes in a warehouse.

SCD Type 1: Overwrite (No History)

The simplest approach: just update the record. No history is preserved.

Python
# SCD Type 1: Update in place — lose the history
# BEFORE: Jane lives in Austin, segment=Standard
customer_dim.loc[customer_dim["customer_key"] == 1001, "city"] = "San Francisco"
customer_dim.loc[customer_dim["customer_key"] == 1001, "segment"] = "Premium"
# AFTER: Jane now shows San Francisco/Premium for ALL historical facts too
# Historical analysis is permanently affected — intended for corrections, not real changes

Use SCD Type 1 when: the change is a correction (a typo was fixed), when history genuinely doesn’t matter (a field being added retroactively), or when storage is very limited.

SCD Type 2: Add New Row (Full History)

The most commonly used SCD type: add a new row for each change, preserving the old row with its effective date range.

Python
import pandas as pd

def scd_type_2_update(
    dim_df: pd.DataFrame,
    natural_key_col: str,
    natural_key_value: str,
    updates: dict,
    change_date: str,
    surrogate_key_col: str = "surrogate_key"
) -> pd.DataFrame:
    """
    Apply a SCD Type 2 update to a dimension table.

    Expires the current record (sets expiry_date and is_current=False)
    and inserts a new record with the updated attribute values.

    Parameters
    ----------
    dim_df : pd.DataFrame
        The dimension table to update.
    natural_key_col : str
        Column containing the business/natural key.
    natural_key_value : str
        Value of the natural key for the record being updated.
    updates : dict
        Dict of {column_name: new_value} for changed attributes.
    change_date : str
        Date the change became effective (ISO format string).
    surrogate_key_col : str
        Name of the surrogate key column.

    Returns
    -------
    pd.DataFrame
        Updated dimension table with expired old record and new current record.
    """
    dim_df = dim_df.copy()
    change_ts = pd.Timestamp(change_date)

    # Find the current active record
    current_mask = (
        (dim_df[natural_key_col] == natural_key_value) &
        (dim_df["is_current"] == True)
    )
    current_record = dim_df[current_mask].iloc[0].to_dict()

    # Expire the current record
    dim_df.loc[current_mask, "expiry_date"] = change_ts - pd.Timedelta(days=1)
    dim_df.loc[current_mask, "is_current"]  = False

    # Create the new record
    new_record = current_record.copy()
    new_record.update(updates)
    new_record["effective_date"] = change_ts
    new_record["expiry_date"]    = pd.Timestamp("9999-12-31")
    new_record["is_current"]     = True
    # Generate new surrogate key
    new_record[surrogate_key_col] = dim_df[surrogate_key_col].max() + 1

    # Append new record
    new_row_df = pd.DataFrame([new_record])
    dim_df = pd.concat([dim_df, new_row_df], ignore_index=True)

    return dim_df


# Apply SCD Type 2: Jane moved to San Francisco and was upgraded to Premium
customer_dim_updated = scd_type_2_update(
    customer_dim,
    natural_key_col="customer_id",
    natural_key_value="CUST_001",
    updates={"city": "San Francisco", "state": "CA",
              "region": "West", "customer_segment": "Premium"},
    change_date="2024-07-01",
    surrogate_key_col="customer_key"
)

jane_records = customer_dim_updated[
    customer_dim_updated["customer_id"] == "CUST_001"
][["customer_key", "city", "customer_segment", "effective_date", "expiry_date", "is_current"]]

print("Jane Smith's dimension history after SCD Type 2 update:")
print(jane_records.to_string(index=False))

Output:

Plaintext
customer_key     city  customer_segment effective_date  expiry_date  is_current
        1001   Austin          Standard     2022-03-14   2024-06-30       False
        1005  San Francisco      Premium    2024-07-01   9999-12-31        True

Now historical sales facts that join to customer_key=1001 correctly show Jane as “Standard/Austin” for her old purchases, while recent sales join to customer_key=1005 and show “Premium/San Francisco.”

SCD Type 3: Add New Column (Limited History)

Adds a new column to preserve the previous value — only one level of history:

Python
# SCD Type 3: Keep both current and previous values
customer_dim_type3 = pd.DataFrame({
    "customer_key":         [1001],
    "customer_id":          ["CUST_001"],
    "name":                 ["Jane Smith"],
    # Current values
    "current_city":         ["San Francisco"],
    "current_segment":      ["Premium"],
    # Previous values (one level of history only)
    "previous_city":        ["Austin"],
    "previous_segment":     ["Standard"],
    # When the change happened
    "last_change_date":     pd.to_datetime(["2024-07-01"])
})

SCD Type 3 is simple but limited — it only stores the most recent previous value. If Jane moved three times, only the last previous city is preserved. Use it when you only need to compare current vs. previous and space is very constrained.

Date Dimension: The Special Case

The date dimension is the most universally present dimension in any warehouse. Unlike other dimensions, it’s static — every possible date is pre-populated at warehouse setup time.

Python
import pandas as pd
import numpy as np

def build_date_dimension(start_date: str = "2020-01-01",
                          end_date: str   = "2030-12-31") -> pd.DataFrame:
    """
    Build a comprehensive date dimension table.

    Pre-computes every possible temporal attribute for every date
    so that queries can filter and group without date functions.

    Parameters
    ----------
    start_date, end_date : str
        Date range for the dimension (cover all historical + future dates).

    Returns
    -------
    pd.DataFrame
        One row per calendar date with all temporal attributes.
    """
    dates = pd.date_range(start=start_date, end=end_date, freq="D")

    df = pd.DataFrame({
        # Surrogate key as integer YYYYMMDD — easy to remember and use
        "date_key":         dates.strftime("%Y%m%d").astype(int),
        "full_date":        dates,

        # Calendar attributes
        "year":             dates.year,
        "quarter":          dates.quarter,
        "month_number":     dates.month,
        "month_name":       dates.strftime("%B"),
        "month_abbrev":     dates.strftime("%b"),
        "week_of_year":     dates.isocalendar().week.astype(int),
        "day_of_year":      dates.day_of_year,
        "day_of_month":     dates.day,
        "day_of_week":      dates.dayofweek,     # 0=Mon
        "day_name":         dates.strftime("%A"),
        "day_abbrev":       dates.strftime("%a"),

        # Boolean flags
        "is_weekday":       dates.dayofweek < 5,
        "is_weekend":       dates.dayofweek >= 5,
        "is_month_start":   dates.is_month_start,
        "is_month_end":     dates.is_month_end,
        "is_quarter_start": dates.is_quarter_start,
        "is_quarter_end":   dates.is_quarter_end,
        "is_year_start":    dates.is_year_start,
        "is_year_end":      dates.is_year_end,

        # Formatted strings for display
        "year_month":       dates.strftime("%Y-%m"),
        "year_quarter":     dates.strftime("%Y") + "-Q" + dates.quarter.astype(str),
        "month_year_label": dates.strftime("%b %Y"),

        # Fiscal year (example: fiscal year starts October 1)
        "fiscal_year":      np.where(dates.month >= 10,
                                      dates.year + 1, dates.year),
        "fiscal_quarter":   np.select(
            [dates.month.isin([10,11,12]), dates.month.isin([1,2,3]),
             dates.month.isin([4,5,6])],
            [1, 2, 3], default=4
        )
    })

    return df

date_dim_full = build_date_dimension("2020-01-01", "2030-12-31")
print(f"Date dimension: {len(date_dim_full):,} rows")
print(date_dim_full.loc[date_dim_full["full_date"] == "2024-09-15"].T)

The date dimension is used in every analytical query involving time. Pre-computing is_weekend, fiscal_quarter, month_name, and hundreds of other attributes eliminates the need for date functions in analytical queries — queries join to the date dimension and filter like WHERE is_weekend = TRUE rather than computing day-of-week at query time.

Data Lakes and Lakehouses

While the traditional data warehouse model has served organizations well, the past decade has seen significant architectural evolution driven by the rise of big data and cloud storage.

The Traditional Warehouse Limitations

Traditional data warehouses had real constraints:

  • Schema-on-write: Data must conform to a defined schema before loading. Unstructured or semi-structured data (text, images, logs) couldn’t be stored.
  • Expensive scaling: Adding storage and compute meant adding hardware — expensive and slow.
  • One team controls it: Centralized governance meant data teams controlled what could be analyzed and by whom.
  • Not for ML: Training machine learning models on raw logs or unstructured data required exporting to a separate system.

The Data Lake

A data lake stores raw data of any type — structured, semi-structured (JSON, XML), unstructured (text, images, audio) — in cheap cloud object storage (Amazon S3, Google Cloud Storage, Azure Blob Storage) in whatever format it arrives.

The core idea: schema-on-read — define the schema when you query the data, not when you store it. This gives maximum flexibility: store everything now, figure out how to use it later.

Plaintext
Data Lake (e.g., Amazon S3)
├── raw/
│   ├── transactions/year=2024/month=09/day=15/txns_001.parquet
│   ├── clickstream/2024-09-15/events_000.jsonl
│   ├── customer_uploads/profile_photos/
│   └── third_party/weather_data/station_readings_2024.csv
├── processed/
│   ├── transactions_cleaned/
│   └── customer_features/
└── ml_datasets/
    ├── churn_training_v3/
    └── recommendation_training/

Data lakes are cheap and flexible, but they create problems: data quality is inconsistent, there’s no governance about what’s in there, data scientists might work from stale or corrupted data, and query performance can be poor on raw JSON/CSV.

The Data Lakehouse

The data lakehouse architecture (popularized by Databricks and reflected in Delta Lake, Apache Iceberg, and Apache Hudi) combines the flexibility of a data lake with the reliability and performance of a data warehouse.

The key innovation is a transactional metadata layer on top of object storage that provides:

  • ACID transactions: Consistent reads and writes, even with concurrent users
  • Schema enforcement: Optional: define a schema and reject data that doesn’t conform
  • Time travel: Query historical versions of the data
  • Streaming + batch: Handle both real-time events and batch loads in the same table
Python
# Delta Lake example (Apache Spark + Delta)
from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()

# Write a Delta table
df.write.format("delta").save("s3://bucket/delta/transactions/")

# ACID merge (upsert) — not possible with plain Parquet
delta_table = DeltaTable.forPath(spark, "s3://bucket/delta/transactions/")
delta_table.alias("existing").merge(
    new_data.alias("updates"),
    "existing.transaction_id = updates.transaction_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# Time travel: query yesterday's version
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2024-09-14") \
    .load("s3://bucket/delta/transactions/")

# Query a specific version
df_v5 = spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("s3://bucket/delta/transactions/")

The Modern Data Stack

Today’s data infrastructure often combines:

Plaintext
Source Systems (OLTP)


Ingestion (Fivetran, Airbyte, Kafka)


Cloud Object Storage (S3, GCS) — the data lake


Cloud Data Warehouse (Snowflake, BigQuery, Redshift)
  OR Data Lakehouse (Databricks, Delta Lake, Apache Iceberg)


Transformation (dbt — SQL-based transformation tool)


Presentation Layer (dimensional models, marts)

        ├──► BI Tools (Tableau, Looker, Power BI)
        ├──► Analytics (SQL + Python notebooks)
        └──► ML (Feature stores, model training)

dbt (data build tool) deserves special mention: it is the most widely adopted transformation tool in the modern data stack, allowing data teams to write SQL SELECT statements that define data models, with version control, testing, documentation, and dependency management built in.

Working with Warehouse Data as a Data Scientist

Understanding warehouse architecture makes you more effective at extracting data for analysis and modeling.

Navigating a New Warehouse

Python
import pandas as pd
from sqlalchemy import create_engine, inspect, text

engine = create_engine("postgresql+psycopg2://user:pass@warehouse/analytics")

def explore_warehouse_schema(engine, schema: str = "public") -> pd.DataFrame:
    """
    List all tables in a schema with row counts and key columns.
    Helps understand available data in a new warehouse environment.
    """
    inspector = inspect(engine)
    tables = inspector.get_table_names(schema=schema)

    table_info = []
    for table in sorted(tables):
        try:
            with engine.connect() as conn:
                count = conn.execute(
                    text(f'SELECT COUNT(*) FROM "{schema}"."{table}"')
                ).scalar()
        except Exception:
            count = None

        columns = inspector.get_columns(table, schema=schema)
        col_names = [c["name"] for c in columns[:5]]  # First 5 columns

        table_info.append({
            "table":        table,
            "row_count":    count,
            "column_count": len(columns),
            "sample_columns": ", ".join(col_names)
        })

    return pd.DataFrame(table_info)


# Identify dimensional model tables by naming convention
schema_df = explore_warehouse_schema(engine, schema="analytics")

fact_tables = schema_df[schema_df["table"].str.startswith("fact_")]
dim_tables  = schema_df[schema_df["table"].str.startswith("dim_")]

print("Fact tables:")
print(fact_tables[["table", "row_count", "column_count"]])
print("\nDimension tables:")
print(dim_tables[["table", "row_count", "column_count"]])

Getting Current Records from SCD Type 2 Dimensions

Python
# When querying a SCD Type 2 dimension, always filter to current records
# unless you specifically need historical analysis

# WRONG: May get multiple rows per customer if SCD Type 2 is used
df = pd.read_sql("SELECT * FROM dim_customer", engine)
# Might have Jane appearing twice — once for Austin/Standard, once for SF/Premium

# CORRECT: Filter to current records for current-state analysis
df = pd.read_sql("""
    SELECT
        customer_key,
        customer_id,
        first_name || ' ' || last_name AS full_name,
        city, state, region,
        customer_segment
    FROM dim_customer
    WHERE is_current = TRUE
""", engine)

# HISTORICAL ANALYSIS: Use surrogate keys and date ranges
historical_query = """
    SELECT
        f.line_revenue,
        f.quantity_sold,
        c.customer_segment,  -- Segment at time of purchase
        c.city,              -- City at time of purchase
        d.year,
        d.month_name
    FROM fact_sales f
    JOIN dim_customer c ON f.customer_key = c.customer_key
        -- SCD Type 2: the fact's surrogate key points to the dimension
        -- record that was current when the transaction occurred
    JOIN dim_date d ON f.order_date_key = d.date_key
    WHERE d.year = 2024
"""
# This automatically gets the right customer attributes for each historical sale

Building ML Features from Warehouse Data

Python
# The typical data scientist's warehouse query — building a feature table
ML_FEATURES_QUERY = """
WITH customer_orders AS (
    SELECT
        c.customer_key,
        c.customer_id,
        c.customer_segment,
        c.region,
        COUNT(DISTINCT f.order_item_key)        AS n_line_items,
        COUNT(DISTINCT f.order_date_key)        AS n_order_days,
        SUM(f.line_revenue)                     AS lifetime_revenue,
        SUM(f.quantity_sold)                    AS total_units,
        AVG(f.line_revenue)                     AS avg_line_value,
        MIN(d.full_date)                        AS first_order_date,
        MAX(d.full_date)                        AS last_order_date,
        -- Recency
        DATEDIFF('day', MAX(d.full_date), CURRENT_DATE) AS recency_days,
        -- Category diversity
        COUNT(DISTINCT p.category)              AS distinct_categories
    FROM dim_customer c
    JOIN fact_sales f   ON c.customer_key   = f.customer_key
    JOIN dim_date   d   ON f.order_date_key = d.date_key
    JOIN dim_product p  ON f.product_key    = p.product_key
    WHERE c.is_current = TRUE
      AND d.full_date >= DATEADD('year', -2, CURRENT_DATE)
    GROUP BY
        c.customer_key, c.customer_id,
        c.customer_segment, c.region
)
SELECT
    *,
    lifetime_revenue / NULLIF(n_order_days, 0) AS avg_daily_spend,
    n_line_items     / NULLIF(n_order_days, 0) AS avg_items_per_order
FROM customer_orders
WHERE n_line_items >= 2  -- At least 2 purchases for meaningful features
"""

df_features = pd.read_sql(ML_FEATURES_QUERY, engine)
print(f"Feature table: {df_features.shape}")
print(df_features.dtypes)

Summary

Data warehouses exist because operational systems are optimized for transactions, not for analysis. The core architectural patterns — staging to presentation layers, fact tables recording events, dimension tables providing context, star schemas connecting them, and slowly changing dimensions preserving history — are the vocabulary of professional data work. Almost every large organization’s analytical infrastructure uses some version of these patterns.

The dimensional modeling concepts (grain, fact types, SCD Types 1-2-3, surrogate keys, date dimensions) developed by Ralph Kimball in the 1990s remain remarkably relevant today — they are visible in the table structures of Snowflake schemas, the naming conventions in dbt models, and the architecture discussions at any data engineering conference. The vocabulary has extended with data lakes (flexible storage for all data types) and lakehouses (adding warehouse reliability to lake storage), but the fundamental goal — a reliable, queryable, historical store of business events with rich dimensional context — remains unchanged.

As a data scientist, understanding these concepts means you can navigate warehouse schemas confidently, write correct SQL that accounts for SCD history, build accurate feature tables for machine learning, and have productive conversations with the data engineers who maintain the infrastructure you depend on.

Key Takeaways

  • OLTP systems (operational databases) are optimized for fast single-row reads and writes to run the business; OLAP systems (data warehouses) are optimized for large analytical queries scanning millions of rows to analyze the business — the two patterns require fundamentally different architectures
  • Fact tables record business events at a specific grain (what happened, how much, how many) and consist mostly of foreign keys and numeric measures; dimension tables provide the context (who, what, where, when) and consist of descriptive attributes
  • The star schema connects one fact table at the center to multiple dimension tables radiating outward — it is the dominant analytical data model because it produces simple, performant, self-documenting queries
  • Surrogate keys are warehouse-generated integers that uniquely identify dimension records; they are essential for SCD Type 2 history, where multiple rows for the same entity (different surrogate keys, same natural key) represent the entity’s state at different points in time
  • SCD Type 1 overwrites attribute values (no history); SCD Type 2 adds new rows with date ranges (full history — the most common approach); SCD Type 3 adds columns for previous values (one level of history)
  • The date dimension is pre-populated for all dates and pre-computes every temporal attribute (day name, fiscal quarter, is_weekend, etc.) so analytical queries can filter and group using simple equality conditions rather than date functions
  • Data lakes store any type of data cheaply in object storage with schema-on-read flexibility; data lakehouses add ACID transactions, schema enforcement, and time travel on top of lake storage — Delta Lake, Apache Iceberg, and Apache Hudi are the leading implementations
  • When querying SCD Type 2 dimensions, always filter to WHERE is_current = TRUE for current-state analysis; for historical analysis, join fact table surrogate keys to dimension records — the surrogate key automatically points to the correct dimensional state at the time of the event
Share:
Subscribe
Notify of
0 Comments

Discover More

NPN versus PNP Transistors: How They Differ and When to Use Each

NPN versus PNP Transistors: How They Differ and When to Use Each

Master the difference between NPN and PNP transistors—polarity, current flow, biasing, circuit configurations—and know exactly…

Java Control Flow: if, else, and switch Statements

Learn the fundamentals of Java control flow, including if-else statements, switch cases and loops. Optimize…

Color Theory for Data Visualization: Using Color Effectively in Charts

Learn how to use color effectively in data visualization. Explore color theory, best practices, and…

Understanding AC versus DC: Why Your Wall Outlet and Battery Work Differently

Discover the crucial differences between AC and DC electricity. Learn why batteries provide DC, wall…

Introduction to Data Warehousing Concepts

Introduction to Data Warehousing Concepts

Learn data warehousing fundamentals: OLTP vs OLAP, star schema, dimension and fact tables, slowly changing…

The Difference Between Analog and Digital Signals Explained Visually

Learn the fundamental differences between analog and digital signals through clear visual explanations. Understand continuous…

Click For More
0
Would love your thoughts, please comment.x
()
x