Introduction to Time Series Data

Learn the fundamentals of time series data in Python. Master datetime indexing, trend, seasonality, stationarity, autocorrelation, resampling, rolling statistics, and time series decomposition.

Introduction to Time Series Data

A time series is a sequence of data points recorded at successive points in time like stock prices, daily temperatures, monthly sales, server request rates, heart rate readings. Unlike standard tabular data where rows are independent observations, time series data has a crucial extra property: temporal ordering matters, and observations are correlated with their neighbors in time. This dependence on time creates unique analytical opportunities (forecasting, trend detection, seasonality analysis) and unique challenges (non-stationarity, autocorrelation, leakage) that require specialized tools and techniques beyond standard machine learning.

Introduction

Time series data is everywhere. Every metric that an organization tracks over time — revenue, user count, inventory levels, server latency, energy consumption, sensor readings — is a time series. Financial markets generate billions of time series data points every day. IoT sensors produce continuous streams of time-stamped readings. Web analytics platforms track pageviews hour by hour. Healthcare systems record patient vitals minute by minute.

For data scientists, time series analysis is both a domain with its own rich methodology and a set of problems that appear in virtually every industry. Learning to work with time series data opens access to forecasting (predicting future values), anomaly detection (identifying unusual events), and pattern recognition (understanding seasonal cycles and long-term trends) — all capabilities in high demand across business, science, and engineering.

This article introduces time series data from the ground up: what makes it different from standard tabular data, how to represent and index it in Python with pandas, the key structural components (trend, seasonality, cycles, noise), the fundamental concept of stationarity, and the essential analytical techniques — resampling, rolling statistics, lag features, autocorrelation, and decomposition — that every data scientist working with time data needs to know.

What Makes Time Series Data Different

Before exploring the techniques, it’s worth being precise about what distinguishes time series data from ordinary tabular data — because the differences have direct consequences for how you analyze it.

Temporal Order Is Meaningful

In a standard customer dataset, row 1 and row 2 are independent observations. The fact that Jane Smith appears before Bob Johnson in the DataFrame carries no information. You can shuffle the rows without changing anything meaningful.

In a time series, the order is the data. The stock price on Monday is meaningfully related to the price on Tuesday — you cannot shuffle the rows of a time series without destroying that relationship. Every time series technique must respect and exploit this ordering.

Observations Are Correlated in Time

In standard regression, we typically assume observations are independent. Time series data violates this assumption: today’s temperature is correlated with yesterday’s temperature, this month’s sales are correlated with last month’s sales. This serial correlation (called autocorrelation) is both a challenge (it violates standard statistical assumptions) and an opportunity (past values help predict future values).

The Goal Is Often Forecasting

Standard ML asks: “Given these features, what is the target?” Time series analysis often asks: “Given the history of this variable, what will happen next?” Forecasting future values, detecting anomalies relative to expected behavior, and understanding the mechanisms driving change over time are central goals that standard ML doesn’t directly address.

Train/Test Split Must Respect Time

In standard ML, you randomly split data into train and test sets. In time series, you must never do this — randomly sampling test points from the middle of a time series leaks future information into your training set. The test set must always be the most recent period; the train set is everything before it.

Plaintext
Standard ML: random split (OK)
[●●○●●○●●○●○●●●○●] ← ○ = test, ● = train (randomly assigned)

Time series: temporal split (required)
[●●●●●●●●●●○○○○○] ← train on past, test on recent future

Time Series Data in Python: The DatetimeIndex

pandas is the primary tool for time series work in Python. Its DatetimeIndex — a specialized index built for datetime values — provides the foundation for all time series operations.

Creating Time Series Data

Python
import pandas as pd
import numpy as np

# ── Method 1: Create from scratch with pd.date_range ──────────────
# Daily data for one year
dates = pd.date_range(start="2024-01-01", end="2024-12-31", freq="D")
print(f"Days: {len(dates)}")  # 366 (2024 is a leap year)

# Hourly data for one week
hourly = pd.date_range(start="2024-01-01", periods=168, freq="h")

# Business days only
biz_days = pd.date_range(start="2024-01-01", end="2024-03-31", freq="B")

# Monthly frequency (month start)
months = pd.date_range(start="2020-01", end="2024-12", freq="MS")

# Create a Series with DatetimeIndex
np.random.seed(42)
sales = pd.Series(
    data=np.random.normal(loc=1000, scale=150, size=len(dates)).cumsum() + 50000,
    index=dates,
    name="daily_sales"
)

print(sales.head())
# 2024-01-01    50048.20
# 2024-01-02    50262.45
# 2024-01-03    50088.12
# ...
print(sales.index.dtype)  # datetime64[ns]

# ── Method 2: Parse dates from a CSV ──────────────────────────────
df = pd.read_csv("data/sales.csv", parse_dates=["date"], index_col="date")
df.index = pd.to_datetime(df.index)   # Ensure DatetimeIndex
print(df.index.dtype)    # datetime64[ns]

# ── Method 3: Convert an existing column to DatetimeIndex ─────────
df = pd.DataFrame({
    "date":    ["2024-01-01", "2024-01-02", "2024-01-03"],
    "revenue": [45200.0, 48100.0, 43700.0]
})
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")
print(df.index)  # DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03'], dtype='datetime64[ns]', freq=None)

Selecting and Slicing by Time

One of the most powerful features of DatetimeIndex is natural time-based indexing:

Python
import pandas as pd
import numpy as np

np.random.seed(42)
dates = pd.date_range("2023-01-01", "2024-12-31", freq="D")
sales = pd.Series(np.random.normal(1000, 150, len(dates)), index=dates, name="sales")

# Select by string (partial datetime matching)
sales["2024"]                           # All of 2024
sales["2024-03"]                        # All of March 2024
sales["2024-03-15"]                     # Just March 15, 2024

# Slice ranges
sales["2024-01":"2024-06"]             # January through June 2024
sales["2023-07-01":"2023-09-30"]       # Exact date range

# Using .loc with datetime strings
sales.loc["2024-Q1"]                    # Q1 2024 (if quarterly frequency)
sales.loc["2024-01-01":"2024-03-31"]   # Q1 2024 by date range

# Select specific attributes from the index
print(sales.index.year.unique())        # [2023, 2024]
print(sales.index.month.unique())       # [1, 2, ..., 12]
print(sales.index.dayofweek.unique())   # [0, 1, 2, 3, 4, 5, 6] (0=Mon)
print(sales.index.quarter.unique())     # [1, 2, 3, 4]

# Filter by index properties
weekdays_only = sales[sales.index.dayofweek < 5]  # Monday–Friday
q4_only       = sales[sales.index.quarter == 4]    # October–December

The DatetimeIndex Accessor: .dt

When dates are in a column (not the index), use the .dt accessor:

Python
df = pd.DataFrame({
    "transaction_date": pd.date_range("2024-01-01", periods=100, freq="D"),
    "amount": np.random.uniform(10, 500, 100)
})

# Extract temporal components from a datetime column
df["year"]        = df["transaction_date"].dt.year
df["month"]       = df["transaction_date"].dt.month
df["day"]         = df["transaction_date"].dt.day
df["day_of_week"] = df["transaction_date"].dt.dayofweek  # 0=Mon, 6=Sun
df["day_name"]    = df["transaction_date"].dt.day_name() # "Monday", etc.
df["week"]        = df["transaction_date"].dt.isocalendar().week
df["quarter"]     = df["transaction_date"].dt.quarter
df["is_weekend"]  = df["transaction_date"].dt.dayofweek >= 5
df["is_month_end"]= df["transaction_date"].dt.is_month_end

print(df.head())

The Four Components of a Time Series

Most real-world time series can be decomposed into four structural components. Understanding these components is the foundation of time series analysis.

1. Trend

The trend is the long-term direction of the series — is it generally increasing, decreasing, or flat over time? Trends can be linear (steady constant rate of change) or non-linear (accelerating or decelerating growth).

Plaintext
Annual e-commerce revenue:  $1.2B → $1.8B → $2.7B → $3.9B → $5.6B
                            ↑ Strong upward trend (roughly exponential)

Landline phone subscriptions: 145M → 120M → 95M → 72M → 55M
                               ↑ Strong downward trend

2. Seasonality

Seasonality is a regular, predictable pattern that repeats at a known, fixed period. The period is what distinguishes seasonality from other cycles:

  • Annual seasonality: Retail sales peak in November-December every year; ice cream sales peak in summer every year
  • Weekly seasonality: Website traffic drops on weekends every week; restaurant orders spike on Friday evenings every week
  • Daily seasonality: Commuter traffic peaks at 8am and 5pm every day; electricity demand peaks in early evening every day
  • Quarterly seasonality: Business software sales spike at quarter-end when companies rush to spend budgets

Seasonality is the predictable, calendar-driven component. It’s perhaps the most practically valuable component because it enables concrete business planning.

3. Cycles

Cycles are recurring patterns that are not fixed in period — they can last years or decades and vary in length. Business cycles (expansion → peak → recession → trough → recovery) typically last 5-10 years but vary enormously. Real estate cycles, commodity price cycles, and technology adoption cycles are all examples. Cycles are difficult to predict because their timing is uncertain.

The key distinction:

  • Seasonality = repeats at a fixed, known period (weekly, annually)
  • Cycles = repeats irregularly over varying longer periods

4. Irregular (Noise/Residual)

The irregular component is what remains after removing trend, seasonality, and cycles — random variation that cannot be explained by systematic patterns. This includes genuine randomness, measurement error, and one-off events (a viral social media post causing a sales spike, a system outage causing a metric drop).

Additive vs. Multiplicative Decomposition

These four components combine in one of two ways:

Additive model: Y(t) = Trend(t) + Seasonality(t) + Cycle(t) + Noise(t)

The seasonal fluctuations are constant in absolute magnitude regardless of the trend level. If the holiday season adds $10M in sales whether revenue is $50M or $200M, it’s additive.

Multiplicative model: Y(t) = Trend(t) × Seasonality(t) × Cycle(t) × Noise(t)

The seasonal fluctuations are proportional to the trend level. If the holiday season adds 20% more sales whether revenue is $50M or $200M, it’s multiplicative. Most business and economic time series are multiplicative — a 20% holiday boost on $200M revenue produces a larger absolute swing than 20% on $50M.

Building a Realistic Time Series Dataset

Let’s create a synthetic dataset that exhibits real-world time series characteristics — trend, seasonality, and noise:

Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Create 3 years of daily data
dates = pd.date_range("2022-01-01", "2024-12-31", freq="D")
n = len(dates)

# ── Trend component ────────────────────────────────────────────────
# Gradual upward drift: starts at ~5000, grows to ~8000 over 3 years
trend = np.linspace(5000, 8000, n)

# ── Seasonality: Annual (strong) ───────────────────────────────────
# Day of year as angle in radians, peaks in December
day_of_year = np.arange(n) % 365
annual_season = 1200 * np.sin(2 * np.pi * (day_of_year - 80) / 365)
# Phase shift (-80) moves the peak to ~December (day 345 ≈ Dec 11)

# ── Seasonality: Weekly (moderate) ────────────────────────────────
# Weekend dip: Monday–Thursday baseline, Friday spike, weekend drop
day_of_week = pd.Series(dates.dayofweek)
weekly_effect = day_of_week.map({
    0: 0,      # Monday: baseline
    1: 50,     # Tuesday: slight uptick
    2: 80,     # Wednesday: mid-week peak
    3: 100,    # Thursday: building to weekend
    4: 200,    # Friday: spike
    5: -400,   # Saturday: weekend drop
    6: -600    # Sunday: lowest day
}).values

# ── Noise ──────────────────────────────────────────────────────────
noise = np.random.normal(0, 250, n)

# ── Special events ────────────────────────────────────────────────
# Black Friday (day after US Thanksgiving ≈ day 330 each year)
events = np.zeros(n)
for year_offset in [0, 365, 730]:
    bf_idx = year_offset + 329  # ~November 25
    if bf_idx < n:
        events[bf_idx] = 3000   # 3× spike

# ── Combine multiplicatively ──────────────────────────────────────
# Final series (all positive: trend is always >> seasonal swings)
revenue = trend + annual_season + weekly_effect + noise + events
revenue = np.maximum(revenue, 100)  # Ensure no negative revenue

ts = pd.Series(revenue, index=dates, name="daily_revenue")

print(ts.describe())
print(f"\nDate range: {ts.index.min().date()} to {ts.index.max().date()}")
print(f"Total points: {len(ts):,}")

Resampling: Changing Time Frequency

Resampling changes the frequency of a time series — aggregating fine-grained data to a coarser frequency (downsampling) or interpolating to a finer frequency (upsampling).

Downsampling (Aggregating to Lower Frequency)

Python
# Daily → Weekly (sum of daily revenue each week)
weekly = ts.resample("W").sum()
print(f"Daily points: {len(ts)} → Weekly points: {len(weekly)}")

# Daily → Monthly
monthly = ts.resample("ME").sum()         # ME = Month End
monthly_mean = ts.resample("ME").mean()
monthly_stats = ts.resample("ME").agg(["sum", "mean", "min", "max", "std"])

# Daily → Quarterly
quarterly = ts.resample("QE").sum()       # QE = Quarter End

# Daily → Annual
annual = ts.resample("YE").sum()          # YE = Year End
print(annual)
# 2022-12-31    1,823,456
# 2023-12-31    2,091,234
# 2024-12-31    2,387,891

# Resampling a DataFrame
df_ts = pd.DataFrame({
    "revenue": ts,
    "transactions": (ts / np.random.uniform(50, 150, len(ts))).astype(int)
})

monthly_df = df_ts.resample("ME").agg({
    "revenue":      "sum",
    "transactions": "sum"
})
monthly_df["avg_transaction"] = monthly_df["revenue"] / monthly_df["transactions"]

# Business-day resampling (exclude weekends)
biz_weekly = ts.resample("W-FRI").sum()   # Week ending Friday

Common Resample Frequency Strings

AliasMeaning
"D"Calendar day
"B"Business day
"W"Weekly (Sunday end)
"W-FRI"Weekly (Friday end)
"ME"Month end
"MS"Month start
"QE"Quarter end
"QS"Quarter start
"YE"Year end
"YS"Year start
"h"Hourly
"min"Minute
"s"Second

Upsampling and Interpolation

Python
# Monthly → Daily (upsample — creates NaN for missing dates)
monthly_upsampled = monthly.resample("D").asfreq()
print(monthly_upsampled.head(10))
# Many NaN values for days that weren't in the monthly series

# Fill with different strategies
monthly_ffill   = monthly.resample("D").ffill()        # Forward fill (hold last value)
monthly_bfill   = monthly.resample("D").bfill()        # Backward fill
monthly_interp  = monthly.resample("D").interpolate()  # Linear interpolation

Rolling Statistics: Moving Windows

Rolling (moving window) statistics compute a metric over a sliding window of past observations. They’re the backbone of time series feature engineering and smoothing.

Rolling Mean: Smoothing Out Noise

Python
# Simple moving averages of different window sizes
ts_7d  = ts.rolling(window=7).mean()    # 7-day moving average
ts_30d = ts.rolling(window=30).mean()   # 30-day moving average
ts_90d = ts.rolling(window=90).mean()   # 90-day moving average

# Visualize the smoothing effect
fig, ax = plt.subplots(figsize=(14, 5))
ts.plot(ax=ax, alpha=0.3, color="gray",  label="Raw daily revenue")
ts_7d.plot(ax=ax, color="blue",          label="7-day MA")
ts_30d.plot(ax=ax, color="orange",       label="30-day MA")
ts_90d.plot(ax=ax, color="red",          label="90-day MA")
ax.set_title("Revenue with Moving Averages")
ax.set_ylabel("Revenue ($)")
ax.legend()
plt.tight_layout()
plt.savefig("plots/rolling_averages.png", dpi=150)
plt.show()

The 7-day moving average smooths noise while preserving weekly seasonality. The 30-day average shows monthly patterns. The 90-day average reveals only the underlying trend.

Rolling Standard Deviation: Measuring Volatility

Python
# 30-day rolling volatility
rolling_std = ts.rolling(window=30).std()

# Bollinger Bands: mean ± 2 standard deviations
rolling_mean = ts.rolling(window=30).mean()
upper_band   = rolling_mean + 2 * rolling_std
lower_band   = rolling_mean - 2 * rolling_std

# Flag days where revenue falls outside the bands (anomalies)
anomalies = ts[(ts > upper_band) | (ts < lower_band)]
print(f"Anomalous days: {len(anomalies)}")
print(anomalies.head())

Rolling Aggregations for Feature Engineering

Python
# Multiple rolling statistics in one DataFrame — useful for ML features
features = pd.DataFrame(index=ts.index)
features["value"] = ts

# Lag features (past values as predictors)
features["lag_1d"]  = ts.shift(1)    # Yesterday's revenue
features["lag_7d"]  = ts.shift(7)    # Same day last week
features["lag_30d"] = ts.shift(30)   # Same day last month
features["lag_365d"]= ts.shift(365)  # Same day last year

# Rolling aggregations
features["roll_7d_mean"]  = ts.rolling(7).mean()
features["roll_7d_std"]   = ts.rolling(7).std()
features["roll_7d_min"]   = ts.rolling(7).min()
features["roll_7d_max"]   = ts.rolling(7).max()
features["roll_30d_mean"] = ts.rolling(30).mean()
features["roll_30d_std"]  = ts.rolling(30).std()

# Percent change from lag
features["pct_change_1d"]  = ts.pct_change(1)    # Daily return
features["pct_change_7d"]  = ts.pct_change(7)    # Week-over-week growth
features["pct_change_30d"] = ts.pct_change(30)   # Month-over-month growth

# Expanding window statistics (computed over all history up to current point)
features["cumulative_mean"] = ts.expanding().mean()
features["cumulative_max"]  = ts.expanding().max()

# Drop NaN rows created by lag/rolling features
features = features.dropna()
print(f"Features shape: {features.shape}")
print(features.head())

The min_periods Parameter

Rolling windows produce NaN at the start of the series (not enough data for the full window). Use min_periods to start computing with fewer observations:

Python
# Default: NaN until 30 observations accumulated
ts.rolling(30).mean().head(35).tail(10)

# With min_periods=1: compute mean from first observation
ts.rolling(30, min_periods=1).mean().head(35).tail(10)  # No NaN

Centered vs. Trailing Windows

By default, rolling windows are trailing — the window looks backward. Use center=True for centered windows (useful for visualization and decomposition, but note it uses future data — not valid for forecasting features):

Python
# Trailing window (default — appropriate for ML features)
trailing_ma = ts.rolling(7).mean()

# Centered window (future data leaks in — only for analysis, not features!)
centered_ma = ts.rolling(7, center=True).mean()

Lag Features: Time as a Predictor

Lag features are shifted versions of the target variable used as predictors. They operationalize the intuition that “what happened yesterday (last week, last month) helps predict what will happen today.”

Python
# Create a clean feature DataFrame for ML
ts_df = ts.to_frame("revenue")

# Lag features at multiple scales
lags_to_create = [1, 2, 3, 7, 14, 30, 60, 90, 365]
for lag in lags_to_create:
    ts_df[f"revenue_lag_{lag}d"] = ts_df["revenue"].shift(lag)

# Calendar features (cyclical encoding)
ts_df["day_of_week"]  = ts_df.index.dayofweek
ts_df["day_of_month"] = ts_df.index.day
ts_df["month"]        = ts_df.index.month
ts_df["quarter"]      = ts_df.index.quarter
ts_df["week_of_year"] = ts_df.index.isocalendar().week.astype(int)
ts_df["is_weekend"]   = (ts_df.index.dayofweek >= 5).astype(int)

# Cyclical encoding for periodic features (handles the wrap-around problem)
# e.g., January (month=1) and December (month=12) are close in the calendar
# but far apart as raw numbers (1 vs. 12); sin/cos encoding fixes this
ts_df["month_sin"] = np.sin(2 * np.pi * ts_df["month"] / 12)
ts_df["month_cos"] = np.cos(2 * np.pi * ts_df["month"] / 12)
ts_df["dow_sin"]   = np.sin(2 * np.pi * ts_df["day_of_week"] / 7)
ts_df["dow_cos"]   = np.cos(2 * np.pi * ts_df["day_of_week"] / 7)

# Drop rows with NaN (from lags)
ts_df = ts_df.dropna()
print(f"Feature matrix shape: {ts_df.shape}")

Stationarity: The Foundational Concept

Stationarity is the single most important concept in classical time series analysis. A time series is stationary if its statistical properties — mean, variance, and autocorrelation structure — do not change over time.

Why Stationarity Matters

Classical forecasting models (ARIMA, exponential smoothing) assume stationarity. Machine learning models also benefit from stationary inputs because they make the training distribution consistent. A non-stationary series has changing statistics that make it hard to learn from: the patterns that held in 2022 might not hold in 2024 if the series has drifted.

Visualizing Non-Stationarity

Python
import matplotlib.pyplot as plt

fig, axes = plt.subplots(3, 1, figsize=(14, 10))

# Original series — clearly non-stationary (has trend)
ts.plot(ax=axes[0], title="Original Series (Non-Stationary: has trend)", alpha=0.7)
axes[0].set_ylabel("Revenue ($)")

# First-order differencing — removes linear trend
ts_diff = ts.diff(1).dropna()
ts_diff.plot(ax=axes[1], title="First Difference (often stationary)", alpha=0.7, color="orange")
axes[1].set_ylabel("Day-over-day change ($)")
axes[1].axhline(y=0, color="black", linestyle="--", linewidth=0.8)

# Log transformation + differencing — for multiplicative series
ts_log_diff = np.log(ts).diff(1).dropna()
ts_log_diff.plot(ax=axes[2], title="Log-Difference (for multiplicative series)",
                  alpha=0.7, color="green")
axes[2].set_ylabel("Log return")
axes[2].axhline(y=0, color="black", linestyle="--", linewidth=0.8)

plt.tight_layout()
plt.savefig("plots/stationarity.png", dpi=150)
plt.show()

The Augmented Dickey-Fuller Test

The ADF test is the standard statistical test for stationarity. It tests the null hypothesis that the series has a unit root (is non-stationary):

  • p-value < 0.05: Reject the null hypothesis → series is stationary (good)
  • p-value ≥ 0.05: Fail to reject → series is non-stationary (needs transformation)
Python
from statsmodels.tsa.stattools import adfuller

def adf_test(series: pd.Series, series_name: str = "Series") -> dict:
    """
    Run the Augmented Dickey-Fuller test for stationarity.

    Parameters
    ----------
    series : pd.Series
        Time series to test. Must have no missing values.
    series_name : str
        Name for display in output.

    Returns
    -------
    dict
        Test results including statistic, p-value, and interpretation.
    """
    series = series.dropna()
    result = adfuller(series, autolag="AIC")

    adf_stat  = result[0]
    p_value   = result[1]
    n_lags    = result[2]
    n_obs     = result[3]
    crit_vals = result[4]

    is_stationary = p_value < 0.05

    print(f"\n{'='*50}")
    print(f"ADF Test: {series_name}")
    print(f"{'='*50}")
    print(f"Test Statistic:  {adf_stat:.4f}")
    print(f"p-value:         {p_value:.4f}")
    print(f"Lags Used:       {n_lags}")
    print(f"Observations:    {n_obs}")
    print(f"Critical Values:")
    for conf, val in crit_vals.items():
        marker = "" if adf_stat < val else ""
        print(f"  {conf}: {val:.4f}{marker}")
    print(f"\nConclusion: {'STATIONARY ✓' if is_stationary else 'NON-STATIONARY ✗'}")
    print(f"  (p={'<' if p_value < 0.001 else ''}{p_value:.4f} {'<' if is_stationary else ''} 0.05)")

    return {
        "statistic": adf_stat, "p_value": p_value,
        "is_stationary": is_stationary, "n_lags": n_lags
    }

# Test original series
result_orig  = adf_test(ts, "Original Revenue")
# Likely p >> 0.05 (non-stationary — has trend)

# Test first difference
result_diff  = adf_test(ts.diff(1), "First Difference")
# Likely p < 0.05 (stationary — differencing removed trend)

# Test log difference
result_lndiff = adf_test(np.log(ts).diff(1), "Log-Difference")
# Likely p << 0.05 (strongly stationary)

Making a Series Stationary

Python
# Method 1: Differencing (subtract previous value)
ts_diff1 = ts.diff(1)           # First difference — removes linear trend
ts_diff2 = ts.diff(1).diff(1)  # Second difference — removes quadratic trend

# Method 2: Seasonal differencing (subtract same period last cycle)
ts_diff_7  = ts.diff(7)    # Remove weekly seasonality
ts_diff_365 = ts.diff(365) # Remove annual seasonality
# Combined: remove both trend and seasonality
ts_diff_combined = ts.diff(365).diff(1)

# Method 3: Log transformation (stabilizes variance for multiplicative series)
ts_log = np.log(ts)

# Method 4: Log + first difference (the most common combo)
ts_log_diff = np.log(ts).diff(1)

# Method 5: Percentage change (similar to log-difference, more interpretable)
ts_pct = ts.pct_change(1)    # Daily percentage change

Autocorrelation: How the Past Relates to the Present

Autocorrelation (or serial correlation) measures how correlated a time series is with a lagged version of itself. An autocorrelation of 0.8 at lag 7 means “today’s value is 0.8 correlated with the value from 7 days ago.”

ACF and PACF Plots

The Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots are the primary diagnostic tools for understanding a time series’ memory structure:

Python
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 8))

# ACF and PACF of original series
plot_acf(ts.dropna(), lags=60, ax=axes[0, 0],
         title="ACF — Original Series")
plot_pacf(ts.dropna(), lags=60, ax=axes[0, 1],
          title="PACF — Original Series", method="ywm")

# ACF and PACF of differenced series
plot_acf(ts.diff(1).dropna(), lags=60, ax=axes[1, 0],
         title="ACF — First Difference")
plot_pacf(ts.diff(1).dropna(), lags=60, ax=axes[1, 1],
          title="PACF — First Difference", method="ywm")

plt.tight_layout()
plt.savefig("plots/acf_pacf.png", dpi=150)
plt.show()

How to read ACF plots:

  • Spikes at regular intervals (7, 14, 21…) → weekly seasonality
  • Slow decay → trend (non-stationary)
  • Sharp cutoff after lag k → MA(k) process
  • Gradual decay → AR process

How to read PACF plots:

  • Sharp cutoff after lag k → AR(k) process (the k most recent values are the direct predictors)
  • Spikes at lags 7, 14, 21… → seasonal AR component

Computing Autocorrelation Values

Python
import pandas as pd

# Compute autocorrelation at specific lags
for lag in [1, 2, 7, 14, 30, 365]:
    corr = ts.autocorr(lag=lag)
    print(f"Autocorrelation at lag {lag:3d}: {corr:.4f}")

# Example output:
# Autocorrelation at lag   1: 0.9823   (strong: yesterday predicts today)
# Autocorrelation at lag   2: 0.9649
# Autocorrelation at lag   7: 0.9112   (weekly seasonality)
# Autocorrelation at lag  14: 0.8734
# Autocorrelation at lag  30: 0.8021
# Autocorrelation at lag 365: 0.7634   (annual seasonality)

# Cross-correlation between two series
sales_corr_with_traffic = ts.corr(web_traffic_ts)  # contemporaneous
sales_lag_corr = pd.Series([
    ts.corr(web_traffic_ts.shift(lag))
    for lag in range(-30, 31)
], index=range(-30, 31))
print("Lag where web traffic best predicts sales:",
      sales_lag_corr.idxmax(), "days")

Time Series Decomposition

Decomposition separates a time series into its constituent components — trend, seasonality, and residual — for cleaner analysis of each part.

Python
from statsmodels.tsa.seasonal import seasonal_decompose, STL
import matplotlib.pyplot as plt

# ── Classical Decomposition ────────────────────────────────────────
# Uses period=7 for weekly seasonality in daily data
decomposition = seasonal_decompose(
    ts,
    model="additive",   # or "multiplicative"
    period=7            # The seasonal period (7 = weekly for daily data)
)

# Access each component
trend_component    = decomposition.trend
seasonal_component = decomposition.seasonal
residual_component = decomposition.resid

# Plot decomposition
fig, axes = plt.subplots(4, 1, figsize=(14, 12))
ts.plot(ax=axes[0], title="Original")
decomposition.trend.plot(ax=axes[1], title="Trend")
decomposition.seasonal.plot(ax=axes[2], title="Seasonality (period=7)")
decomposition.resid.plot(ax=axes[3], title="Residual")
plt.tight_layout()
plt.savefig("plots/decomposition_classical.png", dpi=150)
plt.show()

# ── STL Decomposition (recommended — more robust) ──────────────────
# Seasonal-Trend decomposition using LOESS
# Handles changing seasonality and is robust to outliers
stl = STL(ts, period=7, robust=True)
stl_result = stl.fit()

fig = stl_result.plot()
fig.set_size_inches(14, 10)
plt.tight_layout()
plt.savefig("plots/decomposition_stl.png", dpi=150)
plt.show()

# Annual seasonality (period=365 for daily data)
stl_annual = STL(ts["2022":"2024"], period=365, robust=True)
stl_annual_result = stl_annual.fit()

# Examine the seasonality pattern
annual_seasonal = stl_annual_result.seasonal
print("Peak revenue month:", annual_seasonal.groupby(annual_seasonal.index.month).mean().idxmax())
print("Trough revenue month:", annual_seasonal.groupby(annual_seasonal.index.month).mean().idxmin())

Analyzing the Trend

Python
# Trend strength: how much variance is explained by trend vs. residual
trend_strength = 1 - stl_result.resid.var() / (stl_result.resid + stl_result.trend).var()
print(f"Trend strength: {trend_strength:.3f}")  # 0 = no trend, 1 = perfect trend

# Seasonality strength
seasonal_strength = 1 - stl_result.resid.var() / (stl_result.resid + stl_result.seasonal).var()
print(f"Seasonal strength: {seasonal_strength:.3f}")

# Year-over-year trend
annual_revenue = ts.resample("YE").sum()
yoy_growth = annual_revenue.pct_change() * 100
print("\nYear-over-year growth:")
print(yoy_growth.dropna().round(1))

Handling Missing Values in Time Series

Unlike standard tabular data where missing values in one row don’t affect others, gaps in a time series break the temporal continuity. Different strategies suit different situations:

Python
# Create a series with artificial gaps
ts_with_gaps = ts.copy()
ts_with_gaps.iloc[50:55] = np.nan    # 5-day outage
ts_with_gaps.iloc[200]   = np.nan    # Single missing day
ts_with_gaps.iloc[400:410] = np.nan  # 10-day gap

print(f"Missing values: {ts_with_gaps.isna().sum()}")

# Strategy 1: Forward fill — use last known value (good for slow-changing series)
ts_ffill = ts_with_gaps.ffill()

# Strategy 2: Backward fill — use next known value
ts_bfill = ts_with_gaps.bfill()

# Strategy 3: Linear interpolation — connect gap endpoints smoothly
ts_interp = ts_with_gaps.interpolate(method="linear")

# Strategy 4: Time-aware interpolation (accounts for uneven spacing)
ts_interp_time = ts_with_gaps.interpolate(method="time")

# Strategy 5: Seasonal interpolation — fill with same period from prior cycle
# Manual approach: fill from same day of week 7 days prior
def seasonal_fill(series: pd.Series, period: int = 7) -> pd.Series:
    """Fill gaps using values from the same period in the previous cycle."""
    filled = series.copy()
    missing_idx = filled[filled.isna()].index
    for idx in missing_idx:
        # Try to use value from one period back
        prior_idx = idx - pd.Timedelta(days=period)
        if prior_idx in series.index and not pd.isna(series[prior_idx]):
            filled[idx] = series[prior_idx]
    return filled

ts_seasonal_filled = seasonal_fill(ts_with_gaps, period=7)

# Check result
print(f"Remaining NaN after seasonal fill: {ts_seasonal_filled.isna().sum()}")

A Complete Exploratory Analysis

Putting it all together — a reusable exploratory analysis function for any time series:

Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import STL

def explore_time_series(series: pd.Series, name: str = "Series",
                         seasonal_period: int = 7) -> dict:
    """
    Complete exploratory analysis of a time series.

    Computes descriptive statistics, tests stationarity, identifies
    the seasonal period, and produces summary plots.

    Parameters
    ----------
    series : pd.Series
        Time series with DatetimeIndex.
    name : str
        Human-readable name for the series.
    seasonal_period : int
        Expected seasonal period for decomposition (7=weekly, 12=monthly, 365=annual).

    Returns
    -------
    dict
        Summary statistics and test results.
    """
    series = series.dropna()
    print(f"\n{'='*60}")
    print(f"Time Series Analysis: {name}")
    print(f"{'='*60}")

    # ── Basic Info ─────────────────────────────────────────────────
    print(f"\nDate range:  {series.index.min().date()}{series.index.max().date()}")
    print(f"Observations:{len(series):,}")
    print(f"Frequency:   {pd.infer_freq(series.index) or 'irregular'}")
    print(f"Missing:     {series.isna().sum()}")

    # ── Descriptive Stats ──────────────────────────────────────────
    print(f"\nDescriptive Statistics:")
    print(series.describe().round(2))

    # ── Growth ─────────────────────────────────────────────────────
    total_change_pct = (series.iloc[-1] / series.iloc[0] - 1) * 100
    print(f"\nFirst value:  {series.iloc[0]:,.2f}")
    print(f"Last value:   {series.iloc[-1]:,.2f}")
    print(f"Total change: {total_change_pct:+.1f}%")

    # ── Stationarity ───────────────────────────────────────────────
    adf_result = adfuller(series, autolag="AIC")
    is_stationary = adf_result[1] < 0.05
    print(f"\nADF Test p-value: {adf_result[1]:.4f} → "
          f"{'Stationary' if is_stationary else 'Non-Stationary'}")

    # ── Autocorrelation ────────────────────────────────────────────
    print(f"\nAutocorrelation at key lags:")
    key_lags = [1, seasonal_period, seasonal_period * 2, seasonal_period * 4]
    for lag in key_lags:
        if lag < len(series):
            corr = series.autocorr(lag=lag)
            print(f"  Lag {lag:4d}: {corr:.4f}")

    # ── Seasonal Decomposition ─────────────────────────────────────
    if len(series) >= 2 * seasonal_period:
        stl = STL(series, period=seasonal_period, robust=True)
        stl_fit = stl.fit()

        trend_strength   = max(0, 1 - stl_fit.resid.var() /
                               (stl_fit.resid + stl_fit.trend).var())
        seasonal_strength = max(0, 1 - stl_fit.resid.var() /
                                (stl_fit.resid + stl_fit.seasonal).var())

        print(f"\nDecomposition (STL, period={seasonal_period}):")
        print(f"  Trend strength:    {trend_strength:.3f} (0=none, 1=perfect)")
        print(f"  Seasonal strength: {seasonal_strength:.3f} (0=none, 1=perfect)")
        print(f"  Residual std:      {stl_fit.resid.std():.2f}")

    # ── Simple Plots ───────────────────────────────────────────────
    fig, axes = plt.subplots(3, 1, figsize=(14, 10))
    fig.suptitle(f"Time Series Analysis: {name}", fontsize=14, fontweight="bold")

    series.plot(ax=axes[0], title="Raw Series", alpha=0.6, color="steelblue")
    series.rolling(seasonal_period).mean().plot(ax=axes[0], color="red",
        label=f"{seasonal_period}-period MA", linewidth=2)
    axes[0].legend()

    series.pct_change(1).plot(ax=axes[1], title="Period-over-Period Change (%)",
                               color="darkorange", alpha=0.6)
    axes[1].axhline(y=0, color="black", linestyle="--", linewidth=0.8)

    # Monthly box plots for seasonality
    if hasattr(series.index, "month"):
        monthly = series.groupby(series.index.month)
        monthly_data = [series[series.index.month == m].values for m in range(1, 13)]
        axes[2].boxplot(monthly_data,
                         labels=["Jan","Feb","Mar","Apr","May","Jun",
                                 "Jul","Aug","Sep","Oct","Nov","Dec"])
        axes[2].set_title("Monthly Distribution (Annual Seasonality)")
        axes[2].set_ylabel("Value")

    plt.tight_layout()
    plt.savefig(f"plots/{name.lower().replace(' ', '_')}_analysis.png", dpi=150)
    plt.show()

    return {
        "n_obs": len(series),
        "start": series.index.min(),
        "end":   series.index.max(),
        "is_stationary": is_stationary,
        "adf_pvalue":    adf_result[1],
        "trend_strength":    trend_strength if len(series) >= 2 * seasonal_period else None,
        "seasonal_strength": seasonal_strength if len(series) >= 2 * seasonal_period else None,
    }

# Run on our synthetic revenue series
summary = explore_time_series(ts, name="Daily Revenue", seasonal_period=7)

Summary

Time series data is distinguished from ordinary tabular data by one fundamental property: temporal ordering matters. Observations are not independent — they are correlated with their neighbors in time, and this autocorrelation structure is both the central challenge and the primary analytical opportunity. The four structural components — trend, seasonality, cycles, and noise — provide a framework for understanding any time series, and decomposition (particularly STL) separates them for independent analysis.

The essential Python tools are pandas’ DatetimeIndex for natural time-based indexing and slicing, .resample() for frequency conversion, .rolling() for moving window statistics, .shift() for creating lag features, and statsmodels for ADF stationarity testing, ACF/PACF analysis, and seasonal decomposition. Stationarity — the requirement that statistical properties don’t change over time — is the foundational concept for classical forecasting models, and differencing is the primary tool for achieving it.

These fundamentals — indexing, resampling, rolling statistics, lag features, stationarity, autocorrelation, and decomposition — are the building blocks for everything in time series analysis that comes next: ARIMA models, exponential smoothing, machine learning forecasting, and anomaly detection.

Key Takeaways

  • Time series data has a crucial property standard tabular data lacks: temporal order is meaningful — observations are autocorrelated with their neighbors, and past values help predict future ones
  • Always use a DatetimeIndex for time series in pandas — it enables natural string-based slicing (ts["2024-Q1"]), .resample() for frequency conversion, and the full suite of time series operations
  • Resampling (ts.resample("ME").sum()) aggregates fine-grained data to coarser frequencies — the aggregation function (sum, mean, max, min) must match the meaning of the metric
  • Rolling statistics (ts.rolling(30).mean()) compute metrics over a sliding window; lag features (ts.shift(7)) create past-value predictors — both are essential for time series ML feature engineering
  • The four components of a time series — trend (long-term direction), seasonality (fixed-period cycles), cycles (irregular longer-term patterns), and noise (random variation) — combine additively or multiplicatively and can be separated by STL decomposition
  • Stationarity means the series’ statistical properties (mean, variance, autocorrelation) don’t change over time; the ADF test checks it formally (p < 0.05 → stationary); differencing (ts.diff(1)) is the primary tool to achieve stationarity
  • ACF (Autocorrelation Function) plots reveal seasonal periods (regular spikes) and memory structure (slow decay = trend, sharp cutoff = MA process); PACF (Partial ACF) isolates direct lag relationships to identify AR order
  • Never randomly split time series data into train/test — always use a temporal split (train on past, test on the most recent period) to prevent future data from leaking into training
Share:
Subscribe
Notify of
0 Comments

Discover More

NPN versus PNP Transistors: How They Differ and When to Use Each

NPN versus PNP Transistors: How They Differ and When to Use Each

Master the difference between NPN and PNP transistors—polarity, current flow, biasing, circuit configurations—and know exactly…

Java Control Flow: if, else, and switch Statements

Learn the fundamentals of Java control flow, including if-else statements, switch cases and loops. Optimize…

Color Theory for Data Visualization: Using Color Effectively in Charts

Learn how to use color effectively in data visualization. Explore color theory, best practices, and…

Understanding AC versus DC: Why Your Wall Outlet and Battery Work Differently

Discover the crucial differences between AC and DC electricity. Learn why batteries provide DC, wall…

Introduction to Data Warehousing Concepts

Introduction to Data Warehousing Concepts

Learn data warehousing fundamentals: OLTP vs OLAP, star schema, dimension and fact tables, slowly changing…

The Difference Between Analog and Digital Signals Explained Visually

Learn the fundamental differences between analog and digital signals through clear visual explanations. Understand continuous…

Click For More
0
Would love your thoughts, please comment.x
()
x