Understanding Audio Data for Analysis

Learn to work with audio data in Python. Master waveforms, sample rate, spectrograms, MFCCs, Librosa, feature extraction, audio classification, and speech-to-text for data science.

Understanding Audio Data for Analysis

Audio data is a time-series signal representing pressure variations in air over time, stored digitally as a sequence of amplitude samples at a fixed sample rate (typically 16,000–44,100 samples per second). Raw waveforms are rarely used directly for analysis — instead, audio is transformed into spectrograms (visual frequency-over-time representations) or MFCCs (Mel-Frequency Cepstral Coefficients), which capture the perceptually relevant features of sound. In Python, librosa is the primary library for audio loading, feature extraction, and visualization; soundfile and scipy.io.wavfile handle file I/O; and for speech-to-text, the Whisper model from OpenAI provides state-of-the-art transcription in a few lines of code.

Introduction

Audio data is everywhere — recorded speech, music, environmental sounds, industrial machinery noise, medical acoustic signals like heart sounds and breathing patterns. For data scientists, audio opens analytical possibilities that structured data cannot: analyzing call center conversations for sentiment and compliance, detecting equipment anomalies from vibration recordings, classifying environmental sounds from IoT sensors, transcribing meetings automatically, and building voice-activated interfaces.

Audio is fundamentally a time-series — a continuously varying signal sampled at high frequency — but it differs from typical time-series data in important ways. A 5-second audio clip sampled at 44,100 Hz contains 220,500 data points. The perceptually relevant information is not in the raw amplitudes but in the frequency content and how it changes over time. The human ear processes sound logarithmically, not linearly — small differences in low frequencies are more perceptible than the same differences in high frequencies. These characteristics drive the specialized feature extraction techniques (spectrograms, MFCCs, chroma features) that make audio analysis tractable.

This article introduces audio data for data scientists: the fundamental digital audio concepts, loading and visualizing audio in Python with Librosa, the key feature extraction techniques that transform raw waveforms into analyzable representations, audio classification using extracted features, and practical applications including speech transcription and anomaly detection.

Digital Audio Fundamentals

Before writing any code, a clear understanding of digital audio concepts prevents the common mistakes that arise from treating audio as just another time-series.

Sampling and Sample Rate

When a microphone captures sound, it measures air pressure variations continuously. To store this digitally, the analog signal is sampled at regular intervals. The sample rate (or sampling frequency) is how many samples are captured per second, measured in Hz.

Sample RateCommon Use
8,000 HzTelephone audio (narrow band)
16,000 HzSpeech recognition, voice calls
22,050 HzMid-quality audio
44,100 HzCD audio, professional music
48,000 HzVideo production standard
192,000 HzHigh-resolution audio

The Nyquist theorem states that to accurately represent a frequency, you must sample at least twice that frequency. At 44,100 Hz, the highest frequency representable is 22,050 Hz — slightly above the upper limit of human hearing (~20,000 Hz). This is why 44,100 Hz became the standard for CD audio.

For speech recognition and voice analytics, 16,000 Hz is the sweet spot — it captures all speech-relevant frequencies (80 Hz – 8,000 Hz) with minimal file size.

Bit Depth and Dynamic Range

The bit depth determines how many distinct amplitude values each sample can represent:

  • 16-bit: 65,536 possible values (±32,768). CD standard. Sufficient for most data science work.
  • 24-bit: 16,777,216 values. Professional recording.
  • 32-bit float: Used internally by audio software; values in range [-1.0, 1.0].

Higher bit depth = greater dynamic range (ability to represent both very quiet and very loud sounds without distortion).

Audio File Formats

FormatTypeQualityCommon Use
WAVUncompressedLosslessProfessional, ML training data
FLACLossless compressedLosslessArchival, audiophile
MP3Lossy compressedLossyMusic distribution
OGG/VorbisLossy compressedLossyWeb audio
M4A/AACLossy compressedLossyApple ecosystem
OPUSLossy compressedLossyTelephony, streaming

For data science work: WAV is preferred for training data (no compression artifacts). MP3 is acceptable for downstream analytics where exact waveform fidelity is less critical.

Setting Up the Environment

Python
pip install librosa soundfile scipy matplotlib numpy
pip install openai-whisper    # For speech transcription
pip install torch torchaudio  # For deep learning audio tasks
Python
import librosa
import librosa.display
import soundfile as sf
import numpy as np
import matplotlib.pyplot as plt
import scipy.io.wavfile as wavfile

print(f"librosa: {librosa.__version__}")
print(f"soundfile: {sf.__version__}")

Loading and Inspecting Audio Files

Python
import librosa
import numpy as np
import soundfile as sf

# ── Load with librosa ─────────────────────────────────────────────
# Returns: waveform array (float32, range approx [-1, 1])
#          and sample rate
y, sr = librosa.load("data/audio/speech_sample.wav")

print(f"Sample rate:    {sr} Hz")                    # Default: resampled to 22050
print(f"Duration:       {librosa.get_duration(y=y, sr=sr):.2f} seconds")
print(f"Number of samples: {len(y):,}")              # duration × sr
print(f"Array shape:    {y.shape}")                  # (n_samples,) for mono
print(f"Data type:      {y.dtype}")                  # float32
print(f"Amplitude range: [{y.min():.4f}, {y.max():.4f}]")  # ~[-1, 1]

# Load at original sample rate (don't resample)
y_orig, sr_orig = librosa.load("data/audio/speech_sample.wav", sr=None)
print(f"\nOriginal sample rate: {sr_orig} Hz")

# Load at a specific sample rate (resample)
y_16k, sr_16k = librosa.load("data/audio/speech_sample.wav", sr=16000)

# Load only part of a file (efficient for long recordings)
y_first_10s, sr = librosa.load("data/audio/long_recording.wav",
                                 offset=0.0,      # Start at 0 seconds
                                 duration=10.0)   # Load 10 seconds only

# ── Load stereo audio ─────────────────────────────────────────────
# mono=True (default): convert stereo to mono by averaging channels
# mono=False: preserve stereo as (2, n_samples) array
y_stereo, sr = librosa.load("data/audio/stereo.wav", mono=False)
print(f"\nStereo shape: {y_stereo.shape}")   # (2, n_samples)
print(f"Left channel:  {y_stereo[0].shape}")
print(f"Right channel: {y_stereo[1].shape}")

# Convert stereo to mono manually
y_mono = y_stereo.mean(axis=0)

# ── Load with soundfile (faster, more formats) ────────────────────
data_sf, samplerate = sf.read("data/audio/speech_sample.wav")
print(f"\nsoundfile shape: {data_sf.shape}")  # (n_samples,) or (n_samples, channels)
print(f"soundfile dtype: {data_sf.dtype}")   # float64 by default

# ── Load with scipy (raw, no resampling) ─────────────────────────
sr_scipy, data_scipy = wavfile.read("data/audio/speech_sample.wav")
print(f"\nscipy rate: {sr_scipy}, shape: {data_scipy.shape}, dtype: {data_scipy.dtype}")
# scipy returns int16 by default (raw PCM values, not normalized to [-1,1])
# Normalize:
data_normalized = data_scipy.astype(np.float32) / 32768.0

Visualizing Audio

Python
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Create a synthetic audio signal for illustration
sr = 22050
duration = 3.0  # seconds
t = np.linspace(0, duration, int(sr * duration))

# Combine a 440 Hz tone (A4) with a 880 Hz tone (A5) and noise
y_clean  = 0.5 * np.sin(2 * np.pi * 440 * t)
y_clean += 0.3 * np.sin(2 * np.pi * 880 * t)
y_noise  = 0.05 * np.random.randn(len(t))
y = (y_clean + y_noise).astype(np.float32)

fig, axes = plt.subplots(4, 1, figsize=(14, 12))
fig.suptitle("Audio Visualization Suite", fontsize=14, fontweight="bold")

# ── 1. Waveform ────────────────────────────────────────────────────
ax1 = axes[0]
librosa.display.waveshow(y, sr=sr, ax=ax1, color="steelblue", alpha=0.7)
ax1.set_title("Waveform (Amplitude vs. Time)")
ax1.set_xlabel("Time (seconds)")
ax1.set_ylabel("Amplitude")
ax1.axhline(y=0, color="black", linewidth=0.5)

# ── 2. Short-Time Fourier Transform Spectrogram ────────────────────
D = librosa.stft(y)
D_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

ax2 = axes[1]
img = librosa.display.specshow(D_db, sr=sr, x_axis="time",
                                 y_axis="hz", ax=ax2, cmap="magma")
plt.colorbar(img, ax=ax2, format="%+2.0f dB")
ax2.set_title("STFT Spectrogram (Frequency vs. Time)")
ax2.set_ylabel("Frequency (Hz)")

# ── 3. Mel Spectrogram ─────────────────────────────────────────────
mel_spec = librosa.feature.melspectrogram(y=y, sr=sr,
                                            n_mels=128, fmax=8000)
mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)

ax3 = axes[2]
img = librosa.display.specshow(mel_spec_db, sr=sr, x_axis="time",
                                 y_axis="mel", ax=ax3, cmap="magma",
                                 fmax=8000)
plt.colorbar(img, ax=ax3, format="%+2.0f dB")
ax3.set_title("Mel Spectrogram (Perceptual Frequency Scale)")
ax3.set_ylabel("Mel Frequency")

# ── 4. Power Spectral Density (average frequency content) ─────────
ax4 = axes[3]
from scipy.signal import welch
freqs, psd = welch(y, fs=sr, nperseg=1024)
ax4.semilogy(freqs, psd, color="darkorange")
ax4.set_title("Power Spectral Density (Average Frequency Distribution)")
ax4.set_xlabel("Frequency (Hz)")
ax4.set_ylabel("Power (log scale)")
ax4.set_xlim([0, 4000])
ax4.axvline(x=440,  color="blue",  linestyle="--", alpha=0.7, label="440 Hz (A4)")
ax4.axvline(x=880,  color="red",   linestyle="--", alpha=0.7, label="880 Hz (A5)")
ax4.legend()

plt.tight_layout()
plt.savefig("output/audio_visualization.png", dpi=150, bbox_inches="tight")
plt.show()
print("Audio visualization saved.")

Understanding the Spectrogram

The spectrogram is the most important visual and computational representation of audio. Instead of showing amplitude over time (waveform), it shows frequency content over time.

How a Spectrogram Is Computed

The process uses the Short-Time Fourier Transform (STFT):

  1. Window the signal: Divide the audio into short overlapping frames (typically 25ms window, 10ms hop)
  2. Apply FFT to each frame: Convert each time frame from time domain to frequency domain
  3. Stack the results: Each frame becomes one column in the spectrogram — time on x-axis, frequency on y-axis, intensity (color) shows magnitude
Python
import librosa
import numpy as np

# STFT parameters
n_fft    = 2048    # FFT window size (samples) — frequency resolution
hop_length = 512   # Hop between frames (samples) — time resolution
win_length = None  # Window length (None = same as n_fft)

# Compute STFT
D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)

print(f"STFT shape: {D.shape}")
# (1 + n_fft//2, n_frames)
# = (1025, ~129) for a 3-second clip at 22050 Hz
# Rows = frequency bins; Columns = time frames

print(f"Number of frequency bins: {D.shape[0]}")
print(f"Number of time frames: {D.shape[1]}")

# Frequency resolution: how many Hz per frequency bin
freq_resolution = sr / n_fft
print(f"Frequency resolution: {freq_resolution:.2f} Hz per bin")

# Time resolution: how many ms per hop
time_resolution = hop_length / sr * 1000
print(f"Time resolution: {time_resolution:.1f} ms per frame")

# The fundamental tradeoff: larger n_fft → better frequency resolution
# but worse time resolution. For speech: n_fft=512 (fast events);
# for music analysis: n_fft=2048+ (harmonic structure)

The Mel Scale: Matching Human Perception

The standard STFT uses linear frequency spacing, but human hearing is logarithmic — we’re much more sensitive to frequency differences at low frequencies than high frequencies. The Mel scale warps the frequency axis to match human perception:

Python
import librosa
import numpy as np

# Mel spectrogram: STFT projected onto perceptual Mel frequency scale
mel_spec = librosa.feature.melspectrogram(
    y=y,
    sr=sr,
    n_mels=128,       # Number of Mel frequency bins (typical: 64-128)
    n_fft=2048,       # FFT window size
    hop_length=512,   # Hop between frames
    fmin=20,          # Minimum frequency (Hz) — below human hearing
    fmax=8000         # Maximum frequency — 8kHz captures most speech
)

print(f"Mel spectrogram shape: {mel_spec.shape}")
# (128, n_frames) — 128 Mel bands × time frames

# Convert to decibels (log scale for better visual contrast)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)

# The Mel spectrogram is typically used as input to CNNs for audio classification
# It's the "image" of the audio — CNNs treat it like any other 2D image

MFCCs: The Standard Audio Feature

Mel-Frequency Cepstral Coefficients (MFCCs) are the most widely used feature representation for audio data science, particularly for speech. They compress the key information from the Mel spectrogram into a compact vector.

Computing MFCCs

Python
import librosa
import numpy as np
import pandas as pd

# Compute MFCCs
mfccs = librosa.feature.mfcc(
    y=y,
    sr=sr,
    n_mfcc=13,        # Number of MFCC coefficients (typically 13-40)
    n_fft=2048,
    hop_length=512
)

print(f"MFCC matrix shape: {mfccs.shape}")
# (n_mfcc, n_frames) = (13, ~129)

# Delta MFCCs: how MFCCs change over time (velocity)
delta_mfccs  = librosa.feature.delta(mfccs)

# Delta-delta MFCCs: acceleration
delta2_mfccs = librosa.feature.delta(mfccs, order=2)

print(f"Delta MFCC shape: {delta_mfccs.shape}")
print(f"Delta² MFCC shape: {delta2_mfccs.shape}")

# Standard feature vector: concatenate MFCCs + deltas + delta-deltas
# This gives 39 features (13 × 3) per time frame
all_features = np.vstack([mfccs, delta_mfccs, delta2_mfccs])
print(f"Full feature matrix: {all_features.shape}")  # (39, n_frames)


def extract_audio_features_summary(
    y: np.ndarray,
    sr: int,
    n_mfcc: int = 13
) -> np.ndarray:
    """
    Extract a fixed-length summary feature vector from an audio clip.

    Aggregates MFCC and other spectral features over time
    by computing statistics (mean, std, min, max) per coefficient.

    This converts variable-length audio into a fixed-length vector
    suitable for standard ML classifiers.

    Parameters
    ----------
    y : np.ndarray
        Audio waveform (mono, float32).
    sr : int
        Sample rate.
    n_mfcc : int
        Number of MFCC coefficients.

    Returns
    -------
    np.ndarray
        Fixed-length feature vector (4 × (n_mfcc × 3) + spectral features).
    """
    features = {}

    # MFCCs + deltas + delta-deltas
    mfcc       = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
    delta_mfcc  = librosa.feature.delta(mfcc)
    delta2_mfcc = librosa.feature.delta(mfcc, order=2)

    for name, coefs in [("mfcc", mfcc), ("dmfcc", delta_mfcc), ("d2mfcc", delta2_mfcc)]:
        features[f"{name}_mean"] = coefs.mean(axis=1)  # Mean over time
        features[f"{name}_std"]  = coefs.std(axis=1)   # Std over time
        features[f"{name}_max"]  = coefs.max(axis=1)
        features[f"{name}_min"]  = coefs.min(axis=1)

    # Spectral features
    spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)[0]
    spectral_bandwidth= librosa.feature.spectral_bandwidth(y=y, sr=sr)[0]
    spectral_rolloff  = librosa.feature.spectral_rolloff(y=y, sr=sr)[0]
    zero_crossing_rate= librosa.feature.zero_crossing_rate(y)[0]
    rms_energy        = librosa.feature.rms(y=y)[0]

    for name, feat in [
        ("spec_centroid",   spectral_centroid),
        ("spec_bandwidth",  spectral_bandwidth),
        ("spec_rolloff",    spectral_rolloff),
        ("zcr",             zero_crossing_rate),
        ("rms",             rms_energy),
    ]:
        features[f"{name}_mean"] = np.array([feat.mean()])
        features[f"{name}_std"]  = np.array([feat.std()])

    # Chroma features (pitch class profiles)
    chroma = librosa.feature.chroma_stft(y=y, sr=sr, n_chroma=12)
    features["chroma_mean"] = chroma.mean(axis=1)
    features["chroma_std"]  = chroma.std(axis=1)

    # Mel spectrogram statistics
    mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=40)
    mel_db = librosa.power_to_db(mel, ref=np.max)
    features["mel_mean"] = mel_db.mean(axis=1)
    features["mel_std"]  = mel_db.std(axis=1)

    # Concatenate all into one vector
    return np.concatenate([v.flatten() for v in features.values()])


# Extract features from a single clip
feature_vector = extract_audio_features_summary(y, sr)
print(f"\nFeature vector length: {len(feature_vector)}")

Audio Feature Reference

Python
# Complete reference of librosa features useful for data science

import librosa
import numpy as np

def extract_all_librosa_features(y: np.ndarray, sr: int) -> dict:
    """
    Extract all major librosa features with brief descriptions.
    Returns dict of {feature_name: array_or_value}.
    """
    features = {}

    # ── Temporal features ──────────────────────────────────────────
    # Zero Crossing Rate: how often the signal crosses zero
    # High ZCR → noisy, percussive sounds; Low ZCR → tonal sounds
    features["zero_crossing_rate"] = librosa.feature.zero_crossing_rate(y)[0]

    # RMS Energy: root mean square energy per frame
    # Loudness proxy — useful for detecting silence and loud events
    features["rms"] = librosa.feature.rms(y=y)[0]

    # ── Spectral features ──────────────────────────────────────────
    # Spectral Centroid: "center of mass" of the spectrum
    # Bright sounds (high freq) → high centroid; dark sounds → low centroid
    features["spectral_centroid"] = librosa.feature.spectral_centroid(y=y, sr=sr)[0]

    # Spectral Bandwidth: spread of frequencies around the centroid
    # Narrow bandwidth → pure tone; Wide bandwidth → noise or complex sound
    features["spectral_bandwidth"] = librosa.feature.spectral_bandwidth(y=y, sr=sr)[0]

    # Spectral Rolloff: frequency below which 85% of energy is concentrated
    # Distinguishes harmonic vs. percussive sounds
    features["spectral_rolloff"] = librosa.feature.spectral_rolloff(y=y, sr=sr)[0]

    # Spectral Contrast: energy difference between peaks and valleys
    # Useful for music/speech discrimination
    features["spectral_contrast"] = librosa.feature.spectral_contrast(y=y, sr=sr)

    # ── Cepstral features ──────────────────────────────────────────
    # MFCCs: compact perceptual representation of the spectrum
    # Most important feature for speech and audio classification
    features["mfcc"] = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

    # ── Rhythmic features ──────────────────────────────────────────
    # Tempo: estimated beats per minute
    tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
    features["tempo"] = np.array([tempo])

    # ── Pitch/Harmonic features ────────────────────────────────────
    # Chroma Features: energy in each of 12 pitch classes
    # Represents harmonic and melodic content
    features["chroma_stft"] = librosa.feature.chroma_stft(y=y, sr=sr)

    # Chroma CENS: normalized, more robust to tempo/dynamics changes
    features["chroma_cens"] = librosa.feature.chroma_cens(y=y, sr=sr)

    # Harmonic/Percussive separation
    y_harmonic, y_percussive = librosa.effects.hpss(y)
    features["harmonic_ratio"] = np.array([
        np.mean(np.abs(y_harmonic)) / (np.mean(np.abs(y)) + 1e-10)
    ])

    # Tonnetz: tonal centroid features
    y_harm_for_tonnetz = librosa.effects.harmonic(y)
    features["tonnetz"] = librosa.feature.tonnetz(
        y=y_harm_for_tonnetz, sr=sr
    )

    return features


# Demonstrate on our synthetic tone
feature_dict = extract_all_librosa_features(y, sr)
print("Feature shapes:")
for name, feat in feature_dict.items():
    shape = feat.shape if hasattr(feat, "shape") else "scalar"
    summary = f"mean={feat.mean():.3f}" if hasattr(feat, "mean") else str(feat)
    print(f"  {name:25s}: {str(shape):15s} | {summary}")

Audio Classification Pipeline

Combining feature extraction with a standard sklearn classifier:

Python
import os
import numpy as np
import pandas as pd
import librosa
from pathlib import Path
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings("ignore")

def load_audio_dataset(
    audio_dir: str,
    sr: int = 22050,
    duration: float = None,
    max_files_per_class: int = None
) -> tuple:
    """
    Load an audio dataset from a class-folder directory structure.

    audio_dir/
        class_a/  (e.g., 'dog_bark', 'car_horn', 'speech')
            sound001.wav
            sound002.wav
        class_b/
            sound003.wav

    Parameters
    ----------
    audio_dir : str
        Root directory with class subdirectories.
    sr : int
        Sample rate to load all audio at (resamples if needed).
    duration : float, optional
        Clip duration in seconds. Truncate/pad to this length.
    max_files_per_class : int, optional
        Maximum files to load per class (for balanced datasets).

    Returns
    -------
    tuple
        (features_matrix, labels_array, class_names)
    """
    audio_dir_path = Path(audio_dir)
    class_dirs     = sorted([d for d in audio_dir_path.iterdir() if d.is_dir()])
    class_names    = [d.name for d in class_dirs]
    le             = LabelEncoder()
    le.fit(class_names)

    all_features, all_labels = [], []
    audio_extensions = {".wav", ".mp3", ".flac", ".ogg", ".m4a"}

    for class_dir in class_dirs:
        class_files = [
            f for f in class_dir.iterdir()
            if f.suffix.lower() in audio_extensions
        ]
        if max_files_per_class:
            class_files = class_files[:max_files_per_class]

        print(f"  {class_dir.name}: {len(class_files)} files")

        for filepath in class_files:
            try:
                # Load audio
                y_clip, _ = librosa.load(str(filepath), sr=sr, duration=duration)

                # Pad if shorter than target duration
                if duration:
                    target_samples = int(duration * sr)
                    if len(y_clip) < target_samples:
                        y_clip = np.pad(y_clip, (0, target_samples - len(y_clip)))
                    else:
                        y_clip = y_clip[:target_samples]

                # Extract features
                feat_vector = extract_audio_features_summary(y_clip, sr)
                all_features.append(feat_vector)
                all_labels.append(class_dir.name)

            except Exception as e:
                print(f"    Skipped {filepath.name}: {e}")

    features_matrix = np.array(all_features)
    labels_encoded  = le.transform(all_labels)

    print(f"\nDataset shape: {features_matrix.shape}")
    print(f"Classes: {class_names}")
    print(f"Samples per class: {pd.Series(all_labels).value_counts().to_dict()}")

    return features_matrix, labels_encoded, class_names


def train_audio_classifier(
    features: np.ndarray,
    labels: np.ndarray,
    class_names: list,
    classifier: str = "random_forest"
) -> dict:
    """Train and cross-validate an audio classifier."""

    classifiers = {
        "logistic":     LogisticRegression(C=1.0, max_iter=500, random_state=42),
        "random_forest":RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
        "gradient_boost":GradientBoostingClassifier(n_estimators=100, random_state=42),
    }

    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("clf",    classifiers[classifier])
    ])

    # Stratified cross-validation
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    cv_scores = cross_val_score(pipeline, features, labels, cv=cv,
                                  scoring="accuracy", n_jobs=-1)

    pipeline.fit(features, labels)  # Fit final model on all data

    print(f"\nAudio classifier: {classifier}")
    print(f"CV accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
    print(f"CV scores:   {np.round(cv_scores, 3)}")

    return {
        "model":    pipeline,
        "cv_mean":  cv_scores.mean(),
        "cv_std":   cv_scores.std(),
        "class_names": class_names
    }


def predict_audio_file(model_info: dict, filepath: str, sr: int = 22050) -> dict:
    """Predict the class of a single audio file."""
    model = model_info["model"]

    y, _ = librosa.load(filepath, sr=sr)
    features = extract_audio_features_summary(y, sr).reshape(1, -1)

    class_idx   = model.predict(features)[0]
    class_probs = model.predict_proba(features)[0]
    class_name  = model_info["class_names"][class_idx]

    result = {
        "predicted_class": class_name,
        "confidence":      float(class_probs.max()),
        "all_probs":       dict(zip(model_info["class_names"], class_probs.tolist()))
    }

    print(f"Prediction: {class_name} (confidence: {class_probs.max():.3f})")
    return result

Audio Preprocessing Techniques

Python
import librosa
import numpy as np

def normalize_audio(y: np.ndarray, method: str = "peak") -> np.ndarray:
    """
    Normalize audio amplitude.

    Parameters
    ----------
    method : str
        'peak': normalize so the max absolute value is 1.0
        'rms':  normalize to a target RMS level
    """
    if method == "peak":
        peak = np.max(np.abs(y))
        return y / (peak + 1e-10)
    elif method == "rms":
        target_rms = 0.1
        current_rms = np.sqrt(np.mean(y**2))
        return y * (target_rms / (current_rms + 1e-10))
    return y


def remove_silence(
    y: np.ndarray,
    sr: int,
    top_db: float = 20.0,
    pad_ms: int = 50
) -> np.ndarray:
    """
    Remove silent regions from audio.

    Parameters
    ----------
    y : np.ndarray
        Input audio waveform.
    sr : int
        Sample rate.
    top_db : float
        Silence threshold in dB below the loudest part.
        Lower values = more aggressive silence removal.
    pad_ms : int
        Milliseconds of audio to keep before/after voiced segments.
    """
    # Get intervals of non-silent audio
    intervals = librosa.effects.split(y, top_db=top_db)

    pad_samples = int(pad_ms * sr / 1000)
    segments = []
    for start, end in intervals:
        # Add padding around each voiced segment
        padded_start = max(0, start - pad_samples)
        padded_end   = min(len(y), end + pad_samples)
        segments.append(y[padded_start:padded_end])

    if not segments:
        return y  # All silence — return original

    return np.concatenate(segments)


def apply_noise_reduction(y: np.ndarray, noise_factor: float = 0.1) -> np.ndarray:
    """
    Simple noise reduction using spectral subtraction.

    Estimates noise from the quietest parts of the recording
    and subtracts it from the full spectrum.
    """
    D = librosa.stft(y)
    magnitude = np.abs(D)
    phase = np.angle(D)

    # Estimate noise floor from lowest-energy frames
    frame_energy = magnitude.mean(axis=0)
    quiet_frames = frame_energy < np.percentile(frame_energy, 15)
    noise_estimate = magnitude[:, quiet_frames].mean(axis=1, keepdims=True)

    # Spectral subtraction
    denoised_magnitude = np.maximum(
        magnitude - noise_factor * noise_estimate,
        0.01 * magnitude   # Keep small residual to avoid musical noise
    )

    D_denoised = denoised_magnitude * np.exp(1j * phase)
    return librosa.istft(D_denoised, length=len(y))


def pad_or_trim(y: np.ndarray, sr: int, target_duration: float) -> np.ndarray:
    """
    Ensure audio is exactly target_duration seconds long.
    Pads with zeros if too short, trims if too long.
    """
    target_samples = int(target_duration * sr)
    if len(y) > target_samples:
        return y[:target_samples]
    elif len(y) < target_samples:
        return np.pad(y, (0, target_samples - len(y)))
    return y

Speech-to-Text with Whisper

OpenAI’s Whisper model provides state-of-the-art speech transcription across 100+ languages:

Python
import whisper
import json

def transcribe_audio(
    filepath: str,
    model_size: str = "base",      # 'tiny', 'base', 'small', 'medium', 'large'
    language: str = None,          # None = auto-detect
    task: str = "transcribe"       # 'transcribe' or 'translate' (to English)
) -> dict:
    """
    Transcribe an audio file using OpenAI Whisper.

    Parameters
    ----------
    filepath : str
        Path to audio file (supports WAV, MP3, M4A, FLAC, etc.)
    model_size : str
        Whisper model size: tiny (39M params, fastest) to large (1.5B, most accurate).
    language : str, optional
        Language code (e.g., 'en', 'es', 'fr'). None = auto-detect.
    task : str
        'transcribe' for same-language, 'translate' to translate to English.

    Returns
    -------
    dict
        Transcription result with text, segments, and language.
    """
    print(f"Loading Whisper {model_size} model...")
    model = whisper.load_model(model_size)

    print(f"Transcribing: {filepath}")
    result = model.transcribe(
        filepath,
        language=language,
        task=task,
        verbose=False
    )

    print(f"\nDetected language: {result['language']}")
    print(f"Full transcription:")
    print(result["text"])

    if result.get("segments"):
        print(f"\nWord-level timestamps ({len(result['segments'])} segments):")
        for segment in result["segments"][:5]:   # Show first 5
            print(f"  [{segment['start']:5.1f}s - {segment['end']:5.1f}s] "
                  f"{segment['text'].strip()}")

    return result


def batch_transcribe_directory(
    audio_dir: str,
    output_jsonl: str,
    model_size: str = "base",
    extensions: tuple = (".wav", ".mp3", ".m4a", ".flac")
) -> list:
    """
    Batch transcribe all audio files in a directory.
    Saves results as JSONL for downstream processing.
    """
    from pathlib import Path
    import json

    audio_dir_path = Path(audio_dir)
    audio_files = [
        f for f in audio_dir_path.rglob("*")
        if f.suffix.lower() in extensions
    ]
    print(f"Found {len(audio_files)} audio files to transcribe")

    model = whisper.load_model(model_size)
    results = []

    with open(output_jsonl, "w") as out_f:
        for i, filepath in enumerate(audio_files):
            print(f"[{i+1}/{len(audio_files)}] {filepath.name}")
            try:
                result = model.transcribe(str(filepath), verbose=False)
                record = {
                    "filepath":    str(filepath),
                    "filename":    filepath.name,
                    "language":    result["language"],
                    "text":        result["text"].strip(),
                    "n_segments":  len(result.get("segments", [])),
                    "duration_s":  result["segments"][-1]["end"]
                               if result.get("segments") else None
                }
                out_f.write(json.dumps(record) + "\n")
                results.append(record)
            except Exception as e:
                print(f"  Error: {e}")

    print(f"\nTranscriptions saved to: {output_jsonl}")
    return results

Audio Anomaly Detection

Detecting unusual sounds in industrial or environmental monitoring:

Python
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA
import librosa

def build_audio_anomaly_detector(
    normal_audio_files: list,
    sr: int = 22050,
    contamination: float = 0.05
) -> dict:
    """
    Build an anomaly detector from a set of 'normal' audio recordings.

    Trains on features from known-normal audio (e.g., healthy machinery)
    and flags deviations as anomalies (e.g., bearing failures, unusual sounds).

    Parameters
    ----------
    normal_audio_files : list
        Paths to audio files representing normal (non-anomalous) conditions.
    sr : int
        Sample rate.
    contamination : float
        Expected proportion of anomalies (used to set decision threshold).

    Returns
    -------
    dict
        Trained detector and preprocessing components.
    """
    print(f"Building anomaly detector from {len(normal_audio_files)} normal recordings...")

    features_list = []
    for filepath in normal_audio_files:
        try:
            y, _ = librosa.load(filepath, sr=sr)
            feat = extract_audio_features_summary(y, sr)
            features_list.append(feat)
        except Exception as e:
            print(f"  Skipped {filepath}: {e}")

    features = np.array(features_list)

    # Preprocess
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(features)

    # Dimensionality reduction (optional, helps with high-dimensional features)
    n_components = min(50, X_scaled.shape[0] - 1, X_scaled.shape[1])
    pca = PCA(n_components=n_components, random_state=42)
    X_pca = pca.fit_transform(X_scaled)

    # Train anomaly detector
    detector = IsolationForest(
        n_estimators=100,
        contamination=contamination,
        random_state=42,
        n_jobs=-1
    )
    detector.fit(X_pca)

    print(f"Feature dim: {features.shape[1]} → PCA: {n_components}")
    print(f"Detector trained on {len(features)} normal recordings")

    return {
        "detector": detector,
        "scaler":   scaler,
        "pca":      pca,
        "sr":       sr
    }


def score_audio_anomaly(detector_info: dict, filepath: str) -> dict:
    """
    Score a single audio file for anomaly likelihood.

    Returns a score where lower = more anomalous.
    Score < 0 typically indicates an anomaly.
    """
    y, _ = librosa.load(filepath, sr=detector_info["sr"])
    feat = extract_audio_features_summary(y, detector_info["sr"])

    X = detector_info["scaler"].transform(feat.reshape(1, -1))
    X_pca = detector_info["pca"].transform(X)

    score = detector_info["detector"].score_samples(X_pca)[0]
    is_anomaly = detector_info["detector"].predict(X_pca)[0] == -1

    return {
        "filepath":   filepath,
        "score":      float(score),
        "is_anomaly": bool(is_anomaly),
        "label":      "ANOMALY" if is_anomaly else "Normal"
    }

Audio Processing Best Practices

Common Pitfalls

Python
# ── 1. Always check sample rate consistency ───────────────────────
y1, sr1 = librosa.load("file1.wav")
y2, sr2 = librosa.load("file2.wav")

if sr1 != sr2:
    # Resample to a common rate before processing
    y2 = librosa.resample(y2, orig_sr=sr2, target_sr=sr1)
    print(f"Resampled file2 from {sr2}Hz to {sr1}Hz")

# ── 2. Handle stereo vs. mono consistently ───────────────────────
y, sr = librosa.load("audio.wav", mono=True)   # Always mono for features
# Or if loaded as stereo:
if y.ndim > 1:
    y = y.mean(axis=0)  # Convert stereo to mono by averaging

# ── 3. Be aware of very short files ──────────────────────────────
min_duration_s = 0.5  # Features need at least 0.5s of audio
if librosa.get_duration(y=y, sr=sr) < min_duration_s:
    print(f"Warning: audio too short ({librosa.get_duration(y=y, sr=sr):.2f}s)")
    y = np.pad(y, (0, int(min_duration_s * sr) - len(y)))

# ── 4. Normalize before feature extraction ───────────────────────
y_normalized = librosa.util.normalize(y)  # Peak normalization to [-1, 1]

# ── 5. Check for clipping (distortion) ───────────────────────────
clipped_fraction = (np.abs(y) >= 0.99).mean()
if clipped_fraction > 0.01:
    print(f"Warning: {clipped_fraction:.1%} of samples are clipped")

Summary

Audio data opens a rich analytical dimension for data scientists — from call center analytics and industrial monitoring to music analysis and accessibility tools. The key conceptual shift from tabular or image data is that audio information lives in the frequency domain over time, not the amplitude domain. Raw waveforms are rarely analyzed directly; instead, the STFT and Mel spectrogram transform them into 2D time-frequency representations, from which MFCCs provide the most compact and perceptually meaningful features.

The practical workflow: load with Librosa at a consistent sample rate, preprocess (normalize, remove silence, ensure consistent length), extract features (MFCCs + deltas + spectral features aggregated to statistics), and train a standard sklearn classifier. For speech, Whisper provides state-of-the-art transcription that opens the door to text analytics on spoken content. For industrial monitoring, the same CNN feature extraction approach used for images — treating spectrograms as images — provides powerful anomaly detection.

Audio data science is an area where relatively modest expertise yields valuable capabilities — few data scientists work with audio, which makes the skill genuinely differentiating.

Key Takeaways

  • Audio is a time-series of amplitude samples at a fixed sample rate (Hz): 16,000 Hz for speech recognition, 44,100 Hz for music — always resample to a consistent rate before analysis
  • The spectrogram (STFT) transforms the waveform into a time-frequency representation that reveals which frequencies are active when — it’s the fundamental “visualization” and “image” of audio data
  • The Mel spectrogram warps the frequency axis to match human perceptual sensitivity (logarithmic, more sensitive at low frequencies) — it is the standard input format for audio deep learning models
  • MFCCs (Mel-Frequency Cepstral Coefficients, typically 13-40) are the most widely used compact feature for audio classification, especially speech — always compute them alongside their delta (velocity) and delta-delta (acceleration) coefficients
  • The standard ML pipeline is: load with librosa.load() → preprocess (normalize, remove silence, pad/trim) → extract features with extract_audio_features_summary() → train StandardScaler + RandomForest or LogisticRegression → evaluate with stratified cross-validation
  • OpenAI Whisper provides state-of-the-art speech transcription in three lines: model = whisper.load_model("base"); result = model.transcribe("audio.wav"); print(result["text"]) — supports 100+ languages with automatic language detection
  • For industrial anomaly detection, train an Isolation Forest on MFCCs from known-normal recordings; any recording whose features deviate significantly from the normal distribution is flagged as anomalous
  • Critical pitfalls: always convert stereo to mono before feature extraction, always verify sample rates match when comparing recordings, and always check for audio clipping (saturated recordings) which corrupt MFCC features

Share:
Subscribe
Notify of
0 Comments

Discover More

Implementing Linear Regression from Scratch in Python

Implementing Linear Regression from Scratch in Python

Learn to implement linear regression from scratch in Python using NumPy. Build gradient descent, the…

Ohm’s Law: Relationship Between Voltage, Current and Resistance

Learn about Ohm’s Law, its applications and practical examples. Discover how voltage, current and resistance…

Moving into Data Science from a Business Background

Learn how to transition from business roles to data science. Discover how your business acumen…

What Is System Performance Monitoring?

What Is System Performance Monitoring?

Learn what system performance monitoring is, which metrics matter, how operating systems track CPU, memory,…

Anomaly Detection: Finding Outliers in Your Data

Anomaly Detection: Finding Outliers in Your Data

Master anomaly detection from first principles. Learn Isolation Forest, Local Outlier Factor, One-Class SVM, statistical…

Operator Overloading in C++: Making Your Classes Intuitive

Operator Overloading in C++: Making Your Classes Intuitive

Learn C++ operator overloading to create intuitive custom classes. Master arithmetic, comparison, stream, and assignment…

Click For More
0
Would love your thoughts, please comment.x
()
x