Images in Python are represented as NumPy arrays: a grayscale image is a 2D array of shape (height, width), an RGB color image is a 3D array of shape (height, width, 3), and a batch of images for deep learning is a 4D array of shape (batch_size, height, width, channels). The Pillow library handles loading, saving, and basic manipulation; OpenCV adds computer vision operations like edge detection and feature extraction; and deep learning frameworks (PyTorch, TensorFlow) use these arrays as model inputs. The most practical approach for most data scientists is to use a pre-trained CNN model from torchvision or Hugging Face as a feature extractor, then train a simple classifier on top — this avoids training from scratch and works well even with small datasets.
Introduction
Images are one of the most information-dense data types humans work with, and they’re increasingly central to data science. An e-commerce company analyzes product photos to automatically tag categories and detect defects. A healthcare system uses chest X-rays to assist radiologists. A manufacturing plant uses camera feeds to detect assembly line defects in real time. A retail chain counts in-store foot traffic from security camera footage. Satellite imagery tracks crop health, deforestation, and urban growth. Social media platforms classify billions of uploaded photos every day.
Unlike text (which has a clear unit — the word) or tabular data (which has a clear unit — the row), images are fundamentally spatial and continuous. The meaning of a pixel depends on its context — the surrounding pixels at the same scale and across multiple scales simultaneously. This is why deep learning, and convolutional neural networks in particular, has been so transformative for image analysis: CNNs are architecturally designed to exploit spatial structure.
This article introduces image data for data science practitioners: what images are as data, how to load and manipulate them in Python, essential preprocessing and augmentation techniques, feature extraction approaches from classical histograms to pre-trained deep networks, and a practical guide to when and how to use each tool.
What Is an Image as Data?
An image is a 2D grid of pixels (picture elements), where each pixel holds an intensity value. The key concepts:
Grayscale vs. Color Images
Grayscale: Each pixel holds a single intensity value from 0 (black) to 255 (white). Shape: (height, width).
RGB color: Each pixel holds three values — Red, Green, Blue channels — each from 0 to 255. Shape: (height, width, 3). The combination of the three channel values determines the perceived color.
RGBA: RGB plus an Alpha (transparency) channel. Shape: (height, width, 4). Common in PNG files.
BGR: OpenCV stores color images as Blue-Green-Red (not RGB). This is a common source of bugs when mixing PIL/Pillow (RGB) with OpenCV (BGR).
Data Types and Value Ranges
uint8 (0–255): Standard image format, 1 byte per channel
float32 (0.0–1.0): Common for neural network inputs
float32 (-1.0–1.0): Normalized range used by some pretrained models
uint16 (0–65535): Medical imaging (CT, MRI, X-ray) and raw camera sensor dataimport numpy as np
from PIL import Image
import matplotlib.pyplot as plt
# A 4×4 grayscale image (handcrafted as a numpy array)
gray_array = np.array([
[0, 64, 128, 192],
[32, 96, 160, 224],
[64, 128, 192, 255],
[0, 0, 0, 0 ]
], dtype=np.uint8)
print(f"Grayscale array shape: {gray_array.shape}") # (4, 4)
print(f"Dtype: {gray_array.dtype}") # uint8
print(f"Value range: [{gray_array.min()}, {gray_array.max()}]")
# A 4×4 RGB image (3 channels)
rgb_array = np.zeros((4, 4, 3), dtype=np.uint8)
rgb_array[:, :, 0] = 200 # Red channel
rgb_array[:, :, 1] = 50 # Green channel
rgb_array[:, :, 2] = 50 # Blue channel
rgb_array[2, 2] = [255, 255, 0] # One yellow pixel
print(f"\nRGB array shape: {rgb_array.shape}") # (4, 4, 3)
print(f"Channel 0 (red): {rgb_array[:,:,0]}")Loading and Saving Images
Pillow (PIL)
Pillow is the standard Python imaging library — the right tool for loading, saving, and basic manipulation.
from PIL import Image, ImageFilter, ImageEnhance, ImageDraw
import numpy as np
import matplotlib.pyplot as plt
# ── Load an image ─────────────────────────────────────────────────
img = Image.open("data/images/sample.jpg")
print(f"Format: {img.format}") # JPEG
print(f"Mode: {img.mode}") # RGB
print(f"Size: {img.size}") # (width, height) in pixels, e.g. (1920, 1080)
# Important: PIL uses (width, height), NumPy uses (height, width)
# Always be aware of this axis convention difference!
# ── Convert to numpy array ────────────────────────────────────────
img_array = np.array(img)
print(f"Array shape: {img_array.shape}") # (height, width, 3)
print(f"Dtype: {img_array.dtype}") # uint8
print(f"Value range: [{img_array.min()}, {img_array.max()}]")
# ── Convert between modes ─────────────────────────────────────────
img_gray = img.convert("L") # RGB → Grayscale
img_rgba = img.convert("RGBA") # RGB → RGBA (add alpha channel)
img_rgb = img_gray.convert("RGB") # Grayscale → RGB (3 identical channels)
print(f"\nGrayscale shape: {np.array(img_gray).shape}") # (height, width)
print(f"RGBA shape: {np.array(img_rgba).shape}") # (height, width, 4)
# ── Convert numpy array back to PIL ──────────────────────────────
# Note: ensure correct dtype and value range
array_uint8 = (img_array * 0.5).astype(np.uint8) # Darken: multiply by 0.5
img_from_array = Image.fromarray(array_uint8)
# ── Save images ───────────────────────────────────────────────────
img.save("output/sample_copy.jpg", quality=90) # JPEG with quality setting
img.save("output/sample_copy.png") # PNG (lossless)
img_gray.save("output/sample_gray.jpg")
# ── Display images ─────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].imshow(img)
axes[0].set_title(f"Original RGB {img.size}")
axes[0].axis("off")
axes[1].imshow(img_gray, cmap="gray")
axes[1].set_title("Grayscale")
axes[1].axis("off")
plt.tight_layout()
plt.savefig("output/original_vs_gray.png", dpi=100, bbox_inches="tight")
plt.show()OpenCV
OpenCV (cv2) is the industry-standard computer vision library — faster than Pillow for many operations, with a much broader range of image processing functions.
import cv2
import numpy as np
# ── Load an image ─────────────────────────────────────────────────
# WARNING: OpenCV loads images in BGR order, not RGB!
img_bgr = cv2.imread("data/images/sample.jpg") # Returns BGR array
img_bgr_gray = cv2.imread("data/images/sample.jpg", cv2.IMREAD_GRAYSCALE)
print(f"OpenCV shape: {img_bgr.shape}") # (height, width, 3) in BGR
# ── Convert between color spaces ─────────────────────────────────
# Always convert BGR → RGB when displaying with matplotlib or PIL
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
img_gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
img_hsv = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2HSV) # Hue-Saturation-Value
img_lab = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2LAB) # L*a*b* perceptual color
# ── Save with OpenCV ──────────────────────────────────────────────
cv2.imwrite("output/sample_cv.jpg", img_bgr) # Saves in BGR (correct for CV files)
cv2.imwrite("output/sample_gray_cv.jpg", img_gray)
# ── A critical helper: always convert OpenCV images before matplotlib ──
def cv2_to_display(img_bgr: np.ndarray) -> np.ndarray:
"""Convert OpenCV BGR image to RGB for display with matplotlib."""
if len(img_bgr.shape) == 3 and img_bgr.shape[2] == 3:
return cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
return img_bgr # Grayscale: no conversion needed
plt.imshow(cv2_to_display(img_bgr))
plt.axis("off")
plt.title("Correctly displayed OpenCV image")
plt.show()Essential Image Operations
Resizing
from PIL import Image
import cv2
import numpy as np
# Resizing is critical for ML — models expect fixed-size inputs
# ── Pillow resizing ────────────────────────────────────────────────
img = Image.open("data/images/sample.jpg")
original_size = img.size # (width, height)
# Resize to exact dimensions (may distort aspect ratio)
img_224 = img.resize((224, 224), Image.LANCZOS) # Best quality resampling
# Resize to fit within a max dimension (preserve aspect ratio)
def resize_with_aspect(img: Image.Image, max_dim: int) -> Image.Image:
"""Resize to fit within max_dim × max_dim, preserving aspect ratio."""
w, h = img.size
if w > h:
new_w, new_h = max_dim, int(max_dim * h / w)
else:
new_w, new_h = int(max_dim * w / h), max_dim
return img.resize((new_w, new_h), Image.LANCZOS)
img_maxed = resize_with_aspect(img, 512)
# Thumbnail: PIL's built-in aspect-ratio-preserving resize
img_thumb = img.copy()
img_thumb.thumbnail((128, 128)) # Modifies in-place
# ── Center crop: resize then crop to exact square ─────────────────
def resize_and_center_crop(img: Image.Image, target_size: int) -> Image.Image:
"""
ImageNet-style preprocessing: resize shortest edge to target_size,
then crop a centered square.
"""
w, h = img.size
# Resize so the shortest edge equals target_size
if w < h:
new_w, new_h = target_size, int(target_size * h / w)
else:
new_w, new_h = int(target_size * w / h), target_size
img_resized = img.resize((new_w, new_h), Image.LANCZOS)
# Center crop
left = (new_w - target_size) // 2
top = (new_h - target_size) // 2
return img_resized.crop((left, top, left + target_size, top + target_size))
img_224_cropped = resize_and_center_crop(img, 224)
print(f"After center crop: {img_224_cropped.size}") # (224, 224)
# ── OpenCV resizing ────────────────────────────────────────────────
img_cv = cv2.imread("data/images/sample.jpg")
img_cv_resized = cv2.resize(img_cv, (224, 224),
interpolation=cv2.INTER_LANCZOS4)Image Normalization for ML
import numpy as np
from PIL import Image
def normalize_image(
img_array: np.ndarray,
method: str = "imagenet"
) -> np.ndarray:
"""
Normalize an image array for neural network input.
Parameters
----------
img_array : np.ndarray
Image array of shape (H, W, 3), dtype uint8 (0-255).
method : str
'imagenet': subtract ImageNet mean, divide by std (standard for pretrained models)
'zero_one': scale to [0, 1]
'minus_one_one': scale to [-1, 1]
Returns
-------
np.ndarray
Normalized array of dtype float32.
"""
arr = img_array.astype(np.float32)
if method == "zero_one":
return arr / 255.0
elif method == "minus_one_one":
return (arr / 127.5) - 1.0
elif method == "imagenet":
# ImageNet statistics (computed over the full ImageNet dataset)
IMAGENET_MEAN = np.array([0.485, 0.456, 0.406])
IMAGENET_STD = np.array([0.229, 0.224, 0.225])
arr = arr / 255.0
arr = (arr - IMAGENET_MEAN) / IMAGENET_STD
return arr.astype(np.float32)
else:
raise ValueError(f"Unknown normalization method: {method}")
img = Image.open("data/images/sample.jpg").resize((224, 224))
img_array = np.array(img)
print(f"Original range: [{img_array.min()}, {img_array.max()}]")
img_01 = normalize_image(img_array, "zero_one")
print(f"[0,1] range: [{img_01.min():.3f}, {img_01.max():.3f}]")
img_imagenet = normalize_image(img_array, "imagenet")
print(f"ImageNet normalized range: [{img_imagenet.min():.3f}, {img_imagenet.max():.3f}]")Image Augmentation: Expanding Small Datasets
Data augmentation artificially expands a training dataset by applying random transformations that preserve the semantic label. A photo of a dog is still a dog when flipped horizontally.
from PIL import Image, ImageFilter, ImageEnhance
import numpy as np
import random
def augment_image(
img: Image.Image,
horizontal_flip_prob: float = 0.5,
brightness_range: tuple = (0.7, 1.3),
contrast_range: tuple = (0.8, 1.2),
rotation_range: tuple = (-15, 15),
noise_std: float = 0.02,
random_crop_frac: float = 0.9
) -> Image.Image:
"""
Apply random augmentations to a PIL image.
Each augmentation is applied with the given probability or
drawn from the specified range. Safe for training classifiers.
Parameters
----------
img : PIL.Image
Input image to augment.
horizontal_flip_prob : float
Probability of horizontal flip (0 = never, 1 = always).
brightness_range : tuple
(min, max) brightness multiplier.
contrast_range : tuple
(min, max) contrast multiplier.
rotation_range : tuple
(min_degrees, max_degrees) rotation.
noise_std : float
Standard deviation of Gaussian noise (0 = no noise).
random_crop_frac : float
Fraction of original size to crop to (then resize back).
Returns
-------
PIL.Image
Augmented image (same size as input).
"""
original_size = img.size
# Horizontal flip
if random.random() < horizontal_flip_prob:
img = img.transpose(Image.FLIP_LEFT_RIGHT)
# Brightness adjustment
brightness_factor = random.uniform(*brightness_range)
img = ImageEnhance.Brightness(img).enhance(brightness_factor)
# Contrast adjustment
contrast_factor = random.uniform(*contrast_range)
img = ImageEnhance.Contrast(img).enhance(contrast_factor)
# Random rotation
angle = random.uniform(*rotation_range)
img = img.rotate(angle, fillcolor=(128, 128, 128), expand=False)
# Random crop (then resize back to original)
w, h = img.size
crop_w = int(w * random_crop_frac)
crop_h = int(h * random_crop_frac)
left = random.randint(0, w - crop_w)
top = random.randint(0, h - crop_h)
img = img.crop((left, top, left + crop_w, top + crop_h))
img = img.resize(original_size, Image.LANCZOS)
# Gaussian noise
if noise_std > 0:
arr = np.array(img).astype(np.float32)
noise = np.random.normal(0, noise_std * 255, arr.shape)
arr = np.clip(arr + noise, 0, 255).astype(np.uint8)
img = Image.fromarray(arr)
return img
# Visualize augmentations
img = Image.open("data/images/sample.jpg").resize((224, 224))
fig, axes = plt.subplots(2, 5, figsize=(15, 7))
axes[0, 0].imshow(img)
axes[0, 0].set_title("Original")
axes[0, 0].axis("off")
for i in range(1, 10):
row, col = divmod(i, 5)
augmented = augment_image(img)
axes[row, col].imshow(augmented)
axes[row, col].set_title(f"Augmented #{i}")
axes[row, col].axis("off")
plt.suptitle("Data Augmentation Examples", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.savefig("output/augmentation_examples.png", dpi=100)
plt.show()Loading a Dataset of Images
For real projects, you need to load many images efficiently:
import os
import numpy as np
from PIL import Image
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def load_image_dataset(
image_dir: str,
target_size: tuple = (224, 224),
max_images: int = None,
extensions: tuple = (".jpg", ".jpeg", ".png", ".bmp", ".webp")
) -> tuple:
"""
Load all images from a directory into a numpy array.
Assumes directory structure:
image_dir/
class_a/
img001.jpg
img002.jpg
class_b/
img003.jpg
Parameters
----------
image_dir : str
Root directory containing class subdirectories.
target_size : tuple
(width, height) to resize all images to.
max_images : int, optional
Maximum images to load per class (for balanced sampling).
extensions : tuple
Accepted file extensions.
Returns
-------
tuple
(images_array, labels_array, class_names, file_paths)
"""
image_dir = Path(image_dir)
class_dirs = sorted([d for d in image_dir.iterdir() if d.is_dir()])
class_names = [d.name for d in class_dirs]
print(f"Found {len(class_names)} classes: {class_names}")
images, labels, paths = [], [], []
for class_idx, class_dir in enumerate(class_dirs):
# Find all image files
class_files = [
f for f in class_dir.iterdir()
if f.suffix.lower() in extensions
]
if max_images:
class_files = class_files[:max_images]
print(f" {class_dir.name}: {len(class_files)} images")
for filepath in class_files:
try:
img = Image.open(filepath).convert("RGB")
img = img.resize(target_size, Image.LANCZOS)
images.append(np.array(img, dtype=np.uint8))
labels.append(class_idx)
paths.append(str(filepath))
except Exception as e:
print(f" Skipped {filepath.name}: {e}")
images_array = np.stack(images) if images else np.array([])
labels_array = np.array(labels)
print(f"\nDataset loaded: {len(images_array)} images")
print(f"Array shape: {images_array.shape}") # (N, H, W, 3)
print(f"Class distribution: {pd.Series(labels_array).value_counts().to_dict()}")
return images_array, labels_array, class_names, paths
# Example usage
# images, labels, class_names, paths = load_image_dataset(
# "data/images/product_categories/",
# target_size=(224, 224),
# max_images=500 # 500 per class
# )Classical Feature Extraction
Before deep learning, computer vision relied on hand-crafted features. These are still useful for small datasets, interpretability, and non-deep learning pipelines.
Color Histograms
import cv2
import numpy as np
from sklearn.preprocessing import normalize
def extract_color_histogram(
img_array: np.ndarray,
bins: int = 32,
color_space: str = "RGB"
) -> np.ndarray:
"""
Extract a normalized color histogram as a fixed-length feature vector.
For each color channel, compute a histogram of pixel values.
Concatenate all channel histograms into one vector.
Parameters
----------
img_array : np.ndarray
Image array of shape (H, W, 3), uint8.
bins : int
Number of histogram bins per channel.
color_space : str
'RGB', 'HSV', or 'LAB'.
Returns
-------
np.ndarray
Normalized feature vector of length (3 × bins).
"""
if color_space == "HSV":
img_cv = cv2.cvtColor(img_array, cv2.COLOR_RGB2HSV)
elif color_space == "LAB":
img_cv = cv2.cvtColor(img_array, cv2.COLOR_RGB2LAB)
else:
img_cv = img_array
features = []
for channel_idx in range(3):
hist = cv2.calcHist([img_cv], [channel_idx], None,
[bins], [0, 256])
features.append(hist.flatten())
feature_vector = np.concatenate(features)
# L2 normalize so magnitude doesn't dominate
return feature_vector / (feature_vector.sum() + 1e-10)
# Extract histograms from a batch of images
def batch_color_histograms(images: np.ndarray, bins: int = 32) -> np.ndarray:
features = np.array([
extract_color_histogram(img, bins=bins) for img in images
])
print(f"Color histogram features: {features.shape}") # (N, 3*bins)
return featuresEdge Detection Features
import cv2
import numpy as np
def extract_edge_features(img_array: np.ndarray) -> np.ndarray:
"""
Extract Canny edge detection features.
Returns edge density and directional statistics.
"""
# Convert to grayscale if color
if len(img_array.shape) == 3:
gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
else:
gray = img_array
# Apply Gaussian blur to reduce noise before edge detection
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
# Canny edge detector
edges = cv2.Canny(blurred, threshold1=50, threshold2=150)
# Feature: fraction of pixels that are edges
edge_density = edges.mean() / 255.0
# Gradient magnitudes and directions via Sobel operators
grad_x = cv2.Sobel(gray, cv2.CV_64F, 1, 0, ksize=3)
grad_y = cv2.Sobel(gray, cv2.CV_64F, 0, 1, ksize=3)
magnitude = np.sqrt(grad_x**2 + grad_y**2)
return np.array([
edge_density,
magnitude.mean(),
magnitude.std(),
magnitude.max() / 255.0
])
def extract_texture_features(img_array: np.ndarray) -> np.ndarray:
"""Extract Local Binary Pattern (LBP) texture features."""
if len(img_array.shape) == 3:
gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
else:
gray = img_array
# GLCM-based texture: variance and entropy of pixel blocks
h, w = gray.shape
block_size = 16
variances = []
for i in range(0, h - block_size, block_size):
for j in range(0, w - block_size, block_size):
block = gray[i:i+block_size, j:j+block_size].astype(float)
variances.append(block.var())
return np.array([
np.mean(variances),
np.std(variances),
np.percentile(variances, 75),
np.percentile(variances, 25)
])Deep Learning Feature Extraction with Pre-Trained Models
The most powerful and practical approach for most data science image tasks is to extract features using a pre-trained CNN (trained on ImageNet’s 1.2M images) and then train a simple classifier on those features. This approach — called transfer learning — works exceptionally well even with small datasets (hundreds to a few thousand images).
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
import numpy as np
import pandas as pd
from pathlib import Path
def build_feature_extractor(model_name: str = "resnet50") -> tuple:
"""
Build a pre-trained CNN feature extractor by removing the final classification layer.
Parameters
----------
model_name : str
Model architecture: 'resnet50', 'resnet18', 'efficientnet_b0', 'mobilenet_v3_small'.
Returns
-------
tuple
(model, transform, feature_dim)
"""
if model_name == "resnet50":
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
# Remove the final fully connected layer (classifier)
model = torch.nn.Sequential(*list(model.children())[:-1])
feat_dim = 2048
elif model_name == "resnet18":
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
model = torch.nn.Sequential(*list(model.children())[:-1])
feat_dim = 512
elif model_name == "efficientnet_b0":
model = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.IMAGENET1K_V1)
model.classifier = torch.nn.Identity()
feat_dim = 1280
elif model_name == "mobilenet_v3_small":
model = models.mobilenet_v3_small(
weights=models.MobileNet_V3_Small_Weights.IMAGENET1K_V1
)
model.classifier = torch.nn.Identity()
feat_dim = 576
else:
raise ValueError(f"Unknown model: {model_name}")
model.eval() # Set to evaluation mode (disables dropout/batchnorm training)
# ImageNet preprocessing transform (must match what the model was trained on)
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(), # (H,W,C) uint8 → (C,H,W) float32 [0,1]
transforms.Normalize( # ImageNet normalization
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
print(f"Feature extractor: {model_name} | Feature dim: {feat_dim}")
return model, transform, feat_dim
def extract_features_batch(
image_paths: list,
model,
transform,
feat_dim: int,
batch_size: int = 32,
device: str = None
) -> np.ndarray:
"""
Extract deep features from a list of image file paths.
Parameters
----------
image_paths : list
Paths to image files.
model : torch.nn.Module
Pre-trained feature extractor (final layer removed).
transform : transforms.Compose
Preprocessing transform matching the model's training.
feat_dim : int
Output feature dimension.
batch_size : int
Images per batch for GPU efficiency.
device : str, optional
'cuda', 'mps', or 'cpu'. Auto-detected if None.
Returns
-------
np.ndarray
Feature matrix of shape (n_images, feat_dim).
"""
if device is None:
device = "cuda" if torch.cuda.is_available() else \
"mps" if torch.backends.mps.is_available() else "cpu"
model = model.to(device)
all_features = []
n_batches = (len(image_paths) + batch_size - 1) // batch_size
for batch_idx in range(n_batches):
batch_paths = image_paths[batch_idx * batch_size :
(batch_idx + 1) * batch_size]
tensors = []
for path in batch_paths:
try:
img = Image.open(path).convert("RGB")
tensors.append(transform(img))
except Exception as e:
print(f" Warning: could not load {path}: {e}")
tensors.append(torch.zeros(3, 224, 224)) # Placeholder
batch_tensor = torch.stack(tensors).to(device)
with torch.no_grad(): # No gradients needed for inference
features = model(batch_tensor)
# Flatten to (batch_size, feat_dim)
features = features.squeeze().cpu().numpy()
if features.ndim == 1:
features = features.reshape(1, -1)
all_features.append(features)
if (batch_idx + 1) % 10 == 0 or batch_idx == n_batches - 1:
n_done = min((batch_idx + 1) * batch_size, len(image_paths))
print(f" Extracted features: {n_done:,}/{len(image_paths):,}")
return np.vstack(all_features)Training a Classifier on Extracted Features
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
def train_image_classifier(
features: np.ndarray,
labels: np.ndarray,
class_names: list,
classifier: str = "logistic",
cv_folds: int = 5
) -> dict:
"""
Train a classifier on pre-extracted deep features.
This is the transfer learning pattern: powerful pretrained
CNN extracts features, simple linear model classifies.
Parameters
----------
features : np.ndarray
Feature matrix (N, feat_dim) from extract_features_batch.
labels : np.ndarray
Class indices.
class_names : list
Human-readable class names.
classifier : str
'logistic', 'svm', or 'knn'.
cv_folds : int
Number of cross-validation folds.
Returns
-------
dict
Results including CV accuracy and trained model.
"""
print(f"Training {classifier} classifier on {len(features):,} images "
f"({len(class_names)} classes)...")
if classifier == "logistic":
clf = Pipeline([
("scaler", StandardScaler()),
("classifier", LogisticRegression(
C=1.0, max_iter=1000,
multi_class="auto",
random_state=42
))
])
elif classifier == "svm":
clf = Pipeline([
("scaler", StandardScaler()),
("classifier", SVC(
C=1.0, kernel="rbf",
probability=True,
random_state=42
))
])
cv_scores = cross_val_score(clf, features, labels, cv=cv_folds,
scoring="accuracy", n_jobs=-1)
clf.fit(features, labels) # Fit on all data after CV evaluation
print(f"\nCross-validation accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
print(f"CV scores by fold: {np.round(cv_scores, 3)}")
return {
"model": clf,
"cv_scores": cv_scores,
"cv_mean": cv_scores.mean(),
"cv_std": cv_scores.std()
}
# Complete pipeline for a new image classification task
def run_transfer_learning_pipeline(
image_dir: str,
model_name: str = "resnet50",
target_size: int = 224,
batch_size: int = 32,
max_images_per_class: int = None
) -> dict:
"""
Complete image classification pipeline using transfer learning.
1. Load images from directory (class folders)
2. Extract deep features with pre-trained CNN
3. Train logistic regression classifier on features
4. Evaluate with cross-validation
Returns trained model and evaluation metrics.
"""
# Step 1: Gather image paths and labels
image_dir_path = Path(image_dir)
class_dirs = sorted([d for d in image_dir_path.iterdir() if d.is_dir()])
class_names = [d.name for d in class_dirs]
all_paths, all_labels = [], []
for class_idx, class_dir in enumerate(class_dirs):
paths = list(class_dir.glob("*.jpg")) + list(class_dir.glob("*.png"))
if max_images_per_class:
paths = paths[:max_images_per_class]
all_paths.extend(str(p) for p in paths)
all_labels.extend([class_idx] * len(paths))
labels = np.array(all_labels)
print(f"Dataset: {len(all_paths)} images, {len(class_names)} classes")
# Step 2: Build feature extractor
model, transform, feat_dim = build_feature_extractor(model_name)
# Step 3: Extract features
print("\nExtracting CNN features...")
features = extract_features_batch(all_paths, model, transform,
feat_dim, batch_size=batch_size)
# Step 4: Train classifier
print("\nTraining classifier...")
results = train_image_classifier(features, labels, class_names)
print(f"\nPipeline complete!")
print(f" Feature dim: {feat_dim}")
print(f" CV accuracy: {results['cv_mean']:.3f}")
return {
"features": features,
"labels": labels,
"class_names": class_names,
"model": results["model"],
"cv_accuracy": results["cv_mean"]
}Image Analysis Without Classification
Not all image tasks are about classification. Common analytical tasks include:
Image Similarity Search
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def find_similar_images(
query_feature: np.ndarray,
database_features: np.ndarray,
image_paths: list,
top_k: int = 5
) -> list:
"""
Find the most similar images to a query image using feature cosine similarity.
Parameters
----------
query_feature : np.ndarray
Feature vector for the query image (1D or 1×D).
database_features : np.ndarray
Feature matrix for all database images (N×D).
image_paths : list
File paths corresponding to database_features rows.
top_k : int
Number of most similar images to return.
Returns
-------
list of (similarity_score, image_path) tuples
"""
if query_feature.ndim == 1:
query_feature = query_feature.reshape(1, -1)
similarities = cosine_similarity(query_feature, database_features)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
return [(float(similarities[i]), image_paths[i]) for i in top_indices]
# Usage: find products visually similar to a query product image
# query_feat = features[query_idx] # Feature vector from extract_features_batch
# similar = find_similar_images(query_feat, features, all_paths, top_k=10)
# print("Most similar images:")
# for score, path in similar:
# print(f" [{score:.3f}] {path}")Anomaly Detection in Images
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
def detect_image_anomalies(
features: np.ndarray,
image_paths: list,
method: str = "isolation_forest",
contamination: float = 0.05
) -> pd.DataFrame:
"""
Detect anomalous images (defective products, corrupted files, outliers)
using outlier detection on extracted CNN features.
Parameters
----------
features : np.ndarray
CNN feature matrix (N, feat_dim).
image_paths : list
File paths corresponding to rows.
method : str
'isolation_forest' or 'elliptic_envelope'.
contamination : float
Expected fraction of anomalies (0.05 = 5%).
Returns
-------
pd.DataFrame
Results with anomaly scores and labels.
"""
if method == "isolation_forest":
detector = IsolationForest(
contamination=contamination,
random_state=42,
n_jobs=-1
)
else:
detector = EllipticEnvelope(contamination=contamination)
anomaly_labels = detector.fit_predict(features) # -1=anomaly, 1=normal
anomaly_scores = detector.score_samples(features) # Higher = more normal
results = pd.DataFrame({
"path": image_paths,
"anomaly_score": anomaly_scores,
"is_anomaly": anomaly_labels == -1
}).sort_values("anomaly_score")
n_anomalies = results["is_anomaly"].sum()
print(f"Detected {n_anomalies} anomalous images "
f"({n_anomalies/len(results)*100:.1f}%)")
print("\nTop 5 most anomalous:")
print(results.head(5)[["path", "anomaly_score"]].to_string(index=False))
return resultsDealing with Image Datasets at Scale
import pandas as pd
import numpy as np
from pathlib import Path
from PIL import Image
import hashlib
import json
def audit_image_dataset(image_dir: str) -> pd.DataFrame:
"""
Audit an image dataset for quality issues:
- Missing / corrupted files
- Duplicates (by file hash)
- Extreme sizes (too small, too large)
- Wrong aspect ratios
- Corrupt files (can't be opened)
"""
image_dir = Path(image_dir)
all_files = list(image_dir.rglob("*.jpg")) + \
list(image_dir.rglob("*.jpeg")) + \
list(image_dir.rglob("*.png"))
print(f"Found {len(all_files):,} image files")
records = []
for filepath in all_files:
record = {
"path": str(filepath),
"filename": filepath.name,
"class": filepath.parent.name,
"size_bytes":filepath.stat().st_size,
"width": None,
"height": None,
"mode": None,
"file_hash": None,
"is_corrupt": False,
"issue": None
}
try:
img = Image.open(filepath)
record["width"] = img.width
record["height"] = img.height
record["mode"] = img.mode
# Detect issues
if img.width < 32 or img.height < 32:
record["issue"] = "too_small"
elif img.width > 4096 or img.height > 4096:
record["issue"] = "very_large"
elif record["size_bytes"] < 1000:
record["issue"] = "tiny_file"
# Hash for duplicate detection (hash of raw bytes)
record["file_hash"] = hashlib.md5(
filepath.read_bytes()
).hexdigest()
except Exception as e:
record["is_corrupt"] = True
record["issue"] = f"corrupt: {type(e).__name__}"
records.append(record)
df = pd.DataFrame(records)
# Flag duplicates
hash_counts = df["file_hash"].value_counts()
df["is_duplicate"] = df["file_hash"].map(hash_counts) > 1
# Summary
print(f"\n{'='*50}")
print(f"Image Dataset Audit: {image_dir}")
print(f"{'='*50}")
print(f" Total files: {len(df):,}")
print(f" Corrupt files: {df['is_corrupt'].sum():,}")
print(f" Duplicates: {df['is_duplicate'].sum():,}")
print(f" With issues: {df['issue'].notna().sum():,}")
if df["width"].notna().any():
print(f" Width range: [{df['width'].min():.0f}, {df['width'].max():.0f}]")
print(f" Height range: [{df['height'].min():.0f}, {df['height'].max():.0f}]")
issue_counts = df["issue"].value_counts()
if len(issue_counts) > 0:
print(f"\n Issues found:")
for issue, count in issue_counts.items():
print(f" {issue}: {count}")
return dfSummary
Image data in Python is fundamentally NumPy arrays — understanding the shape conventions (H, W) for grayscale and (H, W, 3) for RGB, the dtype conventions (uint8 for raw images, float32 for normalized inputs), and the BGR vs. RGB difference between OpenCV and PIL are the foundational concepts that prevent the majority of bugs.
The practical hierarchy for image data science tasks: use Pillow for loading, saving, and basic manipulation; OpenCV for more sophisticated image processing operations; and PyTorch with pre-trained models for feature extraction and classification. The transfer learning pattern — extract features with a pre-trained ResNet or EfficientNet, train a simple logistic regression classifier on those features — is the most important technique to internalize: it works with small datasets, trains in minutes, and achieves strong performance across a wide range of visual recognition tasks without requiring a GPU for training (only for feature extraction).
For applications beyond classification — image similarity search, anomaly detection in product photos, or dataset auditing — the same CNN feature extraction approach applies, combined with standard sklearn tools for the downstream task.
Key Takeaways
- Images are NumPy arrays: grayscale has shape
(H, W), color RGB has shape(H, W, 3), batches have shape(N, H, W, 3)— all values are uint8 (0–255) for raw images and float32 for neural network inputs - Pillow uses RGB; OpenCV uses BGR — always call
cv2.cvtColor(img, cv2.COLOR_BGR2RGB)before displaying or passing an OpenCV image to Pillow or matplotlib - Pillow uses (width, height); NumPy uses (height, width) — these flipped axis conventions cause silent bugs; always verify with
.shapeand.sizewhen converting - Normalize images before passing to neural networks: divide by 255 for
[0,1]range, then subtract ImageNet mean[0.485, 0.456, 0.406]and divide by std[0.229, 0.224, 0.225]for pre-trained models - Transfer learning (pre-trained CNN + linear classifier) is the most practical approach for image classification: build a feature extractor by removing the final layer from a pre-trained ResNet or EfficientNet, extract 512–2048 dimensional features from your images, then train a logistic regression or SVM on those features
- Data augmentation (random flips, rotations, brightness/contrast jitter, random crops) is essential for preventing overfitting when training on small image datasets — always apply augmentation only to the training set, never to validation or test
torch.no_grad()context manager is required during feature extraction and inference to disable gradient computation and reduce memory usage by 2–3×- For datasets of thousands of images, always audit before training: check for corrupted files, duplicates (by file hash), extreme sizes, and class imbalance — silent data quality problems in image datasets are as common as in tabular data








