SQL (Structured Query Language) is the standard language for querying and manipulating data in relational databases, and it is one of the most essential skills for any data scientist. With SQL, you retrieve exactly the data you need from databases containing millions of rows, filter records by any condition, group and aggregate data to compute summaries, join multiple tables together, and sort results — all with concise, readable code that runs directly in the database, where the data already lives.
Introduction
If Python is the hammer of data science, SQL is the foundation you build on. Almost every real-world data science project begins the same way: the data you need lives in a database, and to get it out you need SQL. The customer purchase histories, the server log events, the patient records, the financial transactions — they’re all waiting in tables, and SQL is the key.
SQL has been around since the 1970s and has barely changed in its fundamentals. What was true SQL syntax in 1985 still works in PostgreSQL, MySQL, SQLite, BigQuery, Snowflake, and Redshift today. It’s one of the most stable and transferable skills in all of technology — learn it once and it works everywhere.
The Survey of Data Scientists consistently ranks SQL among the top two or three most-used tools, alongside Python. Many job descriptions for data scientists list SQL before any ML library. Teams that review resumes for data science roles report that weak SQL skills are a common and immediate disqualifier. And yet SQL is frequently undertaught in data science courses that focus heavily on machine learning, leaving practitioners who can build a neural network unable to write a decent JOIN.
This article fixes that. We’ll cover every foundational SQL concept you need — from writing your very first SELECT statement through filtering, aggregating, grouping, joining tables, and handling missing data — with clear explanations and practical examples using a realistic e-commerce dataset. By the end, you’ll be writing real queries against real databases with confidence.
Setting Up: Working with SQLite in Python
Before writing your first query, you need a database to query against. SQLite is the perfect learning environment — it’s built into Python, requires no server or installation, and stores the entire database in a single file.
Creating a Practice Database
Let’s create the e-commerce database we’ll use throughout this article:
import sqlite3
import pandas as pd
# Connect to a new SQLite data.base (creates the file if it doesn't exist)
conn = sqlite3.connect("your db")
cursor = conn.cursor()
# Create the customers table
cursor.execute("""
CREATE TABLE IF NOT EXISTS customers (
customer_id TEXT PRIMARY KEY,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
email TEXT UNIQUE NOT NULL,
city TEXT,
country TEXT,
signup_date TEXT,
is_premium INTEGER DEFAULT 0
)
""")
# Create the products table
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
product_id TEXT PRIMARY KEY,
product_name TEXT NOT NULL,
category TEXT NOT NULL,
price REAL NOT NULL,
stock_qty INTEGER DEFAULT 0
)
""")
# Create the orders table
cursor.execute("""
CREATE TABLE IF NOT EXISTS orders (
order_id TEXT PRIMARY KEY,
customer_id TEXT NOT NULL,
order_date TEXT NOT NULL,
total_amount REAL NOT NULL,
status TEXT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
)
""")
# Create the order_items table
cursor.execute("""
CREATE TABLE IF NOT EXISTS order_items (
item_id INTEGER PRIMARY KEY AUTOINCREMENT,
order_id TEXT NOT NULL,
product_id TEXT NOT NULL,
quantity INTEGER NOT NULL,
unit_price REAL NOT NULL,
FOREIGN KEY (order_id) REFERENCES orders(order_id),
FOREIGN KEY (product_id) REFERENCES products(product_id)
)
""")
# Insert sample data
customers_data = [
('CUST_001', 'Jane', 'Smith', 'jane.smith@email.com', 'Austin', 'USA', '2022-03-14', 1),
('CUST_002', 'Bob', 'Johnson', 'bob.j@email.com', 'Seattle', 'USA', '2021-11-28', 0),
('CUST_003', 'Alice', 'Williams','alice.w@email.com', 'Chicago', 'USA', '2023-01-05', 1),
('CUST_004', 'Carlos', 'Garcia', 'c.garcia@email.com', 'Miami', 'USA', '2022-07-19', 0),
('CUST_005', 'Sophie', 'Brown', 's.brown@email.com', 'London', 'UK', '2023-05-30', 1),
('CUST_006', 'Wei', 'Zhang', 'w.zhang@email.com', 'Toronto', 'CA', '2021-09-01', 0),
('CUST_007', 'Priya', 'Patel', 'p.patel@email.com', 'Austin', 'USA', '2024-01-15', 0),
('CUST_008', 'Marcus', 'Lee', 'm.lee@email.com', 'Seattle', 'USA', '2022-12-03', 1),
]
products_data = [
('PROD_001', 'Wireless Headphones', 'Electronics', 149.99, 85),
('PROD_002', 'USB-C Hub', 'Electronics', 49.99, 200),
('PROD_003', 'Running Shoes', 'Apparel', 89.99, 150),
('PROD_004', 'Coffee Maker', 'Kitchen', 79.99, 62),
('PROD_005', 'Python Programming Book','Books', 34.99, 300),
('PROD_006', 'Yoga Mat', 'Sports', 29.99, 175),
('PROD_007', 'Laptop Stand', 'Electronics', 39.99, 120),
('PROD_008', 'Desk Lamp', 'Home', 24.99, 90),
]
orders_data = [
('ORD_1001', 'CUST_001', '2024-07-05', 199.98, 'delivered'),
('ORD_1002', 'CUST_002', '2024-07-12', 49.99, 'delivered'),
('ORD_1003', 'CUST_001', '2024-08-01', 149.99, 'delivered'),
('ORD_1004', 'CUST_003', '2024-08-15', 119.98, 'shipped'),
('ORD_1005', 'CUST_004', '2024-08-20', 79.99, 'delivered'),
('ORD_1006', 'CUST_002', '2024-09-01', 89.99, 'shipped'),
('ORD_1007', 'CUST_005', '2024-09-05', 184.98, 'processing'),
('ORD_1008', 'CUST_001', '2024-09-10', 34.99, 'processing'),
('ORD_1009', 'CUST_006', '2024-09-12', 69.98, 'shipped'),
('ORD_1010', 'CUST_003', '2024-09-15', 149.99, 'processing'),
]
order_items_data = [
('ORD_1001', 'PROD_001', 1, 149.99),
('ORD_1001', 'PROD_007', 1, 49.99), # Note: 149.99 + 49.99 = 199.98
('ORD_1002', 'PROD_002', 1, 49.99),
('ORD_1003', 'PROD_001', 1, 149.99),
('ORD_1004', 'PROD_003', 1, 89.99),
('ORD_1004', 'PROD_006', 1, 29.99),
('ORD_1005', 'PROD_004', 1, 79.99),
('ORD_1006', 'PROD_003', 1, 89.99),
('ORD_1007', 'PROD_001', 1, 149.99),
('ORD_1007', 'PROD_006', 1, 29.99), # Note: adjusted to match total
('ORD_1008', 'PROD_005', 1, 34.99),
('ORD_1009', 'PROD_006', 2, 29.99),
('ORD_1010', 'PROD_001', 1, 149.99),
]
cursor.executemany("INSERT OR IGNORE INTO customers VALUES (?,?,?,?,?,?,?,?)", customers_data)
cursor.executemany("INSERT OR IGNORE INTO products VALUES (?,?,?,?,?)", products_data)
cursor.executemany("INSERT OR IGNORE INTO orders VALUES (?,?,?,?,?)", orders_data)
cursor.executemany("INSERT OR IGNORE INTO order_items (order_id,product_id,quantity,unit_price) VALUES (?,?,?,?)", order_items_data)
conn.commit()
print("Database created and populated successfully.")The Helper Function
We’ll use this helper function to run queries and display results cleanly throughout the article:
def query(sql, conn=conn):
"""Execute a SQL query and return results as a pandas DataFrame."""
return pd.read_sql_query(sql, conn)The SELECT Statement: Retrieving Data
The SELECT statement is the foundation of SQL — it retrieves data from one or more tables. Every query you write as a data scientist starts with SELECT.
Basic Syntax
SELECT column1, column2, column3
FROM table_name;The semicolon (;) terminates the statement. SQL keywords are conventionally written in uppercase (SELECT, FROM, WHERE), though most databases accept lowercase. Column and table names are case-insensitive in most databases.
Selecting Specific Columns
query("""
SELECT first_name, last_name, city, country
FROM customers
""")| first_name | last_name | city | country |
|---|---|---|---|
| Jane | Smith | Austin | USA |
| Bob | Johnson | Seattle | USA |
| Alice | Williams | Chicago | USA |
| Carlos | Garcia | Miami | USA |
| Sophie | Brown | London | UK |
| Wei | Zhang | Toronto | CA |
| Priya | Patel | Austin | USA |
| Marcus | Lee | Seattle | USA |
Selecting All Columns
The asterisk * selects all columns:
query("SELECT * FROM products")Important note: In production queries and data science work, avoid SELECT * on large tables — always specify only the columns you actually need. This reduces data transfer, improves query speed, and makes your code more maintainable.
Column Aliases: Renaming Output Columns
The AS keyword renames a column in the output (the original table column is unchanged):
query("""
SELECT
product_name AS name,
category AS product_category,
price AS price_usd,
stock_qty AS units_in_stock
FROM products
""")Aliases are especially useful when column names are long, ambiguous (after a join), or when computed columns need a descriptive name.
Computed Columns
SQL can compute new values from existing columns directly in the SELECT:
query("""
SELECT
product_name,
price,
price * 1.10 AS price_with_tax,
price * 0.85 AS price_discounted_15pct,
stock_qty * price AS total_inventory_value
FROM products
ORDER BY total_inventory_value DESC
""")| product_name | price | price_with_tax | price_discounted_15pct | total_inventory_value |
|---|---|---|---|---|
| Wireless Headphones | 149.99 | 164.989 | 127.4915 | 12749.15 |
| Running Shoes | 89.99 | 98.989 | 76.4915 | 13498.50 |
| Python Programming Book | 34.99 | 38.489 | 29.7415 | 10497.00 |
| USB-C Hub | 49.99 | 54.989 | 42.4915 | 9998.00 |
The WHERE Clause: Filtering Rows
WHERE filters rows, keeping only those that meet your conditions. It comes after FROM and before GROUP BY or ORDER BY.
Comparison Operators
= Equal to
!= <> Not equal to
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to# Products priced above $50
query("""
SELECT product_name, category, price
FROM products
WHERE price > 50
""")# Only electronics
query("""
SELECT product_name, price, stock_qty
FROM products
WHERE category = 'Electronics'
""")Combining Conditions: AND, OR, NOT
# Electronics under $100
query("""
SELECT product_name, price
FROM products
WHERE category = 'Electronics'
AND price < 100
""")# Electronics OR products under $30
query("""
SELECT product_name, category, price
FROM products
WHERE category = 'Electronics'
OR price < 30
""")# Everything except Electronics
query("""
SELECT product_name, category, price
FROM products
WHERE NOT category = 'Electronics'
-- Equivalent: WHERE category != 'Electronics'
""")Operator precedence: AND is evaluated before OR. Use parentheses to make complex conditions explicit and correct:
# Products that are: (Electronics AND expensive) OR (Books)
query("""
SELECT product_name, category, price
FROM products
WHERE (category = 'Electronics' AND price > 100)
OR category = 'Books'
""")BETWEEN: Range Filtering
# Products priced between $30 and $80 (inclusive on both ends)
query("""
SELECT product_name, price
FROM products
WHERE price BETWEEN 30 AND 80
ORDER BY price
""")IN: Matching Multiple Values
# Multiple categories at once (cleaner than multiple OR conditions)
query("""
SELECT product_name, category, price
FROM products
WHERE category IN ('Electronics', 'Sports', 'Books')
ORDER BY category, price
""")
# NOT IN: exclude specific categories
query("""
SELECT product_name, category
FROM products
WHERE category NOT IN ('Electronics', 'Kitchen')
""")LIKE: Pattern Matching
LIKE matches strings against patterns. Two wildcards:
%— matches any sequence of zero or more characters_— matches exactly one character
# Customer emails ending in .com
query("""
SELECT first_name, email
FROM customers
WHERE email LIKE '%@email.com'
""")
# Products with 'USB' anywhere in the name
query("""
SELECT product_name, price
FROM products
WHERE product_name LIKE '%USB%'
""")
# Products starting with a specific letter
query("""
SELECT product_name
FROM products
WHERE product_name LIKE 'R%'
""")IS NULL and IS NOT NULL
NULL represents missing or unknown values. You cannot compare to NULL with = — you must use IS NULL or IS NOT NULL:
# Customers with no city on record
query("""
SELECT first_name, last_name, city
FROM customers
WHERE city IS NULL
""")
# Customers who do have a city
query("""
SELECT first_name, last_name, city
FROM customers
WHERE city IS NOT NULL
""")
# WRONG: This never returns results because NULL != NULL in SQL
-- WHERE city = NULL ← This is incorrect SQL logicORDER BY: Sorting Results
ORDER BY sorts query results. It comes last in the query (before LIMIT).
# Products sorted by price, cheapest first (default: ascending)
query("""
SELECT product_name, price
FROM products
ORDER BY price ASC
""")
# Products sorted by price, most expensive first
query("""
SELECT product_name, price
FROM products
ORDER BY price DESC
""")
# Sort by multiple columns: category alphabetically, then by price descending within each category
query("""
SELECT product_name, category, price
FROM products
ORDER BY category ASC, price DESC
""")| product_name | category | price |
|---|---|---|
| Python Programming Book | Books | 34.99 |
| Wireless Headphones | Electronics | 149.99 |
| USB-C Hub | Electronics | 49.99 |
| Laptop Stand | Electronics | 39.99 |
| Desk Lamp | Home | 24.99 |
| Coffee Maker | Kitchen | 79.99 |
| Running Shoes | Apparel | 89.99 |
| Yoga Mat | Sports | 29.99 |
LIMIT: Controlling Result Size
LIMIT (or TOP in SQL Server) restricts how many rows are returned. Essential for exploring large tables without pulling millions of rows:
# First 5 customers by signup date
query("""
SELECT first_name, last_name, signup_date
FROM customers
ORDER BY signup_date ASC
LIMIT 5
""")
# The 3 most expensive products
query("""
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 3
""")LIMIT with OFFSET for pagination — useful when building data pipelines that process data in chunks:
# Skip first 3 results, return next 3 (page 2 of 3-per-page results)
query("""
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 3 OFFSET 3
""")Aggregate Functions: Computing Summaries
Aggregate functions compute a single value from a set of rows — essential for summarizing data.
The Core Aggregate Functions
| Function | Description | Example |
|---|---|---|
COUNT(*) | Count all rows | Total number of orders |
COUNT(column) | Count non-NULL values in column | Number of orders with a known city |
COUNT(DISTINCT column) | Count unique values | Number of distinct customers who ordered |
SUM(column) | Sum all values | Total revenue |
AVG(column) | Average (mean) of values | Average order value |
MIN(column) | Smallest value | Cheapest product price |
MAX(column) | Largest value | Most expensive product |
# Summary statistics for all orders
query("""
SELECT
COUNT(*) AS total_orders,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(total_amount) AS total_revenue,
AVG(total_amount) AS avg_order_value,
MIN(total_amount) AS min_order_value,
MAX(total_amount) AS max_order_value
FROM orders
""")| total_orders | unique_customers | total_revenue | avg_order_value | min_order_value | max_order_value |
|---|---|---|---|---|---|
| 10 | 6 | 1029.84 | 102.984 | 34.99 | 199.98 |
# Product catalog statistics
query("""
SELECT
COUNT(*) AS total_products,
COUNT(DISTINCT category) AS num_categories,
AVG(price) AS avg_price,
MIN(price) AS cheapest,
MAX(price) AS most_expensive,
SUM(stock_qty) AS total_units_in_stock
FROM products
""")DISTINCT: Counting Unique Values
# How many unique cities are our customers from?
query("""
SELECT COUNT(DISTINCT city) AS unique_cities
FROM customers
""")
# What are the unique cities?
query("""
SELECT DISTINCT city, country
FROM customers
ORDER BY country, city
""")GROUP BY: Aggregating by Category
GROUP BY divides rows into groups based on one or more columns, and aggregate functions are applied to each group separately. This is where SQL becomes genuinely powerful for data analysis.
Basic GROUP BY
# Total revenue and order count by order status
query("""
SELECT
status,
COUNT(*) AS num_orders,
SUM(total_amount) AS total_revenue,
AVG(total_amount) AS avg_order_value
FROM orders
GROUP BY status
ORDER BY total_revenue DESC
""")| status | num_orders | total_revenue | avg_order_value |
|---|---|---|---|
| delivered | 4 | 469.94 | 117.49 |
| processing | 3 | 334.96 | 111.65 |
| shipped | 3 | 224.94 | 74.98 |
# Revenue by product category
query("""
SELECT
p.category,
COUNT(DISTINCT oi.order_id) AS num_orders,
SUM(oi.quantity) AS total_units_sold,
SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.category
ORDER BY total_revenue DESC
""")GROUP BY with Multiple Columns
# Orders per customer per month
query("""
SELECT
customer_id,
SUBSTR(order_date, 1, 7) AS year_month,
COUNT(*) AS orders_that_month,
SUM(total_amount) AS monthly_spend
FROM orders
GROUP BY customer_id, SUBSTR(order_date, 1, 7)
ORDER BY customer_id, year_month
""")The HAVING Clause: Filtering Groups
WHERE filters individual rows before grouping. HAVING filters groups after aggregation. This is a critical distinction:
# Customers with more than 1 order
query("""
SELECT
customer_id,
COUNT(*) AS num_orders,
SUM(total_amount) AS total_spend
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1 -- Filter GROUPS (not individual rows)
ORDER BY total_spend DESC
""")| customer_id | num_orders | total_spend |
|---|---|---|
| CUST_001 | 3 | 384.96 |
| CUST_002 | 2 | 139.98 |
| CUST_003 | 2 | 269.97 |
# Product categories with average price above $50
query("""
SELECT
category,
COUNT(*) AS num_products,
AVG(price) AS avg_price
FROM products
GROUP BY category
HAVING AVG(price) > 50
ORDER BY avg_price DESC
""")The WHERE vs. HAVING rule: Use WHERE to filter rows before grouping; use HAVING to filter groups after aggregation.
# Combining WHERE and HAVING
# Customers who placed delivered orders and spent more than $100 total
query("""
SELECT
customer_id,
COUNT(*) AS num_delivered_orders,
SUM(total_amount) AS total_delivered_spend
FROM orders
WHERE status = 'delivered' -- Filter BEFORE grouping
GROUP BY customer_id
HAVING SUM(total_amount) > 100 -- Filter AFTER grouping
ORDER BY total_delivered_spend DESC
""")JOIN: Combining Data from Multiple Tables
JOINs are the most powerful and most important SQL concept for data scientists. They combine rows from two or more tables based on a related column — allowing you to assemble complete, denormalized datasets from normalized tables.
INNER JOIN: The Most Common Join
An INNER JOIN returns only rows where there is a matching value in both tables. Non-matching rows from either table are excluded.
# Join customers with their orders (only customers who have orders)
query("""
SELECT
c.first_name,
c.last_name,
c.city,
o.order_id,
o.order_date,
o.total_amount,
o.status
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.last_name, o.order_date
""")| first_name | last_name | city | order_id | order_date | total_amount | status |
|---|---|---|---|---|---|---|
| Bob | Johnson | Seattle | ORD_1002 | 2024-07-12 | 49.99 | delivered |
| Bob | Johnson | Seattle | ORD_1006 | 2024-09-01 | 89.99 | shipped |
| Jane | Smith | Austin | ORD_1001 | 2024-07-05 | 199.98 | delivered |
| Jane | Smith | Austin | ORD_1003 | 2024-08-01 | 149.99 | delivered |
Notice: CUST_007 (Priya Patel) and CUST_008 (Marcus Lee) have no orders, so they don’t appear in this result.
The table alias convention: FROM customers c assigns the alias c to the customers table. c.customer_id then refers to the customer_id column from the customers table. Aliases are essential when joining tables that share column names.
LEFT JOIN: Include All Records from the Left Table
A LEFT JOIN returns all rows from the left (first) table, plus matching rows from the right table. Where there’s no match, right table columns appear as NULL.
# ALL customers, with their order info if they have any
query("""
SELECT
c.first_name,
c.last_name,
c.city,
COUNT(o.order_id) AS num_orders,
SUM(o.total_amount) AS total_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.city
ORDER BY total_spend DESC
""")| first_name | last_name | city | num_orders | total_spend |
|---|---|---|---|---|
| Jane | Smith | Austin | 3 | 384.96 |
| Alice | Williams | Chicago | 2 | 269.97 |
| Sophie | Brown | London | 1 | 184.98 |
| Bob | Johnson | Seattle | 2 | 139.98 |
| Carlos | Garcia | Miami | 1 | 79.99 |
| Wei | Zhang | Toronto | 1 | 69.98 |
| Priya | Patel | Austin | 0 | NULL |
| Marcus | Lee | Seattle | 0 | NULL |
Priya and Marcus appear with NULL spend — they’re customers who haven’t ordered yet. LEFT JOIN is essential for identifying these “zero-activity” records.
Finding Records Without a Match (Anti-Join)
A common and useful pattern: find rows in the left table that have NO match in the right table:
# Customers who have NEVER placed an order
query("""
SELECT
c.first_name,
c.last_name,
c.signup_date,
c.city
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL -- Right table is NULL = no match found
ORDER BY c.signup_date
""")| first_name | last_name | signup_date | city |
|---|---|---|---|
| Marcus | Lee | 2022-12-03 | Seattle |
| Priya | Patel | 2024-01-15 | Austin |
Chaining Multiple JOINs
Real data science queries often join three or more tables:
# Complete order detail: customer name + order info + product info
query("""
SELECT
c.first_name || ' ' || c.last_name AS customer_name,
o.order_date,
p.product_name,
p.category,
oi.quantity,
oi.unit_price,
oi.quantity * oi.unit_price AS line_total
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id
ORDER BY o.order_date, customer_name
""")| customer_name | order_date | product_name | category | quantity | unit_price | line_total |
|---|---|---|---|---|---|---|
| Jane Smith | 2024-07-05 | Wireless Headphones | Electronics | 1 | 149.99 | 149.99 |
| Jane Smith | 2024-07-05 | Laptop Stand | Electronics | 1 | 39.99 | 39.99 |
| Bob Johnson | 2024-07-12 | USB-C Hub | Electronics | 1 | 49.99 | 49.99 |
RIGHT JOIN and FULL OUTER JOIN
- RIGHT JOIN: Like LEFT JOIN but keeps all rows from the right table, NULLing left table columns where there’s no match. Less commonly used than LEFT JOIN.
- FULL OUTER JOIN: Returns all rows from both tables, with NULLs where there’s no match on either side. Useful for comparing two datasets. (Not supported in SQLite, but works in PostgreSQL, MySQL 8+, SQL Server.)
The SQL Query Execution Order
A critical insight that confuses many beginners: the order you write SQL clauses is different from the order the database executes them.
Written order:
SELECT → FROM → JOIN → WHERE → GROUP BY → HAVING → ORDER BY → LIMITExecution order:
1. FROM + JOIN → Identify all rows from relevant tables
2. WHERE → Filter individual rows
3. GROUP BY → Group rows
4. HAVING → Filter groups
5. SELECT → Compute column expressions and aliases
6. ORDER BY → Sort results
7. LIMIT → Restrict number of rows returnedWhy this matters in practice:
You cannot use a SELECT alias in a WHERE clause, because WHERE executes before SELECT:
-- This FAILS: 'price_with_tax' doesn't exist yet when WHERE executes
SELECT price * 1.10 AS price_with_tax
FROM products
WHERE price_with_tax > 100 -- Error! Alias doesn't exist at WHERE time
-- This WORKS: repeat the expression in WHERE
SELECT price * 1.10 AS price_with_tax
FROM products
WHERE price * 1.10 > 100You CAN use a SELECT alias in ORDER BY, because ORDER BY executes after SELECT:
-- This WORKS fine
SELECT price * 1.10 AS price_with_tax
FROM products
ORDER BY price_with_tax DESC -- Alias works hereString Functions: Working with Text Data
SQL provides built-in functions for text manipulation — essential for cleaning and transforming data:
# UPPER, LOWER, LENGTH
query("""
SELECT
first_name,
UPPER(first_name) AS name_upper,
LOWER(email) AS email_lower,
LENGTH(email) AS email_length
FROM customers
LIMIT 4
""")
# SUBSTR: extract part of a string
# Extract year from date stored as text
query("""
SELECT
order_id,
order_date,
SUBSTR(order_date, 1, 4) AS order_year,
SUBSTR(order_date, 6, 2) AS order_month
FROM orders
LIMIT 5
""")
# String concatenation
query("""
SELECT
first_name || ' ' || last_name AS full_name,
city || ', ' || country AS location
FROM customers
""")
# TRIM: remove leading/trailing whitespace
query("""
SELECT TRIM(' hello world ') AS trimmed
""")
# REPLACE: substitute text
query("""
SELECT
product_name,
REPLACE(product_name, ' ', '_') AS slug_name
FROM products
""")Handling NULL Values in Queries
NULL values require special handling in SQL — they propagate through expressions and comparisons in potentially surprising ways:
# COALESCE: return the first non-NULL value
# Replace NULL city with 'Unknown'
query("""
SELECT
first_name,
COALESCE(city, 'Unknown') AS city_or_unknown
FROM customers
""")
# IFNULL (SQLite-specific, equivalent to COALESCE with 2 args)
query("""
SELECT
c.first_name,
IFNULL(SUM(o.total_amount), 0) AS total_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name
""")
# NULL in arithmetic: any operation with NULL returns NULL
-- SELECT 5 + NULL → NULL
-- SELECT NULL * 100 → NULL
-- This is why COALESCE is important for computed columns
# NULL in aggregate functions: NULLs are ignored
-- COUNT(*) counts all rows including NULLs
-- COUNT(city) counts only non-NULL city values
-- AVG(amount) ignores NULL amounts (not treated as 0)Subqueries: Queries Within Queries
A subquery is a SELECT statement nested inside another query. They enable multi-step analytical logic:
# Find customers who have spent more than the average customer
query("""
SELECT
c.first_name,
c.last_name,
SUM(o.total_amount) AS total_spend
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
HAVING SUM(o.total_amount) > (
SELECT AVG(customer_spend)
FROM (
SELECT customer_id, SUM(total_amount) AS customer_spend
FROM orders
GROUP BY customer_id
) customer_totals
)
ORDER BY total_spend DESC
""")
# Products that have been ordered at least once
query("""
SELECT product_name, price
FROM products
WHERE product_id IN (
SELECT DISTINCT product_id
FROM order_items
)
""")
# Products that have NEVER been ordered (using NOT IN)
query("""
SELECT product_name, category, price
FROM products
WHERE product_id NOT IN (
SELECT DISTINCT product_id
FROM order_items
)
""")Common Table Expressions (CTEs): Readable Multi-Step Queries
CTEs (introduced with the WITH keyword) define named temporary result sets that can be referenced later in the query. They make complex queries readable by breaking them into logical steps — think of them as variable assignments for SQL:
# Customer spending analysis using CTEs
query("""
WITH customer_orders AS (
-- Step 1: Calculate spending per customer
SELECT
customer_id,
COUNT(order_id) AS num_orders,
SUM(total_amount) AS total_spend,
MAX(order_date) AS last_order_date
FROM orders
GROUP BY customer_id
),
customer_segments AS (
-- Step 2: Segment customers by spending level
SELECT
customer_id,
num_orders,
total_spend,
last_order_date,
CASE
WHEN total_spend >= 300 THEN 'High Value'
WHEN total_spend >= 100 THEN 'Mid Value'
ELSE 'Low Value'
END AS segment
FROM customer_orders
)
-- Step 3: Join back to get customer names
SELECT
c.first_name || ' ' || c.last_name AS customer_name,
cs.total_spend,
cs.num_orders,
cs.last_order_date,
cs.segment
FROM customer_segments cs
JOIN customers c ON cs.customer_id = c.customer_id
ORDER BY cs.total_spend DESC
""")| customer_name | total_spend | num_orders | last_order_date | segment |
|---|---|---|---|---|
| Jane Smith | 384.96 | 3 | 2024-09-10 | High Value |
| Alice Williams | 269.97 | 2 | 2024-09-15 | Mid Value |
| Sophie Brown | 184.98 | 1 | 2024-09-05 | Mid Value |
| Bob Johnson | 139.98 | 2 | 2024-09-01 | Mid Value |
| Carlos Garcia | 79.99 | 1 | 2024-08-20 | Low Value |
| Wei Zhang | 69.98 | 1 | 2024-09-12 | Low Value |
CTEs are far more readable than deeply nested subqueries for complex analytics.
CASE WHEN: Conditional Logic in SQL
CASE WHEN is SQL’s if-else expression — it creates conditional values within a query:
# Categorize products by price tier
query("""
SELECT
product_name,
price,
CASE
WHEN price >= 100 THEN 'Premium'
WHEN price >= 50 AND price < 100 THEN 'Mid-range'
WHEN price >= 25 AND price < 50 THEN 'Budget'
ELSE 'Value'
END AS price_tier
FROM products
ORDER BY price DESC
""")
# Count orders by status in a single row (pivot-style)
query("""
SELECT
COUNT(*) AS total_orders,
SUM(CASE WHEN status = 'delivered' THEN 1 ELSE 0 END) AS delivered,
SUM(CASE WHEN status = 'shipped' THEN 1 ELSE 0 END) AS shipped,
SUM(CASE WHEN status = 'processing' THEN 1 ELSE 0 END) AS processing
FROM orders
""")
# Flag premium customers in a combined report
query("""
SELECT
c.first_name,
c.last_name,
CASE WHEN c.is_premium = 1 THEN 'Premium' ELSE 'Standard' END AS membership,
COALESCE(SUM(o.total_amount), 0) AS lifetime_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.is_premium
ORDER BY lifetime_spend DESC
""")Putting It All Together: A Complete Analysis Query
Here’s a realistic data science query that combines everything we’ve learned — building a customer feature table for a churn prediction model:
query("""
WITH order_stats AS (
-- Aggregate order-level metrics per customer
SELECT
customer_id,
COUNT(order_id) AS total_orders,
SUM(total_amount) AS lifetime_value,
AVG(total_amount) AS avg_order_value,
MIN(order_date) AS first_order_date,
MAX(order_date) AS last_order_date,
-- Days since last order (recency)
JULIANDAY('2024-09-30') - JULIANDAY(MAX(order_date)) AS recency_days
FROM orders
GROUP BY customer_id
),
category_diversity AS (
-- How many distinct categories each customer has bought from
SELECT
o.customer_id,
COUNT(DISTINCT p.category) AS categories_purchased
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY o.customer_id
)
-- Final feature table: one row per customer
SELECT
c.customer_id,
c.city,
c.country,
c.is_premium,
JULIANDAY('2024-09-30') - JULIANDAY(c.signup_date) AS days_since_signup,
-- Order statistics (NULL for customers with no orders)
COALESCE(os.total_orders, 0) AS total_orders,
COALESCE(os.lifetime_value, 0) AS lifetime_value,
COALESCE(os.avg_order_value, 0) AS avg_order_value,
os.recency_days,
-- Category diversity
COALESCE(cd.categories_purchased, 0) AS categories_purchased,
-- Segment
CASE
WHEN os.lifetime_value >= 300 THEN 'High Value'
WHEN os.lifetime_value >= 100 THEN 'Mid Value'
WHEN os.total_orders > 0 THEN 'Low Value'
ELSE 'No Purchase'
END AS customer_segment
FROM customers c
LEFT JOIN order_stats os ON c.customer_id = os.customer_id
LEFT JOIN category_diversity cd ON c.customer_id = cd.customer_id
ORDER BY lifetime_value DESC
""")This single query — using CTEs, JOINs, aggregations, CASE WHEN, COALESCE, and date math — produces a ready-to-use feature table for machine learning, directly from the raw database tables.
SQL Best Practices for Data Scientists
Format Your Queries for Readability
Unformatted SQL is hard to read and maintain. Use consistent formatting:
-- Well-formatted: easy to read and debug
SELECT
c.customer_id,
c.first_name,
COUNT(o.order_id) AS num_orders,
SUM(o.total_amount) AS total_spend
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
WHERE c.signup_date >= '2022-01-01'
GROUP BY
c.customer_id,
c.first_name
HAVING COUNT(o.order_id) >= 2
ORDER BY total_spend DESC
LIMIT 10;Use Aliases Consistently
Always alias tables in multi-table queries (FROM customers c), and always alias computed or aggregated columns (COUNT(*) AS num_orders). This makes queries self-documenting.
Comment Complex Queries
-- Customer churn features: one row per customer as of 2024-09-30
-- Includes customers with zero orders (for "never purchased" segment)
WITH order_stats AS (
-- Aggregate from orders table
...Test Incrementally
Build complex queries piece by piece. Get the FROM and JOIN right first, verify the joins produce expected row counts, then add WHERE, then GROUP BY, then HAVING. Testing incrementally catches join bugs before they’re buried under layers of aggregation.
Be Careful with Aggregation After Joins
A JOIN that creates duplicate rows before aggregation will inflate your aggregates:
-- If a customer has 3 orders and you join to customer_attributes table
-- (one row per customer), the join produces 3 rows per customer
-- SUM(amount) will be correct, but COUNT(customer_attribute) will be 3x inflated
-- Always verify row counts at each join stepSummary
SQL is the most direct path from a database to insights. With the eight core concepts covered in this article — SELECT for retrieval, WHERE for filtering, ORDER BY for sorting, aggregate functions for summary statistics, GROUP BY for group-level analysis, HAVING for filtering groups, JOIN for combining tables, and CTEs for readable multi-step logic — you can write the vast majority of data science queries you’ll encounter professionally.
SQL’s power lies in its ability to push computation to the database, where millions of rows are handled with optimized algorithms in milliseconds. The final query in this article — a complete customer feature table assembled from four tables using joins, aggregations, CTEs, and conditional logic — would be far more complex and slower to implement in Python from raw files. In a real database with millions of customers and billions of transactions, it’s the only practical approach.
The next step is connecting to real databases from Python using SQLAlchemy and pandas — the subject of the next article. But the SQL skills you’ve built here are the foundation: they work the same whether you’re querying a local SQLite file, a corporate PostgreSQL database, or a Snowflake data warehouse with petabytes of data.
Key Takeaways
SELECTretrieves columns,FROMspecifies the table,WHEREfilters rows,GROUP BYcreates groups,HAVINGfilters groups,ORDER BYsorts results, andLIMITcaps row count — these seven clauses cover the majority of data science queries- Aggregate functions (
COUNT,SUM,AVG,MIN,MAX) compute a single value from a group of rows — they are always used withGROUP BYor across the entire table INNER JOINreturns only matching rows from both tables;LEFT JOINreturns all rows from the left table plus matches from the right (NULLs for non-matches) — LEFT JOIN is essential for identifying records with no corresponding data in another table- The SQL execution order (FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT) differs from the written order — understanding this explains why you can’t use SELECT aliases in WHERE clauses
CASE WHENprovides conditional logic within queries, enabling pivoting, segmentation, and conditional aggregation without multiple query passes- CTEs (
WITHclauses) make complex multi-step queries readable by assigning names to intermediate result sets — prefer them over deeply nested subqueries for analytical queries NULLrequires special handling: useIS NULL/IS NOT NULLfor comparison,COALESCE()to substitute defaults, and remember that aggregate functions ignore NULLs (exceptCOUNT(*))- Always use
LIMITwhen first exploring a new table, always alias computed columns for readability, and push aggregation to the database rather than pulling raw data to Python when possible








