SQL Basics for Data Scientists: Your First Queries

Learn SQL basics for data science. Master SELECT, WHERE, GROUP BY, JOIN, ORDER BY, and aggregate functions with clear examples and real-world data science use cases.

SQL Basics for Data Scientists: Your First Queries

SQL (Structured Query Language) is the standard language for querying and manipulating data in relational databases, and it is one of the most essential skills for any data scientist. With SQL, you retrieve exactly the data you need from databases containing millions of rows, filter records by any condition, group and aggregate data to compute summaries, join multiple tables together, and sort results — all with concise, readable code that runs directly in the database, where the data already lives.

Introduction

If Python is the hammer of data science, SQL is the foundation you build on. Almost every real-world data science project begins the same way: the data you need lives in a database, and to get it out you need SQL. The customer purchase histories, the server log events, the patient records, the financial transactions — they’re all waiting in tables, and SQL is the key.

SQL has been around since the 1970s and has barely changed in its fundamentals. What was true SQL syntax in 1985 still works in PostgreSQL, MySQL, SQLite, BigQuery, Snowflake, and Redshift today. It’s one of the most stable and transferable skills in all of technology — learn it once and it works everywhere.

The Survey of Data Scientists consistently ranks SQL among the top two or three most-used tools, alongside Python. Many job descriptions for data scientists list SQL before any ML library. Teams that review resumes for data science roles report that weak SQL skills are a common and immediate disqualifier. And yet SQL is frequently undertaught in data science courses that focus heavily on machine learning, leaving practitioners who can build a neural network unable to write a decent JOIN.

This article fixes that. We’ll cover every foundational SQL concept you need — from writing your very first SELECT statement through filtering, aggregating, grouping, joining tables, and handling missing data — with clear explanations and practical examples using a realistic e-commerce dataset. By the end, you’ll be writing real queries against real databases with confidence.

Setting Up: Working with SQLite in Python

Before writing your first query, you need a database to query against. SQLite is the perfect learning environment — it’s built into Python, requires no server or installation, and stores the entire database in a single file.

Creating a Practice Database

Let’s create the e-commerce database we’ll use throughout this article:

Python
import sqlite3
import pandas as pd

# Connect to a new SQLite data.base (creates the file if it doesn't exist)
conn = sqlite3.connect("your db")
cursor = conn.cursor()

# Create the customers table
cursor.execute("""
CREATE TABLE IF NOT EXISTS customers (
    customer_id TEXT PRIMARY KEY,
    first_name  TEXT NOT NULL,
    last_name   TEXT NOT NULL,
    email       TEXT UNIQUE NOT NULL,
    city        TEXT,
    country     TEXT,
    signup_date TEXT,
    is_premium  INTEGER DEFAULT 0
)
""")

# Create the products table
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
    product_id   TEXT PRIMARY KEY,
    product_name TEXT NOT NULL,
    category     TEXT NOT NULL,
    price        REAL NOT NULL,
    stock_qty    INTEGER DEFAULT 0
)
""")

# Create the orders table
cursor.execute("""
CREATE TABLE IF NOT EXISTS orders (
    order_id    TEXT PRIMARY KEY,
    customer_id TEXT NOT NULL,
    order_date  TEXT NOT NULL,
    total_amount REAL NOT NULL,
    status      TEXT NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
)
""")

# Create the order_items table
cursor.execute("""
CREATE TABLE IF NOT EXISTS order_items (
    item_id    INTEGER PRIMARY KEY AUTOINCREMENT,
    order_id   TEXT NOT NULL,
    product_id TEXT NOT NULL,
    quantity   INTEGER NOT NULL,
    unit_price REAL NOT NULL,
    FOREIGN KEY (order_id) REFERENCES orders(order_id),
    FOREIGN KEY (product_id) REFERENCES products(product_id)
)
""")

# Insert sample data
customers_data = [
    ('CUST_001', 'Jane',    'Smith',   'jane.smith@email.com',   'Austin',   'USA', '2022-03-14', 1),
    ('CUST_002', 'Bob',     'Johnson', 'bob.j@email.com',        'Seattle',  'USA', '2021-11-28', 0),
    ('CUST_003', 'Alice',   'Williams','alice.w@email.com',      'Chicago',  'USA', '2023-01-05', 1),
    ('CUST_004', 'Carlos',  'Garcia',  'c.garcia@email.com',     'Miami',    'USA', '2022-07-19', 0),
    ('CUST_005', 'Sophie',  'Brown',   's.brown@email.com',      'London',   'UK',  '2023-05-30', 1),
    ('CUST_006', 'Wei',     'Zhang',   'w.zhang@email.com',      'Toronto',  'CA',  '2021-09-01', 0),
    ('CUST_007', 'Priya',   'Patel',   'p.patel@email.com',      'Austin',   'USA', '2024-01-15', 0),
    ('CUST_008', 'Marcus',  'Lee',     'm.lee@email.com',        'Seattle',  'USA', '2022-12-03', 1),
]

products_data = [
    ('PROD_001', 'Wireless Headphones',   'Electronics', 149.99, 85),
    ('PROD_002', 'USB-C Hub',             'Electronics',  49.99, 200),
    ('PROD_003', 'Running Shoes',         'Apparel',      89.99, 150),
    ('PROD_004', 'Coffee Maker',          'Kitchen',      79.99, 62),
    ('PROD_005', 'Python Programming Book','Books',        34.99, 300),
    ('PROD_006', 'Yoga Mat',              'Sports',       29.99, 175),
    ('PROD_007', 'Laptop Stand',          'Electronics',  39.99, 120),
    ('PROD_008', 'Desk Lamp',             'Home',         24.99, 90),
]

orders_data = [
    ('ORD_1001', 'CUST_001', '2024-07-05', 199.98, 'delivered'),
    ('ORD_1002', 'CUST_002', '2024-07-12', 49.99,  'delivered'),
    ('ORD_1003', 'CUST_001', '2024-08-01', 149.99, 'delivered'),
    ('ORD_1004', 'CUST_003', '2024-08-15', 119.98, 'shipped'),
    ('ORD_1005', 'CUST_004', '2024-08-20', 79.99,  'delivered'),
    ('ORD_1006', 'CUST_002', '2024-09-01', 89.99,  'shipped'),
    ('ORD_1007', 'CUST_005', '2024-09-05', 184.98, 'processing'),
    ('ORD_1008', 'CUST_001', '2024-09-10', 34.99,  'processing'),
    ('ORD_1009', 'CUST_006', '2024-09-12', 69.98,  'shipped'),
    ('ORD_1010', 'CUST_003', '2024-09-15', 149.99, 'processing'),
]

order_items_data = [
    ('ORD_1001', 'PROD_001', 1, 149.99),
    ('ORD_1001', 'PROD_007', 1, 49.99),   # Note: 149.99 + 49.99 = 199.98
    ('ORD_1002', 'PROD_002', 1, 49.99),
    ('ORD_1003', 'PROD_001', 1, 149.99),
    ('ORD_1004', 'PROD_003', 1, 89.99),
    ('ORD_1004', 'PROD_006', 1, 29.99),
    ('ORD_1005', 'PROD_004', 1, 79.99),
    ('ORD_1006', 'PROD_003', 1, 89.99),
    ('ORD_1007', 'PROD_001', 1, 149.99),
    ('ORD_1007', 'PROD_006', 1, 29.99),   # Note: adjusted to match total
    ('ORD_1008', 'PROD_005', 1, 34.99),
    ('ORD_1009', 'PROD_006', 2, 29.99),
    ('ORD_1010', 'PROD_001', 1, 149.99),
]

cursor.executemany("INSERT OR IGNORE INTO customers VALUES (?,?,?,?,?,?,?,?)", customers_data)
cursor.executemany("INSERT OR IGNORE INTO products VALUES (?,?,?,?,?)", products_data)
cursor.executemany("INSERT OR IGNORE INTO orders VALUES (?,?,?,?,?)", orders_data)
cursor.executemany("INSERT OR IGNORE INTO order_items (order_id,product_id,quantity,unit_price) VALUES (?,?,?,?)", order_items_data)

conn.commit()
print("Database created and populated successfully.")

The Helper Function

We’ll use this helper function to run queries and display results cleanly throughout the article:

Python
def query(sql, conn=conn):
    """Execute a SQL query and return results as a pandas DataFrame."""
    return pd.read_sql_query(sql, conn)

The SELECT Statement: Retrieving Data

The SELECT statement is the foundation of SQL — it retrieves data from one or more tables. Every query you write as a data scientist starts with SELECT.

Basic Syntax

SQL
SELECT column1, column2, column3
FROM table_name;

The semicolon (;) terminates the statement. SQL keywords are conventionally written in uppercase (SELECT, FROM, WHERE), though most databases accept lowercase. Column and table names are case-insensitive in most databases.

Selecting Specific Columns

SQL
query("""
SELECT first_name, last_name, city, country
FROM customers
""")
first_namelast_namecitycountry
JaneSmithAustinUSA
BobJohnsonSeattleUSA
AliceWilliamsChicagoUSA
CarlosGarciaMiamiUSA
SophieBrownLondonUK
WeiZhangTorontoCA
PriyaPatelAustinUSA
MarcusLeeSeattleUSA

Selecting All Columns

The asterisk * selects all columns:

SQL
query("SELECT * FROM products")

Important note: In production queries and data science work, avoid SELECT * on large tables — always specify only the columns you actually need. This reduces data transfer, improves query speed, and makes your code more maintainable.

Column Aliases: Renaming Output Columns

The AS keyword renames a column in the output (the original table column is unchanged):

SQL
query("""
SELECT 
    product_name AS name,
    category     AS product_category,
    price        AS price_usd,
    stock_qty    AS units_in_stock
FROM products
""")

Aliases are especially useful when column names are long, ambiguous (after a join), or when computed columns need a descriptive name.

Computed Columns

SQL can compute new values from existing columns directly in the SELECT:

SQL
query("""
SELECT
    product_name,
    price,
    price * 1.10  AS price_with_tax,
    price * 0.85  AS price_discounted_15pct,
    stock_qty * price AS total_inventory_value
FROM products
ORDER BY total_inventory_value DESC
""")
product_namepriceprice_with_taxprice_discounted_15pcttotal_inventory_value
Wireless Headphones149.99164.989127.491512749.15
Running Shoes89.9998.98976.491513498.50
Python Programming Book34.9938.48929.741510497.00
USB-C Hub49.9954.98942.49159998.00

The WHERE Clause: Filtering Rows

WHERE filters rows, keeping only those that meet your conditions. It comes after FROM and before GROUP BY or ORDER BY.

Comparison Operators

Plaintext
=       Equal to
!=  <>  Not equal to
>       Greater than
<       Less than
>=      Greater than or equal to
<=      Less than or equal to
SQL
# Products priced above $50
query("""
SELECT product_name, category, price
FROM products
WHERE price > 50
""")
SQL
# Only electronics
query("""
SELECT product_name, price, stock_qty
FROM products
WHERE category = 'Electronics'
""")

Combining Conditions: AND, OR, NOT

SQL
# Electronics under $100
query("""
SELECT product_name, price
FROM products
WHERE category = 'Electronics'
  AND price < 100
""")
SQL
# Electronics OR products under $30
query("""
SELECT product_name, category, price
FROM products
WHERE category = 'Electronics'
   OR price < 30
""")
SQL
# Everything except Electronics
query("""
SELECT product_name, category, price
FROM products
WHERE NOT category = 'Electronics'
-- Equivalent: WHERE category != 'Electronics'
""")

Operator precedence: AND is evaluated before OR. Use parentheses to make complex conditions explicit and correct:

SQL
# Products that are: (Electronics AND expensive) OR (Books)
query("""
SELECT product_name, category, price
FROM products
WHERE (category = 'Electronics' AND price > 100)
   OR category = 'Books'
""")

BETWEEN: Range Filtering

SQL
# Products priced between $30 and $80 (inclusive on both ends)
query("""
SELECT product_name, price
FROM products
WHERE price BETWEEN 30 AND 80
ORDER BY price
""")

IN: Matching Multiple Values

SQL
# Multiple categories at once (cleaner than multiple OR conditions)
query("""
SELECT product_name, category, price
FROM products
WHERE category IN ('Electronics', 'Sports', 'Books')
ORDER BY category, price
""")

# NOT IN: exclude specific categories
query("""
SELECT product_name, category
FROM products
WHERE category NOT IN ('Electronics', 'Kitchen')
""")

LIKE: Pattern Matching

LIKE matches strings against patterns. Two wildcards:

  • % — matches any sequence of zero or more characters
  • _ — matches exactly one character
SQL
# Customer emails ending in .com
query("""
SELECT first_name, email
FROM customers
WHERE email LIKE '%@email.com'
""")

# Products with 'USB' anywhere in the name
query("""
SELECT product_name, price
FROM products
WHERE product_name LIKE '%USB%'
""")

# Products starting with a specific letter
query("""
SELECT product_name
FROM products
WHERE product_name LIKE 'R%'
""")

IS NULL and IS NOT NULL

NULL represents missing or unknown values. You cannot compare to NULL with = — you must use IS NULL or IS NOT NULL:

SQL
# Customers with no city on record
query("""
SELECT first_name, last_name, city
FROM customers
WHERE city IS NULL
""")

# Customers who do have a city
query("""
SELECT first_name, last_name, city
FROM customers
WHERE city IS NOT NULL
""")

# WRONG: This never returns results because NULL != NULL in SQL
-- WHERE city = NULL  ← This is incorrect SQL logic

ORDER BY: Sorting Results

ORDER BY sorts query results. It comes last in the query (before LIMIT).

SQL
# Products sorted by price, cheapest first (default: ascending)
query("""
SELECT product_name, price
FROM products
ORDER BY price ASC
""")

# Products sorted by price, most expensive first
query("""
SELECT product_name, price
FROM products
ORDER BY price DESC
""")

# Sort by multiple columns: category alphabetically, then by price descending within each category
query("""
SELECT product_name, category, price
FROM products
ORDER BY category ASC, price DESC
""")
product_namecategoryprice
Python Programming BookBooks34.99
Wireless HeadphonesElectronics149.99
USB-C HubElectronics49.99
Laptop StandElectronics39.99
Desk LampHome24.99
Coffee MakerKitchen79.99
Running ShoesApparel89.99
Yoga MatSports29.99

LIMIT: Controlling Result Size

LIMIT (or TOP in SQL Server) restricts how many rows are returned. Essential for exploring large tables without pulling millions of rows:

SQL
# First 5 customers by signup date
query("""
SELECT first_name, last_name, signup_date
FROM customers
ORDER BY signup_date ASC
LIMIT 5
""")

# The 3 most expensive products
query("""
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 3
""")

LIMIT with OFFSET for pagination — useful when building data pipelines that process data in chunks:

SQL
# Skip first 3 results, return next 3 (page 2 of 3-per-page results)
query("""
SELECT product_name, price
FROM products
ORDER BY price DESC
LIMIT 3 OFFSET 3
""")

Aggregate Functions: Computing Summaries

Aggregate functions compute a single value from a set of rows — essential for summarizing data.

The Core Aggregate Functions

FunctionDescriptionExample
COUNT(*)Count all rowsTotal number of orders
COUNT(column)Count non-NULL values in columnNumber of orders with a known city
COUNT(DISTINCT column)Count unique valuesNumber of distinct customers who ordered
SUM(column)Sum all valuesTotal revenue
AVG(column)Average (mean) of valuesAverage order value
MIN(column)Smallest valueCheapest product price
MAX(column)Largest valueMost expensive product
SQL
# Summary statistics for all orders
query("""
SELECT
    COUNT(*)            AS total_orders,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(total_amount)   AS total_revenue,
    AVG(total_amount)   AS avg_order_value,
    MIN(total_amount)   AS min_order_value,
    MAX(total_amount)   AS max_order_value
FROM orders
""")
total_ordersunique_customerstotal_revenueavg_order_valuemin_order_valuemax_order_value
1061029.84102.98434.99199.98
SQL
# Product catalog statistics
query("""
SELECT
    COUNT(*)          AS total_products,
    COUNT(DISTINCT category) AS num_categories,
    AVG(price)        AS avg_price,
    MIN(price)        AS cheapest,
    MAX(price)        AS most_expensive,
    SUM(stock_qty)    AS total_units_in_stock
FROM products
""")

DISTINCT: Counting Unique Values

SQL
# How many unique cities are our customers from?
query("""
SELECT COUNT(DISTINCT city) AS unique_cities
FROM customers
""")

# What are the unique cities?
query("""
SELECT DISTINCT city, country
FROM customers
ORDER BY country, city
""")

GROUP BY: Aggregating by Category

GROUP BY divides rows into groups based on one or more columns, and aggregate functions are applied to each group separately. This is where SQL becomes genuinely powerful for data analysis.

Basic GROUP BY

SQL
# Total revenue and order count by order status
query("""
SELECT
    status,
    COUNT(*)          AS num_orders,
    SUM(total_amount) AS total_revenue,
    AVG(total_amount) AS avg_order_value
FROM orders
GROUP BY status
ORDER BY total_revenue DESC
""")
statusnum_orderstotal_revenueavg_order_value
delivered4469.94117.49
processing3334.96111.65
shipped3224.9474.98
SQL
# Revenue by product category
query("""
SELECT
    p.category,
    COUNT(DISTINCT oi.order_id)  AS num_orders,
    SUM(oi.quantity)             AS total_units_sold,
    SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.category
ORDER BY total_revenue DESC
""")

GROUP BY with Multiple Columns

SQL
# Orders per customer per month
query("""
SELECT
    customer_id,
    SUBSTR(order_date, 1, 7)  AS year_month,
    COUNT(*)                  AS orders_that_month,
    SUM(total_amount)         AS monthly_spend
FROM orders
GROUP BY customer_id, SUBSTR(order_date, 1, 7)
ORDER BY customer_id, year_month
""")

The HAVING Clause: Filtering Groups

WHERE filters individual rows before grouping. HAVING filters groups after aggregation. This is a critical distinction:

SQL
# Customers with more than 1 order
query("""
SELECT
    customer_id,
    COUNT(*)          AS num_orders,
    SUM(total_amount) AS total_spend
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1           -- Filter GROUPS (not individual rows)
ORDER BY total_spend DESC
""")
customer_idnum_orderstotal_spend
CUST_0013384.96
CUST_0022139.98
CUST_0032269.97
SQL
# Product categories with average price above $50
query("""
SELECT
    category,
    COUNT(*)   AS num_products,
    AVG(price) AS avg_price
FROM products
GROUP BY category
HAVING AVG(price) > 50
ORDER BY avg_price DESC
""")

The WHERE vs. HAVING rule: Use WHERE to filter rows before grouping; use HAVING to filter groups after aggregation.

SQL
# Combining WHERE and HAVING
# Customers who placed delivered orders and spent more than $100 total
query("""
SELECT
    customer_id,
    COUNT(*)          AS num_delivered_orders,
    SUM(total_amount) AS total_delivered_spend
FROM orders
WHERE status = 'delivered'          -- Filter BEFORE grouping
GROUP BY customer_id
HAVING SUM(total_amount) > 100      -- Filter AFTER grouping
ORDER BY total_delivered_spend DESC
""")

JOIN: Combining Data from Multiple Tables

JOINs are the most powerful and most important SQL concept for data scientists. They combine rows from two or more tables based on a related column — allowing you to assemble complete, denormalized datasets from normalized tables.

INNER JOIN: The Most Common Join

An INNER JOIN returns only rows where there is a matching value in both tables. Non-matching rows from either table are excluded.

SQL
# Join customers with their orders (only customers who have orders)
query("""
SELECT
    c.first_name,
    c.last_name,
    c.city,
    o.order_id,
    o.order_date,
    o.total_amount,
    o.status
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.last_name, o.order_date
""")
first_namelast_namecityorder_idorder_datetotal_amountstatus
BobJohnsonSeattleORD_10022024-07-1249.99delivered
BobJohnsonSeattleORD_10062024-09-0189.99shipped
JaneSmithAustinORD_10012024-07-05199.98delivered
JaneSmithAustinORD_10032024-08-01149.99delivered

Notice: CUST_007 (Priya Patel) and CUST_008 (Marcus Lee) have no orders, so they don’t appear in this result.

The table alias convention: FROM customers c assigns the alias c to the customers table. c.customer_id then refers to the customer_id column from the customers table. Aliases are essential when joining tables that share column names.

LEFT JOIN: Include All Records from the Left Table

A LEFT JOIN returns all rows from the left (first) table, plus matching rows from the right table. Where there’s no match, right table columns appear as NULL.

SQL
# ALL customers, with their order info if they have any
query("""
SELECT
    c.first_name,
    c.last_name,
    c.city,
    COUNT(o.order_id)   AS num_orders,
    SUM(o.total_amount) AS total_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.city
ORDER BY total_spend DESC
""")
first_namelast_namecitynum_orderstotal_spend
JaneSmithAustin3384.96
AliceWilliamsChicago2269.97
SophieBrownLondon1184.98
BobJohnsonSeattle2139.98
CarlosGarciaMiami179.99
WeiZhangToronto169.98
PriyaPatelAustin0NULL
MarcusLeeSeattle0NULL

Priya and Marcus appear with NULL spend — they’re customers who haven’t ordered yet. LEFT JOIN is essential for identifying these “zero-activity” records.

Finding Records Without a Match (Anti-Join)

A common and useful pattern: find rows in the left table that have NO match in the right table:

SQL
# Customers who have NEVER placed an order
query("""
SELECT
    c.first_name,
    c.last_name,
    c.signup_date,
    c.city
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL          -- Right table is NULL = no match found
ORDER BY c.signup_date
""")
first_namelast_namesignup_datecity
MarcusLee2022-12-03Seattle
PriyaPatel2024-01-15Austin

Chaining Multiple JOINs

Real data science queries often join three or more tables:

SQL
# Complete order detail: customer name + order info + product info
query("""
SELECT
    c.first_name || ' ' || c.last_name AS customer_name,
    o.order_date,
    p.product_name,
    p.category,
    oi.quantity,
    oi.unit_price,
    oi.quantity * oi.unit_price AS line_total
FROM orders o
INNER JOIN customers c   ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p    ON oi.product_id = p.product_id
ORDER BY o.order_date, customer_name
""")
customer_nameorder_dateproduct_namecategoryquantityunit_priceline_total
Jane Smith2024-07-05Wireless HeadphonesElectronics1149.99149.99
Jane Smith2024-07-05Laptop StandElectronics139.9939.99
Bob Johnson2024-07-12USB-C HubElectronics149.9949.99

RIGHT JOIN and FULL OUTER JOIN

  • RIGHT JOIN: Like LEFT JOIN but keeps all rows from the right table, NULLing left table columns where there’s no match. Less commonly used than LEFT JOIN.
  • FULL OUTER JOIN: Returns all rows from both tables, with NULLs where there’s no match on either side. Useful for comparing two datasets. (Not supported in SQLite, but works in PostgreSQL, MySQL 8+, SQL Server.)

The SQL Query Execution Order

A critical insight that confuses many beginners: the order you write SQL clauses is different from the order the database executes them.

Written order:

SQL
SELECTFROMJOINWHEREGROUP BYHAVINGORDER BYLIMIT

Execution order:

Plaintext
1. FROM + JOIN    → Identify all rows from relevant tables
2. WHERE          → Filter individual rows
3. GROUP BY       → Group rows
4. HAVING         → Filter groups
5. SELECT         → Compute column expressions and aliases
6. ORDER BY       → Sort results
7. LIMIT          → Restrict number of rows returned

Why this matters in practice:

You cannot use a SELECT alias in a WHERE clause, because WHERE executes before SELECT:

SQL
-- This FAILS: 'price_with_tax' doesn't exist yet when WHERE executes
SELECT price * 1.10 AS price_with_tax
FROM products
WHERE price_with_tax > 100     -- Error! Alias doesn't exist at WHERE time

-- This WORKS: repeat the expression in WHERE
SELECT price * 1.10 AS price_with_tax
FROM products
WHERE price * 1.10 > 100

You CAN use a SELECT alias in ORDER BY, because ORDER BY executes after SELECT:

SQL
-- This WORKS fine
SELECT price * 1.10 AS price_with_tax
FROM products
ORDER BY price_with_tax DESC   -- Alias works here

String Functions: Working with Text Data

SQL provides built-in functions for text manipulation — essential for cleaning and transforming data:

SQL
# UPPER, LOWER, LENGTH
query("""
SELECT
    first_name,
    UPPER(first_name)    AS name_upper,
    LOWER(email)         AS email_lower,
    LENGTH(email)        AS email_length
FROM customers
LIMIT 4
""")

# SUBSTR: extract part of a string
# Extract year from date stored as text
query("""
SELECT
    order_id,
    order_date,
    SUBSTR(order_date, 1, 4)  AS order_year,
    SUBSTR(order_date, 6, 2)  AS order_month
FROM orders
LIMIT 5
""")

# String concatenation
query("""
SELECT
    first_name || ' ' || last_name  AS full_name,
    city || ', ' || country          AS location
FROM customers
""")

# TRIM: remove leading/trailing whitespace
query("""
SELECT TRIM('  hello world  ') AS trimmed
""")

# REPLACE: substitute text
query("""
SELECT 
    product_name,
    REPLACE(product_name, ' ', '_')  AS slug_name
FROM products
""")

Handling NULL Values in Queries

NULL values require special handling in SQL — they propagate through expressions and comparisons in potentially surprising ways:

SQL
# COALESCE: return the first non-NULL value
# Replace NULL city with 'Unknown'
query("""
SELECT
    first_name,
    COALESCE(city, 'Unknown') AS city_or_unknown
FROM customers
""")

# IFNULL (SQLite-specific, equivalent to COALESCE with 2 args)
query("""
SELECT
    c.first_name,
    IFNULL(SUM(o.total_amount), 0) AS total_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name
""")

# NULL in arithmetic: any operation with NULL returns NULL
-- SELECT 5 + NULL  → NULL
-- SELECT NULL * 100  → NULL
-- This is why COALESCE is important for computed columns

# NULL in aggregate functions: NULLs are ignored
-- COUNT(*) counts all rows including NULLs
-- COUNT(city) counts only non-NULL city values
-- AVG(amount) ignores NULL amounts (not treated as 0)

Subqueries: Queries Within Queries

A subquery is a SELECT statement nested inside another query. They enable multi-step analytical logic:

SQL
# Find customers who have spent more than the average customer
query("""
SELECT
    c.first_name,
    c.last_name,
    SUM(o.total_amount) AS total_spend
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name
HAVING SUM(o.total_amount) > (
    SELECT AVG(customer_spend)
    FROM (
        SELECT customer_id, SUM(total_amount) AS customer_spend
        FROM orders
        GROUP BY customer_id
    ) customer_totals
)
ORDER BY total_spend DESC
""")

# Products that have been ordered at least once
query("""
SELECT product_name, price
FROM products
WHERE product_id IN (
    SELECT DISTINCT product_id
    FROM order_items
)
""")

# Products that have NEVER been ordered (using NOT IN)
query("""
SELECT product_name, category, price
FROM products
WHERE product_id NOT IN (
    SELECT DISTINCT product_id
    FROM order_items
)
""")

Common Table Expressions (CTEs): Readable Multi-Step Queries

CTEs (introduced with the WITH keyword) define named temporary result sets that can be referenced later in the query. They make complex queries readable by breaking them into logical steps — think of them as variable assignments for SQL:

SQL
# Customer spending analysis using CTEs
query("""
WITH customer_orders AS (
    -- Step 1: Calculate spending per customer
    SELECT
        customer_id,
        COUNT(order_id)   AS num_orders,
        SUM(total_amount) AS total_spend,
        MAX(order_date)   AS last_order_date
    FROM orders
    GROUP BY customer_id
),
customer_segments AS (
    -- Step 2: Segment customers by spending level
    SELECT
        customer_id,
        num_orders,
        total_spend,
        last_order_date,
        CASE
            WHEN total_spend >= 300 THEN 'High Value'
            WHEN total_spend >= 100 THEN 'Mid Value'
            ELSE 'Low Value'
        END AS segment
    FROM customer_orders
)
-- Step 3: Join back to get customer names
SELECT
    c.first_name || ' ' || c.last_name AS customer_name,
    cs.total_spend,
    cs.num_orders,
    cs.last_order_date,
    cs.segment
FROM customer_segments cs
JOIN customers c ON cs.customer_id = c.customer_id
ORDER BY cs.total_spend DESC
""")
customer_nametotal_spendnum_orderslast_order_datesegment
Jane Smith384.9632024-09-10High Value
Alice Williams269.9722024-09-15Mid Value
Sophie Brown184.9812024-09-05Mid Value
Bob Johnson139.9822024-09-01Mid Value
Carlos Garcia79.9912024-08-20Low Value
Wei Zhang69.9812024-09-12Low Value

CTEs are far more readable than deeply nested subqueries for complex analytics.

CASE WHEN: Conditional Logic in SQL

CASE WHEN is SQL’s if-else expression — it creates conditional values within a query:

SQL
# Categorize products by price tier
query("""
SELECT
    product_name,
    price,
    CASE
        WHEN price >= 100            THEN 'Premium'
        WHEN price >= 50 AND price < 100 THEN 'Mid-range'
        WHEN price >= 25 AND price < 50  THEN 'Budget'
        ELSE                             'Value'
    END AS price_tier
FROM products
ORDER BY price DESC
""")

# Count orders by status in a single row (pivot-style)
query("""
SELECT
    COUNT(*) AS total_orders,
    SUM(CASE WHEN status = 'delivered'  THEN 1 ELSE 0 END) AS delivered,
    SUM(CASE WHEN status = 'shipped'    THEN 1 ELSE 0 END) AS shipped,
    SUM(CASE WHEN status = 'processing' THEN 1 ELSE 0 END) AS processing
FROM orders
""")

# Flag premium customers in a combined report
query("""
SELECT
    c.first_name,
    c.last_name,
    CASE WHEN c.is_premium = 1 THEN 'Premium' ELSE 'Standard' END AS membership,
    COALESCE(SUM(o.total_amount), 0) AS lifetime_spend
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.is_premium
ORDER BY lifetime_spend DESC
""")

Putting It All Together: A Complete Analysis Query

Here’s a realistic data science query that combines everything we’ve learned — building a customer feature table for a churn prediction model:

SQL
query("""
WITH order_stats AS (
    -- Aggregate order-level metrics per customer
    SELECT
        customer_id,
        COUNT(order_id)                         AS total_orders,
        SUM(total_amount)                       AS lifetime_value,
        AVG(total_amount)                       AS avg_order_value,
        MIN(order_date)                         AS first_order_date,
        MAX(order_date)                         AS last_order_date,
        -- Days since last order (recency)
        JULIANDAY('2024-09-30') - JULIANDAY(MAX(order_date)) AS recency_days
    FROM orders
    GROUP BY customer_id
),
category_diversity AS (
    -- How many distinct categories each customer has bought from
    SELECT
        o.customer_id,
        COUNT(DISTINCT p.category) AS categories_purchased
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    GROUP BY o.customer_id
)
-- Final feature table: one row per customer
SELECT
    c.customer_id,
    c.city,
    c.country,
    c.is_premium,
    JULIANDAY('2024-09-30') - JULIANDAY(c.signup_date) AS days_since_signup,
    -- Order statistics (NULL for customers with no orders)
    COALESCE(os.total_orders, 0)          AS total_orders,
    COALESCE(os.lifetime_value, 0)        AS lifetime_value,
    COALESCE(os.avg_order_value, 0)       AS avg_order_value,
    os.recency_days,
    -- Category diversity
    COALESCE(cd.categories_purchased, 0) AS categories_purchased,
    -- Segment
    CASE
        WHEN os.lifetime_value >= 300  THEN 'High Value'
        WHEN os.lifetime_value >= 100  THEN 'Mid Value'
        WHEN os.total_orders > 0       THEN 'Low Value'
        ELSE 'No Purchase'
    END AS customer_segment
FROM customers c
LEFT JOIN order_stats       os ON c.customer_id = os.customer_id
LEFT JOIN category_diversity cd ON c.customer_id = cd.customer_id
ORDER BY lifetime_value DESC
""")

This single query — using CTEs, JOINs, aggregations, CASE WHEN, COALESCE, and date math — produces a ready-to-use feature table for machine learning, directly from the raw database tables.

SQL Best Practices for Data Scientists

Format Your Queries for Readability

Unformatted SQL is hard to read and maintain. Use consistent formatting:

SQL
-- Well-formatted: easy to read and debug
SELECT
    c.customer_id,
    c.first_name,
    COUNT(o.order_id)    AS num_orders,
    SUM(o.total_amount)  AS total_spend
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id
WHERE c.signup_date >= '2022-01-01'
GROUP BY
    c.customer_id,
    c.first_name
HAVING COUNT(o.order_id) >= 2
ORDER BY total_spend DESC
LIMIT 10;

Use Aliases Consistently

Always alias tables in multi-table queries (FROM customers c), and always alias computed or aggregated columns (COUNT(*) AS num_orders). This makes queries self-documenting.

Comment Complex Queries

SQL
-- Customer churn features: one row per customer as of 2024-09-30
-- Includes customers with zero orders (for "never purchased" segment)
WITH order_stats AS (
    -- Aggregate from orders table
    ...

Test Incrementally

Build complex queries piece by piece. Get the FROM and JOIN right first, verify the joins produce expected row counts, then add WHERE, then GROUP BY, then HAVING. Testing incrementally catches join bugs before they’re buried under layers of aggregation.

Be Careful with Aggregation After Joins

A JOIN that creates duplicate rows before aggregation will inflate your aggregates:

SQL
-- If a customer has 3 orders and you join to customer_attributes table 
-- (one row per customer), the join produces 3 rows per customer
-- SUM(amount) will be correct, but COUNT(customer_attribute) will be 3x inflated
-- Always verify row counts at each join step

Summary

SQL is the most direct path from a database to insights. With the eight core concepts covered in this article — SELECT for retrieval, WHERE for filtering, ORDER BY for sorting, aggregate functions for summary statistics, GROUP BY for group-level analysis, HAVING for filtering groups, JOIN for combining tables, and CTEs for readable multi-step logic — you can write the vast majority of data science queries you’ll encounter professionally.

SQL’s power lies in its ability to push computation to the database, where millions of rows are handled with optimized algorithms in milliseconds. The final query in this article — a complete customer feature table assembled from four tables using joins, aggregations, CTEs, and conditional logic — would be far more complex and slower to implement in Python from raw files. In a real database with millions of customers and billions of transactions, it’s the only practical approach.

The next step is connecting to real databases from Python using SQLAlchemy and pandas — the subject of the next article. But the SQL skills you’ve built here are the foundation: they work the same whether you’re querying a local SQLite file, a corporate PostgreSQL database, or a Snowflake data warehouse with petabytes of data.

Key Takeaways

  • SELECT retrieves columns, FROM specifies the table, WHERE filters rows, GROUP BY creates groups, HAVING filters groups, ORDER BY sorts results, and LIMIT caps row count — these seven clauses cover the majority of data science queries
  • Aggregate functions (COUNT, SUM, AVG, MIN, MAX) compute a single value from a group of rows — they are always used with GROUP BY or across the entire table
  • INNER JOIN returns only matching rows from both tables; LEFT JOIN returns all rows from the left table plus matches from the right (NULLs for non-matches) — LEFT JOIN is essential for identifying records with no corresponding data in another table
  • The SQL execution order (FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT) differs from the written order — understanding this explains why you can’t use SELECT aliases in WHERE clauses
  • CASE WHEN provides conditional logic within queries, enabling pivoting, segmentation, and conditional aggregation without multiple query passes
  • CTEs (WITH clauses) make complex multi-step queries readable by assigning names to intermediate result sets — prefer them over deeply nested subqueries for analytical queries
  • NULL requires special handling: use IS NULL / IS NOT NULL for comparison, COALESCE() to substitute defaults, and remember that aggregate functions ignore NULLs (except COUNT(*))
  • Always use LIMIT when first exploring a new table, always alias computed columns for readability, and push aggregation to the database rather than pulling raw data to Python when possible
Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Python Libraries for Data Science: NumPy and Pandas

Explore NumPy and Pandas, two essential Python libraries for data science. Learn their features, applications…

The Data Science Workflow: From Problem to Solution

Master the data science workflow with this comprehensive guide covering problem definition, data collection, cleaning,…

The Standard Template Library (STL): Your C++ Toolkit

The Standard Template Library (STL): Your C++ Toolkit

Master the C++ Standard Template Library (STL). Learn containers, iterators, and algorithms—vector, map, sort, find—with…

The Difference Between Analog and Digital Signals Explained Visually

Learn the fundamental differences between analog and digital signals through clear visual explanations. Understand continuous…

Connecting to Databases from Python

Connecting to Databases from Python

Learn how to connect to databases from Python. Master SQLite, PostgreSQL, SQLAlchemy, pandas read_sql, connection…

Meta Reports Strong Q4 Earnings, Plans Up to $135 Billion AI Spending

Meta reports $59.9B Q4 revenue beating estimates while unveiling $115-135 billion AI infrastructure spending plan…

Click For More
0
Would love your thoughts, please comment.x
()
x