Connecting to Databases from Python

Learn how to connect to databases from Python. Master SQLite, PostgreSQL, SQLAlchemy, pandas read_sql, connection pooling, security best practices, and writing results back.

By Techietory on May 24, 2026

Connecting to a database from Python requires three things: a database driver (a library that speaks the database’s protocol), a connection string (specifying the host, port, credentials, and database name), and a query interface (either direct cursor-based execution or a higher-level library like SQLAlchemy or pandas). The most common pattern in data science is using SQLAlchemy to create a database engine, then passing that engine to pandas’ read_sql_query() function to execute SQL and receive results directly as a DataFrame.

Introduction

Writing SQL queries is only half the skill. The other half — equally important but less often taught — is getting those queries into and out of Python smoothly. You need to know how to establish a connection, execute a query, handle the results, manage credentials securely, write data back to the database, and avoid the common pitfalls that lead to connection leaks, credential exposures, and security vulnerabilities.

In professional data science work, database connections are a daily activity. You start the morning by pulling fresh data from the warehouse, spend the day building features and training models, and end by writing predictions back to a production table. Everything flows through database connections.

This article covers the complete spectrum of database connectivity in Python: from the simplest possible SQLite connection to enterprise PostgreSQL with connection pooling, from raw cursor-based queries to the elegant pandas integration that makes database access feel like reading a CSV, and through all the security and best practices that separate professional from amateur database work.

By the end, connecting to any database, in any professional environment, will feel straightforward.

The Python Database Connectivity Landscape

Python’s database ecosystem is layered. Understanding the layers helps you choose the right tool for each situation.

The DB-API 2.0 Standard (PEP 249)

Python’s database connectivity is standardized by PEP 249, also called DB-API 2.0. This specification defines a common interface that all database drivers must implement, regardless of the underlying database. The core concepts:

Connection: An object representing the connection to the database
Cursor: An object for executing queries and fetching results
connect(): The function to establish a connection
execute(): The cursor method to run a query
fetchall() / fetchone() / fetchmany(n): Methods to retrieve results

Because all drivers implement the same interface, the code pattern is nearly identical whether you’re using SQLite, PostgreSQL, or MySQL — only the import and connection string differ.

The Layered Architecture

Plaintext

Your Python Code (data science notebook / script)
        │
        ▼
High-Level Libraries (pandas, SQLAlchemy ORM)
        │
        ▼
SQLAlchemy Core (data.base-agnostic SQL execution)
        │
        ▼
Data.base Drivers (psycopg2, mysql-connector, sqlite3)
        │
        ▼
Data.base Server (PostgreSQL, MySQL, SQLite, etc.)

Your Python Code (data science notebook / script)
        │
        ▼
High-Level Libraries (pandas, SQLAlchemy ORM)
        │
        ▼
SQLAlchemy Core (data.base-agnostic SQL execution)
        │
        ▼
Data.base Drivers (psycopg2, mysql-connector, sqlite3)
        │
        ▼
Data.base Server (PostgreSQL, MySQL, SQLite, etc.)

Layer 1 — Database drivers: Speak the wire protocol of a specific database. Low-level, fast, but verbose to use directly.

Layer 2 — SQLAlchemy Core: A database-agnostic SQL toolkit that wraps drivers, providing connection pooling, dialect translation, and a unified API across all database types.

Layer 3 — High-level libraries: pandas’ read_sql_query() accepts SQLAlchemy engines or DB-API connections and returns DataFrames directly. This is the layer data scientists use most.

SQLite: The Zero-Configuration Starting Point

SQLite is built into Python’s standard library — no installation required. The entire database is stored in a single file, making it perfect for development, testing, and small-scale projects. Remarkably powerful despite its simplicity: SQLite is the most widely deployed database engine in the world, used in browsers, smartphones, and embedded devices everywhere.

Basic SQLite Connection

Python

import sqlite3
import pandas as pd

# Connect to a file-based data.base (creates it if it doesn't exist)
conn = sqlite3.connect("my_project.db")

# Connect to an in-memory data.base (temporary, lost when connection closes)
# Useful for testing and one-off data manipulation
conn_memory = sqlite3.connect(":memory:")

# Verify connection is working
print(sqlite3.version)        # SQLite module version
print(sqlite3.sqlite_version)  # Underlying SQLite library version

import sqlite3
import pandas as pd

# Connect to a file-based data.base (creates it if it doesn't exist)
conn = sqlite3.connect("my_project.db")

# Connect to an in-memory data.base (temporary, lost when connection closes)
# Useful for testing and one-off data manipulation
conn_memory = sqlite3.connect(":memory:")

# Verify connection is working
print(sqlite3.version)        # SQLite module version
print(sqlite3.sqlite_version)  # Underlying SQLite library version

Executing Queries with a Cursor

The cursor is the SQL execution interface. DB-API 2.0 pattern:

Python

conn = sqlite3.connect("my_project.db")
cursor = conn.cursor()

# Execute a query
cursor.execute("""
    CREATE TABLE IF NOT EXISTS sales (
        sale_id    INTEGER PRIMARY KEY AUTOINCREMENT,
        product    TEXT NOT NULL,
        amount     REAL NOT NULL,
        sale_date  TEXT NOT NULL,
        region     TEXT
    )
""")

# Insert a single row
cursor.execute(
    "INSERT INTO sales (product, amount, sale_date, region) VALUES (?, ?, ?, ?)",
    ("Widget A", 149.99, "2024-09-01", "West")
)

# Insert multiple rows efficiently with executemany
sales_data = [
    ("Widget B",  89.99, "2024-09-02", "East"),
    ("Widget A", 149.99, "2024-09-02", "West"),
    ("Widget C",  34.99, "2024-09-03", "North"),
    ("Widget B",  89.99, "2024-09-03", "East"),
    ("Widget A", 149.99, "2024-09-04", "South"),
]
cursor.executemany(
    "INSERT INTO sales (product, amount, sale_date, region) VALUES (?, ?, ?, ?)",
    sales_data
)

# IMPORTANT: Commit the transaction to persist changes
conn.commit()
print("Data inserted successfully.")

conn = sqlite3.connect("my_project.db")
cursor = conn.cursor()

# Execute a query
cursor.execute("""
    CREATE TABLE IF NOT EXISTS sales (
        sale_id    INTEGER PRIMARY KEY AUTOINCREMENT,
        product    TEXT NOT NULL,
        amount     REAL NOT NULL,
        sale_date  TEXT NOT NULL,
        region     TEXT
    )
""")

# Insert a single row
cursor.execute(
    "INSERT INTO sales (product, amount, sale_date, region) VALUES (?, ?, ?, ?)",
    ("Widget A", 149.99, "2024-09-01", "West")
)

# Insert multiple rows efficiently with executemany
sales_data = [
    ("Widget B",  89.99, "2024-09-02", "East"),
    ("Widget A", 149.99, "2024-09-02", "West"),
    ("Widget C",  34.99, "2024-09-03", "North"),
    ("Widget B",  89.99, "2024-09-03", "East"),
    ("Widget A", 149.99, "2024-09-04", "South"),
]
cursor.executemany(
    "INSERT INTO sales (product, amount, sale_date, region) VALUES (?, ?, ?, ?)",
    sales_data
)

# IMPORTANT: Commit the transaction to persist changes
conn.commit()
print("Data inserted successfully.")

Fetching Query Results

Python

cursor = conn.cursor()
cursor.execute("SELECT product, amount, region FROM sales ORDER BY amount DESC")

# Fetch all rows at once
all_rows = cursor.fetchall()
print(all_rows)
# [('Widget A', 149.99, 'West'), ('Widget A', 149.99, 'West'), ...]

# Fetch one row at a time
cursor.execute("SELECT * FROM sales")
first_row = cursor.fetchone()   # Returns a tuple, or None if no more rows
print(first_row)

# Fetch in batches (memory-efficient for large result sets)
cursor.execute("SELECT * FROM sales")
while True:
    batch = cursor.fetchmany(100)  # Fetch 100 rows at a time
    if not batch:
        break
    for row in batch:
        process(row)  # Your processing function

# Column names from cursor description
cursor.execute("SELECT product, amount, region FROM sales")
col_names = [description[0] for description in cursor.description]
print(col_names)  # ['product', 'amount', 'region']

cursor = conn.cursor()
cursor.execute("SELECT product, amount, region FROM sales ORDER BY amount DESC")

# Fetch all rows at once
all_rows = cursor.fetchall()
print(all_rows)
# [('Widget A', 149.99, 'West'), ('Widget A', 149.99, 'West'), ...]

# Fetch one row at a time
cursor.execute("SELECT * FROM sales")
first_row = cursor.fetchone()   # Returns a tuple, or None if no more rows
print(first_row)

# Fetch in batches (memory-efficient for large result sets)
cursor.execute("SELECT * FROM sales")
while True:
    batch = cursor.fetchmany(100)  # Fetch 100 rows at a time
    if not batch:
        break
    for row in batch:
        process(row)  # Your processing function

# Column names from cursor description
cursor.execute("SELECT product, amount, region FROM sales")
col_names = [description[0] for description in cursor.description]
print(col_names)  # ['product', 'amount', 'region']

Reading into pandas Directly

For data science, pd.read_sql_query() is far more convenient than manually fetching rows:

Python

import pandas as pd
import sqlite3

conn = sqlite3.connect("my_project.db")

# Execute query and get results as a DataFrame — one line
df = pd.read_sql_query("SELECT * FROM sales", conn)
print(df.head())
print(df.dtypes)

# Pass parameters safely (no string concatenation!)
region = "West"
df = pd.read_sql_query(
    "SELECT * FROM sales WHERE region = ?",
    conn,
    params=(region,)
)

# Parse date columns automatically
df = pd.read_sql_query(
    "SELECT * FROM sales",
    conn,
    parse_dates=["sale_date"]
)
print(df["sale_date"].dtype)  # datetime64[ns]

import pandas as pd
import sqlite3

conn = sqlite3.connect("my_project.db")

# Execute query and get results as a DataFrame — one line
df = pd.read_sql_query("SELECT * FROM sales", conn)
print(df.head())
print(df.dtypes)

# Pass parameters safely (no string concatenation!)
region = "West"
df = pd.read_sql_query(
    "SELECT * FROM sales WHERE region = ?",
    conn,
    params=(region,)
)

# Parse date columns automatically
df = pd.read_sql_query(
    "SELECT * FROM sales",
    conn,
    parse_dates=["sale_date"]
)
print(df["sale_date"].dtype)  # datetime64[ns]

The Context Manager Pattern: Safe Connection Handling

Always use connections as context managers to ensure they’re properly closed — even if an error occurs:

Python

# Using a context manager — connection closes automatically
with sqlite3.connect("my_project.db") as conn:
    df = pd.read_sql_query("SELECT * FROM sales", conn)
    # Process df...
# conn is automatically closed (and committed) here

# For SQLite specifically, the 'with conn:' pattern auto-commits on success
# and auto-rolls back on exception

# Using a context manager — connection closes automatically
with sqlite3.connect("my_project.db") as conn:
    df = pd.read_sql_query("SELECT * FROM sales", conn)
    # Process df...
# conn is automatically closed (and committed) here

# For SQLite specifically, the 'with conn:' pattern auto-commits on success
# and auto-rolls back on exception

Writing DataFrames to SQLite

Python

import pandas as pd
import sqlite3

# Create a sample DataFrame
results_df = pd.DataFrame({
    "customer_id": ["CUST_001", "CUST_002", "CUST_003"],
    "churn_probability": [0.82, 0.23, 0.67],
    "predicted_segment": ["High Risk", "Low Risk", "Medium Risk"],
    "scored_at": pd.Timestamp.now()
})

with sqlite3.connect("my_project.db") as conn:
    results_df.to_sql(
        name="churn_predictions",
        con=conn,
        if_exists="replace",  # 'replace', 'append', or 'fail'
        index=False           # Don't write the pandas index as a column
    )
    print("Predictions written to data.base.")

import pandas as pd
import sqlite3

# Create a sample DataFrame
results_df = pd.DataFrame({
    "customer_id": ["CUST_001", "CUST_002", "CUST_003"],
    "churn_probability": [0.82, 0.23, 0.67],
    "predicted_segment": ["High Risk", "Low Risk", "Medium Risk"],
    "scored_at": pd.Timestamp.now()
})

with sqlite3.connect("my_project.db") as conn:
    results_df.to_sql(
        name="churn_predictions",
        con=conn,
        if_exists="replace",  # 'replace', 'append', or 'fail'
        index=False           # Don't write the pandas index as a column
    )
    print("Predictions written to data.base.")

The if_exists parameter controls behavior when the table already exists:

"fail": Raise an error (default, prevents accidental overwrites)
"replace": Drop and recreate the table, then insert
"append": Add rows to the existing table

PostgreSQL: The Professional Standard

PostgreSQL is the most popular open-source relational database for data science and production systems. When you join a data team, there’s a good chance their primary database is PostgreSQL. The standard Python driver is psycopg2.

Installing psycopg2

Python

pip install psycopg2-binary   # Includes compiled binary — easier installation
# Or for production environments (compiles from source, more control):
pip install psycopg2

pip install psycopg2-binary   # Includes compiled binary — easier installation
# Or for production environments (compiles from source, more control):
pip install psycopg2

Basic PostgreSQL Connection

Python

import psycopg2
import pandas as pd

# Connection parameters — in practice these come from environment variables
conn = psycopg2.connect(
    host="localhost",
    port=5432,
    data.base="analytics",
    user="jane_smith",
    password="your_password_here"    # Never hardcode in real code!
)

# Verify connection
cursor = conn.cursor()
cursor.execute("SELECT version()")
version = cursor.fetchone()
print(f"Connected to: {version[0]}")

cursor.close()
conn.close()

import psycopg2
import pandas as pd

# Connection parameters — in practice these come from environment variables
conn = psycopg2.connect(
    host="localhost",
    port=5432,
    data.base="analytics",
    user="jane_smith",
    password="your_password_here"    # Never hardcode in real code!
)

# Verify connection
cursor = conn.cursor()
cursor.execute("SELECT version()")
version = cursor.fetchone()
print(f"Connected to: {version[0]}")

cursor.close()
conn.close()

PostgreSQL with pandas

The same pd.read_sql_query() pattern works with psycopg2 connections:

Python

import psycopg2
import pandas as pd

conn = psycopg2.connect(
    host="warehouse.company.internal",
    port=5432,
    data.base="analytics_db",
    user="ds_user",
    password="..."
)

# Pull training data for churn model
query = """
    SELECT
        c.customer_id,
        c.signup_date,
        c.is_premium,
        COUNT(o.order_id)          AS total_orders,
        SUM(o.total_amount)        AS lifetime_value,
        MAX(o.order_date)          AS last_order_date
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    WHERE c.signup_date >= '2022-01-01'
    GROUP BY c.customer_id, c.signup_date, c.is_premium
    HAVING COUNT(o.order_id) >= 1
"""

df = pd.read_sql_query(query, conn, parse_dates=["signup_date", "last_order_date"])
print(f"Retrieved {len(df):,} customer records")
print(df.dtypes)

conn.close()

import psycopg2
import pandas as pd

conn = psycopg2.connect(
    host="warehouse.company.internal",
    port=5432,
    data.base="analytics_db",
    user="ds_user",
    password="..."
)

# Pull training data for churn model
query = """
    SELECT
        c.customer_id,
        c.signup_date,
        c.is_premium,
        COUNT(o.order_id)          AS total_orders,
        SUM(o.total_amount)        AS lifetime_value,
        MAX(o.order_date)          AS last_order_date
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    WHERE c.signup_date >= '2022-01-01'
    GROUP BY c.customer_id, c.signup_date, c.is_premium
    HAVING COUNT(o.order_id) >= 1
"""

df = pd.read_sql_query(query, conn, parse_dates=["signup_date", "last_order_date"])
print(f"Retrieved {len(df):,} customer records")
print(df.dtypes)

conn.close()

Parameterized Queries in psycopg2

PostgreSQL uses %s placeholders (not ? like SQLite):

Python

cursor = conn.cursor()

# Single parameter
cursor.execute(
    "SELECT * FROM customers WHERE city = %s",
    ("Austin",)    # Note: must be a tuple even for a single value
)

# Multiple parameters
cursor.execute(
    """SELECT * FROM orders 
       WHERE customer_id = %s 
         AND order_date >= %s
         AND total_amount > %s""",
    ("CUST_001", "2024-01-01", 100.0)
)

# Named parameters with %(name)s syntax
cursor.execute(
    "SELECT * FROM customers WHERE city = %(city)s AND country = %(country)s",
    {"city": "Austin", "country": "USA"}
)

cursor = conn.cursor()

# Single parameter
cursor.execute(
    "SELECT * FROM customers WHERE city = %s",
    ("Austin",)    # Note: must be a tuple even for a single value
)

# Multiple parameters
cursor.execute(
    """SELECT * FROM orders 
       WHERE customer_id = %s 
         AND order_date >= %s
         AND total_amount > %s""",
    ("CUST_001", "2024-01-01", 100.0)
)

# Named parameters with %(name)s syntax
cursor.execute(
    "SELECT * FROM customers WHERE city = %(city)s AND country = %(country)s",
    {"city": "Austin", "country": "USA"}
)

Transactions in PostgreSQL

PostgreSQL requires explicit transaction management. Every execute() runs inside a transaction — you must commit() to save changes or rollback() to undo them:

Python

conn = psycopg2.connect(...)

try:
    cursor = conn.cursor()
    
    # Multiple operations as one atomic transaction
    cursor.execute(
        "UPDATE accounts SET balance = balance - %s WHERE account_id = %s",
        (500.00, "ACC_001")
    )
    cursor.execute(
        "UPDATE accounts SET balance = balance + %s WHERE account_id = %s",
        (500.00, "ACC_002")
    )
    cursor.execute(
        "INSERT INTO transfers (from_account, to_account, amount, ts) "
        "VALUES (%s, %s, %s, NOW())",
        ("ACC_001", "ACC_002", 500.00)
    )
    
    conn.commit()   # All three succeed together
    print("Transfer complete.")

except Exception as e:
    conn.rollback()  # If anything fails, undo everything
    print(f"Transfer failed, rolled back: {e}")
    raise

finally:
    cursor.close()
    conn.close()

conn = psycopg2.connect(...)

try:
    cursor = conn.cursor()
    
    # Multiple operations as one atomic transaction
    cursor.execute(
        "UPDATE accounts SET balance = balance - %s WHERE account_id = %s",
        (500.00, "ACC_001")
    )
    cursor.execute(
        "UPDATE accounts SET balance = balance + %s WHERE account_id = %s",
        (500.00, "ACC_002")
    )
    cursor.execute(
        "INSERT INTO transfers (from_account, to_account, amount, ts) "
        "VALUES (%s, %s, %s, NOW())",
        ("ACC_001", "ACC_002", 500.00)
    )
    
    conn.commit()   # All three succeed together
    print("Transfer complete.")

except Exception as e:
    conn.rollback()  # If anything fails, undo everything
    print(f"Transfer failed, rolled back: {e}")
    raise

finally:
    cursor.close()
    conn.close()

SQLAlchemy: The Database Abstraction Layer

SQLAlchemy is the most important database library in the Python data ecosystem. Rather than working directly with database-specific drivers, SQLAlchemy provides a unified interface that works identically across PostgreSQL, MySQL, SQLite, Oracle, SQL Server, and more.

For data scientists, SQLAlchemy’s most important role is acting as the engine that powers pandas database operations — pd.read_sql_query() and DataFrame.to_sql() both accept SQLAlchemy engines for more powerful and reliable behavior than raw DB-API connections.

Python

pip install sqlalchemy
pip install psycopg2-binary   # Still needed as the underlying driver

pip install sqlalchemy
pip install psycopg2-binary   # Still needed as the underlying driver

Creating an Engine

The create_engine() function is the entry point to SQLAlchemy. It takes a connection URL (also called a connection string or DSN — Data Source Name):

Python

from sqlalchemy import create_engine

# Connection URL format:
# dialect+driver://username:password@host:port/data.base

# SQLite (file-based)
engine = create_engine("sqlite:///my_project.db")

# SQLite in-memory
engine = create_engine("sqlite:///:memory:")

# PostgreSQL with psycopg2 driver
engine = create_engine(
    "postgresql+psycopg2://username:password@localhost:5432/analytics"
)

# MySQL with mysql-connector driver
engine = create_engine(
    "mysql+mysqlconnector://username:password@localhost:3306/mydb"
)

# SQL Server with pyodbc driver
engine = create_engine(
    "mssql+pyodbc://username:password@server/data.base?driver=ODBC+Driver+17+for+SQL+Server"
)

# Amazon Redshift (via psycopg2 with Redshift dialect)
engine = create_engine(
    "redshift+psycopg2://username:password@cluster.region.redshift.amazonaws.com:5439/analytics"
)

from sqlalchemy import create_engine

# Connection URL format:
# dialect+driver://username:password@host:port/data.base

# SQLite (file-based)
engine = create_engine("sqlite:///my_project.db")

# SQLite in-memory
engine = create_engine("sqlite:///:memory:")

# PostgreSQL with psycopg2 driver
engine = create_engine(
    "postgresql+psycopg2://username:password@localhost:5432/analytics"
)

# MySQL with mysql-connector driver
engine = create_engine(
    "mysql+mysqlconnector://username:password@localhost:3306/mydb"
)

# SQL Server with pyodbc driver
engine = create_engine(
    "mssql+pyodbc://username:password@server/data.base?driver=ODBC+Driver+17+for+SQL+Server"
)

# Amazon Redshift (via psycopg2 with Redshift dialect)
engine = create_engine(
    "redshift+psycopg2://username:password@cluster.region.redshift.amazonaws.com:5439/analytics"
)

The engine doesn’t immediately connect — it’s a lazy object that creates connections only when needed, and manages a pool of reusable connections.

Using SQLAlchemy Engine with pandas

Python

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("postgresql+psycopg2://user:pass@localhost/analytics")

# Read — SQLAlchemy engine is more reliable than raw connections for pandas
df = pd.read_sql_query(
    "SELECT * FROM customers WHERE is_premium = true",
    engine,
    parse_dates=["signup_date"]
)

# Write — to_sql works much better with SQLAlchemy
df.to_sql(
    name="processed_customers",
    con=engine,
    if_exists="replace",
    index=False,
    dtype={                    # Optionally specify column types
        "customer_id": sqlalchemy.Text(),
        "signup_date": sqlalchemy.Date(),
        "lifetime_value": sqlalchemy.Numeric(10, 2)
    }
)

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("postgresql+psycopg2://user:pass@localhost/analytics")

# Read — SQLAlchemy engine is more reliable than raw connections for pandas
df = pd.read_sql_query(
    "SELECT * FROM customers WHERE is_premium = true",
    engine,
    parse_dates=["signup_date"]
)

# Write — to_sql works much better with SQLAlchemy
df.to_sql(
    name="processed_customers",
    con=engine,
    if_exists="replace",
    index=False,
    dtype={                    # Optionally specify column types
        "customer_id": sqlalchemy.Text(),
        "signup_date": sqlalchemy.Date(),
        "lifetime_value": sqlalchemy.Numeric(10, 2)
    }
)

Executing SQL with SQLAlchemy Core

For non-pandas SQL execution, SQLAlchemy provides a clean connect() context manager:

Python

from sqlalchemy import create_engine, text

engine = create_engine("postgresql+psycopg2://user:pass@localhost/analytics")

# Read data using connection context manager
with engine.connect() as conn:
    result = conn.execute(text("SELECT COUNT(*) FROM customers"))
    count = result.scalar()  # Fetch single value
    print(f"Total customers: {count:,}")

# Execute multiple statements in a transaction
with engine.begin() as conn:   # engine.begin() auto-commits on success
    conn.execute(
        text("UPDATE customers SET is_premium = true WHERE lifetime_value > :threshold"),
        {"threshold": 500.0}
    )
    conn.execute(
        text("INSERT INTO audit_log (action, ts) VALUES (:action, NOW())"),
        {"action": "premium_upgrade_batch"}
    )
# Automatically committed here (or rolled back if exception occurred)

# Run a parameterized query safely
with engine.connect() as conn:
    df = pd.read_sql_query(
        text("SELECT * FROM orders WHERE status = :status AND total_amount > :min_amount"),
        conn,
        params={"status": "shipped", "min_amount": 100.0}
    )

from sqlalchemy import create_engine, text

engine = create_engine("postgresql+psycopg2://user:pass@localhost/analytics")

# Read data using connection context manager
with engine.connect() as conn:
    result = conn.execute(text("SELECT COUNT(*) FROM customers"))
    count = result.scalar()  # Fetch single value
    print(f"Total customers: {count:,}")

# Execute multiple statements in a transaction
with engine.begin() as conn:   # engine.begin() auto-commits on success
    conn.execute(
        text("UPDATE customers SET is_premium = true WHERE lifetime_value > :threshold"),
        {"threshold": 500.0}
    )
    conn.execute(
        text("INSERT INTO audit_log (action, ts) VALUES (:action, NOW())"),
        {"action": "premium_upgrade_batch"}
    )
# Automatically committed here (or rolled back if exception occurred)

# Run a parameterized query safely
with engine.connect() as conn:
    df = pd.read_sql_query(
        text("SELECT * FROM orders WHERE status = :status AND total_amount > :min_amount"),
        conn,
        params={"status": "shipped", "min_amount": 100.0}
    )

Connection Pooling

One of SQLAlchemy’s most important features for production environments is connection pooling — maintaining a pool of open database connections that can be reused across requests, rather than opening a new connection for every query.

Opening a database connection is expensive: it involves network round-trips, authentication, and session setup. For a data pipeline that runs 1,000 queries, opening 1,000 connections would be enormously wasteful. A connection pool opens a fixed number of connections and lends them out on demand:

Python

from sqlalchemy import create_engine

# Configure connection pool
engine = create_engine(
    "postgresql+psycopg2://user:pass@localhost/analytics",
    pool_size=5,          # Number of connections to maintain in pool
    max_overflow=10,      # Additional connections beyond pool_size if pool is full
    pool_timeout=30,      # Seconds to wait for a connection from pool
    pool_recycle=3600,    # Recycle connections older than 1 hour (avoids stale conn errors)
    pool_pre_ping=True,   # Test connections before handing them out (detects dropped conns)
    echo=False            # Set True to log all SQL executed (useful for debugging)
)

from sqlalchemy import create_engine

# Configure connection pool
engine = create_engine(
    "postgresql+psycopg2://user:pass@localhost/analytics",
    pool_size=5,          # Number of connections to maintain in pool
    max_overflow=10,      # Additional connections beyond pool_size if pool is full
    pool_timeout=30,      # Seconds to wait for a connection from pool
    pool_recycle=3600,    # Recycle connections older than 1 hour (avoids stale conn errors)
    pool_pre_ping=True,   # Test connections before handing them out (detects dropped conns)
    echo=False            # Set True to log all SQL executed (useful for debugging)
)

For data science notebooks and scripts that run once and exit, the default pool settings are fine. For web applications or data pipelines that run continuously, pool configuration matters significantly.

Secure Credential Management

Never hardcode database credentials in your Python code. This is the most important security principle in database connectivity. Hardcoded credentials in code get committed to Git, accidentally exposed in screenshots, and found in log files — leading to data breaches that end careers and companies.

Method 1: Environment Variables (Recommended)

Store credentials in environment variables — not in code:

Python

# Set in your shell or .env file (never commit .env to Git)
export DB_HOST="warehouse.company.internal"
export DB_PORT="5432"
export DB_NAME="analytics"
export DB_USER="jane_smith"
export DB_PASSWORD="s3cur3_p@ssw0rd"

# Set in your shell or .env file (never commit .env to Git)
export DB_HOST="warehouse.company.internal"
export DB_PORT="5432"
export DB_NAME="analytics"
export DB_USER="jane_smith"
export DB_PASSWORD="s3cur3_p@ssw0rd"

Python

import os
from sqlalchemy import create_engine

# Read from environment variables — never from code
host = os.environ["DB_HOST"]
port = os.environ["DB_PORT"]
db   = os.environ["DB_NAME"]
user = os.environ["DB_USER"]
pwd  = os.environ["DB_PASSWORD"]

engine = create_engine(f"postgresql+psycopg2://{user}:{pwd}@{host}:{port}/{db}")

import os
from sqlalchemy import create_engine

# Read from environment variables — never from code
host = os.environ["DB_HOST"]
port = os.environ["DB_PORT"]
db   = os.environ["DB_NAME"]
user = os.environ["DB_USER"]
pwd  = os.environ["DB_PASSWORD"]

engine = create_engine(f"postgresql+psycopg2://{user}:{pwd}@{host}:{port}/{db}")

Method 2: python-dotenv for Local Development

The python-dotenv library loads environment variables from a .env file automatically — perfect for local development:

Python

pip install python-dotenv

pip install python-dotenv

Python

# .env file (add to .gitignore — NEVER commit this file)
DB_HOST=localhost
DB_PORT=5432
DB_NAME=analytics
DB_USER=jane_smith
DB_PASSWORD=local_dev_password

# .env file (add to .gitignore — NEVER commit this file)
DB_HOST=localhost
DB_PORT=5432
DB_NAME=analytics
DB_USER=jane_smith
DB_PASSWORD=local_dev_password

Python

from dotenv import load_dotenv
import os
from sqlalchemy import create_engine

# Load variables from .env file into os.environ
load_dotenv()

engine = create_engine(
    f"postgresql+psycopg2://{os.environ['DB_USER']}:{os.environ['DB_PASSWORD']}"
    f"@{os.environ['DB_HOST']}:{os.environ['DB_PORT']}/{os.environ['DB_NAME']}"
)

from dotenv import load_dotenv
import os
from sqlalchemy import create_engine

# Load variables from .env file into os.environ
load_dotenv()

engine = create_engine(
    f"postgresql+psycopg2://{os.environ['DB_USER']}:{os.environ['DB_PASSWORD']}"
    f"@{os.environ['DB_HOST']}:{os.environ['DB_PORT']}/{os.environ['DB_NAME']}"
)

Python

# .gitignore — ALWAYS include this
.env
*.env
.env.local
.env.production

# .gitignore — ALWAYS include this
.env
*.env
.env.local
.env.production

Provide a .env.example file with placeholder values to document required variables:

Python

# .env.example (safe to commit — no real credentials)
DB_HOST=your_data.base_host
DB_PORT=5432
DB_NAME=your_data.base_name
DB_USER=your_username
DB_PASSWORD=your_password

# .env.example (safe to commit — no real credentials)
DB_HOST=your_data.base_host
DB_PORT=5432
DB_NAME=your_data.base_name
DB_USER=your_username
DB_PASSWORD=your_password

Method 3: Secret Management Services (Production)

For production systems, use dedicated secret management:

Python

# AWS Secrets Manager
import boto3
import json
from sqlalchemy import create_engine

def get_db_credentials(secret_name: str, region: str = "us-east-1") -> dict:
    client = boto3.client("secretsmanager", region_name=region)
    secret = client.get_secret_value(SecretId=secret_name)
    return json.loads(secret["SecretString"])

creds = get_db_credentials("prod/analytics/db-credentials")
engine = create_engine(
    f"postgresql+psycopg2://{creds['username']}:{creds['password']}"
    f"@{creds['host']}:{creds['port']}/{creds['dbname']}"
)

# AWS Secrets Manager
import boto3
import json
from sqlalchemy import create_engine

def get_db_credentials(secret_name: str, region: str = "us-east-1") -> dict:
    client = boto3.client("secretsmanager", region_name=region)
    secret = client.get_secret_value(SecretId=secret_name)
    return json.loads(secret["SecretString"])

creds = get_db_credentials("prod/analytics/db-credentials")
engine = create_engine(
    f"postgresql+psycopg2://{creds['username']}:{creds['password']}"
    f"@{creds['host']}:{creds['port']}/{creds['dbname']}"
)

Other services: HashiCorp Vault, GCP Secret Manager, Azure Key Vault.

Building a Reusable Connection Factory

A clean pattern: centralize all connection logic in a single function:

Python

# src/data.base.py

import os
from functools import lru_cache
from sqlalchemy import create_engine
from sqlalchemy.engine import Engine
from dotenv import load_dotenv

load_dotenv()

@lru_cache(maxsize=None)
def get_engine(data.base: str = "analytics") -> Engine:
    """
    Create and cache a SQLAlchemy engine for the specified data.base.
    
    Reads credentials from environment variables. Caches the engine
    so only one is created per data.base per process lifetime.
    
    Parameters
    ----------
    data.base : str
        The data.base name to connect to. Defaults to 'analytics'.
    
    Returns
    -------
    Engine
        Configured SQLAlchemy engine with connection pooling.
    
    Raises
    ------
    KeyError
        If required environment variables are not set.
    """
    return create_engine(
        f"postgresql+psycopg2://"
        f"{os.environ['DB_USER']}:{os.environ['DB_PASSWORD']}"
        f"@{os.environ['DB_HOST']}:{os.environ['DB_PORT']}/{data.base}",
        pool_size=5,
        max_overflow=10,
        pool_recycle=3600,
        pool_pre_ping=True
    )

# Usage throughout the codebase:
# from src.data.base import get_engine
# df = pd.read_sql_query(query, get_engine())

# src/data.base.py

import os
from functools import lru_cache
from sqlalchemy import create_engine
from sqlalchemy.engine import Engine
from dotenv import load_dotenv

load_dotenv()

@lru_cache(maxsize=None)
def get_engine(data.base: str = "analytics") -> Engine:
    """
    Create and cache a SQLAlchemy engine for the specified data.base.
    
    Reads credentials from environment variables. Caches the engine
    so only one is created per data.base per process lifetime.
    
    Parameters
    ----------
    data.base : str
        The data.base name to connect to. Defaults to 'analytics'.
    
    Returns
    -------
    Engine
        Configured SQLAlchemy engine with connection pooling.
    
    Raises
    ------
    KeyError
        If required environment variables are not set.
    """
    return create_engine(
        f"postgresql+psycopg2://"
        f"{os.environ['DB_USER']}:{os.environ['DB_PASSWORD']}"
        f"@{os.environ['DB_HOST']}:{os.environ['DB_PORT']}/{data.base}",
        pool_size=5,
        max_overflow=10,
        pool_recycle=3600,
        pool_pre_ping=True
    )

# Usage throughout the codebase:
# from src.data.base import get_engine
# df = pd.read_sql_query(query, get_engine())

The @lru_cache decorator ensures the engine is created once and reused — you don’t accidentally create dozens of separate connection pools.

Reading Data Efficiently from Databases

Chunked Reading for Large Datasets

When a query returns millions of rows, loading everything into memory at once can exhaust RAM. Both pandas and SQLAlchemy support reading in chunks:

Python

import pandas as pd
from sqlalchemy import create_engine

engine = get_engine()

# pandas chunksize parameter — returns an iterator of DataFrames
chunk_iter = pd.read_sql_query(
    "SELECT * FROM transactions WHERE transaction_date >= '2024-01-01'",
    engine,
    chunksize=50_000   # Process 50,000 rows at a time
)

results = []
for i, chunk in enumerate(chunk_iter):
    # Process each chunk
    chunk_processed = engineer_features(chunk)
    results.append(chunk_processed)
    print(f"Processed chunk {i+1}: {len(chunk):,} rows")

df_final = pd.concat(results, ignore_index=True)
print(f"Total rows processed: {len(df_final):,}")

import pandas as pd
from sqlalchemy import create_engine

engine = get_engine()

# pandas chunksize parameter — returns an iterator of DataFrames
chunk_iter = pd.read_sql_query(
    "SELECT * FROM transactions WHERE transaction_date >= '2024-01-01'",
    engine,
    chunksize=50_000   # Process 50,000 rows at a time
)

results = []
for i, chunk in enumerate(chunk_iter):
    # Process each chunk
    chunk_processed = engineer_features(chunk)
    results.append(chunk_processed)
    print(f"Processed chunk {i+1}: {len(chunk):,} rows")

df_final = pd.concat(results, ignore_index=True)
print(f"Total rows processed: {len(df_final):,}")

Pushing Computation to the Database

Pull only the aggregated data you need, not the raw rows:

Python

# Inefficient: Pull 10 million rows to Python, then aggregate
df = pd.read_sql_query("SELECT * FROM transactions", engine)
result = df.groupby("customer_id")["amount"].sum()

# Efficient: Aggregate in the data.base, pull 100K summary rows
result = pd.read_sql_query("""
    SELECT 
        customer_id,
        COUNT(*)          AS num_transactions,
        SUM(amount)       AS total_spend,
        AVG(amount)       AS avg_transaction,
        MAX(transaction_date) AS last_transaction_date
    FROM transactions
    WHERE transaction_date >= '2024-01-01'
    GROUP BY customer_id
""", engine)

# Inefficient: Pull 10 million rows to Python, then aggregate
df = pd.read_sql_query("SELECT * FROM transactions", engine)
result = df.groupby("customer_id")["amount"].sum()

# Efficient: Aggregate in the data.base, pull 100K summary rows
result = pd.read_sql_query("""
    SELECT 
        customer_id,
        COUNT(*)          AS num_transactions,
        SUM(amount)       AS total_spend,
        AVG(amount)       AS avg_transaction,
        MAX(transaction_date) AS last_transaction_date
    FROM transactions
    WHERE transaction_date >= '2024-01-01'
    GROUP BY customer_id
""", engine)

Reading from Multiple Databases in One Script

Python

from src.data.base import get_engine
import pandas as pd

# Connect to multiple data.bases
crm_engine = get_engine("crm_db")
warehouse_engine = get_engine("analytics_db")

# Pull data from different sources
customers = pd.read_sql_query(
    "SELECT customer_id, email, plan_type FROM customers",
    crm_engine
)

orders = pd.read_sql_query(
    "SELECT customer_id, SUM(amount) AS total_spend FROM orders GROUP BY customer_id",
    warehouse_engine
)

# Join in Python
combined = customers.merge(orders, on="customer_id", how="left")
combined["total_spend"].fillna(0, inplace=True)

from src.data.base import get_engine
import pandas as pd

# Connect to multiple data.bases
crm_engine = get_engine("crm_db")
warehouse_engine = get_engine("analytics_db")

# Pull data from different sources
customers = pd.read_sql_query(
    "SELECT customer_id, email, plan_type FROM customers",
    crm_engine
)

orders = pd.read_sql_query(
    "SELECT customer_id, SUM(amount) AS total_spend FROM orders GROUP BY customer_id",
    warehouse_engine
)

# Join in Python
combined = customers.merge(orders, on="customer_id", how="left")
combined["total_spend"].fillna(0, inplace=True)

Writing Data Back to Databases

Data scientists don’t just read from databases — they write back too: model predictions, feature tables, analysis results, monitoring metrics.

DataFrame.to_sql(): The Standard Approach

Python

import pandas as pd
from sqlalchemy import create_engine
import sqlalchemy

engine = get_engine()

# Model predictions
predictions_df = pd.DataFrame({
    "customer_id":       X_test["customer_id"].values,
    "churn_probability": model.predict_proba(X_test_features)[:, 1],
    "predicted_label":   model.predict(X_test_features),
    "model_version":     "xgboost_v3",
    "scored_at":         pd.Timestamp.now()
})

predictions_df.to_sql(
    name="model_predictions",
    con=engine,
    schema="ml",             # Write to specific sc.hema (optional)
    if_exists="append",      # Append to existing table
    index=False,
    method="multi",          # Use multi-row INSERT (faster than row-by-row)
    chunksize=1000,          # Write in batches of 1000 rows
    dtype={
        "customer_id":       sqlalchemy.Text(),
        "churn_probability": sqlalchemy.Float(),
        "predicted_label":   sqlalchemy.Integer(),
        "model_version":     sqlalchemy.Text(),
        "scored_at":         sqlalchemy.DateTime()
    }
)
print(f"Wrote {len(predictions_df):,} predictions to data.base.")

import pandas as pd
from sqlalchemy import create_engine
import sqlalchemy

engine = get_engine()

# Model predictions
predictions_df = pd.DataFrame({
    "customer_id":       X_test["customer_id"].values,
    "churn_probability": model.predict_proba(X_test_features)[:, 1],
    "predicted_label":   model.predict(X_test_features),
    "model_version":     "xgboost_v3",
    "scored_at":         pd.Timestamp.now()
})

predictions_df.to_sql(
    name="model_predictions",
    con=engine,
    schema="ml",             # Write to specific sc.hema (optional)
    if_exists="append",      # Append to existing table
    index=False,
    method="multi",          # Use multi-row INSERT (faster than row-by-row)
    chunksize=1000,          # Write in batches of 1000 rows
    dtype={
        "customer_id":       sqlalchemy.Text(),
        "churn_probability": sqlalchemy.Float(),
        "predicted_label":   sqlalchemy.Integer(),
        "model_version":     sqlalchemy.Text(),
        "scored_at":         sqlalchemy.DateTime()
    }
)
print(f"Wrote {len(predictions_df):,} predictions to data.base.")

Upsert: Insert or Update

Standard to_sql() doesn’t support upserts (insert if new, update if exists). For upserts, use SQLAlchemy with database-specific syntax:

Python

from sqlalchemy import create_engine, text

engine = get_engine()

# PostgreSQL upsert (ON CONFLICT DO UPDATE)
upsert_query = text("""
    INSERT INTO customer_features 
        (customer_id, churn_score, segment, updated_at)
    VALUES 
        (:customer_id, :churn_score, :segment, NOW())
    ON CONFLICT (customer_id) 
    DO UPDATE SET
        churn_score = EXCLUDED.churn_score,
        segment     = EXCLUDED.segment,
        updated_at  = NOW()
""")

with engine.begin() as conn:
    for _, row in predictions_df.iterrows():
        conn.execute(upsert_query, {
            "customer_id": row["customer_id"],
            "churn_score": row["churn_probability"],
            "segment":     row["predicted_segment"]
        })

from sqlalchemy import create_engine, text

engine = get_engine()

# PostgreSQL upsert (ON CONFLICT DO UPDATE)
upsert_query = text("""
    INSERT INTO customer_features 
        (customer_id, churn_score, segment, updated_at)
    VALUES 
        (:customer_id, :churn_score, :segment, NOW())
    ON CONFLICT (customer_id) 
    DO UPDATE SET
        churn_score = EXCLUDED.churn_score,
        segment     = EXCLUDED.segment,
        updated_at  = NOW()
""")

with engine.begin() as conn:
    for _, row in predictions_df.iterrows():
        conn.execute(upsert_query, {
            "customer_id": row["customer_id"],
            "churn_score": row["churn_probability"],
            "segment":     row["predicted_segment"]
        })

For bulk upserts, use the execute_many pattern or database-specific bulk tools.

Connecting to Common Data Warehouse Systems

Google BigQuery

Python

pip install google-cloud-bigquery pandas-gbq sqlalchemy-bigquery

pip install google-cloud-bigquery pandas-gbq sqlalchemy-bigquery

Python

from google.cloud import bigquery
import pandas as pd

# BigQuery uses project.dataset.table format
# Authenticate via: gcloud auth application-default login
# Or: GOOGLE_APPLICATION_CREDENTIALS env var pointing to service account JSON

client = bigquery.Client(project="my-gcp-project")

query = """
    SELECT
        customer_id,
        SUM(transaction_amount) AS total_spend,
        COUNT(DISTINCT transaction_id) AS num_transactions
    FROM `my-gcp-project.analytics.transactions`
    WHERE DATE(transaction_timestamp) >= '2024-01-01'
    GROUP BY customer_id
"""

# Method 1: BigQuery client directly
df = client.query(query).to_dataframe()

# Method 2: pandas-gbq (simpler for exploratory work)
df = pd.read_gbq(query, project_id="my-gcp-project")

# Method 3: SQLAlchemy engine
from sqlalchemy import create_engine
engine = create_engine("bigquery://my-gcp-project/analytics")
df = pd.read_sql_query(query, engine)

print(f"Retrieved {len(df):,} rows from BigQuery")

from google.cloud import bigquery
import pandas as pd

# BigQuery uses project.dataset.table format
# Authenticate via: gcloud auth application-default login
# Or: GOOGLE_APPLICATION_CREDENTIALS env var pointing to service account JSON

client = bigquery.Client(project="my-gcp-project")

query = """
    SELECT
        customer_id,
        SUM(transaction_amount) AS total_spend,
        COUNT(DISTINCT transaction_id) AS num_transactions
    FROM `my-gcp-project.analytics.transactions`
    WHERE DATE(transaction_timestamp) >= '2024-01-01'
    GROUP BY customer_id
"""

# Method 1: BigQuery client directly
df = client.query(query).to_dataframe()

# Method 2: pandas-gbq (simpler for exploratory work)
df = pd.read_gbq(query, project_id="my-gcp-project")

# Method 3: SQLAlchemy engine
from sqlalchemy import create_engine
engine = create_engine("bigquery://my-gcp-project/analytics")
df = pd.read_sql_query(query, engine)

print(f"Retrieved {len(df):,} rows from BigQuery")

Snowflake

Python

pip install snowflake-sqlalchemy snowflake-connector-python

pip install snowflake-sqlalchemy snowflake-connector-python

Python

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(
    f"snowflake://{os.environ['SNOWFLAKE_USER']}:{os.environ['SNOWFLAKE_PASSWORD']}"
    f"@{os.environ['SNOWFLAKE_ACCOUNT']}/{os.environ['SNOWFLAKE_DATA.BASE']}"
    f"/{os.environ['SNOWFLAKE_SCHEMA']}"
    f"?warehouse={os.environ['SNOWFLAKE_WAREHOUSE']}&role={os.environ['SNOWFLAKE_ROLE']}"
)

# All standard SQLAlchemy / pandas patterns work identically
df = pd.read_sql_query("""
    SELECT * FROM customer_features
    WHERE CREATED_AT >= DATEADD(day, -30, CURRENT_DATE())
""", engine)

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(
    f"snowflake://{os.environ['SNOWFLAKE_USER']}:{os.environ['SNOWFLAKE_PASSWORD']}"
    f"@{os.environ['SNOWFLAKE_ACCOUNT']}/{os.environ['SNOWFLAKE_DATA.BASE']}"
    f"/{os.environ['SNOWFLAKE_SCHEMA']}"
    f"?warehouse={os.environ['SNOWFLAKE_WAREHOUSE']}&role={os.environ['SNOWFLAKE_ROLE']}"
)

# All standard SQLAlchemy / pandas patterns work identically
df = pd.read_sql_query("""
    SELECT * FROM customer_features
    WHERE CREATED_AT >= DATEADD(day, -30, CURRENT_DATE())
""", engine)

Snowflake

Python

pip install snowflake-sqlalchemy snowflake-connector-python

pip install snowflake-sqlalchemy snowflake-connector-python

Python

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(
    f"snowflake://{os.environ['SNOWFLAKE_USER']}:{os.environ['SNOWFLAKE_PASSWORD']}"
    f"@{os.environ['SNOWFLAKE_ACCOUNT']}/{os.environ['SNOWFLAKE_DATA.BASE']}"
    f"/{os.environ['SNOWFLAKE_SCHEMA']}"
    f"?warehouse={os.environ['SNOWFLAKE_WAREHOUSE']}&role={os.environ['SNOWFLAKE_ROLE']}"
)

# All standard SQLAlchemy / pandas patterns work identically
df = pd.read_sql_query("""
    SELECT * FROM customer_features
    WHERE CREATED_AT >= DATEADD(day, -30, CURRENT_DATE())
""", engine)

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine(
    f"snowflake://{os.environ['SNOWFLAKE_USER']}:{os.environ['SNOWFLAKE_PASSWORD']}"
    f"@{os.environ['SNOWFLAKE_ACCOUNT']}/{os.environ['SNOWFLAKE_DATA.BASE']}"
    f"/{os.environ['SNOWFLAKE_SCHEMA']}"
    f"?warehouse={os.environ['SNOWFLAKE_WAREHOUSE']}&role={os.environ['SNOWFLAKE_ROLE']}"
)

# All standard SQLAlchemy / pandas patterns work identically
df = pd.read_sql_query("""
    SELECT * FROM customer_features
    WHERE CREATED_AT >= DATEADD(day, -30, CURRENT_DATE())
""", engine)

Amazon Redshift

Python

pip install redshift-connector sqlalchemy-redshift

pip install redshift-connector sqlalchemy-redshift

Python

import redshift_connector
import pandas as pd

# Direct connection
conn = redshift_connector.connect(
    host=os.environ["REDSHIFT_HOST"],
    data.base=os.environ["REDSHIFT_DB"],
    port=5439,
    user=os.environ["REDSHIFT_USER"],
    password=os.environ["REDSHIFT_PASSWORD"]
)

df = pd.read_sql_query("SELECT * FROM analytics.customer_metrics LIMIT 1000", conn)

# Or via SQLAlchemy
from sqlalchemy import create_engine
engine = create_engine(
    f"redshift+redshift_connector://{os.environ['REDSHIFT_USER']}:{os.environ['REDSHIFT_PASSWORD']}"
    f"@{os.environ['REDSHIFT_HOST']}:5439/{os.environ['REDSHIFT_DB']}"
)

import redshift_connector
import pandas as pd

# Direct connection
conn = redshift_connector.connect(
    host=os.environ["REDSHIFT_HOST"],
    data.base=os.environ["REDSHIFT_DB"],
    port=5439,
    user=os.environ["REDSHIFT_USER"],
    password=os.environ["REDSHIFT_PASSWORD"]
)

df = pd.read_sql_query("SELECT * FROM analytics.customer_metrics LIMIT 1000", conn)

# Or via SQLAlchemy
from sqlalchemy import create_engine
engine = create_engine(
    f"redshift+redshift_connector://{os.environ['REDSHIFT_USER']}:{os.environ['REDSHIFT_PASSWORD']}"
    f"@{os.environ['REDSHIFT_HOST']}:5439/{os.environ['REDSHIFT_DB']}"
)

DuckDB: The Data Scientist’s Local Analytical Database

DuckDB deserves special mention. It’s an in-process OLAP database — like SQLite for analytical queries — that can query pandas DataFrames, Parquet files, and CSV files directly with SQL, without loading them into a database first.

Python

pip install duckdb

pip install duckdb

Python

import duckdb
import pandas as pd

# Query a pandas DataFrame directly with SQL
customers_df = pd.read_parquet("data/processed/customers.parquet")
orders_df    = pd.read_parquet("data/processed/orders.parquet")

result = duckdb.query("""
    SELECT
        c.customer_id,
        c.city,
        COUNT(o.order_id)   AS num_orders,
        SUM(o.total_amount) AS lifetime_value
    FROM customers_df c
    LEFT JOIN orders_df o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.city
    ORDER BY lifetime_value DESC
""").df()   # Returns a pandas DataFrame

# Query CSV and Parquet files directly — no loading needed
result = duckdb.query("""
    SELECT 
        product_category,
        SUM(amount) AS total_revenue
    FROM read_csv_auto('data/raw/transactions.csv')
    GROUP BY product_category
    ORDER BY total_revenue DESC
""").df()

# Query multiple Parquet files as one table (glob pattern)
result = duckdb.query("""
    SELECT COUNT(*) AS total_rows
    FROM read_parquet('data/raw/transactions_2024_*.parquet')
""").df()

import duckdb
import pandas as pd

# Query a pandas DataFrame directly with SQL
customers_df = pd.read_parquet("data/processed/customers.parquet")
orders_df    = pd.read_parquet("data/processed/orders.parquet")

result = duckdb.query("""
    SELECT
        c.customer_id,
        c.city,
        COUNT(o.order_id)   AS num_orders,
        SUM(o.total_amount) AS lifetime_value
    FROM customers_df c
    LEFT JOIN orders_df o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.city
    ORDER BY lifetime_value DESC
""").df()   # Returns a pandas DataFrame

# Query CSV and Parquet files directly — no loading needed
result = duckdb.query("""
    SELECT 
        product_category,
        SUM(amount) AS total_revenue
    FROM read_csv_auto('data/raw/transactions.csv')
    GROUP BY product_category
    ORDER BY total_revenue DESC
""").df()

# Query multiple Parquet files as one table (glob pattern)
result = duckdb.query("""
    SELECT COUNT(*) AS total_rows
    FROM read_parquet('data/raw/transactions_2024_*.parquet')
""").df()

DuckDB is remarkably fast for analytical queries on local data — often faster than equivalent pandas operations — and its SQL-on-DataFrames capability eliminates the need to load data into a separate database for exploration.

Error Handling and Robust Connection Code

Production database code needs to handle failures gracefully:

Python

import logging
from sqlalchemy import create_engine, text
from sqlalchemy.exc import SQLAlchemyError, OperationalError
import pandas as pd
import time

logger = logging.getLogger(__name__)

def execute_query_with_retry(
    query: str,
    engine,
    params: dict = None,
    max_retries: int = 3,
    retry_delay: float = 2.0
) -> pd.DataFrame:
    """
    Execute a SQL query with automatic retry on transient errors.
    
    Parameters
    ----------
    query : str
        SQL query to execute.
    engine : sqlalchemy.engine.Engine
        Data.base engine to use.
    params : dict, optional
        Query parameters for parameterized queries.
    max_retries : int, optional
        Maximum number of retry attempts. By default 3.
    retry_delay : float, optional
        Seconds to wait between retries. By default 2.0.
    
    Returns
    -------
    pd.DataFrame
        Query results.
    
    Raises
    ------
    SQLAlchemyError
        If all retries are exhausted or a non-transient error occurs.
    """
    last_error = None
    
    for attempt in range(max_retries):
        try:
            with engine.connect() as conn:
                if params:
                    df = pd.read_sql_query(text(query), conn, params=params)
                else:
                    df = pd.read_sql_query(query, conn)
                logger.info(f"Query returned {len(df):,} rows")
                return df
                
        except OperationalError as e:
            # Network/connectivity errors — worth retrying
            last_error = e
            if attempt < max_retries - 1:
                logger.warning(
                    f"Data.base connection error (attempt {attempt + 1}/{max_retries}): {e}. "
                    f"Retrying in {retry_delay}s..."
                )
                time.sleep(retry_delay)
            else:
                logger.error(f"All {max_retries} retry attempts failed.")
                
        except SQLAlchemyError as e:
            # Logic/syntax errors — don't retry, fail fast
            logger.error(f"SQL error (not retrying): {e}")
            raise
    
    raise last_error


def safe_write_dataframe(
    df: pd.DataFrame,
    table_name: str,
    engine,
    if_exists: str = "append",
    chunksize: int = 1000
) -> int:
    """Write a DataFrame to a data.base table with error handling."""
    try:
        df.to_sql(
            name=table_name,
            con=engine,
            if_exists=if_exists,
            index=False,
            method="multi",
            chunksize=chunksize
        )
        logger.info(f"Successfully wrote {len(df):,} rows to {table_name}")
        return len(df)
        
    except SQLAlchemyError as e:
        logger.error(f"Failed to write to {table_name}: {e}")
        raise

import logging
from sqlalchemy import create_engine, text
from sqlalchemy.exc import SQLAlchemyError, OperationalError
import pandas as pd
import time

logger = logging.getLogger(__name__)

def execute_query_with_retry(
    query: str,
    engine,
    params: dict = None,
    max_retries: int = 3,
    retry_delay: float = 2.0
) -> pd.DataFrame:
    """
    Execute a SQL query with automatic retry on transient errors.
    
    Parameters
    ----------
    query : str
        SQL query to execute.
    engine : sqlalchemy.engine.Engine
        Data.base engine to use.
    params : dict, optional
        Query parameters for parameterized queries.
    max_retries : int, optional
        Maximum number of retry attempts. By default 3.
    retry_delay : float, optional
        Seconds to wait between retries. By default 2.0.
    
    Returns
    -------
    pd.DataFrame
        Query results.
    
    Raises
    ------
    SQLAlchemyError
        If all retries are exhausted or a non-transient error occurs.
    """
    last_error = None
    
    for attempt in range(max_retries):
        try:
            with engine.connect() as conn:
                if params:
                    df = pd.read_sql_query(text(query), conn, params=params)
                else:
                    df = pd.read_sql_query(query, conn)
                logger.info(f"Query returned {len(df):,} rows")
                return df
                
        except OperationalError as e:
            # Network/connectivity errors — worth retrying
            last_error = e
            if attempt < max_retries - 1:
                logger.warning(
                    f"Data.base connection error (attempt {attempt + 1}/{max_retries}): {e}. "
                    f"Retrying in {retry_delay}s..."
                )
                time.sleep(retry_delay)
            else:
                logger.error(f"All {max_retries} retry attempts failed.")
                
        except SQLAlchemyError as e:
            # Logic/syntax errors — don't retry, fail fast
            logger.error(f"SQL error (not retrying): {e}")
            raise
    
    raise last_error


def safe_write_dataframe(
    df: pd.DataFrame,
    table_name: str,
    engine,
    if_exists: str = "append",
    chunksize: int = 1000
) -> int:
    """Write a DataFrame to a data.base table with error handling."""
    try:
        df.to_sql(
            name=table_name,
            con=engine,
            if_exists=if_exists,
            index=False,
            method="multi",
            chunksize=chunksize
        )
        logger.info(f"Successfully wrote {len(df):,} rows to {table_name}")
        return len(df)
        
    except SQLAlchemyError as e:
        logger.error(f"Failed to write to {table_name}: {e}")
        raise

Complete Working Example: A Data Science Pipeline

Putting it all together — a complete pipeline that extracts data, processes it, and writes results back:

Python

"""
churn_scoring_pipeline.py

Extract customer data → score with churn model → write predictions back to data.base.
"""

import os
import logging
import pandas as pd
import joblib
from dotenv import load_dotenv
from sqlalchemy import create_engine
from functools import lru_cache

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Load credentials from .env
load_dotenv()


@lru_cache(maxsize=None)
def get_engine():
    return create_engine(
        f"postgresql+psycopg2://"
        f"{os.environ['DB_USER']}:{os.environ['DB_PASSWORD']}"
        f"@{os.environ['DB_HOST']}:{os.environ['DB_PORT']}/{os.environ['DB_NAME']}",
        pool_pre_ping=True
    )


FEATURE_QUERY = """
    WITH order_stats AS (
        SELECT
            customer_id,
            COUNT(order_id)                    AS total_orders,
            SUM(total_amount)                  AS lifetime_value,
            AVG(total_amount)                  AS avg_order_value,
            MAX(order_date::date)              AS last_order_date,
            CURRENT_DATE - MAX(order_date::date) AS recency_days
        FROM orders
        WHERE order_date >= CURRENT_DATE - INTERVAL '365 days'
        GROUP BY customer_id
    )
    SELECT
        c.customer_id,
        c.is_premium::int            AS is_premium,
        c.country,
        COALESCE(os.total_orders, 0)    AS total_orders,
        COALESCE(os.lifetime_value, 0)  AS lifetime_value,
        COALESCE(os.avg_order_value, 0) AS avg_order_value,
        COALESCE(os.recency_days, 999)  AS recency_days
    FROM customers c
    LEFT JOIN order_stats os ON c.customer_id = os.customer_id
    WHERE c.signup_date >= '2020-01-01'
"""

FEATURE_COLUMNS = [
    "is_premium", "total_orders", "lifetime_value",
    "avg_order_value", "recency_days"
]


def extract_features() -> pd.DataFrame:
    logger.info("Extracting customer features from data.base...")
    engine = get_engine()
    df = pd.read_sql_query(FEATURE_QUERY, engine)
    logger.info(f"Extracted {len(df):,} customer records")
    return df


def score_customers(df: pd.DataFrame) -> pd.DataFrame:
    logger.info("Loading model and scoring customers...")
    model = joblib.load("models/churn_xgboost_v3.pkl")
    X = df[FEATURE_COLUMNS]
    df["churn_probability"] = model.predict_proba(X)[:, 1]
    df["model_version"] = "xgboost_v3"
    df["scored_at"] = pd.Timestamp.now()
    logger.info(f"Scored {len(df):,} customers. "
                f"Mean churn probability: {df['churn_probability'].mean():.3f}")
    return df


def write_predictions(df: pd.DataFrame) -> None:
    logger.info("Writing predictions to data.base...")
    output_df = df[["customer_id", "churn_probability", "model_version", "scored_at"]]
    engine = get_engine()
    output_df.to_sql(
        name="churn_predictions",
        con=engine,
        schema="ml",
        if_exists="replace",
        index=False,
        method="multi",
        chunksize=2000
    )
    logger.info(f"Wrote {len(output_df):,} predictions to ml.churn_predictions")


def main():
    logger.info("=== Churn Scoring Pipeline Starting ===")
    df = extract_features()
    df = score_customers(df)
    write_predictions(df)
    logger.info("=== Pipeline Complete ===")


if __name__ == "__main__":
    main()

"""
churn_scoring_pipeline.py

Extract customer data → score with churn model → write predictions back to data.base.
"""

import os
import logging
import pandas as pd
import joblib
from dotenv import load_dotenv
from sqlalchemy import create_engine
from functools import lru_cache

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Load credentials from .env
load_dotenv()


@lru_cache(maxsize=None)
def get_engine():
    return create_engine(
        f"postgresql+psycopg2://"
        f"{os.environ['DB_USER']}:{os.environ['DB_PASSWORD']}"
        f"@{os.environ['DB_HOST']}:{os.environ['DB_PORT']}/{os.environ['DB_NAME']}",
        pool_pre_ping=True
    )


FEATURE_QUERY = """
    WITH order_stats AS (
        SELECT
            customer_id,
            COUNT(order_id)                    AS total_orders,
            SUM(total_amount)                  AS lifetime_value,
            AVG(total_amount)                  AS avg_order_value,
            MAX(order_date::date)              AS last_order_date,
            CURRENT_DATE - MAX(order_date::date) AS recency_days
        FROM orders
        WHERE order_date >= CURRENT_DATE - INTERVAL '365 days'
        GROUP BY customer_id
    )
    SELECT
        c.customer_id,
        c.is_premium::int            AS is_premium,
        c.country,
        COALESCE(os.total_orders, 0)    AS total_orders,
        COALESCE(os.lifetime_value, 0)  AS lifetime_value,
        COALESCE(os.avg_order_value, 0) AS avg_order_value,
        COALESCE(os.recency_days, 999)  AS recency_days
    FROM customers c
    LEFT JOIN order_stats os ON c.customer_id = os.customer_id
    WHERE c.signup_date >= '2020-01-01'
"""

FEATURE_COLUMNS = [
    "is_premium", "total_orders", "lifetime_value",
    "avg_order_value", "recency_days"
]


def extract_features() -> pd.DataFrame:
    logger.info("Extracting customer features from data.base...")
    engine = get_engine()
    df = pd.read_sql_query(FEATURE_QUERY, engine)
    logger.info(f"Extracted {len(df):,} customer records")
    return df


def score_customers(df: pd.DataFrame) -> pd.DataFrame:
    logger.info("Loading model and scoring customers...")
    model = joblib.load("models/churn_xgboost_v3.pkl")
    X = df[FEATURE_COLUMNS]
    df["churn_probability"] = model.predict_proba(X)[:, 1]
    df["model_version"] = "xgboost_v3"
    df["scored_at"] = pd.Timestamp.now()
    logger.info(f"Scored {len(df):,} customers. "
                f"Mean churn probability: {df['churn_probability'].mean():.3f}")
    return df


def write_predictions(df: pd.DataFrame) -> None:
    logger.info("Writing predictions to data.base...")
    output_df = df[["customer_id", "churn_probability", "model_version", "scored_at"]]
    engine = get_engine()
    output_df.to_sql(
        name="churn_predictions",
        con=engine,
        schema="ml",
        if_exists="replace",
        index=False,
        method="multi",
        chunksize=2000
    )
    logger.info(f"Wrote {len(output_df):,} predictions to ml.churn_predictions")


def main():
    logger.info("=== Churn Scoring Pipeline Starting ===")
    df = extract_features()
    df = score_customers(df)
    write_predictions(df)
    logger.info("=== Pipeline Complete ===")


if __name__ == "__main__":
    main()

Common Mistakes and How to Avoid Them

Mistake	Problem	Fix
Hardcoding credentials	Security breach, Git exposure	Use environment variables + `.env`
Not closing connections	Connection leaks, database locks	Use context managers (`with`)
String concatenation for parameters	SQL injection vulnerability	Use parameterized queries (`?` or `%s`)
`SELECT *` on large tables	Pulling unnecessary data over network	Specify only needed columns
No LIMIT when exploring	Accidentally pulling millions of rows	Always use LIMIT when exploring
Opening connection per query	Performance: authentication overhead per call	Use SQLAlchemy engine with connection pool
`if_exists='replace'` carelessly	Drops and recreates table, loses indexes/permissions	Use `'append'` or only `'replace'` when intentional
Not handling NULL return from `fetchone()`	NoneType errors when query returns no rows	Check `if result is not None` before using
Forgetting `conn.commit()`	Changes not persisted — silently lost	Always commit after writes; use `engine.begin()`
Loading full dataset then aggregating in Python	Slow and memory-intensive for large data	Aggregate in SQL before pulling to Python

Summary

Connecting to databases from Python is a foundational data science skill that involves choosing the right driver, constructing connections safely and efficiently, executing queries using parameterized syntax, and managing the full lifecycle of connections — from opening through committing and closing. The standard professional pattern is SQLAlchemy engines paired with pandas’ read_sql_query() and to_sql() — a combination that covers 90% of data science database needs with clean, consistent code that works across all major databases.

The security principle — credentials always come from environment variables, never from code — is non-negotiable and must become reflex. The efficiency principle — push computation to the database, pull only what you need — keeps pipelines fast and memory-efficient. And the robustness principle — use context managers, parameterize queries, handle errors explicitly — keeps production code reliable.

With these skills, you can connect to SQLite for local development, PostgreSQL or MySQL for production systems, BigQuery or Snowflake for cloud analytics, and DuckDB for fast local analytical queries — all using the same familiar patterns, all in service of the data science work that depends on smooth, reliable data access.

Key Takeaways

Python database connectivity is standardized by PEP 249 (DB-API 2.0), giving all database drivers a common interface — learn the pattern once and it applies to SQLite, PostgreSQL, MySQL, and others
SQLAlchemy engines are the standard interface for pandas database operations — pd.read_sql_query(query, engine) to read and df.to_sql(name, engine) to write covers the majority of data science database needs
Connection strings follow the format dialect+driver://user:password@host:port/data.base — never hardcode credentials in this string; read them from environment variables using os.environ or python-dotenv
Always use context managers (with engine.connect() as conn:) or engine.begin() to ensure connections are properly committed and closed — even if an exception occurs
Parameterized queries (%s for psycopg2, :param for SQLAlchemy text) are mandatory for any query that includes variable values — string concatenation creates SQL injection vulnerabilities
Connection pooling (built into SQLAlchemy) eliminates the overhead of opening a new database connection for every query — configure it once in a shared get_engine() factory function used throughout the codebase
pd.read_sql_query() with chunksize enables memory-efficient processing of query results too large to fit in RAM at once — process in batches rather than loading everything at once
DuckDB enables SQL queries directly on pandas DataFrames and Parquet files without a separate database server — a powerful tool for local analytical work that doesn’t require loading data into a database

0 Comments

Discover More

K-Nearest Neighbors: The Simplest Classification Algorithm

Click For More

Connecting to Databases from Python

Introduction

The Python Database Connectivity Landscape

The DB-API 2.0 Standard (PEP 249)

The Layered Architecture

SQLite: The Zero-Configuration Starting Point

Basic SQLite Connection

Executing Queries with a Cursor

Fetching Query Results

Reading into pandas Directly

The Context Manager Pattern: Safe Connection Handling

Writing DataFrames to SQLite

PostgreSQL: The Professional Standard

Installing psycopg2

Basic PostgreSQL Connection

PostgreSQL with pandas

Parameterized Queries in psycopg2

Transactions in PostgreSQL

SQLAlchemy: The Database Abstraction Layer

Creating an Engine

Using SQLAlchemy Engine with pandas

Executing SQL with SQLAlchemy Core

Connection Pooling

Secure Credential Management

Method 1: Environment Variables (Recommended)

Method 2: python-dotenv for Local Development

Method 3: Secret Management Services (Production)

Building a Reusable Connection Factory

Reading Data Efficiently from Databases

Chunked Reading for Large Datasets

Pushing Computation to the Database

Reading from Multiple Databases in One Script

Writing Data Back to Databases

DataFrame.to_sql(): The Standard Approach

Upsert: Insert or Update

Connecting to Common Data Warehouse Systems

Google BigQuery

Snowflake

Snowflake

Amazon Redshift

DuckDB: The Data Scientist’s Local Analytical Database

Error Handling and Robust Connection Code

Complete Working Example: A Data Science Pipeline

Common Mistakes and How to Avoid Them

Summary

Key Takeaways

Discover More

K-Nearest Neighbors: The Simplest Classification Algorithm

Version Control for AI Projects: Git and GitHub Essentials

Support Vector Machines: Finding the Optimal Boundary

Gruve Raises $50M Series A for Distributed AI Data Center Infrastructure

Chata Technologies Raises $10M for Deterministic AI in Finance

How to Find Files in Linux Using the find Command