Reading and Writing JSON Files in Python

Learn to read and write JSON files in Python. Master the json module, pandas JSON methods, nested data flattening, JSON Lines, large file handling, and schema validation.

Reading and Writing JSON Files in Python

JSON (JavaScript Object Notation) is the most common data interchange format on the web, and Python’s built-in json module makes reading and writing it straightforward: json.load(file) reads a JSON file into a Python dictionary or list, and json.dump(data, file) writes a Python object to a JSON file. For tabular data science work, pd.read_json() and df.to_json() provide direct conversion between JSON and pandas DataFrames, while pd.json_normalize() flattens nested JSON structures into tabular columns.

Introduction

JSON is everywhere. It is the default format for web API responses, configuration files, event log streams, NoSQL database exports, and data interchange between systems. If you work with APIs (covered in Article 75), you’re already working with JSON constantly — response.json() returns a Python dictionary parsed directly from a JSON response body. But JSON also lives in files: downloaded datasets, configuration files, database exports, streamed event logs saved to disk.

Understanding JSON beyond just response.json() — knowing how to read it from files, write it back, handle complex nested structures, work with large JSON files memory-efficiently, validate schemas, and avoid common pitfalls — is a practical skill that data scientists use daily.

This article covers the full spectrum of JSON work in Python: the json standard library module for precise control, pandas integration for data science workflows, flattening and restructuring nested data, working with the JSON Lines format popular in streaming data, handling large files without loading everything into memory, and validating JSON schemas before processing.

JSON Fundamentals: Structure and Data Types

Before working with JSON in Python, it’s worth understanding the format itself. JSON is built on two structures that map almost perfectly to Python objects:

JSON objects (dictionaries in Python):

JSON
{
  "customer_id": "CUST_001",
  "name": "Jane Smith",
  "is_premium": true,
  "signup_date": "2022-03-14",
  "lifetime_value": 3847.50,
  "preferences": null
}

JSON arrays (lists in Python):

JSON
["electronics", "home", "books"]

JSON-to-Python Type Mapping

JSON TypePython TypeExample
objectdict{"key": "value"}
arraylist[1, 2, 3]
stringstr"hello"
number (integer)int42
number (float)float3.14
trueTruetrueTrue
falseFalsefalseFalse
nullNonenullNone

Two important things JSON does not support that Python does:

  • Date/datetime objects — dates must be stored as strings (e.g., "2024-09-15") or timestamps (e.g., 1726358400)
  • Tuple — JSON arrays map to Python lists on both read and write

The json Module: Full Control Over JSON

Python’s json module is in the standard library — no installation required. It provides four core functions:

FunctionDirectionTarget
json.load(file)JSON → PythonFile object
json.loads(string)JSON → PythonString
json.dump(obj, file)Python → JSONFile object
json.dumps(obj)Python → JSONReturns string

The pattern: load/loads = deserialize (JSON → Python); dump/dumps = serialize (Python → JSON). The s suffix means “string” rather than file.

Reading JSON Files

Python
import json

# Read a JSON file — standard pattern
with open("data/customers.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# data is now a Python dict or list, depending on the JSON root element
print(type(data))    # <class 'dict'> or <class 'list'>

# If the JSON root is an object (dict)
print(data.keys())
print(data["customer_id"])

# If the JSON root is an array (list of dicts)
print(f"Records: {len(data)}")
print(data[0])       # First record
print(data[0]["name"])

Always use encoding="utf-8" when opening JSON files — JSON is specified as UTF-8 by default, and omitting the encoding can cause errors on Windows where the default encoding may be different.

Writing JSON Files

Python
import json

customers = [
    {
        "customer_id": "CUST_001",
        "name": "Jane Smith",
        "email": "jane.smith@email.com",
        "city": "Austin",
        "is_premium": True,
        "lifetime_value": 3847.50,
        "signup_date": "2022-03-14",
        "tags": ["loyal", "high-value"],
        "preferences": None
    },
    {
        "customer_id": "CUST_002",
        "name": "Bob Johnson",
        "email": "bob.j@email.com",
        "city": "Seattle",
        "is_premium": False,
        "lifetime_value": 892.00,
        "signup_date": "2021-11-28",
        "tags": ["new"],
        "preferences": {"theme": "dark", "notifications": True}
    }
]

# Write JSON to file
with open("data/customers_export.json", "w", encoding="utf-8") as f:
    json.dump(customers, f)

# Human-readable (indented) JSON — use indent parameter
with open("data/customers_pretty.json", "w", encoding="utf-8") as f:
    json.dump(customers, f, indent=2)

# With additional formatting options
with open("data/customers_formatted.json", "w", encoding="utf-8") as f:
    json.dump(
        customers,
        f,
        indent=2,
        ensure_ascii=False,   # Allow non-ASCII characters (important for international data)
        sort_keys=True        # Sort dict keys alphabetically for consistent output
    )

The indent parameter transforms compact single-line JSON into human-readable multi-line format. For files that will be read by machines only, omit indent to produce compact output. For config files, logs, or anything humans might read, use indent=2.

Reading JSON from Strings

json.loads() (note the s — string) parses a JSON string that’s already in memory:

Python
import json

# Parse a JSON string
json_string = '{"name": "Jane", "score": 0.95, "passed": true}'
data = json.loads(json_string)
print(data)          # {'name': 'Jane', 'score': 0.95, 'passed': True}
print(data["score"]) # 0.95

# Useful when JSON comes from a source other than a file:
# - API response text: json.loads(response.text)
# - Database JSON column: json.loads(row["metadata"])
# - Environment variable: json.loads(os.environ["CONFIG_JSON"])
# - Kafka message: json.loads(message.value.decode("utf-8"))

Writing JSON to Strings

json.dumps() serializes to a string instead of writing to a file:

Python
data = {"model": "xgboost_v3", "auc": 0.9142, "threshold": 0.42}

# Serialize to string (for storing in database, sending via API, etc.)
json_string = json.dumps(data)
print(json_string)   # '{"model": "xgboost_v3", "auc": 0.9142, "threshold": 0.42}'

# Pretty-printed string
pretty_string = json.dumps(data, indent=2)
print(pretty_string)
# {
#   "model": "xgboost_v3",
#   "auc": 0.9142,
#   "threshold": 0.42
# }

# Storing JSON in a database column
cursor.execute(
    "INSERT INTO model_runs (run_id, metadata) VALUES (?, ?)",
    ("run_001", json.dumps(data))
)

Handling Data Types JSON Doesn’t Support

JSON has no native date, datetime, or numpy type. Attempting to serialize these directly raises a TypeError. There are two clean solutions.

Solution 1: Custom JSON Encoder

Python
import json
import datetime
import numpy as np
import pandas as pd

class DataScienceEncoder(json.JSONEncoder):
    """
    Custom JSON encoder that handles common data science types
    not natively supported by JSON.
    """
    def default(self, obj):
        # datetime.datetime and datetime.date → ISO format string
        if isinstance(obj, (datetime.datetime, datetime.date)):
            return obj.isoformat()

        # numpy integer types → Python int
        if isinstance(obj, (np.integer, np.int64, np.int32)):
            return int(obj)

        # numpy float types → Python float
        if isinstance(obj, (np.floating, np.float64, np.float32)):
            return float(obj)

        # numpy arrays → Python list
        if isinstance(obj, np.ndarray):
            return obj.tolist()

        # pandas Timestamp → ISO string
        if isinstance(obj, pd.Timestamp):
            return obj.isoformat()

        # pandas NA/NaT → None (becomes JSON null)
        if pd.isna(obj):
            return None

        # Let the base class handle unknown types (raises TypeError)
        return super().default(obj)


# Usage
metrics = {
    "model": "xgboost_v3",
    "trained_at": datetime.datetime(2024, 9, 15, 14, 30, 0),
    "auc_roc": np.float64(0.9142),
    "n_trees": np.int64(500),
    "feature_importances": np.array([0.23, 0.18, 0.15, 0.44]),
    "evaluation_date": datetime.date(2024, 9, 15)
}

# Serialize with custom encoder
json_string = json.dumps(metrics, cls=DataScienceEncoder, indent=2)
print(json_string)

# Write to file
with open("models/xgboost_v3_metrics.json", "w", encoding="utf-8") as f:
    json.dump(metrics, f, cls=DataScienceEncoder, indent=2)

Output:

Python
{
  "model": "xgboost_v3",
  "trained_at": "2024-09-15T14:30:00",
  "auc_roc": 0.9142,
  "n_trees": 500,
  "feature_importances": [0.23, 0.18, 0.15, 0.44],
  "evaluation_date": "2024-09-15"
}

Solution 2: Convert Before Serializing

For one-off cases, convert problem types before passing to json.dump:

Python
import json
import datetime
import numpy as np

def make_json_serializable(obj):
    """Recursively convert a nested dict/list to JSON-serializable types."""
    if isinstance(obj, dict):
        return {k: make_json_serializable(v) for k, v in obj.items()}
    elif isinstance(obj, (list, tuple)):
        return [make_json_serializable(item) for item in obj]
    elif isinstance(obj, (datetime.datetime, datetime.date)):
        return obj.isoformat()
    elif isinstance(obj, (np.integer,)):
        return int(obj)
    elif isinstance(obj, (np.floating,)):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    elif obj != obj:  # NaN check (NaN != NaN is True)
        return None
    return obj

# Convert entire data structure before serializing
serializable_data = make_json_serializable(metrics)
json_string = json.dumps(serializable_data, indent=2)

Working with Nested JSON: Flattening and Restructuring

Real-world JSON data is almost always nested — dictionaries within dictionaries, lists of objects, arrays of arrays. Most data science tools work best with flat tabular data, so a critical skill is flattening nested JSON into a DataFrame.

The Nested JSON Problem

Python
import json

# Load a realistic customer JSON with deep nesting
customer_json = """
[
  {
    "customer_id": "CUST_001",
    "name": {"first": "Jane", "last": "Smith"},
    "contact": {
      "email": "jane.smith@email.com",
      "phone": "512-555-0101",
      "address": {
        "street": "123 Oak Ave",
        "city": "Austin",
        "state": "TX",
        "zip": "78701"
      }
    },
    "account": {
      "is_premium": true,
      "signup_date": "2022-03-14",
      "lifetime_value": 3847.50
    },
    "orders": [
      {"order_id": "ORD_1001", "date": "2024-07-05", "amount": 199.98},
      {"order_id": "ORD_1003", "date": "2024-08-01", "amount": 149.99},
      {"order_id": "ORD_1008", "date": "2024-09-10", "amount": 34.99}
    ],
    "tags": ["loyal", "high-value", "electronics-buyer"]
  },
  {
    "customer_id": "CUST_002",
    "name": {"first": "Bob", "last": "Johnson"},
    "contact": {
      "email": "bob.j@email.com",
      "phone": null,
      "address": {
        "street": "456 Pine St",
        "city": "Seattle",
        "state": "WA",
        "zip": "98101"
      }
    },
    "account": {
      "is_premium": false,
      "signup_date": "2021-11-28",
      "lifetime_value": 892.00
    },
    "orders": [
      {"order_id": "ORD_1002", "date": "2024-07-12", "amount": 49.99}
    ],
    "tags": ["new"]
  }
]
"""

customers = json.loads(customer_json)
print(f"Records: {len(customers)}")

Approach 1: pd.json_normalize() — Automatic Flattening

Python
import pandas as pd
from pandas import json_normalize

# Basic normalization — flattens one level of nesting
df_basic = json_normalize(customers)
print(df_basic.columns.tolist())
# ['customer_id', 'tags', 'name.first', 'name.last',
#  'contact.email', 'contact.phone', 'contact.address.street',
#  'contact.address.city', 'contact.address.state', 'contact.address.zip',
#  'account.is_premium', 'account.signup_date', 'account.lifetime_value',
#  'orders']  ← 'orders' stays as a list column

# Customize separator (default is ".")
df_underscore = json_normalize(customers, sep="_")
# Columns: customer_id, name_first, name_last, contact_email, etc.

print(df_underscore[["customer_id", "name_first", "name_last",
                      "contact_address_city", "account_lifetime_value"]].head())
customer_idname_firstname_lastcontact_address_cityaccount_lifetime_value
CUST_001JaneSmithAustin3847.50
CUST_002BobJohnsonSeattle892.00

Approach 2: Expanding Nested Arrays with record_path

When one of the nested fields is a list of objects (like orders), use record_path to expand it — producing one row per array element:

Python
# Expand the 'orders' array — one row per order
# 'meta' specifies fields from the parent object to include in each row
df_orders = json_normalize(
    customers,
    record_path="orders",                    # Expand this array
    meta=[
        "customer_id",                       # Include these parent fields
        ["name", "first"],                   # Nested parent fields use list notation
        ["name", "last"],
        ["account", "is_premium"],
        ["account", "lifetime_value"]
    ],
    meta_prefix="customer_",                 # Prefix parent fields to avoid name clashes
    errors="ignore"                          # Ignore missing meta fields
)

print(df_orders)
order_iddateamountcustomer_customer_idcustomer_name_firstcustomer_name_last
ORD_10012024-07-05199.98CUST_001JaneSmith
ORD_10032024-08-01149.99CUST_001JaneSmith
ORD_10082024-09-1034.99CUST_001JaneSmith
ORD_10022024-07-1249.99CUST_002BobJohnson

Approach 3: Manual Extraction for Full Control

When json_normalize doesn’t produce exactly what you need, extract fields manually:

Python
def extract_customer_records(customers: list) -> pd.DataFrame:
    """
    Manually extract and flatten customer records with full control
    over field selection, naming, and transformation.
    """
    records = []
    for c in customers:
        # Safe extraction with .get() and nested .get() chains
        record = {
            # Flat fields
            "customer_id":    c.get("customer_id"),

            # Nested name fields
            "first_name":     c.get("name", {}).get("first"),
            "last_name":      c.get("name", {}).get("last"),

            # Contact fields
            "email":          c.get("contact", {}).get("email"),
            "phone":          c.get("contact", {}).get("phone"),

            # Address fields (double nesting)
            "city":           c.get("contact", {}).get("address", {}).get("city"),
            "state":          c.get("contact", {}).get("address", {}).get("state"),

            # Account fields
            "is_premium":     c.get("account", {}).get("is_premium"),
            "signup_date":    c.get("account", {}).get("signup_date"),
            "lifetime_value": c.get("account", {}).get("lifetime_value"),

            # Derived from array: aggregate order stats
            "n_orders":       len(c.get("orders", [])),
            "total_orders":   sum(o.get("amount", 0) for o in c.get("orders", [])),
            "last_order_date": max(
                (o.get("date") for o in c.get("orders", [])),
                default=None
            ),

            # Array field: join list to string for storage
            "tags":           "|".join(c.get("tags", []))
        }
        records.append(record)

    df = pd.DataFrame(records)
    df["signup_date"]    = pd.to_datetime(df["signup_date"])
    df["last_order_date"] = pd.to_datetime(df["last_order_date"])
    return df

df_customers = extract_customer_records(customers)
print(df_customers.dtypes)
print(df_customers.head())

This approach is more verbose but gives complete control over field names, types, and derivations.

pandas JSON Integration

pandas has built-in JSON support through pd.read_json() and df.to_json().

pd.read_json(): Reading JSON Directly into DataFrames

Python
import pandas as pd

# Read a JSON file containing a list of flat objects
# (the most common case for datasets)
df = pd.read_json("data/transactions.json")

# Read with type inference and date parsing
df = pd.read_json(
    "data/transactions.json",
    convert_dates=["transaction_date", "created_at"],
    dtype={"amount": float, "customer_id": str}
)

# Read from a URL (directly from a web API or remote file)
df = pd.read_json("https://api.example.com/data/export.json")

# Different orientations — how the JSON is structured
# 'records': [{col: val, ...}] ← most common, one dict per row
df = pd.read_json("data.json", orient="records")

# 'index': {index: {col: val}} ← indexed by row label
df = pd.read_json("data.json", orient="index")

# 'columns': {col: {index: val}} ← column-indexed
df = pd.read_json("data.json", orient="columns")

# 'values': [[val, val, ...]] ← array of arrays
df = pd.read_json("data.json", orient="values")

# 'split': {index:[...], columns:[...], data:[[...]]} ← split format
df = pd.read_json("data.json", orient="split")

df.to_json(): Writing DataFrames to JSON

Python
import pandas as pd

df = pd.DataFrame({
    "customer_id":  ["CUST_001", "CUST_002", "CUST_003"],
    "name":         ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "total_spend":  [384.96, 139.98, 269.97],
    "signup_date":  pd.to_datetime(["2022-03-14", "2021-11-28", "2023-01-05"])
})

# Default: records orientation — list of dicts
df.to_json("output/customers.json", orient="records", indent=2)

# Result:
# [
#   {"customer_id": "CUST_001", "name": "Jane Smith", ...},
#   ...
# ]

# With date formatting
df.to_json(
    "output/customers.json",
    orient="records",
    indent=2,
    date_format="iso",     # ISO 8601 strings: "2022-03-14T00:00:00.000Z"
    force_ascii=False      # Allow non-ASCII characters in output
)

# Index orientation — rows indexed by row label
df.to_json("output/customers_indexed.json", orient="index", indent=2)

# Get JSON as string (not write to file)
json_str = df.to_json(orient="records")

JSON Lines (JSONL): The Streaming Data Format

JSON Lines (also called JSONL or newline-delimited JSON) is a format where each line of a file is a separate, complete JSON object. It’s the standard format for streaming event data, log files, Kafka exports, and large datasets that need to be processed incrementally.

JSON
{"event_id": "EVT_001", "user_id": "U_001", "action": "purchase", "ts": "2024-09-15T10:23:11", "amount": 149.99}
{"event_id": "EVT_002", "user_id": "U_042", "action": "view", "ts": "2024-09-15T10:23:15", "product_id": "PROD_007"}
{"event_id": "EVT_003", "user_id": "U_001", "action": "review", "ts": "2024-09-15T10:24:02", "rating": 5}

JSONL is superior to standard JSON for large datasets because:

  • Appendable: New records are added by appending lines — no need to parse and rewrite the entire file
  • Streamable: Process one record at a time without loading the whole file
  • Fault-tolerant: A corrupt line affects only that line, not the whole file

Reading JSON Lines

Python
import json
import pandas as pd

# Method 1: pandas — simplest approach
df = pd.read_json("data/events.jsonl", lines=True)
print(df.head())

# Method 2: Manual reading — useful for large files or complex filtering
records = []
with open("data/events.jsonl", "r", encoding="utf-8") as f:
    for line_num, line in enumerate(f, 1):
        line = line.strip()
        if not line:
            continue   # Skip blank lines
        try:
            record = json.loads(line)
            records.append(record)
        except json.JSONDecodeError as e:
            print(f"Warning: Invalid JSON on line {line_num}: {e}")
            continue   # Skip malformed lines, continue processing

df = pd.DataFrame(records)

# Method 3: Generator — memory-efficient for very large files
def read_jsonl(filepath: str):
    """Generator that yields one parsed JSON record per line."""
    with open(filepath, "r", encoding="utf-8") as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                yield json.loads(line)
            except json.JSONDecodeError as e:
                print(f"Line {line_num} skipped: {e}")

# Process 10 million events without loading all into memory
total_revenue = 0.0
purchase_count = 0

for event in read_jsonl("data/events_2024.jsonl"):
    if event.get("action") == "purchase":
        total_revenue += event.get("amount", 0)
        purchase_count += 1

print(f"Total revenue from {purchase_count:,} purchases: ${total_revenue:,.2f}")

Writing JSON Lines

Python
import json

events = [
    {"event_id": "EVT_001", "user_id": "U_001", "action": "purchase", "amount": 149.99},
    {"event_id": "EVT_002", "user_id": "U_042", "action": "view", "product_id": "PROD_007"},
    {"event_id": "EVT_003", "user_id": "U_001", "action": "review", "rating": 5}
]

# Write JSONL — one JSON object per line, no trailing comma
with open("data/events.jsonl", "w", encoding="utf-8") as f:
    for event in events:
        f.write(json.dumps(event, ensure_ascii=False) + "\n")

# Append to an existing JSONL file (the key advantage of JSONL)
new_event = {"event_id": "EVT_004", "user_id": "U_007", "action": "signup"}
with open("data/events.jsonl", "a", encoding="utf-8") as f:   # "a" = append mode
    f.write(json.dumps(new_event) + "\n")

# pandas DataFrame → JSONL
df.to_json("output/predictions.jsonl", orient="records", lines=True)

Processing Large JSON Files Without Loading Everything

When JSON files are too large to fit in memory (multi-gigabyte files), load them in chunks or use streaming parsers.

Chunked Reading with pandas

Python
import pandas as pd

# Read JSONL in chunks — memory-efficient
chunk_size = 50_000
results = []

for chunk in pd.read_json("data/large_events.jsonl", lines=True, chunksize=chunk_size):
    # Filter and aggregate each chunk before accumulating
    purchase_chunk = chunk[chunk["action"] == "purchase"]
    monthly_revenue = (
        purchase_chunk
        .assign(month=pd.to_datetime(chunk["ts"]).dt.to_period("M"))
        .groupby("month")["amount"]
        .sum()
    )
    results.append(monthly_revenue)
    print(f"Processed chunk: {len(chunk):,} events")

# Combine chunk results
final_revenue = pd.concat(results).groupby(level=0).sum()
print(final_revenue)

ijson: Streaming Parser for Large Standard JSON

For very large standard JSON files (not JSONL), ijson parses incrementally without loading the whole file:

Python
pip install ijson
Python
import ijson
import pandas as pd

# Parse a large JSON file containing an array of records
# without loading the entire file into memory
def stream_large_json_array(filepath: str, array_prefix: str = "item"):
    """
    Stream records from a large JSON file containing a top-level array.

    Parameters
    ----------
    filepath : str
        Path to the JSON file.
    array_prefix : str
        ijson prefix for the array elements. 'item' for root-level arrays.

    Yields
    ------
    dict
        One parsed record at a time.
    """
    with open(filepath, "rb") as f:   # Binary mode required for ijson
        for record in ijson.items(f, array_prefix):
            yield record

# Process a 2GB JSON file with millions of records
records_processed = 0
total_spend = 0.0

for record in stream_large_json_array("data/massive_transactions.json"):
    total_spend += record.get("amount", 0)
    records_processed += 1

    if records_processed % 100_000 == 0:
        print(f"Processed {records_processed:,} records...")

print(f"Processed {records_processed:,} records, total spend: ${total_spend:,.2f}")

# Or collect into a DataFrame in batches
BATCH_SIZE = 10_000
batch = []

for record in stream_large_json_array("data/massive_transactions.json"):
    batch.append(record)
    if len(batch) >= BATCH_SIZE:
        df_batch = pd.DataFrame(batch)
        process_batch(df_batch)  # Your processing function
        batch = []

# Process remaining records
if batch:
    df_batch = pd.DataFrame(batch)
    process_batch(df_batch)

JSON Schema Validation

Before processing JSON data, validating that it conforms to an expected schema catches errors early and provides clear, actionable error messages instead of cryptic KeyErrors deep in your pipeline.

Python
pip install jsonschema
Python
from jsonschema import validate, ValidationError, Draft7Validator
import json

# Define the expected schema
CUSTOMER_SCHEMA = {
    "type": "object",
    "required": ["customer_id", "name", "email"],
    "properties": {
        "customer_id": {
            "type": "string",
            "pattern": "^CUST_[0-9]+$"      # Must match this regex
        },
        "name": {
            "type": "object",
            "required": ["first", "last"],
            "properties": {
                "first": {"type": "string", "minLength": 1},
                "last":  {"type": "string", "minLength": 1}
            }
        },
        "email": {
            "type": "string",
            "format": "email"
        },
        "lifetime_value": {
            "type": "number",
            "minimum": 0
        },
        "is_premium": {
            "type": "boolean"
        },
        "tags": {
            "type": "array",
            "items": {"type": "string"},
            "uniqueItems": True
        }
    },
    "additionalProperties": True    # Allow extra fields
}

# Validate a single record
def validate_customer(customer: dict) -> list[str]:
    """
    Validate a customer record against the schema.

    Returns a list of validation error messages (empty if valid).
    """
    validator = Draft7Validator(CUSTOMER_SCHEMA)
    errors = [
        f"{'.'.join(str(p) for p in error.absolute_path)}: {error.message}"
        if error.absolute_path
        else error.message
        for error in validator.iter_errors(customer)
    ]
    return errors

# Test with valid record
valid_customer = {
    "customer_id": "CUST_001",
    "name": {"first": "Jane", "last": "Smith"},
    "email": "jane@email.com",
    "lifetime_value": 3847.50,
    "is_premium": True,
    "tags": ["loyal", "high-value"]
}
errors = validate_customer(valid_customer)
print(f"Valid: {len(errors) == 0}")  # Valid: True

# Test with invalid record
invalid_customer = {
    "customer_id": "CUSTOMER_001",  # Wrong format — should be CUST_XXX
    "name": {"first": "Bob"},       # Missing 'last' field
    "email": "not-an-email",        # Invalid email format
    "lifetime_value": -50.0,        # Below minimum (0)
    "is_premium": "yes"             # Wrong type — should be boolean
}
errors = validate_customer(invalid_customer)
print("Validation errors:")
for error in errors:
    print(f"  - {error}")

# Validate all records in a file, collecting errors
def validate_json_file(filepath: str, schema: dict) -> tuple[list, list]:
    """
    Validate all records in a JSONL file against a schema.
    Returns (valid_records, error_records).
    """
    valid_records = []
    error_records = []
    validator = Draft7Validator(schema)

    with open(filepath, "r", encoding="utf-8") as f:
        for line_num, line in enumerate(f, 1):
            try:
                record = json.loads(line.strip())
            except json.JSONDecodeError as e:
                error_records.append({"line": line_num, "error": f"JSON parse error: {e}"})
                continue

            errors = list(validator.iter_errors(record))
            if errors:
                error_records.append({
                    "line": line_num,
                    "record_id": record.get("customer_id", "unknown"),
                    "errors": [e.message for e in errors]
                })
            else:
                valid_records.append(record)

    print(f"Valid: {len(valid_records):,} | Invalid: {len(error_records):,}")
    return valid_records, error_records

Common JSON Patterns in Data Science Workflows

Storing and Loading Model Metrics

Python
import json
import datetime
import numpy as np

class DataScienceEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (datetime.datetime, datetime.date)):
            return obj.isoformat()
        if isinstance(obj, (np.integer,)):
            return int(obj)
        if isinstance(obj, (np.floating,)):
            return float(obj)
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        return super().default(obj)

# Save model run results
metrics = {
    "experiment": "xgboost_v3_tuned",
    "timestamp":  datetime.datetime.now(),
    "config": {
        "n_estimators": 500,
        "learning_rate": 0.05,
        "random_seed": 42
    },
    "metrics": {
        "roc_auc":  np.float64(0.9142),
        "f1_score": np.float64(0.847),
        "precision": np.float64(0.856),
        "recall":   np.float64(0.839)
    },
    "feature_importance": {
        "recency_days":   np.float64(0.231),
        "lifetime_value": np.float64(0.184),
        "total_orders":   np.float64(0.147)
    }
}

with open("models/run_metrics.json", "w") as f:
    json.dump(metrics, f, cls=DataScienceEncoder, indent=2)

# Load and compare runs
def load_run_metrics(filepath: str) -> dict:
    with open(filepath, "r") as f:
        return json.load(f)

run_metrics = load_run_metrics("models/run_metrics.json")
print(f"AUC-ROC: {run_metrics['metrics']['roc_auc']}")

Config Files

Python
import json
from pathlib import Path

# Load a JSON config file
def load_config(config_path: str = "configs/config.json") -> dict:
    """Load configuration from a JSON file."""
    path = Path(config_path)
    if not path.exists():
        raise FileNotFoundError(f"Config file not found: {config_path}")

    with open(path, "r", encoding="utf-8") as f:
        config = json.load(f)

    return config

# configs/config.json
config_template = {
    "data": {
        "input_path": "data/raw/transactions.csv",
        "output_path": "data/processed/features.parquet",
        "target_column": "churned"
    },
    "model": {
        "algorithm": "xgboost",
        "n_estimators": 500,
        "learning_rate": 0.05,
        "random_seed": 42
    },
    "evaluation": {
        "test_size": 0.2,
        "primary_metric": "roc_auc"
    }
}

with open("configs/config.json", "w") as f:
    json.dump(config_template, f, indent=2)

config = load_config("configs/config.json")
print(f"Algorithm: {config['model']['algorithm']}")
print(f"Random seed: {config['model']['random_seed']}")

Accumulating Results to JSONL During a Pipeline

Python
import json
import datetime

# Write results incrementally as a pipeline processes records
def score_customers_with_logging(customers: list, model, output_path: str):
    """Score customers and write results to JSONL as we go."""
    with open(output_path, "w", encoding="utf-8") as f:
        for customer in customers:
            score = model.predict_proba([customer["features"]])[0][1]
            result = {
                "customer_id":     customer["customer_id"],
                "churn_score":     round(float(score), 4),
                "predicted_label": int(score >= 0.5),
                "scored_at":       datetime.datetime.now().isoformat()
            }
            f.write(json.dumps(result) + "\n")

    print(f"Scored {len(customers):,} customers → {output_path}")

JSON Best Practices for Data Scientists

Always Use UTF-8 Encoding

Python
# Always specify encoding explicitly
with open("data.json", "r", encoding="utf-8") as f:
    data = json.load(f)

with open("output.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False)  # ensure_ascii=False allows non-ASCII

Use .get() for All Optional Fields

Python
# FRAGILE: KeyError if field is missing
city = record["contact"]["address"]["city"]

# ROBUST: Returns None (or default) if any level is missing
city = record.get("contact", {}).get("address", {}).get("city")

# Or with explicit defaults:
city = record.get("contact", {}).get("address", {}).get("city", "Unknown")

Validate Before Processing

Python
# Fail early with a clear message rather than failing deep in your pipeline
def load_and_validate(filepath: str, schema: dict) -> list:
    with open(filepath) as f:
        data = json.load(f)

    if not isinstance(data, list):
        raise ValueError(f"Expected JSON array at root, got {type(data).__name__}")

    valid, invalid = validate_json_file(filepath, schema)

    if len(invalid) > 0:
        raise ValueError(
            f"{len(invalid)} invalid records found. "
            f"First error: {invalid[0]['errors'][0]}"
        )
    return valid

Prefer JSONL for Large Datasets

Use standard JSON for small datasets and configs (human-readable, easy to inspect). Use JSONL for large datasets (streamable, appendable, line-by-line processing).

Summary

JSON is the universal data interchange format of the web era, and fluency with it is a daily requirement for data scientists. Python’s json module provides precise, low-level control for reading and writing any JSON structure. pandas’ pd.read_json(), df.to_json(), and pd.json_normalize() handle the most common data science use cases — loading flat or moderately nested JSON into DataFrames and writing DataFrames back to JSON files.

The critical advanced skills that separate professional JSON handling from beginner usage are: handling non-serializable types (dates, numpy values) with custom encoders; flattening deeply nested structures with json_normalize() or manual extraction; working with the JSON Lines format for large and streaming datasets; processing gigabyte-scale files without loading everything into memory using generators or ijson; and validating data schemas before processing to catch errors at the boundary rather than deep in the pipeline.

With these skills, JSON becomes not just a format you can read when you see it, but a format you can work with systematically and professionally — reading from files, APIs, and streams; writing clean, reproducible outputs; and handling the full messiness of real-world nested and inconsistently structured data.

Key Takeaways

  • json.load(file) reads a JSON file into Python (dict or list); json.dump(obj, file) writes Python to a JSON file; add indent=2 for human-readable output and always specify encoding="utf-8"
  • json.loads(string) and json.dumps(obj) work with strings instead of files — useful for API responses, database columns, and environment variable configs
  • JSON doesn’t support dates, numpy types, or tuples — use a custom JSONEncoder subclass or pre-convert these types before serializing
  • pd.json_normalize() flattens nested JSON dicts into DataFrame columns using dot notation (or a custom sep); use record_path to expand nested arrays into one-row-per-array-element format
  • Always use .get() with fallback defaults for nested field access — record.get("a", {}).get("b", {}).get("c") — never chain [] indexing on JSON from external sources
  • JSON Lines (JSONL) format — one JSON object per line — is superior to standard JSON for large datasets and event streams: it’s appendable, streamable, and one corrupt record doesn’t break the whole file
  • Use a generator function with yield json.loads(line) to process large JSONL files one record at a time without loading everything into memory; use ijson for streaming very large standard JSON files
  • jsonschema.Draft7Validator validates JSON data against a schema before processing — catch malformed data at the input boundary rather than deep in the pipeline
Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Introduction to Model Evaluation Metrics

Master machine learning evaluation metrics including accuracy, precision, recall, F1-score, ROC-AUC, RMSE, and more with…

Probability Theory Fundamentals for Machine Learning

Master probability theory fundamentals essential for machine learning. Learn probability distributions, conditional probability, Bayes’ theorem,…

Parallel Circuits Demystified: Creating Multiple Paths for Current

Master parallel circuits with this comprehensive guide. Learn how components connect in parallel, calculate current…

Samsung Announces Massive AI Expansion Targeting 800 Million Mobile Devices in 2026

Samsung announces aggressive AI strategy to double Galaxy AI-enabled devices to 800 million by 2026.…

JavaScript Functions: Declaration, Invocation and Parameters

Learn about JavaScript functions, including declarations, invocations, parameters, and handling asynchronous tasks with callbacks, promises…

What is Continuity Testing and Why is it Your Best Debugging Friend?

Learn what continuity testing is, how to use it with a multimeter, and why it’s…

Click For More
0
Would love your thoughts, please comment.x
()
x