JSON (JavaScript Object Notation) is the most common data interchange format on the web, and Python’s built-in json module makes reading and writing it straightforward: json.load(file) reads a JSON file into a Python dictionary or list, and json.dump(data, file) writes a Python object to a JSON file. For tabular data science work, pd.read_json() and df.to_json() provide direct conversion between JSON and pandas DataFrames, while pd.json_normalize() flattens nested JSON structures into tabular columns.
Introduction
JSON is everywhere. It is the default format for web API responses, configuration files, event log streams, NoSQL database exports, and data interchange between systems. If you work with APIs (covered in Article 75), you’re already working with JSON constantly — response.json() returns a Python dictionary parsed directly from a JSON response body. But JSON also lives in files: downloaded datasets, configuration files, database exports, streamed event logs saved to disk.
Understanding JSON beyond just response.json() — knowing how to read it from files, write it back, handle complex nested structures, work with large JSON files memory-efficiently, validate schemas, and avoid common pitfalls — is a practical skill that data scientists use daily.
This article covers the full spectrum of JSON work in Python: the json standard library module for precise control, pandas integration for data science workflows, flattening and restructuring nested data, working with the JSON Lines format popular in streaming data, handling large files without loading everything into memory, and validating JSON schemas before processing.
JSON Fundamentals: Structure and Data Types
Before working with JSON in Python, it’s worth understanding the format itself. JSON is built on two structures that map almost perfectly to Python objects:
JSON objects (dictionaries in Python):
{
"customer_id": "CUST_001",
"name": "Jane Smith",
"is_premium": true,
"signup_date": "2022-03-14",
"lifetime_value": 3847.50,
"preferences": null
}JSON arrays (lists in Python):
["electronics", "home", "books"]JSON-to-Python Type Mapping
| JSON Type | Python Type | Example |
|---|---|---|
object | dict | {"key": "value"} |
array | list | [1, 2, 3] |
string | str | "hello" |
number (integer) | int | 42 |
number (float) | float | 3.14 |
true | True | true → True |
false | False | false → False |
null | None | null → None |
Two important things JSON does not support that Python does:
- Date/datetime objects — dates must be stored as strings (e.g.,
"2024-09-15") or timestamps (e.g.,1726358400) - Tuple — JSON arrays map to Python lists on both read and write
The json Module: Full Control Over JSON
Python’s json module is in the standard library — no installation required. It provides four core functions:
| Function | Direction | Target |
|---|---|---|
json.load(file) | JSON → Python | File object |
json.loads(string) | JSON → Python | String |
json.dump(obj, file) | Python → JSON | File object |
json.dumps(obj) | Python → JSON | Returns string |
The pattern: load/loads = deserialize (JSON → Python); dump/dumps = serialize (Python → JSON). The s suffix means “string” rather than file.
Reading JSON Files
import json
# Read a JSON file — standard pattern
with open("data/customers.json", "r", encoding="utf-8") as f:
data = json.load(f)
# data is now a Python dict or list, depending on the JSON root element
print(type(data)) # <class 'dict'> or <class 'list'>
# If the JSON root is an object (dict)
print(data.keys())
print(data["customer_id"])
# If the JSON root is an array (list of dicts)
print(f"Records: {len(data)}")
print(data[0]) # First record
print(data[0]["name"])Always use encoding="utf-8" when opening JSON files — JSON is specified as UTF-8 by default, and omitting the encoding can cause errors on Windows where the default encoding may be different.
Writing JSON Files
import json
customers = [
{
"customer_id": "CUST_001",
"name": "Jane Smith",
"email": "jane.smith@email.com",
"city": "Austin",
"is_premium": True,
"lifetime_value": 3847.50,
"signup_date": "2022-03-14",
"tags": ["loyal", "high-value"],
"preferences": None
},
{
"customer_id": "CUST_002",
"name": "Bob Johnson",
"email": "bob.j@email.com",
"city": "Seattle",
"is_premium": False,
"lifetime_value": 892.00,
"signup_date": "2021-11-28",
"tags": ["new"],
"preferences": {"theme": "dark", "notifications": True}
}
]
# Write JSON to file
with open("data/customers_export.json", "w", encoding="utf-8") as f:
json.dump(customers, f)
# Human-readable (indented) JSON — use indent parameter
with open("data/customers_pretty.json", "w", encoding="utf-8") as f:
json.dump(customers, f, indent=2)
# With additional formatting options
with open("data/customers_formatted.json", "w", encoding="utf-8") as f:
json.dump(
customers,
f,
indent=2,
ensure_ascii=False, # Allow non-ASCII characters (important for international data)
sort_keys=True # Sort dict keys alphabetically for consistent output
)The indent parameter transforms compact single-line JSON into human-readable multi-line format. For files that will be read by machines only, omit indent to produce compact output. For config files, logs, or anything humans might read, use indent=2.
Reading JSON from Strings
json.loads() (note the s — string) parses a JSON string that’s already in memory:
import json
# Parse a JSON string
json_string = '{"name": "Jane", "score": 0.95, "passed": true}'
data = json.loads(json_string)
print(data) # {'name': 'Jane', 'score': 0.95, 'passed': True}
print(data["score"]) # 0.95
# Useful when JSON comes from a source other than a file:
# - API response text: json.loads(response.text)
# - Database JSON column: json.loads(row["metadata"])
# - Environment variable: json.loads(os.environ["CONFIG_JSON"])
# - Kafka message: json.loads(message.value.decode("utf-8"))Writing JSON to Strings
json.dumps() serializes to a string instead of writing to a file:
data = {"model": "xgboost_v3", "auc": 0.9142, "threshold": 0.42}
# Serialize to string (for storing in database, sending via API, etc.)
json_string = json.dumps(data)
print(json_string) # '{"model": "xgboost_v3", "auc": 0.9142, "threshold": 0.42}'
# Pretty-printed string
pretty_string = json.dumps(data, indent=2)
print(pretty_string)
# {
# "model": "xgboost_v3",
# "auc": 0.9142,
# "threshold": 0.42
# }
# Storing JSON in a database column
cursor.execute(
"INSERT INTO model_runs (run_id, metadata) VALUES (?, ?)",
("run_001", json.dumps(data))
)Handling Data Types JSON Doesn’t Support
JSON has no native date, datetime, or numpy type. Attempting to serialize these directly raises a TypeError. There are two clean solutions.
Solution 1: Custom JSON Encoder
import json
import datetime
import numpy as np
import pandas as pd
class DataScienceEncoder(json.JSONEncoder):
"""
Custom JSON encoder that handles common data science types
not natively supported by JSON.
"""
def default(self, obj):
# datetime.datetime and datetime.date → ISO format string
if isinstance(obj, (datetime.datetime, datetime.date)):
return obj.isoformat()
# numpy integer types → Python int
if isinstance(obj, (np.integer, np.int64, np.int32)):
return int(obj)
# numpy float types → Python float
if isinstance(obj, (np.floating, np.float64, np.float32)):
return float(obj)
# numpy arrays → Python list
if isinstance(obj, np.ndarray):
return obj.tolist()
# pandas Timestamp → ISO string
if isinstance(obj, pd.Timestamp):
return obj.isoformat()
# pandas NA/NaT → None (becomes JSON null)
if pd.isna(obj):
return None
# Let the base class handle unknown types (raises TypeError)
return super().default(obj)
# Usage
metrics = {
"model": "xgboost_v3",
"trained_at": datetime.datetime(2024, 9, 15, 14, 30, 0),
"auc_roc": np.float64(0.9142),
"n_trees": np.int64(500),
"feature_importances": np.array([0.23, 0.18, 0.15, 0.44]),
"evaluation_date": datetime.date(2024, 9, 15)
}
# Serialize with custom encoder
json_string = json.dumps(metrics, cls=DataScienceEncoder, indent=2)
print(json_string)
# Write to file
with open("models/xgboost_v3_metrics.json", "w", encoding="utf-8") as f:
json.dump(metrics, f, cls=DataScienceEncoder, indent=2)Output:
{
"model": "xgboost_v3",
"trained_at": "2024-09-15T14:30:00",
"auc_roc": 0.9142,
"n_trees": 500,
"feature_importances": [0.23, 0.18, 0.15, 0.44],
"evaluation_date": "2024-09-15"
}Solution 2: Convert Before Serializing
For one-off cases, convert problem types before passing to json.dump:
import json
import datetime
import numpy as np
def make_json_serializable(obj):
"""Recursively convert a nested dict/list to JSON-serializable types."""
if isinstance(obj, dict):
return {k: make_json_serializable(v) for k, v in obj.items()}
elif isinstance(obj, (list, tuple)):
return [make_json_serializable(item) for item in obj]
elif isinstance(obj, (datetime.datetime, datetime.date)):
return obj.isoformat()
elif isinstance(obj, (np.integer,)):
return int(obj)
elif isinstance(obj, (np.floating,)):
return float(obj)
elif isinstance(obj, np.ndarray):
return obj.tolist()
elif obj != obj: # NaN check (NaN != NaN is True)
return None
return obj
# Convert entire data structure before serializing
serializable_data = make_json_serializable(metrics)
json_string = json.dumps(serializable_data, indent=2)Working with Nested JSON: Flattening and Restructuring
Real-world JSON data is almost always nested — dictionaries within dictionaries, lists of objects, arrays of arrays. Most data science tools work best with flat tabular data, so a critical skill is flattening nested JSON into a DataFrame.
The Nested JSON Problem
import json
# Load a realistic customer JSON with deep nesting
customer_json = """
[
{
"customer_id": "CUST_001",
"name": {"first": "Jane", "last": "Smith"},
"contact": {
"email": "jane.smith@email.com",
"phone": "512-555-0101",
"address": {
"street": "123 Oak Ave",
"city": "Austin",
"state": "TX",
"zip": "78701"
}
},
"account": {
"is_premium": true,
"signup_date": "2022-03-14",
"lifetime_value": 3847.50
},
"orders": [
{"order_id": "ORD_1001", "date": "2024-07-05", "amount": 199.98},
{"order_id": "ORD_1003", "date": "2024-08-01", "amount": 149.99},
{"order_id": "ORD_1008", "date": "2024-09-10", "amount": 34.99}
],
"tags": ["loyal", "high-value", "electronics-buyer"]
},
{
"customer_id": "CUST_002",
"name": {"first": "Bob", "last": "Johnson"},
"contact": {
"email": "bob.j@email.com",
"phone": null,
"address": {
"street": "456 Pine St",
"city": "Seattle",
"state": "WA",
"zip": "98101"
}
},
"account": {
"is_premium": false,
"signup_date": "2021-11-28",
"lifetime_value": 892.00
},
"orders": [
{"order_id": "ORD_1002", "date": "2024-07-12", "amount": 49.99}
],
"tags": ["new"]
}
]
"""
customers = json.loads(customer_json)
print(f"Records: {len(customers)}")Approach 1: pd.json_normalize() — Automatic Flattening
import pandas as pd
from pandas import json_normalize
# Basic normalization — flattens one level of nesting
df_basic = json_normalize(customers)
print(df_basic.columns.tolist())
# ['customer_id', 'tags', 'name.first', 'name.last',
# 'contact.email', 'contact.phone', 'contact.address.street',
# 'contact.address.city', 'contact.address.state', 'contact.address.zip',
# 'account.is_premium', 'account.signup_date', 'account.lifetime_value',
# 'orders'] ← 'orders' stays as a list column
# Customize separator (default is ".")
df_underscore = json_normalize(customers, sep="_")
# Columns: customer_id, name_first, name_last, contact_email, etc.
print(df_underscore[["customer_id", "name_first", "name_last",
"contact_address_city", "account_lifetime_value"]].head())| customer_id | name_first | name_last | contact_address_city | account_lifetime_value |
|---|---|---|---|---|
| CUST_001 | Jane | Smith | Austin | 3847.50 |
| CUST_002 | Bob | Johnson | Seattle | 892.00 |
Approach 2: Expanding Nested Arrays with record_path
When one of the nested fields is a list of objects (like orders), use record_path to expand it — producing one row per array element:
# Expand the 'orders' array — one row per order
# 'meta' specifies fields from the parent object to include in each row
df_orders = json_normalize(
customers,
record_path="orders", # Expand this array
meta=[
"customer_id", # Include these parent fields
["name", "first"], # Nested parent fields use list notation
["name", "last"],
["account", "is_premium"],
["account", "lifetime_value"]
],
meta_prefix="customer_", # Prefix parent fields to avoid name clashes
errors="ignore" # Ignore missing meta fields
)
print(df_orders)| order_id | date | amount | customer_customer_id | customer_name_first | customer_name_last |
|---|---|---|---|---|---|
| ORD_1001 | 2024-07-05 | 199.98 | CUST_001 | Jane | Smith |
| ORD_1003 | 2024-08-01 | 149.99 | CUST_001 | Jane | Smith |
| ORD_1008 | 2024-09-10 | 34.99 | CUST_001 | Jane | Smith |
| ORD_1002 | 2024-07-12 | 49.99 | CUST_002 | Bob | Johnson |
Approach 3: Manual Extraction for Full Control
When json_normalize doesn’t produce exactly what you need, extract fields manually:
def extract_customer_records(customers: list) -> pd.DataFrame:
"""
Manually extract and flatten customer records with full control
over field selection, naming, and transformation.
"""
records = []
for c in customers:
# Safe extraction with .get() and nested .get() chains
record = {
# Flat fields
"customer_id": c.get("customer_id"),
# Nested name fields
"first_name": c.get("name", {}).get("first"),
"last_name": c.get("name", {}).get("last"),
# Contact fields
"email": c.get("contact", {}).get("email"),
"phone": c.get("contact", {}).get("phone"),
# Address fields (double nesting)
"city": c.get("contact", {}).get("address", {}).get("city"),
"state": c.get("contact", {}).get("address", {}).get("state"),
# Account fields
"is_premium": c.get("account", {}).get("is_premium"),
"signup_date": c.get("account", {}).get("signup_date"),
"lifetime_value": c.get("account", {}).get("lifetime_value"),
# Derived from array: aggregate order stats
"n_orders": len(c.get("orders", [])),
"total_orders": sum(o.get("amount", 0) for o in c.get("orders", [])),
"last_order_date": max(
(o.get("date") for o in c.get("orders", [])),
default=None
),
# Array field: join list to string for storage
"tags": "|".join(c.get("tags", []))
}
records.append(record)
df = pd.DataFrame(records)
df["signup_date"] = pd.to_datetime(df["signup_date"])
df["last_order_date"] = pd.to_datetime(df["last_order_date"])
return df
df_customers = extract_customer_records(customers)
print(df_customers.dtypes)
print(df_customers.head())This approach is more verbose but gives complete control over field names, types, and derivations.
pandas JSON Integration
pandas has built-in JSON support through pd.read_json() and df.to_json().
pd.read_json(): Reading JSON Directly into DataFrames
import pandas as pd
# Read a JSON file containing a list of flat objects
# (the most common case for datasets)
df = pd.read_json("data/transactions.json")
# Read with type inference and date parsing
df = pd.read_json(
"data/transactions.json",
convert_dates=["transaction_date", "created_at"],
dtype={"amount": float, "customer_id": str}
)
# Read from a URL (directly from a web API or remote file)
df = pd.read_json("https://api.example.com/data/export.json")
# Different orientations — how the JSON is structured
# 'records': [{col: val, ...}] ← most common, one dict per row
df = pd.read_json("data.json", orient="records")
# 'index': {index: {col: val}} ← indexed by row label
df = pd.read_json("data.json", orient="index")
# 'columns': {col: {index: val}} ← column-indexed
df = pd.read_json("data.json", orient="columns")
# 'values': [[val, val, ...]] ← array of arrays
df = pd.read_json("data.json", orient="values")
# 'split': {index:[...], columns:[...], data:[[...]]} ← split format
df = pd.read_json("data.json", orient="split")df.to_json(): Writing DataFrames to JSON
import pandas as pd
df = pd.DataFrame({
"customer_id": ["CUST_001", "CUST_002", "CUST_003"],
"name": ["Jane Smith", "Bob Johnson", "Alice Williams"],
"total_spend": [384.96, 139.98, 269.97],
"signup_date": pd.to_datetime(["2022-03-14", "2021-11-28", "2023-01-05"])
})
# Default: records orientation — list of dicts
df.to_json("output/customers.json", orient="records", indent=2)
# Result:
# [
# {"customer_id": "CUST_001", "name": "Jane Smith", ...},
# ...
# ]
# With date formatting
df.to_json(
"output/customers.json",
orient="records",
indent=2,
date_format="iso", # ISO 8601 strings: "2022-03-14T00:00:00.000Z"
force_ascii=False # Allow non-ASCII characters in output
)
# Index orientation — rows indexed by row label
df.to_json("output/customers_indexed.json", orient="index", indent=2)
# Get JSON as string (not write to file)
json_str = df.to_json(orient="records")JSON Lines (JSONL): The Streaming Data Format
JSON Lines (also called JSONL or newline-delimited JSON) is a format where each line of a file is a separate, complete JSON object. It’s the standard format for streaming event data, log files, Kafka exports, and large datasets that need to be processed incrementally.
{"event_id": "EVT_001", "user_id": "U_001", "action": "purchase", "ts": "2024-09-15T10:23:11", "amount": 149.99}
{"event_id": "EVT_002", "user_id": "U_042", "action": "view", "ts": "2024-09-15T10:23:15", "product_id": "PROD_007"}
{"event_id": "EVT_003", "user_id": "U_001", "action": "review", "ts": "2024-09-15T10:24:02", "rating": 5}JSONL is superior to standard JSON for large datasets because:
- Appendable: New records are added by appending lines — no need to parse and rewrite the entire file
- Streamable: Process one record at a time without loading the whole file
- Fault-tolerant: A corrupt line affects only that line, not the whole file
Reading JSON Lines
import json
import pandas as pd
# Method 1: pandas — simplest approach
df = pd.read_json("data/events.jsonl", lines=True)
print(df.head())
# Method 2: Manual reading — useful for large files or complex filtering
records = []
with open("data/events.jsonl", "r", encoding="utf-8") as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue # Skip blank lines
try:
record = json.loads(line)
records.append(record)
except json.JSONDecodeError as e:
print(f"Warning: Invalid JSON on line {line_num}: {e}")
continue # Skip malformed lines, continue processing
df = pd.DataFrame(records)
# Method 3: Generator — memory-efficient for very large files
def read_jsonl(filepath: str):
"""Generator that yields one parsed JSON record per line."""
with open(filepath, "r", encoding="utf-8") as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
yield json.loads(line)
except json.JSONDecodeError as e:
print(f"Line {line_num} skipped: {e}")
# Process 10 million events without loading all into memory
total_revenue = 0.0
purchase_count = 0
for event in read_jsonl("data/events_2024.jsonl"):
if event.get("action") == "purchase":
total_revenue += event.get("amount", 0)
purchase_count += 1
print(f"Total revenue from {purchase_count:,} purchases: ${total_revenue:,.2f}")Writing JSON Lines
import json
events = [
{"event_id": "EVT_001", "user_id": "U_001", "action": "purchase", "amount": 149.99},
{"event_id": "EVT_002", "user_id": "U_042", "action": "view", "product_id": "PROD_007"},
{"event_id": "EVT_003", "user_id": "U_001", "action": "review", "rating": 5}
]
# Write JSONL — one JSON object per line, no trailing comma
with open("data/events.jsonl", "w", encoding="utf-8") as f:
for event in events:
f.write(json.dumps(event, ensure_ascii=False) + "\n")
# Append to an existing JSONL file (the key advantage of JSONL)
new_event = {"event_id": "EVT_004", "user_id": "U_007", "action": "signup"}
with open("data/events.jsonl", "a", encoding="utf-8") as f: # "a" = append mode
f.write(json.dumps(new_event) + "\n")
# pandas DataFrame → JSONL
df.to_json("output/predictions.jsonl", orient="records", lines=True)Processing Large JSON Files Without Loading Everything
When JSON files are too large to fit in memory (multi-gigabyte files), load them in chunks or use streaming parsers.
Chunked Reading with pandas
import pandas as pd
# Read JSONL in chunks — memory-efficient
chunk_size = 50_000
results = []
for chunk in pd.read_json("data/large_events.jsonl", lines=True, chunksize=chunk_size):
# Filter and aggregate each chunk before accumulating
purchase_chunk = chunk[chunk["action"] == "purchase"]
monthly_revenue = (
purchase_chunk
.assign(month=pd.to_datetime(chunk["ts"]).dt.to_period("M"))
.groupby("month")["amount"]
.sum()
)
results.append(monthly_revenue)
print(f"Processed chunk: {len(chunk):,} events")
# Combine chunk results
final_revenue = pd.concat(results).groupby(level=0).sum()
print(final_revenue)ijson: Streaming Parser for Large Standard JSON
For very large standard JSON files (not JSONL), ijson parses incrementally without loading the whole file:
pip install ijsonimport ijson
import pandas as pd
# Parse a large JSON file containing an array of records
# without loading the entire file into memory
def stream_large_json_array(filepath: str, array_prefix: str = "item"):
"""
Stream records from a large JSON file containing a top-level array.
Parameters
----------
filepath : str
Path to the JSON file.
array_prefix : str
ijson prefix for the array elements. 'item' for root-level arrays.
Yields
------
dict
One parsed record at a time.
"""
with open(filepath, "rb") as f: # Binary mode required for ijson
for record in ijson.items(f, array_prefix):
yield record
# Process a 2GB JSON file with millions of records
records_processed = 0
total_spend = 0.0
for record in stream_large_json_array("data/massive_transactions.json"):
total_spend += record.get("amount", 0)
records_processed += 1
if records_processed % 100_000 == 0:
print(f"Processed {records_processed:,} records...")
print(f"Processed {records_processed:,} records, total spend: ${total_spend:,.2f}")
# Or collect into a DataFrame in batches
BATCH_SIZE = 10_000
batch = []
for record in stream_large_json_array("data/massive_transactions.json"):
batch.append(record)
if len(batch) >= BATCH_SIZE:
df_batch = pd.DataFrame(batch)
process_batch(df_batch) # Your processing function
batch = []
# Process remaining records
if batch:
df_batch = pd.DataFrame(batch)
process_batch(df_batch)JSON Schema Validation
Before processing JSON data, validating that it conforms to an expected schema catches errors early and provides clear, actionable error messages instead of cryptic KeyErrors deep in your pipeline.
pip install jsonschemafrom jsonschema import validate, ValidationError, Draft7Validator
import json
# Define the expected schema
CUSTOMER_SCHEMA = {
"type": "object",
"required": ["customer_id", "name", "email"],
"properties": {
"customer_id": {
"type": "string",
"pattern": "^CUST_[0-9]+$" # Must match this regex
},
"name": {
"type": "object",
"required": ["first", "last"],
"properties": {
"first": {"type": "string", "minLength": 1},
"last": {"type": "string", "minLength": 1}
}
},
"email": {
"type": "string",
"format": "email"
},
"lifetime_value": {
"type": "number",
"minimum": 0
},
"is_premium": {
"type": "boolean"
},
"tags": {
"type": "array",
"items": {"type": "string"},
"uniqueItems": True
}
},
"additionalProperties": True # Allow extra fields
}
# Validate a single record
def validate_customer(customer: dict) -> list[str]:
"""
Validate a customer record against the schema.
Returns a list of validation error messages (empty if valid).
"""
validator = Draft7Validator(CUSTOMER_SCHEMA)
errors = [
f"{'.'.join(str(p) for p in error.absolute_path)}: {error.message}"
if error.absolute_path
else error.message
for error in validator.iter_errors(customer)
]
return errors
# Test with valid record
valid_customer = {
"customer_id": "CUST_001",
"name": {"first": "Jane", "last": "Smith"},
"email": "jane@email.com",
"lifetime_value": 3847.50,
"is_premium": True,
"tags": ["loyal", "high-value"]
}
errors = validate_customer(valid_customer)
print(f"Valid: {len(errors) == 0}") # Valid: True
# Test with invalid record
invalid_customer = {
"customer_id": "CUSTOMER_001", # Wrong format — should be CUST_XXX
"name": {"first": "Bob"}, # Missing 'last' field
"email": "not-an-email", # Invalid email format
"lifetime_value": -50.0, # Below minimum (0)
"is_premium": "yes" # Wrong type — should be boolean
}
errors = validate_customer(invalid_customer)
print("Validation errors:")
for error in errors:
print(f" - {error}")
# Validate all records in a file, collecting errors
def validate_json_file(filepath: str, schema: dict) -> tuple[list, list]:
"""
Validate all records in a JSONL file against a schema.
Returns (valid_records, error_records).
"""
valid_records = []
error_records = []
validator = Draft7Validator(schema)
with open(filepath, "r", encoding="utf-8") as f:
for line_num, line in enumerate(f, 1):
try:
record = json.loads(line.strip())
except json.JSONDecodeError as e:
error_records.append({"line": line_num, "error": f"JSON parse error: {e}"})
continue
errors = list(validator.iter_errors(record))
if errors:
error_records.append({
"line": line_num,
"record_id": record.get("customer_id", "unknown"),
"errors": [e.message for e in errors]
})
else:
valid_records.append(record)
print(f"Valid: {len(valid_records):,} | Invalid: {len(error_records):,}")
return valid_records, error_recordsCommon JSON Patterns in Data Science Workflows
Storing and Loading Model Metrics
import json
import datetime
import numpy as np
class DataScienceEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, (datetime.datetime, datetime.date)):
return obj.isoformat()
if isinstance(obj, (np.integer,)):
return int(obj)
if isinstance(obj, (np.floating,)):
return float(obj)
if isinstance(obj, np.ndarray):
return obj.tolist()
return super().default(obj)
# Save model run results
metrics = {
"experiment": "xgboost_v3_tuned",
"timestamp": datetime.datetime.now(),
"config": {
"n_estimators": 500,
"learning_rate": 0.05,
"random_seed": 42
},
"metrics": {
"roc_auc": np.float64(0.9142),
"f1_score": np.float64(0.847),
"precision": np.float64(0.856),
"recall": np.float64(0.839)
},
"feature_importance": {
"recency_days": np.float64(0.231),
"lifetime_value": np.float64(0.184),
"total_orders": np.float64(0.147)
}
}
with open("models/run_metrics.json", "w") as f:
json.dump(metrics, f, cls=DataScienceEncoder, indent=2)
# Load and compare runs
def load_run_metrics(filepath: str) -> dict:
with open(filepath, "r") as f:
return json.load(f)
run_metrics = load_run_metrics("models/run_metrics.json")
print(f"AUC-ROC: {run_metrics['metrics']['roc_auc']}")Config Files
import json
from pathlib import Path
# Load a JSON config file
def load_config(config_path: str = "configs/config.json") -> dict:
"""Load configuration from a JSON file."""
path = Path(config_path)
if not path.exists():
raise FileNotFoundError(f"Config file not found: {config_path}")
with open(path, "r", encoding="utf-8") as f:
config = json.load(f)
return config
# configs/config.json
config_template = {
"data": {
"input_path": "data/raw/transactions.csv",
"output_path": "data/processed/features.parquet",
"target_column": "churned"
},
"model": {
"algorithm": "xgboost",
"n_estimators": 500,
"learning_rate": 0.05,
"random_seed": 42
},
"evaluation": {
"test_size": 0.2,
"primary_metric": "roc_auc"
}
}
with open("configs/config.json", "w") as f:
json.dump(config_template, f, indent=2)
config = load_config("configs/config.json")
print(f"Algorithm: {config['model']['algorithm']}")
print(f"Random seed: {config['model']['random_seed']}")Accumulating Results to JSONL During a Pipeline
import json
import datetime
# Write results incrementally as a pipeline processes records
def score_customers_with_logging(customers: list, model, output_path: str):
"""Score customers and write results to JSONL as we go."""
with open(output_path, "w", encoding="utf-8") as f:
for customer in customers:
score = model.predict_proba([customer["features"]])[0][1]
result = {
"customer_id": customer["customer_id"],
"churn_score": round(float(score), 4),
"predicted_label": int(score >= 0.5),
"scored_at": datetime.datetime.now().isoformat()
}
f.write(json.dumps(result) + "\n")
print(f"Scored {len(customers):,} customers → {output_path}")JSON Best Practices for Data Scientists
Always Use UTF-8 Encoding
# Always specify encoding explicitly
with open("data.json", "r", encoding="utf-8") as f:
data = json.load(f)
with open("output.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False) # ensure_ascii=False allows non-ASCIIUse .get() for All Optional Fields
# FRAGILE: KeyError if field is missing
city = record["contact"]["address"]["city"]
# ROBUST: Returns None (or default) if any level is missing
city = record.get("contact", {}).get("address", {}).get("city")
# Or with explicit defaults:
city = record.get("contact", {}).get("address", {}).get("city", "Unknown")Validate Before Processing
# Fail early with a clear message rather than failing deep in your pipeline
def load_and_validate(filepath: str, schema: dict) -> list:
with open(filepath) as f:
data = json.load(f)
if not isinstance(data, list):
raise ValueError(f"Expected JSON array at root, got {type(data).__name__}")
valid, invalid = validate_json_file(filepath, schema)
if len(invalid) > 0:
raise ValueError(
f"{len(invalid)} invalid records found. "
f"First error: {invalid[0]['errors'][0]}"
)
return validPrefer JSONL for Large Datasets
Use standard JSON for small datasets and configs (human-readable, easy to inspect). Use JSONL for large datasets (streamable, appendable, line-by-line processing).
Summary
JSON is the universal data interchange format of the web era, and fluency with it is a daily requirement for data scientists. Python’s json module provides precise, low-level control for reading and writing any JSON structure. pandas’ pd.read_json(), df.to_json(), and pd.json_normalize() handle the most common data science use cases — loading flat or moderately nested JSON into DataFrames and writing DataFrames back to JSON files.
The critical advanced skills that separate professional JSON handling from beginner usage are: handling non-serializable types (dates, numpy values) with custom encoders; flattening deeply nested structures with json_normalize() or manual extraction; working with the JSON Lines format for large and streaming datasets; processing gigabyte-scale files without loading everything into memory using generators or ijson; and validating data schemas before processing to catch errors at the boundary rather than deep in the pipeline.
With these skills, JSON becomes not just a format you can read when you see it, but a format you can work with systematically and professionally — reading from files, APIs, and streams; writing clean, reproducible outputs; and handling the full messiness of real-world nested and inconsistently structured data.
Key Takeaways
json.load(file)reads a JSON file into Python (dict or list);json.dump(obj, file)writes Python to a JSON file; addindent=2for human-readable output and always specifyencoding="utf-8"json.loads(string)andjson.dumps(obj)work with strings instead of files — useful for API responses, database columns, and environment variable configs- JSON doesn’t support dates, numpy types, or tuples — use a custom
JSONEncodersubclass or pre-convert these types before serializing pd.json_normalize()flattens nested JSON dicts into DataFrame columns using dot notation (or a customsep); userecord_pathto expand nested arrays into one-row-per-array-element format- Always use
.get()with fallback defaults for nested field access —record.get("a", {}).get("b", {}).get("c")— never chain[]indexing on JSON from external sources - JSON Lines (JSONL) format — one JSON object per line — is superior to standard JSON for large datasets and event streams: it’s appendable, streamable, and one corrupt record doesn’t break the whole file
- Use a generator function with
yield json.loads(line)to process large JSONL files one record at a time without loading everything into memory; useijsonfor streaming very large standard JSON files jsonschema.Draft7Validatorvalidates JSON data against a schema before processing — catch malformed data at the input boundary rather than deep in the pipeline








