Data formats are the containers that store and transport data — and choosing the right one dramatically affects storage size, read/write speed, and analytical performance. CSV is universal and human-readable but slow and large. JSON is flexible and web-native but wastes space with repeated keys. Parquet is a columnar binary format that compresses data 5–10× smaller than CSV and reads analytical queries 10–100× faster by loading only the columns you need. For production data pipelines and data science projects, Parquet has become the default format for large tabular datasets, while CSV remains the standard for small data exchange and interoperability.
Introduction
Every dataset you work with lives in a file — or came from one. The format of that file determines how fast it reads, how much disk space it uses, whether it preserves data types, how easily other tools can read it, and whether it can store complex nested data or only flat tables.
As a data scientist, you’ll encounter a surprisingly wide variety of data formats. A stakeholder sends you a CSV. An API returns JSON. The data warehouse exports to Parquet. A legacy system produces XML. A deep learning researcher shares data in HDF5. The ML engineer asks you to save your model in pickle or joblib format. A Kafka stream writes Avro. Each format exists for a reason, and understanding those reasons helps you make better decisions about how to store and exchange your own data.
This article provides the conceptual foundation and practical guidance you need: what each major format is, how it works under the hood, when to use it, and how to read and write it in Python. By the end, you’ll be able to look at a data exchange situation and know instantly which format fits.
Two Fundamental Distinctions: Text vs. Binary and Row vs. Column
Before diving into specific formats, two axes help organize the entire format landscape.
Text Formats vs. Binary Formats
Text formats (CSV, JSON, XML) store data as human-readable characters. You can open them in any text editor and read the contents. They’re universally compatible — any tool, language, or operating system can work with plain text. But they’re inefficient: the number 1234567.89 takes 10 characters (bytes) in a text file, versus 8 bytes as a double-precision float. Repetition in keys (JSON) or structure (XML) wastes enormous space.
Binary formats (Parquet, Avro, HDF5, Feather, Pickle) encode data as compact binary data. You can’t read them in a text editor. They require specific libraries to read. But they’re far more efficient: the same number might take 4 bytes as an int32 or 4 bytes as a float32, and with compression they might take 1–2 bytes. And binary formats preserve exact data types — a float is a float, not a string that needs to be parsed back.
Row-Oriented vs. Column-Oriented Storage
This is the most important structural distinction for data science performance.
Row-oriented formats (CSV, JSON, Avro) store all fields of one record together, then all fields of the next record, and so on:
Record 1: [id=CUST_001, name="Jane", city="Austin", spend=384.96]
Record 2: [id=CUST_002, name="Bob", city="Seattle", spend=139.98]
Record 3: [id=CUST_003, name="Alice", city="Chicago", spend=269.97]To read just the spend column from a 10-million-row file, a row-oriented reader must scan every single record — reading and discarding id, name, and city for each row just to get to spend.
Column-oriented formats (Parquet, ORC, Feather) store all values of one column together, then all values of the next column:
id column: [CUST_001, CUST_002, CUST_003, ...]
name column: [Jane, Bob, Alice, ...]
city column: [Austin, Seattle, Chicago, ...]
spend column: [384.96, 139.98, 269.97, ...]To read just spend, the columnar reader jumps directly to the spend column data and reads only that — skipping the other columns entirely. A query that reads 3 columns out of 100 reads about 3% of the file instead of 100%.
Column storage also enables dramatic compression because similar values are stored together — a column of country codes like ["USA", "USA", "UK", "USA", "CA", "USA"] compresses far better when all the USA values are adjacent than when they’re scattered across millions of rows.
CSV: The Universal Standard
CSV (Comma-Separated Values) is the simplest and most universal data format. It stores tabular data as plain text with commas separating fields and newlines separating records:
customer_id,name,city,country,total_spend,is_premium
CUST_001,Jane Smith,Austin,USA,384.96,True
CUST_002,Bob Johnson,Seattle,USA,139.98,False
CUST_003,Alice Williams,Chicago,USA,269.97,TrueWhat Makes CSV Good
- Universal compatibility: Every tool, language, and operating system reads CSV — Excel, pandas, R, Spark, databases, text editors,
cat, everything - Human-readable: Open any CSV in a text editor and immediately understand the data
- Simple to generate: Any system can write CSV by appending comma-separated lines
- Good for small data exchange: When sharing small datasets with non-technical stakeholders, CSV is the right choice
What Makes CSV Poor
- No type information: Every value is a string. Is
"True"a boolean? Is"123"an integer or a string? Is"2024-09-15"a date? The reader must guess or be explicitly told. - No schema: No formal definition of column names, types, or constraints. Two CSVs of the “same” dataset can have different column names, different column orders, different null representations.
- Slow for analytical queries: Row-oriented, text-based. Reading one column from a 10-million-row CSV requires parsing every line.
- No native support for nested data: A customer’s list of orders can’t be represented natively — you either join them into separate CSVs or serialize the nested structure into a string.
- Encoding ambiguity: UTF-8? Latin-1? Windows-1252? CSVs don’t declare their encoding.
- Compression is external: CSV can be gzip-compressed (
.csv.gz) but the compression is not aware of the data structure.
CSV in Python
import pandas as pd
# Read
df = pd.read_csv("data/customers.csv")
# Read with explicit types and options
df = pd.read_csv(
"data/customers.csv",
dtype={"customer_id": str, "zip_code": str},
parse_dates=["signup_date"],
na_values=["N/A", "-", "null"],
encoding="utf-8"
)
# Write
df.to_csv("output/customers.csv", index=False, encoding="utf-8")
# Write compressed
df.to_csv("output/customers.csv.gz", index=False, compression="gzip")When to Use CSV
Use CSV when: the data is small, you need maximum compatibility, you’re sharing data with non-technical users or tools that can’t read other formats, or the data has a simple flat structure. Avoid CSV for: large datasets (> a few hundred MB), performance-sensitive pipelines, data with complex types or nested structures, or anything you’ll read many times.
JSON: Flexibility and Web-Native Data
JSON (JavaScript Object Notation) stores data as nested key-value structures:
[
{
"customer_id": "CUST_001",
"name": "Jane Smith",
"is_premium": true,
"total_spend": 384.96,
"signup_date": "2022-03-14",
"orders": [
{"order_id": "ORD_1001", "amount": 199.98},
{"order_id": "ORD_1003", "amount": 149.99}
]
}
]What Makes JSON Good
- Native nested/hierarchical data: Represents the actual structure of data — a customer with a list of orders, an order with a list of items — without artificial flattening
- Web-native: The universal format for APIs; every REST API returns JSON
- Flexible schema: Each record can have different fields; missing fields are simply absent
- Readable types:
true/falsefor booleans, numbers without quotes,nullfor missing — better than CSV’s everything-is-a-string
What Makes JSON Poor
- Verbose with repetition: Every record repeats all the field names. A 1-million-row dataset with 10 fields repeats every field name 1 million times — potentially more bytes than the actual data.
- Row-oriented: Like CSV, reading one field from a large JSON array requires parsing every record.
- No date type: Dates are strings — ISO format by convention but not enforced.
- Large file size: Generally 2-5× larger than equivalent Parquet for tabular data.
- Slow parsing at scale: Parsing JSON at 100 million records is slow compared to binary formats.
JSON Lines (JSONL) for Large Data
For large datasets, JSONL (one JSON object per line) is better than a single JSON array — it’s appendable, streamable, and one corrupt line doesn’t break the whole file:
{"customer_id": "CUST_001", "name": "Jane Smith", "spend": 384.96}
{"customer_id": "CUST_002", "name": "Bob Johnson", "spend": 139.98}
{"customer_id": "CUST_003", "name": "Alice Williams", "spend": 269.97}import pandas as pd
# Read JSON
df = pd.read_json("data/customers.json")
df = pd.read_json("data/events.jsonl", lines=True) # JSONL
# Write JSON
df.to_json("output/customers.json", orient="records", indent=2)
df.to_json("output/events.jsonl", orient="records", lines=True)When to Use JSON
Use JSON when: the data is hierarchical or nested, the data comes from or goes to a web API, you need schema flexibility (different records with different fields), or the data is small to medium sized. Use JSONL for streaming/event data. Avoid JSON for: large flat tabular datasets where Parquet is dramatically better.
Parquet: The Modern Standard for Data Science
Apache Parquet is a columnar binary format designed for analytical workloads. It was developed by Twitter and Cloudera and has become the standard format for data lakes, cloud data warehouses (BigQuery, Redshift, Snowflake use it internally), and data science pipelines.
How Parquet Works
Parquet stores data in three tiers:
- File: Contains one or more row groups
- Row group: A horizontal partition of the data (e.g., 128MB chunks)
- Column chunk: Within each row group, each column’s data is stored together
Within each column chunk, data is:
- Dictionary-encoded: Repeated values (like country codes, category names) are stored as integers pointing to a lookup table
- Compressed: Using Snappy (fast), Gzip (small), or Zstandard (balanced) compression
- Type-preserved: Each column has an explicit data type — int32, float64, string, date, timestamp, etc.
The file footer stores metadata about every column: min/max values, null counts, and byte offsets. This enables predicate pushdown — when you filter WHERE country = 'USA', Parquet skips entire row groups where the min/max metadata proves there are no USA records, without even reading that data.
Parquet Performance in Practice
The numbers are striking:
| Operation | CSV | Parquet | Speedup |
|---|---|---|---|
| File size (1M rows, 20 cols) | ~200 MB | ~25 MB | 8× smaller |
| Read all columns | 12 seconds | 1.8 seconds | 7× faster |
| Read 3 of 20 columns | 12 seconds | 0.4 seconds | 30× faster |
| Filter to 5% of rows | 12 seconds | 0.3 seconds | 40× faster |
| Write | 8 seconds | 2 seconds | 4× faster |
(Approximate figures; actual results vary by data characteristics and hardware.)
Parquet in Python
import pandas as pd
# Read Parquet (requires pyarrow or fastparquet)
df = pd.read_parquet("data/customers.parquet")
# Read only specific columns (key Parquet advantage)
df = pd.read_parquet("data/customers.parquet",
columns=["customer_id", "total_spend", "signup_date"])
# Read with filters (predicate pushdown — skips irrelevant row groups)
df = pd.read_parquet(
"data/transactions.parquet",
filters=[
("transaction_date", ">=", "2024-01-01"),
("country", "==", "USA"),
("amount", ">", 100.0)
]
)
# Write Parquet
df.to_parquet("output/customers.parquet", index=False)
# Write with explicit compression (default is snappy)
df.to_parquet("output/customers.parquet",
engine="pyarrow",
compression="gzip", # "snappy", "gzip", "brotli", "zstd", "none"
index=False)
# Partition data by a column (creates a directory structure)
# data/transactions/year=2024/month=9/part-0.parquet
df.to_parquet("data/transactions/",
partition_cols=["year", "month"],
engine="pyarrow")
# Read a partitioned dataset
df = pd.read_parquet("data/transactions/") # Reads all partitions
df = pd.read_parquet("data/transactions/",
filters=[("year", "==", 2024), ("month", "==", 9)])Parquet with pyarrow Directly
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
# Read a Parquet file's sc.hema without loading data
schema = pq.read_schema("data/customers.parquet")
print(schema)
# customer_id: string
# name: string
# city: string
# total_spend: double
# signup_date: date32[day]
# is_premium: bool
# Read metadata
metadata = pq.read_metadata("data/customers.parquet")
print(f"Row groups: {metadata.num_row_groups}")
print(f"Total rows: {metadata.num_rows}")
print(f"Serialized size: {metadata.serialized_size:,} bytes")
# Read specific row group
table = pq.read_table("data/customers.parquet",
columns=["customer_id", "total_spend"])
df = table.to_pandas()
# Write with explicit sc.hema
schema = pa.sc.hema([
pa.field("customer_id", pa.string()),
pa.field("name", pa.string()),
pa.field("total_spend", pa.float64()),
pa.field("signup_date", pa.date32()),
pa.field("is_premium", pa.bool_())
])
table = pa.Table.from_pandas(df, sc.hema=sc.hema)
pq.write_table(table, "output/customers_typed.parquet",
compression="snappy")When to Use Parquet
Use Parquet when: the data is large (> a few hundred MB), you’ll run analytical queries (groupby, filter, aggregation), you need to preserve exact data types, you’re building a data pipeline or data lake, or the data will be read many times (the write cost is paid once, read benefits are repeated). Parquet is essentially the default format for all large tabular data science work today.
Avro: Row-Oriented Binary with Schema Evolution
Apache Avro is a row-oriented binary serialization format developed for Apache Hadoop. Unlike Parquet’s analytical focus, Avro’s design priorities are: compact row serialization, schema evolution (reading data written with an old schema using a new schema), and streaming compatibility.
Avro’s Key Characteristics
- Schema included in file: The JSON schema is embedded in the file header — every Avro file is self-describing
- Schema evolution: Fields can be added or removed between schema versions; old data can be read with new schema using default values
- Row-oriented: Optimized for writing individual records (ideal for streaming/Kafka)
- Compact binary: Efficient encoding, especially for records with many null fields
Avro is the standard format for Apache Kafka messages and streaming data pipelines. When data flows through Kafka, each message is typically Avro-encoded with the schema stored in a Schema Registry.
# pip install fastavro
import fastavro
import io
schema = {
"type": "record",
"name": "Customer",
"fields": [
{"name": "customer_id", "type": "string"},
{"name": "name", "type": "string"},
{"name": "total_spend", "type": "double"},
{"name": "is_premium", "type": "boolean"},
{"name": "signup_date", "type": ["null", "string"], "default": None}
]
}
records = [
{"customer_id": "CUST_001", "name": "Jane Smith",
"total_spend": 384.96, "is_premium": True, "signup_date": "2022-03-14"},
{"customer_id": "CUST_002", "name": "Bob Johnson",
"total_spend": 139.98, "is_premium": False, "signup_date": None}
]
# Write Avro
with open("output/customers.avro", "wb") as f:
fastavro.writer(f, fastavro.parse_schema(schema), records)
# Read Avro into list of dicts
with open("output/customers.avro", "rb") as f:
records_back = list(fastavro.reader(f))
import pandas as pd
df = pd.DataFrame(records_back)When to Use Avro
Use Avro when: data flows through Kafka or other streaming systems, you need schema evolution (the schema will change over time), you’re building write-heavy pipelines where row-at-a-time appending is needed, or you need the schema embedded in the data itself for long-term archival. Avro is not ideal for analytical queries — use Parquet for that.
ORC: Columnar Binary for Hive and Spark
ORC (Optimized Row Columnar) is a columnar binary format developed by Hortonworks for Apache Hive, the SQL-on-Hadoop query engine. ORC and Parquet fill similar niches — columnar, binary, compressed, type-aware — but ORC has stronger integration with Hive and ORC’s bloom filters and lighter-weight compression can be advantageous in Spark workloads.
import pandas as pd
# Requires pyarrow
df = pd.read_orc("data/transactions.orc")
df.to_orc("output/transactions.orc")When to use ORC: Primarily in Hive-based data warehouses and Spark SQL workloads where your team or infrastructure has already standardized on ORC. For new projects, Parquet has broader ecosystem support and is equally performant.
Feather: Ultra-Fast Local Data Exchange
Feather (also called Arrow IPC format) is a binary columnar format based on Apache Arrow’s in-memory data representation. Its design goal is a single thing: maximum I/O speed for reading and writing between Python, R, and other Arrow-compatible tools.
import pandas as pd
# Read (requires pyarrow)
df = pd.read_feather("data/features.feather")
# Write
df.to_feather("output/features.feather")Feather trades compression for speed. It writes data with minimal transformation from Arrow’s in-memory format — essentially writing the memory layout directly to disk. This makes Feather reads and writes faster than any other format, often several times faster than Parquet.
| Format | Write Speed | Read Speed | File Size |
|---|---|---|---|
| CSV | Slow | Slow | Large |
| Parquet (snappy) | Fast | Fast | Small |
| Feather | Fastest | Fastest | Medium (uncompressed) |
| Pickle | Fast | Fast | Medium |
When to use Feather: For saving intermediate results during iterative development — saving the output of an expensive feature engineering step so you can quickly reload it without re-running all the computation. Not recommended for long-term storage or data sharing (it’s a transient format).
HDF5: Scientific and Hierarchical Data
HDF5 (Hierarchical Data Format, version 5) is a binary format from the scientific computing world, capable of storing multiple datasets, arrays of any shape and type, and rich metadata in a single file. Think of it as a filesystem inside a file — directories (groups) containing datasets (arrays or tables).
import pandas as pd
# Write multiple DataFrames to one HDF5 file
with pd.HDFStore("data/project_data.h5", mode="w") as store:
store["customers"] = df_customers
store["orders"] = df_orders
store["features"] = df_features
# Read back
with pd.HDFStore("data/project_data.h5", mode="r") as store:
print(store.keys()) # ['/customers', '/orders', '/features']
df = store["customers"]
df_filtered = store.select("orders", where="status == 'delivered'")
# Direct read/write
df.to_hdf("data/customers.h5", key="customers", mode="w")
df = pd.read_hdf("data/customers.h5", key="customers")
# With PyTables for efficient querying
df.to_hdf("data/transactions.h5", key="transactions",
mode="w", format="table", # "table" enables querying; "fixed" is faster but no queries
data_columns=True)
# Query without loading everything
df_large = pd.read_hdf(
"data/transactions.h5", key="transactions",
where="amount > 100 and country == 'USA'"
)HDF5 is the standard format for deep learning dataset storage (NumPy arrays, image tensors) and scientific simulation output. Its ability to store hierarchical arrays of arbitrary shape makes it ideal for scenarios where data isn’t naturally tabular.
When to use HDF5: Multi-dimensional arrays (images, audio, time series with multiple sensors), deep learning datasets, scientific computing, storing multiple related datasets in one file. For simple 2D tabular data, Parquet is simpler and faster.
Pickle: Python-Specific Object Serialization
Pickle is Python’s built-in object serialization format. It can serialize virtually any Python object — not just DataFrames, but trained scikit-learn models, custom classes, nested data structures, anything Python can represent.
import pickle
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# Save a trained model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
with open("models/rf_churn_model.pkl", "wb") as f:
pickle.dump(model, f)
# Load the model back
with open("models/rf_churn_model.pkl", "rb") as f:
model_loaded = pickle.load(f)
predictions = model_loaded.predict(X_test)
# Pickle DataFrames (works but Parquet/Feather are usually better)
df.to_pickle("data/features.pkl")
df_loaded = pd.read_pickle("data/features.pkl")
# joblib is better than pickle for large numpy arrays (models)
import joblib
joblib.dump(model, "models/rf_churn_model.joblib")
model_loaded = joblib.load("models/rf_churn_model.joblib")Important Security Warning
Never unpickle data from untrusted sources. Pickle can execute arbitrary Python code during deserialization — a malicious .pkl file can compromise your system. Only pickle/unpickle objects from sources you control.
When to use Pickle: Saving trained ML models for deployment (alongside documentation of the model’s expected input schema), caching intermediate Python objects during development. Never for long-term data storage, cross-language compatibility, or data received from external sources.
Format Comparison: A Comprehensive Reference
| Format | Type | Structure | Size | Read Speed | Write Speed | Type Safety | Nested Data | Best For |
|---|---|---|---|---|---|---|---|---|
| CSV | Text | Row | Large | Slow | Slow | None | No | Small data exchange |
| JSON | Text | Row | Large | Slow | Slow | Partial | Yes | APIs, nested data |
| JSONL | Text | Row | Large | Slow | Fast | Partial | Yes | Streaming, event logs |
| Parquet | Binary | Column | Small | Fast | Fast | Full | Limited | Large analytical data |
| Avro | Binary | Row | Small | Medium | Fast | Full | Yes | Kafka, streaming |
| ORC | Binary | Column | Small | Fast | Fast | Full | Limited | Hive/Spark |
| Feather | Binary | Column | Medium | Fastest | Fastest | Full | No | Dev/intermediate |
| HDF5 | Binary | Hierarchical | Small | Fast | Fast | Full | Yes | Arrays, science |
| Pickle | Binary | Object | Medium | Fast | Fast | Full | Yes | ML models, Python objs |
Choosing the Right Format: A Decision Guide
The right format depends on five factors: data size, usage frequency, structural complexity, sharing requirements, and performance needs.
By Data Size
Small data (< 10 MB): CSV or JSON — simplicity and compatibility outweigh performance. The difference between formats at this scale is milliseconds.
Medium data (10 MB – 1 GB): Parquet is the clear winner — meaningful speed and size improvements over CSV, and all tools support it. HDF5 if the data is multi-dimensional.
Large data (> 1 GB): Parquet is essential. With partitioning and predicate pushdown, analytical queries on 100 GB Parquet datasets complete in seconds where equivalent CSV queries might take hours.
By Use Case
Sharing data with non-technical stakeholders
→ CSV (they can open it in Excel)
Web API response or data from an API
→ JSON (web-native, handles nested structure)
Streaming/Kafka pipeline
→ Avro (schema evolution, row-at-a-time writing)
Large analytical dataset in a data pipeline
→ Parquet (columnar, compressed, fast queries)
Intermediate results during ML development
→ Feather (fastest reload) or Parquet (if you'll share it)
Trained ML model artifact
→ Pickle or joblib (Python-specific) or ONNX (cross-platform)
Multi-dimensional arrays (images, tensors, time series)
→ HDF5 or NumPy .npy / .npz
Long-term archival of production data
→ Parquet (well-supported, open standard, compressed)A Practical Decision Tree
Does the data have nested/hierarchical structure?
├── Yes → JSON (small/medium) or Avro (streaming)
└── No → Is it going to a non-technical user or legacy system?
├── Yes → CSV
└── No → Is it large (> 100MB) or queried repeatedly?
├── Yes → Parquet
└── No → Feather (dev) or Parquet (production)Real-World Format Benchmarks
Here’s a concrete comparison on a representative dataset — 5 million rows, 15 columns (mix of strings, integers, floats, dates):
import pandas as pd
import time
import os
# Generate synthetic dataset
n = 5_000_000
df = pd.DataFrame({
"customer_id": [f"CUST_{i:07d}" for i in range(n)],
"transaction_date": pd.date_range("2020-01-01", periods=n, freq="1min"),
"amount": pd.Series(range(n), dtype="float64") * 0.01 + 10.0,
"country": pd.Categorical(["USA"] * (n//2) + ["UK"] * (n//4) + ["CA"] * (n//4)),
"status": pd.Categorical(["completed"] * int(n*0.8) + ["pending"] * int(n*0.2)),
"product_id": [f"PROD_{i % 1000:04d}" for i in range(n)],
"quantity": pd.Series(range(1, n+1), dtype="int32") % 10 + 1,
})
def benchmark_format(df, write_path, write_fn, read_fn, format_name):
# Write
t0 = time.time()
write_fn(df, write_path)
write_time = time.time() - t0
# File size
size_mb = os.path.getsize(write_path) / (1024 ** 2)
# Read
t0 = time.time()
df_back = read_fn(write_path)
read_time = time.time() - t0
print(f"{format_name:12s} | Size: {size_mb:6.1f} MB | "
f"Write: {write_time:.2f}s | Read: {read_time:.2f}s")
return size_mb, write_time, read_time
print(f"{'Format':12s} | {'Size':^12s} | {'Write':^10s} | {'Read':^10s}")
print("-" * 55)
benchmark_format(df, "/tmp/test.csv",
lambda d, p: d.to_csv(p, index=False),
lambda p: pd.read_csv(p),
"CSV")
benchmark_format(df, "/tmp/test.parquet",
lambda d, p: d.to_parquet(p, index=False),
lambda p: pd.read_parquet(p),
"Parquet(snappy)")
benchmark_format(df, "/tmp/test.parquet.gz",
lambda d, p: d.to_parquet(p, index=False, compression="gzip"),
lambda p: pd.read_parquet(p),
"Parquet(gzip)")
benchmark_format(df, "/tmp/test.feather",
lambda d, p: d.to_feather(p),
lambda p: pd.read_feather(p),
"Feather")Typical results (approximate):
| Format | Size | Write | Read | Notes |
|---|---|---|---|---|
| CSV | 420 MB | 38s | 25s | Baseline |
| Parquet (snappy) | 48 MB | 4.2s | 1.8s | 9× smaller, 14× faster read |
| Parquet (gzip) | 28 MB | 7.1s | 2.2s | 15× smaller, 11× faster read |
| Feather | 95 MB | 1.1s | 0.9s | Fastest I/O, larger than Parquet |
Converting Between Formats
import pandas as pd
# CSV → Parquet (the most common conversion)
def csv_to_parquet(csv_path: str, parquet_path: str,
dtype: dict = None, parse_dates: list = None):
"""Convert a CSV file to Parquet with proper type handling."""
df = pd.read_csv(csv_path, dtype=dtype, parse_dates=parse_dates)
df.to_parquet(parquet_path, index=False, compression="snappy")
csv_mb = os.path.getsize(csv_path) / (1024**2)
parquet_mb = os.path.getsize(parquet_path) / (1024**2)
print(f"Converted: {csv_mb:.1f}MB CSV → {parquet_mb:.1f}MB Parquet "
f"({csv_mb/parquet_mb:.1f}× compression)")
# JSON/JSONL → Parquet
def jsonl_to_parquet(jsonl_path: str, parquet_path: str):
df = pd.read_json(jsonl_path, lines=True)
df.to_parquet(parquet_path, index=False)
# Parquet → CSV (for sharing with stakeholders)
def parquet_to_csv(parquet_path: str, csv_path: str,
columns: list = None):
df = pd.read_parquet(parquet_path, columns=columns)
df.to_csv(csv_path, index=False)
# Multiple CSVs → Single Parquet (common data lake ingestion pattern)
def combine_csvs_to_parquet(csv_dir: str, parquet_path: str,
pattern: str = "*.csv"):
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq
writer = None
total_rows = 0
for csv_file in sorted(Path(csv_dir).glob(pattern)):
df = pd.read_csv(csv_file)
table = pa.Table.from_pandas(df, preserve_index=False)
if writer is None:
writer = pq.ParquetWriter(parquet_path, table.schema)
writer.write_table(table)
total_rows += len(df)
print(f" Added {csv_file.name}: {len(df):,} rows")
if writer:
writer.close()
print(f"Combined {total_rows:,} total rows → {parquet_path}")
combine_csvs_to_parquet("data/raw/monthly_exports/", "data/processed/all_transactions.parquet")Format Best Practices for Data Science Projects
Standardize on Parquet for All Large Tabular Data
data/
├── raw/ ← Original data as received (preserve original format)
│ ├── customers_2024.csv ← Keep original for audit trail
│ └── orders_api.jsonl ← Keep original for audit trail
├── processed/ ← Converted and cleaned (always Parquet)
│ ├── customers.parquet
│ └── orders.parquet
└── features/ ← Feature engineered data (Parquet or Feather)
└── training_features.parquetAlways Preserve the Original
Never overwrite raw data. Convert it to Parquet for performance, but keep the original for debugging and auditing.
Write Schema Documentation
Parquet preserves types but not business semantics. Document what each column means:
# Write schema as a sidecar JSON file
import json
schema_doc = {
"description": "Customer features for churn model",
"created": "2024-09-15",
"source": "analytics.customers + orders tables",
"columns": {
"customer_id": {"type": "string", "description": "Unique customer identifier"},
"recency_days": {"type": "int32", "description": "Days since last order"},
"lifetime_value": {"type": "float64","description": "Total spend in USD, all time"},
"is_premium": {"type": "bool", "description": "Premium tier membership flag"}
}
}
with open("data/features/training_features_schema.json", "w") as f:
json.dump(schema_doc, f, indent=2)Use Compression Strategically
- Snappy (default for Parquet): Fastest decompression, good compression ratio — best for frequently read data
- Gzip: Best compression ratio, slower decompression — best for archival or infrequently read data
- Zstandard (zstd): Best balance of compression and speed — best overall for production pipelines
Summary
Data formats are not an afterthought — they are a core technical decision that affects every part of a data science workflow. The choice between CSV and Parquet for a 10-million-row dataset can mean the difference between a pipeline that takes 5 minutes and one that takes 5 seconds. The choice between JSON and Avro for a Kafka stream determines whether your pipeline handles schema changes gracefully or breaks.
The practical hierarchy for data scientists is: receive data in whatever format it arrives (preserve the original), convert to Parquet as the canonical format for all large tabular data, use Feather for fast local intermediate results, use JSON for API data and nested structures, use Pickle or joblib for ML model artifacts, and use HDF5 for multi-dimensional arrays.
The most impactful format decision most data scientists can make right now is simply replacing CSV with Parquet as their default format for large datasets. The combination of smaller file size, faster reads, column pruning, and type preservation makes it strictly better for large analytical data — with no real downsides except needing Python to read it rather than a text editor.
Key Takeaways
- Row-oriented formats (CSV, JSON, Avro) store all fields of a record together — efficient for writing one record at a time and reading complete records, but slow for analytical queries that read a few columns from many rows
- Column-oriented formats (Parquet, ORC, Feather) store all values of one column together — dramatically faster for analytical queries, better compression, and able to skip irrelevant columns entirely
- CSV is universal and human-readable but has no type information, is slow for large data, and wastes space — use it only for small data exchange with non-technical users or legacy systems
- Parquet is the default choice for large tabular data in data science: 5–15× smaller than CSV, 10–40× faster for analytical queries, preserves exact data types, and is supported by every major data tool
- JSON excels at hierarchical/nested data and web APIs; JSONL (one object per line) is better than a large JSON array for large datasets because it’s appendable and streamable
- Avro is the standard for Kafka and streaming pipelines — row-oriented binary with schema evolution capability; Feather is the fastest format for local intermediate results but not intended for long-term storage
- Parquet compression: Snappy (fast decompression, good compression — default), Gzip (best compression, slower), Zstandard (best balance for production) — always compress Parquet files
- The practical workflow: preserve original data in whatever format it arrives, convert to Parquet for all large processing, convert back to CSV only when sharing with non-technical recipients








