Understanding Data Formats: CSV, JSON, Parquet, and More

Learn the key data formats used in data science: CSV, JSON, Parquet, Avro, ORC, HDF5, Feather, and Pickle. Understand when to use each and how they affect performance and storage.

By Techietory on May 24, 2026

Understanding Data Formats: CSV, JSON, Parquet, and More

Data formats are the containers that store and transport data — and choosing the right one dramatically affects storage size, read/write speed, and analytical performance. CSV is universal and human-readable but slow and large. JSON is flexible and web-native but wastes space with repeated keys. Parquet is a columnar binary format that compresses data 5–10× smaller than CSV and reads analytical queries 10–100× faster by loading only the columns you need. For production data pipelines and data science projects, Parquet has become the default format for large tabular datasets, while CSV remains the standard for small data exchange and interoperability.

Introduction

Every dataset you work with lives in a file — or came from one. The format of that file determines how fast it reads, how much disk space it uses, whether it preserves data types, how easily other tools can read it, and whether it can store complex nested data or only flat tables.

As a data scientist, you’ll encounter a surprisingly wide variety of data formats. A stakeholder sends you a CSV. An API returns JSON. The data warehouse exports to Parquet. A legacy system produces XML. A deep learning researcher shares data in HDF5. The ML engineer asks you to save your model in pickle or joblib format. A Kafka stream writes Avro. Each format exists for a reason, and understanding those reasons helps you make better decisions about how to store and exchange your own data.

This article provides the conceptual foundation and practical guidance you need: what each major format is, how it works under the hood, when to use it, and how to read and write it in Python. By the end, you’ll be able to look at a data exchange situation and know instantly which format fits.

Two Fundamental Distinctions: Text vs. Binary and Row vs. Column

Before diving into specific formats, two axes help organize the entire format landscape.

Text Formats vs. Binary Formats

Text formats (CSV, JSON, XML) store data as human-readable characters. You can open them in any text editor and read the contents. They’re universally compatible — any tool, language, or operating system can work with plain text. But they’re inefficient: the number 1234567.89 takes 10 characters (bytes) in a text file, versus 8 bytes as a double-precision float. Repetition in keys (JSON) or structure (XML) wastes enormous space.

Binary formats (Parquet, Avro, HDF5, Feather, Pickle) encode data as compact binary data. You can’t read them in a text editor. They require specific libraries to read. But they’re far more efficient: the same number might take 4 bytes as an int32 or 4 bytes as a float32, and with compression they might take 1–2 bytes. And binary formats preserve exact data types — a float is a float, not a string that needs to be parsed back.

Row-Oriented vs. Column-Oriented Storage

This is the most important structural distinction for data science performance.

Row-oriented formats (CSV, JSON, Avro) store all fields of one record together, then all fields of the next record, and so on:

Plaintext

Record 1: [id=CUST_001, name="Jane", city="Austin", spend=384.96]
Record 2: [id=CUST_002, name="Bob",  city="Seattle", spend=139.98]
Record 3: [id=CUST_003, name="Alice", city="Chicago", spend=269.97]

Record 1: [id=CUST_001, name="Jane", city="Austin", spend=384.96]
Record 2: [id=CUST_002, name="Bob",  city="Seattle", spend=139.98]
Record 3: [id=CUST_003, name="Alice", city="Chicago", spend=269.97]

To read just the spend column from a 10-million-row file, a row-oriented reader must scan every single record — reading and discarding id, name, and city for each row just to get to spend.

Column-oriented formats (Parquet, ORC, Feather) store all values of one column together, then all values of the next column:

Plaintext

id column:    [CUST_001, CUST_002, CUST_003, ...]
name column:  [Jane, Bob, Alice, ...]
city column:  [Austin, Seattle, Chicago, ...]
spend column: [384.96, 139.98, 269.97, ...]

id column:    [CUST_001, CUST_002, CUST_003, ...]
name column:  [Jane, Bob, Alice, ...]
city column:  [Austin, Seattle, Chicago, ...]
spend column: [384.96, 139.98, 269.97, ...]

To read just spend, the columnar reader jumps directly to the spend column data and reads only that — skipping the other columns entirely. A query that reads 3 columns out of 100 reads about 3% of the file instead of 100%.

Column storage also enables dramatic compression because similar values are stored together — a column of country codes like ["USA", "USA", "UK", "USA", "CA", "USA"] compresses far better when all the USA values are adjacent than when they’re scattered across millions of rows.

CSV: The Universal Standard

CSV (Comma-Separated Values) is the simplest and most universal data format. It stores tabular data as plain text with commas separating fields and newlines separating records:

Plaintext

customer_id,name,city,country,total_spend,is_premium
CUST_001,Jane Smith,Austin,USA,384.96,True
CUST_002,Bob Johnson,Seattle,USA,139.98,False
CUST_003,Alice Williams,Chicago,USA,269.97,True

customer_id,name,city,country,total_spend,is_premium
CUST_001,Jane Smith,Austin,USA,384.96,True
CUST_002,Bob Johnson,Seattle,USA,139.98,False
CUST_003,Alice Williams,Chicago,USA,269.97,True

What Makes CSV Good

Universal compatibility: Every tool, language, and operating system reads CSV — Excel, pandas, R, Spark, databases, text editors, cat, everything
Human-readable: Open any CSV in a text editor and immediately understand the data
Simple to generate: Any system can write CSV by appending comma-separated lines
Good for small data exchange: When sharing small datasets with non-technical stakeholders, CSV is the right choice

What Makes CSV Poor

No type information: Every value is a string. Is "True" a boolean? Is "123" an integer or a string? Is "2024-09-15" a date? The reader must guess or be explicitly told.
No schema: No formal definition of column names, types, or constraints. Two CSVs of the “same” dataset can have different column names, different column orders, different null representations.
Slow for analytical queries: Row-oriented, text-based. Reading one column from a 10-million-row CSV requires parsing every line.
No native support for nested data: A customer’s list of orders can’t be represented natively — you either join them into separate CSVs or serialize the nested structure into a string.
Encoding ambiguity: UTF-8? Latin-1? Windows-1252? CSVs don’t declare their encoding.
Compression is external: CSV can be gzip-compressed (.csv.gz) but the compression is not aware of the data structure.

CSV in Python

Python

import pandas as pd

# Read
df = pd.read_csv("data/customers.csv")

# Read with explicit types and options
df = pd.read_csv(
    "data/customers.csv",
    dtype={"customer_id": str, "zip_code": str},
    parse_dates=["signup_date"],
    na_values=["N/A", "-", "null"],
    encoding="utf-8"
)

# Write
df.to_csv("output/customers.csv", index=False, encoding="utf-8")

# Write compressed
df.to_csv("output/customers.csv.gz", index=False, compression="gzip")

import pandas as pd

# Read
df = pd.read_csv("data/customers.csv")

# Read with explicit types and options
df = pd.read_csv(
    "data/customers.csv",
    dtype={"customer_id": str, "zip_code": str},
    parse_dates=["signup_date"],
    na_values=["N/A", "-", "null"],
    encoding="utf-8"
)

# Write
df.to_csv("output/customers.csv", index=False, encoding="utf-8")

# Write compressed
df.to_csv("output/customers.csv.gz", index=False, compression="gzip")

When to Use CSV

Use CSV when: the data is small, you need maximum compatibility, you’re sharing data with non-technical users or tools that can’t read other formats, or the data has a simple flat structure. Avoid CSV for: large datasets (> a few hundred MB), performance-sensitive pipelines, data with complex types or nested structures, or anything you’ll read many times.

JSON: Flexibility and Web-Native Data

JSON (JavaScript Object Notation) stores data as nested key-value structures:

JSON

[
  {
    "customer_id": "CUST_001",
    "name": "Jane Smith",
    "is_premium": true,
    "total_spend": 384.96,
    "signup_date": "2022-03-14",
    "orders": [
      {"order_id": "ORD_1001", "amount": 199.98},
      {"order_id": "ORD_1003", "amount": 149.99}
    ]
  }
]

[
  {
    "customer_id": "CUST_001",
    "name": "Jane Smith",
    "is_premium": true,
    "total_spend": 384.96,
    "signup_date": "2022-03-14",
    "orders": [
      {"order_id": "ORD_1001", "amount": 199.98},
      {"order_id": "ORD_1003", "amount": 149.99}
    ]
  }
]

What Makes JSON Good

Native nested/hierarchical data: Represents the actual structure of data — a customer with a list of orders, an order with a list of items — without artificial flattening
Web-native: The universal format for APIs; every REST API returns JSON
Flexible schema: Each record can have different fields; missing fields are simply absent
Readable types: true/false for booleans, numbers without quotes, null for missing — better than CSV’s everything-is-a-string

What Makes JSON Poor

Verbose with repetition: Every record repeats all the field names. A 1-million-row dataset with 10 fields repeats every field name 1 million times — potentially more bytes than the actual data.
Row-oriented: Like CSV, reading one field from a large JSON array requires parsing every record.
No date type: Dates are strings — ISO format by convention but not enforced.
Large file size: Generally 2-5× larger than equivalent Parquet for tabular data.
Slow parsing at scale: Parsing JSON at 100 million records is slow compared to binary formats.

JSON Lines (JSONL) for Large Data

For large datasets, JSONL (one JSON object per line) is better than a single JSON array — it’s appendable, streamable, and one corrupt line doesn’t break the whole file:

JSON

{"customer_id": "CUST_001", "name": "Jane Smith", "spend": 384.96}
{"customer_id": "CUST_002", "name": "Bob Johnson", "spend": 139.98}
{"customer_id": "CUST_003", "name": "Alice Williams", "spend": 269.97}

{"customer_id": "CUST_001", "name": "Jane Smith", "spend": 384.96}
{"customer_id": "CUST_002", "name": "Bob Johnson", "spend": 139.98}
{"customer_id": "CUST_003", "name": "Alice Williams", "spend": 269.97}

Python

import pandas as pd

# Read JSON
df = pd.read_json("data/customers.json")
df = pd.read_json("data/events.jsonl", lines=True)  # JSONL

# Write JSON
df.to_json("output/customers.json", orient="records", indent=2)
df.to_json("output/events.jsonl", orient="records", lines=True)

import pandas as pd

# Read JSON
df = pd.read_json("data/customers.json")
df = pd.read_json("data/events.jsonl", lines=True)  # JSONL

# Write JSON
df.to_json("output/customers.json", orient="records", indent=2)
df.to_json("output/events.jsonl", orient="records", lines=True)

When to Use JSON

Use JSON when: the data is hierarchical or nested, the data comes from or goes to a web API, you need schema flexibility (different records with different fields), or the data is small to medium sized. Use JSONL for streaming/event data. Avoid JSON for: large flat tabular datasets where Parquet is dramatically better.

Parquet: The Modern Standard for Data Science

Apache Parquet is a columnar binary format designed for analytical workloads. It was developed by Twitter and Cloudera and has become the standard format for data lakes, cloud data warehouses (BigQuery, Redshift, Snowflake use it internally), and data science pipelines.

How Parquet Works

Parquet stores data in three tiers:

File: Contains one or more row groups
Row group: A horizontal partition of the data (e.g., 128MB chunks)
Column chunk: Within each row group, each column’s data is stored together

Within each column chunk, data is:

Dictionary-encoded: Repeated values (like country codes, category names) are stored as integers pointing to a lookup table
Compressed: Using Snappy (fast), Gzip (small), or Zstandard (balanced) compression
Type-preserved: Each column has an explicit data type — int32, float64, string, date, timestamp, etc.

The file footer stores metadata about every column: min/max values, null counts, and byte offsets. This enables predicate pushdown — when you filter WHERE country = 'USA', Parquet skips entire row groups where the min/max metadata proves there are no USA records, without even reading that data.

Parquet Performance in Practice

The numbers are striking:

Operation	CSV	Parquet	Speedup
File size (1M rows, 20 cols)	~200 MB	~25 MB	8× smaller
Read all columns	12 seconds	1.8 seconds	7× faster
Read 3 of 20 columns	12 seconds	0.4 seconds	30× faster
Filter to 5% of rows	12 seconds	0.3 seconds	40× faster
Write	8 seconds	2 seconds	4× faster

(Approximate figures; actual results vary by data characteristics and hardware.)

Parquet in Python

Python

import pandas as pd

# Read Parquet (requires pyarrow or fastparquet)
df = pd.read_parquet("data/customers.parquet")

# Read only specific columns (key Parquet advantage)
df = pd.read_parquet("data/customers.parquet",
                     columns=["customer_id", "total_spend", "signup_date"])

# Read with filters (predicate pushdown — skips irrelevant row groups)
df = pd.read_parquet(
    "data/transactions.parquet",
    filters=[
        ("transaction_date", ">=", "2024-01-01"),
        ("country", "==", "USA"),
        ("amount", ">", 100.0)
    ]
)

# Write Parquet
df.to_parquet("output/customers.parquet", index=False)

# Write with explicit compression (default is snappy)
df.to_parquet("output/customers.parquet",
              engine="pyarrow",
              compression="gzip",      # "snappy", "gzip", "brotli", "zstd", "none"
              index=False)

# Partition data by a column (creates a directory structure)
# data/transactions/year=2024/month=9/part-0.parquet
df.to_parquet("data/transactions/",
              partition_cols=["year", "month"],
              engine="pyarrow")

# Read a partitioned dataset
df = pd.read_parquet("data/transactions/")   # Reads all partitions
df = pd.read_parquet("data/transactions/",
                     filters=[("year", "==", 2024), ("month", "==", 9)])

import pandas as pd

# Read Parquet (requires pyarrow or fastparquet)
df = pd.read_parquet("data/customers.parquet")

# Read only specific columns (key Parquet advantage)
df = pd.read_parquet("data/customers.parquet",
                     columns=["customer_id", "total_spend", "signup_date"])

# Read with filters (predicate pushdown — skips irrelevant row groups)
df = pd.read_parquet(
    "data/transactions.parquet",
    filters=[
        ("transaction_date", ">=", "2024-01-01"),
        ("country", "==", "USA"),
        ("amount", ">", 100.0)
    ]
)

# Write Parquet
df.to_parquet("output/customers.parquet", index=False)

# Write with explicit compression (default is snappy)
df.to_parquet("output/customers.parquet",
              engine="pyarrow",
              compression="gzip",      # "snappy", "gzip", "brotli", "zstd", "none"
              index=False)

# Partition data by a column (creates a directory structure)
# data/transactions/year=2024/month=9/part-0.parquet
df.to_parquet("data/transactions/",
              partition_cols=["year", "month"],
              engine="pyarrow")

# Read a partitioned dataset
df = pd.read_parquet("data/transactions/")   # Reads all partitions
df = pd.read_parquet("data/transactions/",
                     filters=[("year", "==", 2024), ("month", "==", 9)])

Parquet with pyarrow Directly

Python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# Read a Parquet file's sc.hema without loading data
schema = pq.read_schema("data/customers.parquet")
print(schema)
# customer_id: string
# name: string
# city: string
# total_spend: double
# signup_date: date32[day]
# is_premium: bool

# Read metadata
metadata = pq.read_metadata("data/customers.parquet")
print(f"Row groups: {metadata.num_row_groups}")
print(f"Total rows: {metadata.num_rows}")
print(f"Serialized size: {metadata.serialized_size:,} bytes")

# Read specific row group
table = pq.read_table("data/customers.parquet",
                       columns=["customer_id", "total_spend"])
df = table.to_pandas()

# Write with explicit sc.hema
schema = pa.sc.hema([
    pa.field("customer_id",  pa.string()),
    pa.field("name",         pa.string()),
    pa.field("total_spend",  pa.float64()),
    pa.field("signup_date",  pa.date32()),
    pa.field("is_premium",   pa.bool_())
])

table = pa.Table.from_pandas(df, sc.hema=sc.hema)
pq.write_table(table, "output/customers_typed.parquet",
               compression="snappy")

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# Read a Parquet file's sc.hema without loading data
schema = pq.read_schema("data/customers.parquet")
print(schema)
# customer_id: string
# name: string
# city: string
# total_spend: double
# signup_date: date32[day]
# is_premium: bool

# Read metadata
metadata = pq.read_metadata("data/customers.parquet")
print(f"Row groups: {metadata.num_row_groups}")
print(f"Total rows: {metadata.num_rows}")
print(f"Serialized size: {metadata.serialized_size:,} bytes")

# Read specific row group
table = pq.read_table("data/customers.parquet",
                       columns=["customer_id", "total_spend"])
df = table.to_pandas()

# Write with explicit sc.hema
schema = pa.sc.hema([
    pa.field("customer_id",  pa.string()),
    pa.field("name",         pa.string()),
    pa.field("total_spend",  pa.float64()),
    pa.field("signup_date",  pa.date32()),
    pa.field("is_premium",   pa.bool_())
])

table = pa.Table.from_pandas(df, sc.hema=sc.hema)
pq.write_table(table, "output/customers_typed.parquet",
               compression="snappy")

When to Use Parquet

Use Parquet when: the data is large (> a few hundred MB), you’ll run analytical queries (groupby, filter, aggregation), you need to preserve exact data types, you’re building a data pipeline or data lake, or the data will be read many times (the write cost is paid once, read benefits are repeated). Parquet is essentially the default format for all large tabular data science work today.

Avro: Row-Oriented Binary with Schema Evolution

Apache Avro is a row-oriented binary serialization format developed for Apache Hadoop. Unlike Parquet’s analytical focus, Avro’s design priorities are: compact row serialization, schema evolution (reading data written with an old schema using a new schema), and streaming compatibility.

Avro’s Key Characteristics

Schema included in file: The JSON schema is embedded in the file header — every Avro file is self-describing
Schema evolution: Fields can be added or removed between schema versions; old data can be read with new schema using default values
Row-oriented: Optimized for writing individual records (ideal for streaming/Kafka)
Compact binary: Efficient encoding, especially for records with many null fields

Avro is the standard format for Apache Kafka messages and streaming data pipelines. When data flows through Kafka, each message is typically Avro-encoded with the schema stored in a Schema Registry.

Python

# pip install fastavro
import fastavro
import io

schema = {
    "type": "record",
    "name": "Customer",
    "fields": [
        {"name": "customer_id",  "type": "string"},
        {"name": "name",         "type": "string"},
        {"name": "total_spend",  "type": "double"},
        {"name": "is_premium",   "type": "boolean"},
        {"name": "signup_date",  "type": ["null", "string"], "default": None}
    ]
}

records = [
    {"customer_id": "CUST_001", "name": "Jane Smith",
     "total_spend": 384.96, "is_premium": True, "signup_date": "2022-03-14"},
    {"customer_id": "CUST_002", "name": "Bob Johnson",
     "total_spend": 139.98, "is_premium": False, "signup_date": None}
]

# Write Avro
with open("output/customers.avro", "wb") as f:
    fastavro.writer(f, fastavro.parse_schema(schema), records)

# Read Avro into list of dicts
with open("output/customers.avro", "rb") as f:
    records_back = list(fastavro.reader(f))

import pandas as pd
df = pd.DataFrame(records_back)

# pip install fastavro
import fastavro
import io

schema = {
    "type": "record",
    "name": "Customer",
    "fields": [
        {"name": "customer_id",  "type": "string"},
        {"name": "name",         "type": "string"},
        {"name": "total_spend",  "type": "double"},
        {"name": "is_premium",   "type": "boolean"},
        {"name": "signup_date",  "type": ["null", "string"], "default": None}
    ]
}

records = [
    {"customer_id": "CUST_001", "name": "Jane Smith",
     "total_spend": 384.96, "is_premium": True, "signup_date": "2022-03-14"},
    {"customer_id": "CUST_002", "name": "Bob Johnson",
     "total_spend": 139.98, "is_premium": False, "signup_date": None}
]

# Write Avro
with open("output/customers.avro", "wb") as f:
    fastavro.writer(f, fastavro.parse_schema(schema), records)

# Read Avro into list of dicts
with open("output/customers.avro", "rb") as f:
    records_back = list(fastavro.reader(f))

import pandas as pd
df = pd.DataFrame(records_back)

When to Use Avro

Use Avro when: data flows through Kafka or other streaming systems, you need schema evolution (the schema will change over time), you’re building write-heavy pipelines where row-at-a-time appending is needed, or you need the schema embedded in the data itself for long-term archival. Avro is not ideal for analytical queries — use Parquet for that.

ORC: Columnar Binary for Hive and Spark

ORC (Optimized Row Columnar) is a columnar binary format developed by Hortonworks for Apache Hive, the SQL-on-Hadoop query engine. ORC and Parquet fill similar niches — columnar, binary, compressed, type-aware — but ORC has stronger integration with Hive and ORC’s bloom filters and lighter-weight compression can be advantageous in Spark workloads.

Python

import pandas as pd

# Requires pyarrow
df = pd.read_orc("data/transactions.orc")
df.to_orc("output/transactions.orc")

import pandas as pd

# Requires pyarrow
df = pd.read_orc("data/transactions.orc")
df.to_orc("output/transactions.orc")

When to use ORC: Primarily in Hive-based data warehouses and Spark SQL workloads where your team or infrastructure has already standardized on ORC. For new projects, Parquet has broader ecosystem support and is equally performant.

Feather: Ultra-Fast Local Data Exchange

Feather (also called Arrow IPC format) is a binary columnar format based on Apache Arrow’s in-memory data representation. Its design goal is a single thing: maximum I/O speed for reading and writing between Python, R, and other Arrow-compatible tools.

Python

import pandas as pd

# Read (requires pyarrow)
df = pd.read_feather("data/features.feather")

# Write
df.to_feather("output/features.feather")

import pandas as pd

# Read (requires pyarrow)
df = pd.read_feather("data/features.feather")

# Write
df.to_feather("output/features.feather")

Feather trades compression for speed. It writes data with minimal transformation from Arrow’s in-memory format — essentially writing the memory layout directly to disk. This makes Feather reads and writes faster than any other format, often several times faster than Parquet.

Format	Write Speed	Read Speed	File Size
CSV	Slow	Slow	Large
Parquet (snappy)	Fast	Fast	Small
Feather	Fastest	Fastest	Medium (uncompressed)
Pickle	Fast	Fast	Medium

When to use Feather: For saving intermediate results during iterative development — saving the output of an expensive feature engineering step so you can quickly reload it without re-running all the computation. Not recommended for long-term storage or data sharing (it’s a transient format).

HDF5: Scientific and Hierarchical Data

HDF5 (Hierarchical Data Format, version 5) is a binary format from the scientific computing world, capable of storing multiple datasets, arrays of any shape and type, and rich metadata in a single file. Think of it as a filesystem inside a file — directories (groups) containing datasets (arrays or tables).

Python

import pandas as pd

# Write multiple DataFrames to one HDF5 file
with pd.HDFStore("data/project_data.h5", mode="w") as store:
    store["customers"]    = df_customers
    store["orders"]       = df_orders
    store["features"]     = df_features

# Read back
with pd.HDFStore("data/project_data.h5", mode="r") as store:
    print(store.keys())               # ['/customers', '/orders', '/features']
    df = store["customers"]
    df_filtered = store.select("orders", where="status == 'delivered'")

# Direct read/write
df.to_hdf("data/customers.h5", key="customers", mode="w")
df = pd.read_hdf("data/customers.h5", key="customers")

# With PyTables for efficient querying
df.to_hdf("data/transactions.h5", key="transactions",
          mode="w", format="table",   # "table" enables querying; "fixed" is faster but no queries
          data_columns=True)

# Query without loading everything
df_large = pd.read_hdf(
    "data/transactions.h5", key="transactions",
    where="amount > 100 and country == 'USA'"
)

import pandas as pd

# Write multiple DataFrames to one HDF5 file
with pd.HDFStore("data/project_data.h5", mode="w") as store:
    store["customers"]    = df_customers
    store["orders"]       = df_orders
    store["features"]     = df_features

# Read back
with pd.HDFStore("data/project_data.h5", mode="r") as store:
    print(store.keys())               # ['/customers', '/orders', '/features']
    df = store["customers"]
    df_filtered = store.select("orders", where="status == 'delivered'")

# Direct read/write
df.to_hdf("data/customers.h5", key="customers", mode="w")
df = pd.read_hdf("data/customers.h5", key="customers")

# With PyTables for efficient querying
df.to_hdf("data/transactions.h5", key="transactions",
          mode="w", format="table",   # "table" enables querying; "fixed" is faster but no queries
          data_columns=True)

# Query without loading everything
df_large = pd.read_hdf(
    "data/transactions.h5", key="transactions",
    where="amount > 100 and country == 'USA'"
)

HDF5 is the standard format for deep learning dataset storage (NumPy arrays, image tensors) and scientific simulation output. Its ability to store hierarchical arrays of arbitrary shape makes it ideal for scenarios where data isn’t naturally tabular.

When to use HDF5: Multi-dimensional arrays (images, audio, time series with multiple sensors), deep learning datasets, scientific computing, storing multiple related datasets in one file. For simple 2D tabular data, Parquet is simpler and faster.

Pickle: Python-Specific Object Serialization

Pickle is Python’s built-in object serialization format. It can serialize virtually any Python object — not just DataFrames, but trained scikit-learn models, custom classes, nested data structures, anything Python can represent.

Python

import pickle
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Save a trained model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

with open("models/rf_churn_model.pkl", "wb") as f:
    pickle.dump(model, f)

# Load the model back
with open("models/rf_churn_model.pkl", "rb") as f:
    model_loaded = pickle.load(f)

predictions = model_loaded.predict(X_test)

# Pickle DataFrames (works but Parquet/Feather are usually better)
df.to_pickle("data/features.pkl")
df_loaded = pd.read_pickle("data/features.pkl")

# joblib is better than pickle for large numpy arrays (models)
import joblib

joblib.dump(model, "models/rf_churn_model.joblib")
model_loaded = joblib.load("models/rf_churn_model.joblib")

import pickle
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Save a trained model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

with open("models/rf_churn_model.pkl", "wb") as f:
    pickle.dump(model, f)

# Load the model back
with open("models/rf_churn_model.pkl", "rb") as f:
    model_loaded = pickle.load(f)

predictions = model_loaded.predict(X_test)

# Pickle DataFrames (works but Parquet/Feather are usually better)
df.to_pickle("data/features.pkl")
df_loaded = pd.read_pickle("data/features.pkl")

# joblib is better than pickle for large numpy arrays (models)
import joblib

joblib.dump(model, "models/rf_churn_model.joblib")
model_loaded = joblib.load("models/rf_churn_model.joblib")

Important Security Warning

Never unpickle data from untrusted sources. Pickle can execute arbitrary Python code during deserialization — a malicious .pkl file can compromise your system. Only pickle/unpickle objects from sources you control.

When to use Pickle: Saving trained ML models for deployment (alongside documentation of the model’s expected input schema), caching intermediate Python objects during development. Never for long-term data storage, cross-language compatibility, or data received from external sources.

Format Comparison: A Comprehensive Reference

Format	Type	Structure	Size	Read Speed	Write Speed	Type Safety	Nested Data	Best For
CSV	Text	Row	Large	Slow	Slow	None	No	Small data exchange
JSON	Text	Row	Large	Slow	Slow	Partial	Yes	APIs, nested data
JSONL	Text	Row	Large	Slow	Fast	Partial	Yes	Streaming, event logs
Parquet	Binary	Column	Small	Fast	Fast	Full	Limited	Large analytical data
Avro	Binary	Row	Small	Medium	Fast	Full	Yes	Kafka, streaming
ORC	Binary	Column	Small	Fast	Fast	Full	Limited	Hive/Spark
Feather	Binary	Column	Medium	Fastest	Fastest	Full	No	Dev/intermediate
HDF5	Binary	Hierarchical	Small	Fast	Fast	Full	Yes	Arrays, science
Pickle	Binary	Object	Medium	Fast	Fast	Full	Yes	ML models, Python objs

Choosing the Right Format: A Decision Guide

The right format depends on five factors: data size, usage frequency, structural complexity, sharing requirements, and performance needs.

By Data Size

Small data (< 10 MB): CSV or JSON — simplicity and compatibility outweigh performance. The difference between formats at this scale is milliseconds.

Medium data (10 MB – 1 GB): Parquet is the clear winner — meaningful speed and size improvements over CSV, and all tools support it. HDF5 if the data is multi-dimensional.

Large data (> 1 GB): Parquet is essential. With partitioning and predicate pushdown, analytical queries on 100 GB Parquet datasets complete in seconds where equivalent CSV queries might take hours.

By Use Case

Plaintext

Sharing data with non-technical stakeholders
    → CSV (they can open it in Excel)

Web API response or data from an API
    → JSON (web-native, handles nested structure)

Streaming/Kafka pipeline
    → Avro (schema evolution, row-at-a-time writing)

Large analytical dataset in a data pipeline
    → Parquet (columnar, compressed, fast queries)

Intermediate results during ML development
    → Feather (fastest reload) or Parquet (if you'll share it)

Trained ML model artifact
    → Pickle or joblib (Python-specific) or ONNX (cross-platform)

Multi-dimensional arrays (images, tensors, time series)
    → HDF5 or NumPy .npy / .npz

Long-term archival of production data
    → Parquet (well-supported, open standard, compressed)

Sharing data with non-technical stakeholders
    → CSV (they can open it in Excel)

Web API response or data from an API
    → JSON (web-native, handles nested structure)

Streaming/Kafka pipeline
    → Avro (schema evolution, row-at-a-time writing)

Large analytical dataset in a data pipeline
    → Parquet (columnar, compressed, fast queries)

Intermediate results during ML development
    → Feather (fastest reload) or Parquet (if you'll share it)

Trained ML model artifact
    → Pickle or joblib (Python-specific) or ONNX (cross-platform)

Multi-dimensional arrays (images, tensors, time series)
    → HDF5 or NumPy .npy / .npz

Long-term archival of production data
    → Parquet (well-supported, open standard, compressed)

A Practical Decision Tree

Plaintext

Does the data have nested/hierarchical structure?
├── Yes → JSON (small/medium) or Avro (streaming)
└── No  → Is it going to a non-technical user or legacy system?
          ├── Yes → CSV
          └── No  → Is it large (> 100MB) or queried repeatedly?
                    ├── Yes → Parquet
                    └── No  → Feather (dev) or Parquet (production)

Does the data have nested/hierarchical structure?
├── Yes → JSON (small/medium) or Avro (streaming)
└── No  → Is it going to a non-technical user or legacy system?
          ├── Yes → CSV
          └── No  → Is it large (> 100MB) or queried repeatedly?
                    ├── Yes → Parquet
                    └── No  → Feather (dev) or Parquet (production)

Real-World Format Benchmarks

Here’s a concrete comparison on a representative dataset — 5 million rows, 15 columns (mix of strings, integers, floats, dates):

Python

import pandas as pd
import time
import os

# Generate synthetic dataset
n = 5_000_000
df = pd.DataFrame({
    "customer_id":  [f"CUST_{i:07d}" for i in range(n)],
    "transaction_date": pd.date_range("2020-01-01", periods=n, freq="1min"),
    "amount":       pd.Series(range(n), dtype="float64") * 0.01 + 10.0,
    "country":      pd.Categorical(["USA"] * (n//2) + ["UK"] * (n//4) + ["CA"] * (n//4)),
    "status":       pd.Categorical(["completed"] * int(n*0.8) + ["pending"] * int(n*0.2)),
    "product_id":   [f"PROD_{i % 1000:04d}" for i in range(n)],
    "quantity":     pd.Series(range(1, n+1), dtype="int32") % 10 + 1,
})

def benchmark_format(df, write_path, write_fn, read_fn, format_name):
    # Write
    t0 = time.time()
    write_fn(df, write_path)
    write_time = time.time() - t0

    # File size
    size_mb = os.path.getsize(write_path) / (1024 ** 2)

    # Read
    t0 = time.time()
    df_back = read_fn(write_path)
    read_time = time.time() - t0

    print(f"{format_name:12s} | Size: {size_mb:6.1f} MB | "
          f"Write: {write_time:.2f}s | Read: {read_time:.2f}s")
    return size_mb, write_time, read_time

print(f"{'Format':12s} | {'Size':^12s} | {'Write':^10s} | {'Read':^10s}")
print("-" * 55)

benchmark_format(df, "/tmp/test.csv",
    lambda d, p: d.to_csv(p, index=False),
    lambda p: pd.read_csv(p),
    "CSV")

benchmark_format(df, "/tmp/test.parquet",
    lambda d, p: d.to_parquet(p, index=False),
    lambda p: pd.read_parquet(p),
    "Parquet(snappy)")

benchmark_format(df, "/tmp/test.parquet.gz",
    lambda d, p: d.to_parquet(p, index=False, compression="gzip"),
    lambda p: pd.read_parquet(p),
    "Parquet(gzip)")

benchmark_format(df, "/tmp/test.feather",
    lambda d, p: d.to_feather(p),
    lambda p: pd.read_feather(p),
    "Feather")

import pandas as pd
import time
import os

# Generate synthetic dataset
n = 5_000_000
df = pd.DataFrame({
    "customer_id":  [f"CUST_{i:07d}" for i in range(n)],
    "transaction_date": pd.date_range("2020-01-01", periods=n, freq="1min"),
    "amount":       pd.Series(range(n), dtype="float64") * 0.01 + 10.0,
    "country":      pd.Categorical(["USA"] * (n//2) + ["UK"] * (n//4) + ["CA"] * (n//4)),
    "status":       pd.Categorical(["completed"] * int(n*0.8) + ["pending"] * int(n*0.2)),
    "product_id":   [f"PROD_{i % 1000:04d}" for i in range(n)],
    "quantity":     pd.Series(range(1, n+1), dtype="int32") % 10 + 1,
})

def benchmark_format(df, write_path, write_fn, read_fn, format_name):
    # Write
    t0 = time.time()
    write_fn(df, write_path)
    write_time = time.time() - t0

    # File size
    size_mb = os.path.getsize(write_path) / (1024 ** 2)

    # Read
    t0 = time.time()
    df_back = read_fn(write_path)
    read_time = time.time() - t0

    print(f"{format_name:12s} | Size: {size_mb:6.1f} MB | "
          f"Write: {write_time:.2f}s | Read: {read_time:.2f}s")
    return size_mb, write_time, read_time

print(f"{'Format':12s} | {'Size':^12s} | {'Write':^10s} | {'Read':^10s}")
print("-" * 55)

benchmark_format(df, "/tmp/test.csv",
    lambda d, p: d.to_csv(p, index=False),
    lambda p: pd.read_csv(p),
    "CSV")

benchmark_format(df, "/tmp/test.parquet",
    lambda d, p: d.to_parquet(p, index=False),
    lambda p: pd.read_parquet(p),
    "Parquet(snappy)")

benchmark_format(df, "/tmp/test.parquet.gz",
    lambda d, p: d.to_parquet(p, index=False, compression="gzip"),
    lambda p: pd.read_parquet(p),
    "Parquet(gzip)")

benchmark_format(df, "/tmp/test.feather",
    lambda d, p: d.to_feather(p),
    lambda p: pd.read_feather(p),
    "Feather")

Typical results (approximate):

Format	Size	Write	Read	Notes
CSV	420 MB	38s	25s	Baseline
Parquet (snappy)	48 MB	4.2s	1.8s	9× smaller, 14× faster read
Parquet (gzip)	28 MB	7.1s	2.2s	15× smaller, 11× faster read
Feather	95 MB	1.1s	0.9s	Fastest I/O, larger than Parquet

Converting Between Formats

Python

import pandas as pd

# CSV → Parquet (the most common conversion)
def csv_to_parquet(csv_path: str, parquet_path: str,
                   dtype: dict = None, parse_dates: list = None):
    """Convert a CSV file to Parquet with proper type handling."""
    df = pd.read_csv(csv_path, dtype=dtype, parse_dates=parse_dates)
    df.to_parquet(parquet_path, index=False, compression="snappy")
    csv_mb     = os.path.getsize(csv_path) / (1024**2)
    parquet_mb = os.path.getsize(parquet_path) / (1024**2)
    print(f"Converted: {csv_mb:.1f}MB CSV → {parquet_mb:.1f}MB Parquet "
          f"({csv_mb/parquet_mb:.1f}× compression)")

# JSON/JSONL → Parquet
def jsonl_to_parquet(jsonl_path: str, parquet_path: str):
    df = pd.read_json(jsonl_path, lines=True)
    df.to_parquet(parquet_path, index=False)

# Parquet → CSV (for sharing with stakeholders)
def parquet_to_csv(parquet_path: str, csv_path: str,
                   columns: list = None):
    df = pd.read_parquet(parquet_path, columns=columns)
    df.to_csv(csv_path, index=False)

# Multiple CSVs → Single Parquet (common data lake ingestion pattern)
def combine_csvs_to_parquet(csv_dir: str, parquet_path: str,
                             pattern: str = "*.csv"):
    from pathlib import Path
    import pyarrow as pa
    import pyarrow.parquet as pq

    writer = None
    total_rows = 0

    for csv_file in sorted(Path(csv_dir).glob(pattern)):
        df = pd.read_csv(csv_file)
        table = pa.Table.from_pandas(df, preserve_index=False)

        if writer is None:
            writer = pq.ParquetWriter(parquet_path, table.schema)

        writer.write_table(table)
        total_rows += len(df)
        print(f"  Added {csv_file.name}: {len(df):,} rows")

    if writer:
        writer.close()

    print(f"Combined {total_rows:,} total rows → {parquet_path}")

combine_csvs_to_parquet("data/raw/monthly_exports/", "data/processed/all_transactions.parquet")

import pandas as pd

# CSV → Parquet (the most common conversion)
def csv_to_parquet(csv_path: str, parquet_path: str,
                   dtype: dict = None, parse_dates: list = None):
    """Convert a CSV file to Parquet with proper type handling."""
    df = pd.read_csv(csv_path, dtype=dtype, parse_dates=parse_dates)
    df.to_parquet(parquet_path, index=False, compression="snappy")
    csv_mb     = os.path.getsize(csv_path) / (1024**2)
    parquet_mb = os.path.getsize(parquet_path) / (1024**2)
    print(f"Converted: {csv_mb:.1f}MB CSV → {parquet_mb:.1f}MB Parquet "
          f"({csv_mb/parquet_mb:.1f}× compression)")

# JSON/JSONL → Parquet
def jsonl_to_parquet(jsonl_path: str, parquet_path: str):
    df = pd.read_json(jsonl_path, lines=True)
    df.to_parquet(parquet_path, index=False)

# Parquet → CSV (for sharing with stakeholders)
def parquet_to_csv(parquet_path: str, csv_path: str,
                   columns: list = None):
    df = pd.read_parquet(parquet_path, columns=columns)
    df.to_csv(csv_path, index=False)

# Multiple CSVs → Single Parquet (common data lake ingestion pattern)
def combine_csvs_to_parquet(csv_dir: str, parquet_path: str,
                             pattern: str = "*.csv"):
    from pathlib import Path
    import pyarrow as pa
    import pyarrow.parquet as pq

    writer = None
    total_rows = 0

    for csv_file in sorted(Path(csv_dir).glob(pattern)):
        df = pd.read_csv(csv_file)
        table = pa.Table.from_pandas(df, preserve_index=False)

        if writer is None:
            writer = pq.ParquetWriter(parquet_path, table.schema)

        writer.write_table(table)
        total_rows += len(df)
        print(f"  Added {csv_file.name}: {len(df):,} rows")

    if writer:
        writer.close()

    print(f"Combined {total_rows:,} total rows → {parquet_path}")

combine_csvs_to_parquet("data/raw/monthly_exports/", "data/processed/all_transactions.parquet")

Format Best Practices for Data Science Projects

Standardize on Parquet for All Large Tabular Data

Plaintext

data/
├── raw/                    ← Original data as received (preserve original format)
│   ├── customers_2024.csv  ← Keep original for audit trail
│   └── orders_api.jsonl    ← Keep original for audit trail
├── processed/              ← Converted and cleaned (always Parquet)
│   ├── customers.parquet
│   └── orders.parquet
└── features/               ← Feature engineered data (Parquet or Feather)
    └── training_features.parquet

data/
├── raw/                    ← Original data as received (preserve original format)
│   ├── customers_2024.csv  ← Keep original for audit trail
│   └── orders_api.jsonl    ← Keep original for audit trail
├── processed/              ← Converted and cleaned (always Parquet)
│   ├── customers.parquet
│   └── orders.parquet
└── features/               ← Feature engineered data (Parquet or Feather)
    └── training_features.parquet

Always Preserve the Original

Never overwrite raw data. Convert it to Parquet for performance, but keep the original for debugging and auditing.

Write Schema Documentation

Parquet preserves types but not business semantics. Document what each column means:

Python

# Write schema as a sidecar JSON file
import json

schema_doc = {
    "description": "Customer features for churn model",
    "created": "2024-09-15",
    "source": "analytics.customers + orders tables",
    "columns": {
        "customer_id":    {"type": "string", "description": "Unique customer identifier"},
        "recency_days":   {"type": "int32",  "description": "Days since last order"},
        "lifetime_value": {"type": "float64","description": "Total spend in USD, all time"},
        "is_premium":     {"type": "bool",   "description": "Premium tier membership flag"}
    }
}

with open("data/features/training_features_schema.json", "w") as f:
    json.dump(schema_doc, f, indent=2)

# Write schema as a sidecar JSON file
import json

schema_doc = {
    "description": "Customer features for churn model",
    "created": "2024-09-15",
    "source": "analytics.customers + orders tables",
    "columns": {
        "customer_id":    {"type": "string", "description": "Unique customer identifier"},
        "recency_days":   {"type": "int32",  "description": "Days since last order"},
        "lifetime_value": {"type": "float64","description": "Total spend in USD, all time"},
        "is_premium":     {"type": "bool",   "description": "Premium tier membership flag"}
    }
}

with open("data/features/training_features_schema.json", "w") as f:
    json.dump(schema_doc, f, indent=2)

Use Compression Strategically

Snappy (default for Parquet): Fastest decompression, good compression ratio — best for frequently read data
Gzip: Best compression ratio, slower decompression — best for archival or infrequently read data
Zstandard (zstd): Best balance of compression and speed — best overall for production pipelines

Summary

Data formats are not an afterthought — they are a core technical decision that affects every part of a data science workflow. The choice between CSV and Parquet for a 10-million-row dataset can mean the difference between a pipeline that takes 5 minutes and one that takes 5 seconds. The choice between JSON and Avro for a Kafka stream determines whether your pipeline handles schema changes gracefully or breaks.

The practical hierarchy for data scientists is: receive data in whatever format it arrives (preserve the original), convert to Parquet as the canonical format for all large tabular data, use Feather for fast local intermediate results, use JSON for API data and nested structures, use Pickle or joblib for ML model artifacts, and use HDF5 for multi-dimensional arrays.

The most impactful format decision most data scientists can make right now is simply replacing CSV with Parquet as their default format for large datasets. The combination of smaller file size, faster reads, column pruning, and type preservation makes it strictly better for large analytical data — with no real downsides except needing Python to read it rather than a text editor.

Key Takeaways

Row-oriented formats (CSV, JSON, Avro) store all fields of a record together — efficient for writing one record at a time and reading complete records, but slow for analytical queries that read a few columns from many rows
Column-oriented formats (Parquet, ORC, Feather) store all values of one column together — dramatically faster for analytical queries, better compression, and able to skip irrelevant columns entirely
CSV is universal and human-readable but has no type information, is slow for large data, and wastes space — use it only for small data exchange with non-technical users or legacy systems
Parquet is the default choice for large tabular data in data science: 5–15× smaller than CSV, 10–40× faster for analytical queries, preserves exact data types, and is supported by every major data tool
JSON excels at hierarchical/nested data and web APIs; JSONL (one object per line) is better than a large JSON array for large datasets because it’s appendable and streamable
Avro is the standard for Kafka and streaming pipelines — row-oriented binary with schema evolution capability; Feather is the fastest format for local intermediate results but not intended for long-term storage
Parquet compression: Snappy (fast decompression, good compression — default), Gzip (best compression, slower), Zstandard (best balance for production) — always compress Parquet files
The practical workflow: preserve original data in whatever format it arrives, convert to Parquet for all large processing, convert back to CSV only when sharing with non-technical recipients

0 Comments

Discover More

Click For More

Understanding Data Formats: CSV, JSON, Parquet, and More

Introduction

Two Fundamental Distinctions: Text vs. Binary and Row vs. Column

Text Formats vs. Binary Formats

Row-Oriented vs. Column-Oriented Storage

CSV: The Universal Standard

What Makes CSV Good

What Makes CSV Poor

CSV in Python

When to Use CSV

JSON: Flexibility and Web-Native Data

What Makes JSON Good

What Makes JSON Poor

JSON Lines (JSONL) for Large Data

When to Use JSON

Parquet: The Modern Standard for Data Science

How Parquet Works

Parquet Performance in Practice

Parquet in Python

Parquet with pyarrow Directly

When to Use Parquet

Avro: Row-Oriented Binary with Schema Evolution

Avro’s Key Characteristics

When to Use Avro

ORC: Columnar Binary for Hive and Spark

Feather: Ultra-Fast Local Data Exchange

HDF5: Scientific and Hierarchical Data

Pickle: Python-Specific Object Serialization

Important Security Warning

Format Comparison: A Comprehensive Reference

Choosing the Right Format: A Decision Guide

By Data Size

By Use Case

A Practical Decision Tree

Real-World Format Benchmarks

Converting Between Formats

Format Best Practices for Data Science Projects

Standardize on Parquet for All Large Tabular Data

Always Preserve the Original

Write Schema Documentation

Use Compression Strategically

Summary

Key Takeaways

Discover More

Why Machine Learning?

What is Data Science? A Complete Beginner’s Guide

ETL vs ELT: Understanding Data Pipelines

Exploring GPIO Pins on Raspberry Pi: Basics of General-Purpose Input/Output

Working with Date and Time in Python

10 Essential Jupyter Notebook Tips for Beginners