Atomic Operations in C++: Lock-Free Programming

Master C++ atomic operations — learn std::atomic, memory orders, compare-exchange, lock-free data structures, and when atomics beat mutexes for performance.

By Techietory on May 10, 2026

Atomic Operations in C++: Lock-Free Programming

std::atomic<T> in C++ is a template that wraps a value and guarantees that all operations on it — reads, writes, increments, and compare-exchanges — are indivisible (atomic). No other thread can observe the value in a partially-updated state, and no separate mutex is needed. For simple types like int, bool, and pointers, atomic operations compile down to single machine instructions or hardware lock-prefix instructions, making them significantly faster than mutex-based synchronization for fine-grained shared state.

Introduction

Mutexes and condition variables solve concurrency problems correctly and are the right tool for most situations — but they come with overhead. Every lock() and unlock() involves OS system calls, memory barriers, and potential thread blocking. For high-throughput systems — a web server handling thousands of requests per second, a game engine updating thousands of entities per frame, a financial trading system processing millions of transactions — that overhead accumulates.

For many common concurrency patterns, there is a faster alternative: atomic operations. Instead of protecting shared data with a lock, atomic operations provide hardware-level guarantees that a single read, write, or read-modify-write operation on a variable is indivisible — it either completes fully or not at all, with no possibility of another thread observing an intermediate state.

std::atomic<T> is the C++ abstraction over these hardware primitives. For integer types, it provides atomic increment, decrement, fetch-and-add, and bitwise operations. For all types, it provides atomic load, store, and compare-and-exchange. For pointers, it provides atomic pointer arithmetic.

Beyond the mechanics of std::atomic itself, this article explores the deeper topic of memory ordering — the rules that govern how memory operations in one thread are visible to other threads. Getting memory ordering right is what separates correct lock-free code from subtly broken code that works on x86 but fails on ARM.

This article builds your understanding from simple atomic counters through memory ordering to lock-free data structures. You will understand when atomics are the right tool, how to use them correctly, and where the sharp edges are.

The Problem: Counter Without Atomics

Let’s start with a clear demonstration of why a simple shared counter breaks without synchronization, and why std::atomic fixes it cleanly and efficiently.

C++

#include <iostream>
#include <thread>
#include <atomic>
#include <vector>
#include <chrono>
using namespace std;

// Non-atomic: data race, undefined behavior
int unsafeCounter = 0;

// Atomic: correct and fast
atomic<int> safeCounter{0};

void incrementUnsafe(int n) {
    for (int i = 0; i < n; i++) unsafeCounter++;
}

void incrementSafe(int n) {
    for (int i = 0; i < n; i++) safeCounter++;  // atomic increment
}

template<typename Func>
pair<long long, int> benchmark(Func f, int threads, int iterations) {
    auto start = chrono::high_resolution_clock::now();

    vector<thread> ts;
    for (int i = 0; i < threads; i++)
        ts.emplace_back(f, iterations);
    for (auto& t : ts) t.join();

    auto end = chrono::high_resolution_clock::now();
    auto ms = chrono::duration_cast<chrono::milliseconds>(end - start).count();
    return {ms, 0};
}

int main() {
    const int THREADS = 8;
    const int ITERS   = 1'000'000;
    const int EXPECTED = THREADS * ITERS;

    // Unsafe
    unsafeCounter = 0;
    benchmark(incrementUnsafe, THREADS, ITERS);
    cout << "Unsafe counter: " << unsafeCounter
         << " (expected " << EXPECTED << ", lost "
         << (EXPECTED - unsafeCounter) << ")" << endl;

    // Safe with atomic
    safeCounter = 0;
    benchmark(incrementSafe, THREADS, ITERS);
    cout << "Atomic counter: " << safeCounter.load()
         << " (expected " << EXPECTED << ")" << endl;

    return 0;
}

#include <iostream>
#include <thread>
#include <atomic>
#include <vector>
#include <chrono>
using namespace std;

// Non-atomic: data race, undefined behavior
int unsafeCounter = 0;

// Atomic: correct and fast
atomic<int> safeCounter{0};

void incrementUnsafe(int n) {
    for (int i = 0; i < n; i++) unsafeCounter++;
}

void incrementSafe(int n) {
    for (int i = 0; i < n; i++) safeCounter++;  // atomic increment
}

template<typename Func>
pair<long long, int> benchmark(Func f, int threads, int iterations) {
    auto start = chrono::high_resolution_clock::now();

    vector<thread> ts;
    for (int i = 0; i < threads; i++)
        ts.emplace_back(f, iterations);
    for (auto& t : ts) t.join();

    auto end = chrono::high_resolution_clock::now();
    auto ms = chrono::duration_cast<chrono::milliseconds>(end - start).count();
    return {ms, 0};
}

int main() {
    const int THREADS = 8;
    const int ITERS   = 1'000'000;
    const int EXPECTED = THREADS * ITERS;

    // Unsafe
    unsafeCounter = 0;
    benchmark(incrementUnsafe, THREADS, ITERS);
    cout << "Unsafe counter: " << unsafeCounter
         << " (expected " << EXPECTED << ", lost "
         << (EXPECTED - unsafeCounter) << ")" << endl;

    // Safe with atomic
    safeCounter = 0;
    benchmark(incrementSafe, THREADS, ITERS);
    cout << "Atomic counter: " << safeCounter.load()
         << " (expected " << EXPECTED << ")" << endl;

    return 0;
}

Output:

Plaintext

Unsafe counter: 5432187 (expected 8000000, lost 2567813)
Atomic counter: 8000000 (expected 8000000)

Unsafe counter: 5432187 (expected 8000000, lost 2567813)
Atomic counter: 8000000 (expected 8000000)

Step-by-step explanation:

unsafeCounter++ is three machine instructions: load, increment, store. Two threads can execute these concurrently with interleaved steps, causing lost updates. The result is always less than expected.
safeCounter++ on an atomic<int> compiles to a single atomic XADD (fetch-and-add) instruction on x86 — one indivisible read-modify-write. No other thread can see an intermediate state. The result is always exactly THREADS * ITERS.
The atomic<int> version requires no mutex, no lock_guard, no condition variable — just a different variable type. For this pattern (many threads incrementing a counter), atomics are faster than a mutex because there is no lock contention management, no OS scheduling, and the operation is a single hardware instruction.
safeCounter.load() explicitly reads the atomic value. You can also use the implicit conversion (cout << safeCounter), but load() makes the atomic operation visible in the code — a good practice for clarity.

std::atomic Basics: Load, Store, and Exchange

C++

#include <iostream>
#include <atomic>
#include <thread>
using namespace std;

int main() {
    atomic<int>  counter{0};
    atomic<bool> flag{false};
    atomic<int*> ptr{nullptr};

    // --- Basic operations ---

    // Store: write a value atomically
    counter.store(42);
    cout << "After store: " << counter.load() << endl;  // 42

    // Load: read a value atomically
    int val = counter.load();
    cout << "Loaded: " << val << endl;

    // Exchange: atomically set new value, return old
    int old = counter.exchange(100);
    cout << "exchange(100): old=" << old << ", new=" << counter.load() << endl;
    // old=42, new=100

    // Fetch-and-add: atomically add, return OLD value before addition
    int before = counter.fetch_add(5);
    cout << "fetch_add(5): before=" << before << ", now=" << counter.load() << endl;
    // before=100, now=105

    // Fetch-and-sub
    before = counter.fetch_sub(3);
    cout << "fetch_sub(3): before=" << before << ", now=" << counter.load() << endl;
    // before=105, now=102

    // Compound assignment operators (syntactic sugar for fetch_add etc.)
    counter += 10;   // Equivalent to counter.fetch_add(10)
    counter -= 5;    // Equivalent to counter.fetch_sub(5)
    counter++;       // Equivalent to counter.fetch_add(1)
    ++counter;       // Also atomic increment
    cout << "After +=10, -=5, ++, ++: " << counter.load() << endl;  // 109

    // Bitwise operations for integer atomics
    atomic<unsigned int> bits{0b1111'0000};
    bits.fetch_and(0b1010'1010);  // AND
    cout << "After AND: " << bits.load() << endl;  // 0b1010'0000 = 160

    bits.fetch_or(0b0000'0101);   // OR
    cout << "After OR:  " << bits.load() << endl;  // 0b1010'0101 = 165

    bits.fetch_xor(0b1111'1111);  // XOR
    cout << "After XOR: " << bits.load() << endl;  // 0b0101'1010 = 90

    return 0;
}

#include <iostream>
#include <atomic>
#include <thread>
using namespace std;

int main() {
    atomic<int>  counter{0};
    atomic<bool> flag{false};
    atomic<int*> ptr{nullptr};

    // --- Basic operations ---

    // Store: write a value atomically
    counter.store(42);
    cout << "After store: " << counter.load() << endl;  // 42

    // Load: read a value atomically
    int val = counter.load();
    cout << "Loaded: " << val << endl;

    // Exchange: atomically set new value, return old
    int old = counter.exchange(100);
    cout << "exchange(100): old=" << old << ", new=" << counter.load() << endl;
    // old=42, new=100

    // Fetch-and-add: atomically add, return OLD value before addition
    int before = counter.fetch_add(5);
    cout << "fetch_add(5): before=" << before << ", now=" << counter.load() << endl;
    // before=100, now=105

    // Fetch-and-sub
    before = counter.fetch_sub(3);
    cout << "fetch_sub(3): before=" << before << ", now=" << counter.load() << endl;
    // before=105, now=102

    // Compound assignment operators (syntactic sugar for fetch_add etc.)
    counter += 10;   // Equivalent to counter.fetch_add(10)
    counter -= 5;    // Equivalent to counter.fetch_sub(5)
    counter++;       // Equivalent to counter.fetch_add(1)
    ++counter;       // Also atomic increment
    cout << "After +=10, -=5, ++, ++: " << counter.load() << endl;  // 109

    // Bitwise operations for integer atomics
    atomic<unsigned int> bits{0b1111'0000};
    bits.fetch_and(0b1010'1010);  // AND
    cout << "After AND: " << bits.load() << endl;  // 0b1010'0000 = 160

    bits.fetch_or(0b0000'0101);   // OR
    cout << "After OR:  " << bits.load() << endl;  // 0b1010'0101 = 165

    bits.fetch_xor(0b1111'1111);  // XOR
    cout << "After XOR: " << bits.load() << endl;  // 0b0101'1010 = 90

    return 0;
}

Output:

Plaintext

After store: 42
Loaded: 42
exchange(100): old=42, new=100
fetch_add(5): before=100, now=105
fetch_sub(3): before=105, now=102
After +=10, -=5, ++, ++: 109
After AND: 160
After OR:  165
After XOR: 90

After store: 42
Loaded: 42
exchange(100): old=42, new=100
fetch_add(5): before=100, now=105
fetch_sub(3): before=105, now=102
After +=10, -=5, ++, ++: 109
After AND: 160
After OR:  165
After XOR: 90

Step-by-step explanation:

store(value) and load() are the atomic write and read primitives. For simple types, these compile to single machine instructions. store and load are always atomic — no other thread can see a partial write.
exchange(new_val) atomically sets the value to new_val and returns the old value. This is useful for “swap-and-check” patterns — you can atomically replace a value and know what you replaced.
fetch_add(n) atomically adds n and returns the value before the addition (the “fetch” part). This is subtly different from += n which does not return the old value. fetch_add is essential in patterns where you need to know the before-value (e.g., claiming a slot in an array).
The compound assignment operators (+=, -=, ++, --) are convenience wrappers around fetch_add / fetch_sub. They are fully atomic but do not return the old value.
Bitwise operations (fetch_and, fetch_or, fetch_xor) atomically modify individual bits. These are useful for atomic flag sets, bitmask operations, and managing sets of boolean flags as a single atomic word.

Compare-and-Exchange: The Foundation of Lock-Free Algorithms

compare_exchange_weak and compare_exchange_strong are the most powerful atomic operations. They are the fundamental building block of virtually all lock-free algorithms.

The semantics: “Atomically, if the current value equals expected, set it to desired and return true. Otherwise, load the current value into expected and return false.”

C++

#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
using namespace std;

// Lock-free max update: atomically ensure the stored value
// is the maximum of the current stored value and a new value
void atomicMax(atomic<int>& maxVal, int candidate) {
    int current = maxVal.load();

    // Keep trying until either:
    // (a) current >= candidate (no update needed), or
    // (b) we successfully update from current to candidate
    while (candidate > current) {
        // Try to swap: "if maxVal still equals current, set it to candidate"
        if (maxVal.compare_exchange_weak(current, candidate)) {
            // Success: we updated the value
            break;
        }
        // Failure: another thread changed maxVal between our load and CAS
        // 'current' has been updated to the actual current value by CAS
        // Loop and try again with the new current value
    }
}

// Lock-free stack (simplified Treiber stack)
template<typename T>
class LockFreeStack {
    struct Node {
        T     value;
        Node* next;
        Node(T v) : value(v), next(nullptr) {}
    };

    atomic<Node*> head_{nullptr};

public:
    void push(T value) {
        Node* newNode = new Node(value);

        // Atomically set head to newNode, making newNode->next the old head
        // Retry if another thread changed head between our load and CAS
        newNode->next = head_.load();
        while (!head_.compare_exchange_weak(newNode->next, newNode)) {
            // CAS failed: head changed; compare_exchange_weak updated
            // newNode->next to the current head automatically — just retry
        }
    }

    optional<T> pop() {
        Node* oldHead = head_.load();

        while (oldHead != nullptr) {
            // Try to atomically advance head past oldHead
            if (head_.compare_exchange_weak(oldHead, oldHead->next)) {
                T value = oldHead->value;
                delete oldHead;
                return value;
            }
            // CAS failed: oldHead updated to current head — retry
        }
        return nullopt; // Stack was empty
    }

    ~LockFreeStack() {
        while (pop().has_value()) {}
    }
};

int main() {
    // --- atomicMax demo ---
    cout << "=== Atomic Max ===" << endl;
    atomic<int> maxValue{0};

    vector<thread> threads;
    vector<int> candidates = {5, 12, 3, 19, 7, 11, 9, 15};

    for (int c : candidates) {
        threads.emplace_back(atomicMax, ref(maxValue), c);
    }
    for (auto& t : threads) t.join();
    cout << "Max of {5,12,3,19,7,11,9,15} = " << maxValue.load() << endl;

    // --- Lock-free stack demo ---
    cout << "\n=== Lock-Free Stack ===" << endl;
    LockFreeStack<int> stack;

    // Push from multiple threads
    vector<thread> pushers;
    for (int i = 0; i < 5; i++) {
        pushers.emplace_back([&stack, i]() {
            stack.push(i * 10);
            cout << "Pushed: " << i * 10 << endl;
        });
    }
    for (auto& t : pushers) t.join();

    // Pop all items
    cout << "Popping: ";
    auto item = stack.pop();
    while (item.has_value()) {
        cout << *item << " ";
        item = stack.pop();
    }
    cout << endl;

    return 0;
}

#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
using namespace std;

// Lock-free max update: atomically ensure the stored value
// is the maximum of the current stored value and a new value
void atomicMax(atomic<int>& maxVal, int candidate) {
    int current = maxVal.load();

    // Keep trying until either:
    // (a) current >= candidate (no update needed), or
    // (b) we successfully update from current to candidate
    while (candidate > current) {
        // Try to swap: "if maxVal still equals current, set it to candidate"
        if (maxVal.compare_exchange_weak(current, candidate)) {
            // Success: we updated the value
            break;
        }
        // Failure: another thread changed maxVal between our load and CAS
        // 'current' has been updated to the actual current value by CAS
        // Loop and try again with the new current value
    }
}

// Lock-free stack (simplified Treiber stack)
template<typename T>
class LockFreeStack {
    struct Node {
        T     value;
        Node* next;
        Node(T v) : value(v), next(nullptr) {}
    };

    atomic<Node*> head_{nullptr};

public:
    void push(T value) {
        Node* newNode = new Node(value);

        // Atomically set head to newNode, making newNode->next the old head
        // Retry if another thread changed head between our load and CAS
        newNode->next = head_.load();
        while (!head_.compare_exchange_weak(newNode->next, newNode)) {
            // CAS failed: head changed; compare_exchange_weak updated
            // newNode->next to the current head automatically — just retry
        }
    }

    optional<T> pop() {
        Node* oldHead = head_.load();

        while (oldHead != nullptr) {
            // Try to atomically advance head past oldHead
            if (head_.compare_exchange_weak(oldHead, oldHead->next)) {
                T value = oldHead->value;
                delete oldHead;
                return value;
            }
            // CAS failed: oldHead updated to current head — retry
        }
        return nullopt; // Stack was empty
    }

    ~LockFreeStack() {
        while (pop().has_value()) {}
    }
};

int main() {
    // --- atomicMax demo ---
    cout << "=== Atomic Max ===" << endl;
    atomic<int> maxValue{0};

    vector<thread> threads;
    vector<int> candidates = {5, 12, 3, 19, 7, 11, 9, 15};

    for (int c : candidates) {
        threads.emplace_back(atomicMax, ref(maxValue), c);
    }
    for (auto& t : threads) t.join();
    cout << "Max of {5,12,3,19,7,11,9,15} = " << maxValue.load() << endl;

    // --- Lock-free stack demo ---
    cout << "\n=== Lock-Free Stack ===" << endl;
    LockFreeStack<int> stack;

    // Push from multiple threads
    vector<thread> pushers;
    for (int i = 0; i < 5; i++) {
        pushers.emplace_back([&stack, i]() {
            stack.push(i * 10);
            cout << "Pushed: " << i * 10 << endl;
        });
    }
    for (auto& t : pushers) t.join();

    // Pop all items
    cout << "Popping: ";
    auto item = stack.pop();
    while (item.has_value()) {
        cout << *item << " ";
        item = stack.pop();
    }
    cout << endl;

    return 0;
}

Output:

Plaintext

=== Atomic Max ===
Max of {5,12,3,19,7,11,9,15} = 19

=== Lock-Free Stack ===
Pushed: 0
Pushed: 10
Pushed: 20
Pushed: 30
Pushed: 40
Popping: 40 30 20 10 0

=== Atomic Max ===
Max of {5,12,3,19,7,11,9,15} = 19

=== Lock-Free Stack ===
Pushed: 0
Pushed: 10
Pushed: 20
Pushed: 30
Pushed: 40
Popping: 40 30 20 10 0

Step-by-step explanation:

compare_exchange_weak(expected, desired) is the CAS (Compare-And-Swap) operation. On success: atomically changes the value from expected to desired, returns true. On failure: the value was not equal to expected, so nothing changes; expected is updated to the actual current value, returns false.
In atomicMax, the pattern is: load current value, check if candidate is larger, attempt CAS. If another thread changed maxVal between the load and CAS, CAS fails and we loop with the updated current. This is the retry loop pattern — the core of all lock-free algorithms.
compare_exchange_weak vs compare_exchange_strong: weak may fail spuriously (return false even when expected equals current) on some architectures (like ARM), but is faster in loops because it maps to a single LL/SC instruction pair. Strong never fails spuriously but may be slower. Rule: use weak in a retry loop, use strong when you only want to try once.
In LockFreeStack::push(), newNode->next = head_.load() reads the current head. The CAS attempts to set head_ from oldHead to newNode. If another thread pushed between our load and CAS, head_ changed — CAS fails and newNode->next is updated to the new head_. We loop and try again.
The lock-free stack provides wait-free push and lock-free pop: push always completes in O(1) amortized retries; pop may retry but always makes progress. Crucially, no thread ever blocks another — even if one thread is preempted mid-push, other threads can still push and pop normally.

Memory Ordering: The Critical Detail

The most subtle aspect of atomic operations is memory ordering — specifying how atomic operations interact with the visibility of non-atomic memory operations in other threads. Getting this wrong produces code that is correct on x86 (which has a strong memory model) but fails silently on ARM, PowerPC, or RISC-V (which have weaker models).

C++ provides six memory ordering options:

C++

#include <atomic>
#include <thread>
#include <iostream>
using namespace std;

atomic<int>  data{0};
atomic<bool> ready{false};

// Example 1: Acquire-Release ordering (the most common and correct ordering)
// for producer-consumer patterns

void producer_acqrel() {
    data.store(42, memory_order_relaxed);  // Non-synchronizing store
    // The release store: all writes before this point are visible
    // to any thread that performs an acquire load of 'ready'
    ready.store(true, memory_order_release);
}

void consumer_acqrel() {
    // The acquire load: all writes visible to the producer before its
    // release store are now visible to this thread
    while (!ready.load(memory_order_acquire)) {
        // Spin until ready — in practice, use a condition variable for this
    }
    // Guaranteed: data == 42 here, because of acquire-release pairing
    cout << "data = " << data.load(memory_order_relaxed) << endl;  // Always 42
}

// Example 2: Relaxed ordering — no synchronization guarantees
atomic<int> relaxedCounter{0};

void relaxedIncrement(int n) {
    for (int i = 0; i < n; i++) {
        // Relaxed: only guarantees the operation itself is atomic
        // No ordering guarantees relative to other memory operations
        relaxedCounter.fetch_add(1, memory_order_relaxed);
    }
}
// Suitable for: counters where only the final total matters,
// not the order of increments relative to other operations

// Example 3: Sequential consistency (the default, and safest)
atomic<int> seqCst{0};
void sequentialOps() {
    seqCst.store(1);  // Default: memory_order_seq_cst
    int v = seqCst.load();  // Default: memory_order_seq_cst
    // All seq_cst operations form a single total order seen by ALL threads
}

int main() {
    cout << "--- Acquire-Release demo ---" << endl;
    data  = 0;
    ready = false;

    thread prod(producer_acqrel);
    thread cons(consumer_acqrel);
    prod.join(); cons.join();

    cout << "--- Relaxed counter ---" << endl;
    relaxedCounter = 0;
    vector<thread> threads;
    for (int i = 0; i < 4; i++)
        threads.emplace_back(relaxedIncrement, 1000000);
    for (auto& t : threads) t.join();
    cout << "Relaxed counter: " << relaxedCounter.load() << " (expected 4000000)" << endl;

    return 0;
}

#include <atomic>
#include <thread>
#include <iostream>
using namespace std;

atomic<int>  data{0};
atomic<bool> ready{false};

// Example 1: Acquire-Release ordering (the most common and correct ordering)
// for producer-consumer patterns

void producer_acqrel() {
    data.store(42, memory_order_relaxed);  // Non-synchronizing store
    // The release store: all writes before this point are visible
    // to any thread that performs an acquire load of 'ready'
    ready.store(true, memory_order_release);
}

void consumer_acqrel() {
    // The acquire load: all writes visible to the producer before its
    // release store are now visible to this thread
    while (!ready.load(memory_order_acquire)) {
        // Spin until ready — in practice, use a condition variable for this
    }
    // Guaranteed: data == 42 here, because of acquire-release pairing
    cout << "data = " << data.load(memory_order_relaxed) << endl;  // Always 42
}

// Example 2: Relaxed ordering — no synchronization guarantees
atomic<int> relaxedCounter{0};

void relaxedIncrement(int n) {
    for (int i = 0; i < n; i++) {
        // Relaxed: only guarantees the operation itself is atomic
        // No ordering guarantees relative to other memory operations
        relaxedCounter.fetch_add(1, memory_order_relaxed);
    }
}
// Suitable for: counters where only the final total matters,
// not the order of increments relative to other operations

// Example 3: Sequential consistency (the default, and safest)
atomic<int> seqCst{0};
void sequentialOps() {
    seqCst.store(1);  // Default: memory_order_seq_cst
    int v = seqCst.load();  // Default: memory_order_seq_cst
    // All seq_cst operations form a single total order seen by ALL threads
}

int main() {
    cout << "--- Acquire-Release demo ---" << endl;
    data  = 0;
    ready = false;

    thread prod(producer_acqrel);
    thread cons(consumer_acqrel);
    prod.join(); cons.join();

    cout << "--- Relaxed counter ---" << endl;
    relaxedCounter = 0;
    vector<thread> threads;
    for (int i = 0; i < 4; i++)
        threads.emplace_back(relaxedIncrement, 1000000);
    for (auto& t : threads) t.join();
    cout << "Relaxed counter: " << relaxedCounter.load() << " (expected 4000000)" << endl;

    return 0;
}

Output:

Plaintext

--- Acquire-Release demo ---
data = 42
--- Relaxed counter ---
Relaxed counter: 4000000 (expected 4000000)

--- Acquire-Release demo ---
data = 42
--- Relaxed counter ---
Relaxed counter: 4000000 (expected 4000000)

The Six Memory Orders Explained

Understanding memory ordering requires understanding that modern CPUs and compilers reorder memory operations for performance. The memory ordering parameters control how much reordering is permitted.

memory_order_relaxed — The weakest ordering. Only guarantees the individual atomic operation is atomic — no ordering relative to other memory operations in the same thread or in other threads. Use for counters and statistics where only the final value matters, not intermediate ordering.

memory_order_acquire — Used on loads (reads). No reads or writes in the current thread can be moved before this load. When combined with a release store in another thread, it establishes a synchronization point: all writes that happened before the release store in the other thread are now visible.

memory_order_release — Used on stores (writes). No reads or writes in the current thread can be moved after this store. Paired with an acquire load, it makes all preceding writes visible to the thread doing the acquire.

memory_order_acq_rel — For read-modify-write operations (like fetch_add). Combines acquire and release: acts as an acquire for the load part and a release for the store part.

memory_order_consume — A weaker variant of acquire, only for data-dependent operations. Complex and rarely used correctly in practice — prefer acquire instead.

memory_order_seq_cst — The strongest ordering. All seq_cst operations form a single, globally consistent total order across all threads. This is the default when no ordering is specified. Safest but potentially slowest on architectures with weak memory models.

C++

// Memory ordering quick reference:

// For a flag that signals "data is ready" (producer-consumer):
// Producer:
data.store(value, memory_order_relaxed);   // The data write
flag.store(true, memory_order_release);    // Signal: everything before here is visible

// Consumer:
while (!flag.load(memory_order_acquire));  // Wait for signal
int v = data.load(memory_order_relaxed);  // Safe to read — guaranteed visible

// For simple counters (order doesn't matter):
counter.fetch_add(1, memory_order_relaxed);  // Maximum performance

// When in doubt — use the default (seq_cst):
counter.fetch_add(1);  // Safe, correct, slightly slower on weak-memory archs

// Memory ordering quick reference:

// For a flag that signals "data is ready" (producer-consumer):
// Producer:
data.store(value, memory_order_relaxed);   // The data write
flag.store(true, memory_order_release);    // Signal: everything before here is visible

// Consumer:
while (!flag.load(memory_order_acquire));  // Wait for signal
int v = data.load(memory_order_relaxed);  // Safe to read — guaranteed visible

// For simple counters (order doesn't matter):
counter.fetch_add(1, memory_order_relaxed);  // Maximum performance

// When in doubt — use the default (seq_cst):
counter.fetch_add(1);  // Safe, correct, slightly slower on weak-memory archs

Practical Patterns with std::atomic

Pattern 1: Atomic Flags for One-Time Events

C++

#include <iostream>
#include <atomic>
#include <thread>
using namespace std;

atomic<bool> shutdownRequested{false};
atomic<bool> emergencyStop{false};

void worker(int id) {
    int count = 0;
    while (!shutdownRequested.load(memory_order_relaxed)) {
        // Do work
        count++;
        if (count % 100000 == 0) {
            cout << "Worker " << id << ": " << count << " iterations" << endl;
        }
        if (emergencyStop.load(memory_order_acquire)) {
            cout << "Worker " << id << ": emergency stop!" << endl;
            return;
        }
        this_thread::yield();
    }
    cout << "Worker " << id << ": normal shutdown after " << count << " iters" << endl;
}

int main() {
    thread w1(worker, 1);
    thread w2(worker, 2);

    this_thread::sleep_for(chrono::milliseconds(10));
    shutdownRequested.store(true, memory_order_relaxed);

    w1.join();
    w2.join();
    cout << "All workers shut down" << endl;

    return 0;
}

#include <iostream>
#include <atomic>
#include <thread>
using namespace std;

atomic<bool> shutdownRequested{false};
atomic<bool> emergencyStop{false};

void worker(int id) {
    int count = 0;
    while (!shutdownRequested.load(memory_order_relaxed)) {
        // Do work
        count++;
        if (count % 100000 == 0) {
            cout << "Worker " << id << ": " << count << " iterations" << endl;
        }
        if (emergencyStop.load(memory_order_acquire)) {
            cout << "Worker " << id << ": emergency stop!" << endl;
            return;
        }
        this_thread::yield();
    }
    cout << "Worker " << id << ": normal shutdown after " << count << " iters" << endl;
}

int main() {
    thread w1(worker, 1);
    thread w2(worker, 2);

    this_thread::sleep_for(chrono::milliseconds(10));
    shutdownRequested.store(true, memory_order_relaxed);

    w1.join();
    w2.join();
    cout << "All workers shut down" << endl;

    return 0;
}

shutdownRequested uses relaxed ordering for the inner loop check — we only care that the flag is eventually seen, not that it establishes a happens-before relationship with specific data. For emergencyStop, acquire is used because it might need to synchronize with data written before the emergency condition was set.

Pattern 2: Reference Counting with atomics

std::shared_ptr uses atomic reference counting internally. You can build similar patterns:

C++

#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
using namespace std;

class RefCounted {
    atomic<int> refCount_{1};
    string name_;
public:
    RefCounted(string name) : name_(name) {
        cout << "Created: " << name_ << endl;
    }
    ~RefCounted() {
        cout << "Destroyed: " << name_ << endl;
    }

    void addRef() {
        refCount_.fetch_add(1, memory_order_relaxed);
    }

    void release() {
        // fetch_sub with release: ensures all accesses to the object
        // happen-before the decrement that triggers destruction
        if (refCount_.fetch_sub(1, memory_order_release) == 1) {
            // We decremented from 1 to 0: we're the last owner
            // acquire fence to sync with other releases
            atomic_thread_fence(memory_order_acquire);
            delete this;
        }
    }

    void use() const { cout << "Using: " << name_ << endl; }
};

int main() {
    RefCounted* obj = new RefCounted("SharedResource");

    // Multiple threads "share" the object
    vector<thread> threads;
    for (int i = 0; i < 3; i++) {
        obj->addRef();
        threads.emplace_back([obj, i]() {
            this_thread::sleep_for(chrono::milliseconds(i * 20));
            obj->use();
            obj->release();  // Will delete when last reference released
        });
    }

    obj->release();  // Main thread releases its initial reference

    for (auto& t : threads) t.join();

    return 0;
}

#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
using namespace std;

class RefCounted {
    atomic<int> refCount_{1};
    string name_;
public:
    RefCounted(string name) : name_(name) {
        cout << "Created: " << name_ << endl;
    }
    ~RefCounted() {
        cout << "Destroyed: " << name_ << endl;
    }

    void addRef() {
        refCount_.fetch_add(1, memory_order_relaxed);
    }

    void release() {
        // fetch_sub with release: ensures all accesses to the object
        // happen-before the decrement that triggers destruction
        if (refCount_.fetch_sub(1, memory_order_release) == 1) {
            // We decremented from 1 to 0: we're the last owner
            // acquire fence to sync with other releases
            atomic_thread_fence(memory_order_acquire);
            delete this;
        }
    }

    void use() const { cout << "Using: " << name_ << endl; }
};

int main() {
    RefCounted* obj = new RefCounted("SharedResource");

    // Multiple threads "share" the object
    vector<thread> threads;
    for (int i = 0; i < 3; i++) {
        obj->addRef();
        threads.emplace_back([obj, i]() {
            this_thread::sleep_for(chrono::milliseconds(i * 20));
            obj->use();
            obj->release();  // Will delete when last reference released
        });
    }

    obj->release();  // Main thread releases its initial reference

    for (auto& t : threads) t.join();

    return 0;
}

Output:

Plaintext

Created: SharedResource
Using: SharedResource
Using: SharedResource
Using: SharedResource
Destroyed: SharedResource

Created: SharedResource
Using: SharedResource
Using: SharedResource
Using: SharedResource
Destroyed: SharedResource

The release ordering on fetch_sub ensures all uses of the object in the current thread are visible to whichever thread performs the final decrement. The acquire fence on the final decrement ensures those uses are visible before destruction — this is exactly how std::shared_ptr‘s reference counting works.

Pattern 3: Atomic Pointer for Lock-Free Reads

C++

#include <iostream>
#include <atomic>
#include <thread>
using namespace std;

struct Config {
    int timeout;
    int maxConnections;
    string endpoint;

    Config(int t, int m, string e)
        : timeout(t), maxConnections(m), endpoint(move(e)) {}
};

// Atomic pointer: readers get a consistent snapshot without locking
atomic<Config*> currentConfig{
    new Config(30, 100, "localhost:8080")
};

// Readers: many threads, lock-free
void readConfig(int id) {
    // Acquire load: ensures we see all writes that happened before
    // the store of this pointer
    const Config* cfg = currentConfig.load(memory_order_acquire);
    cout << "Thread " << id << ": timeout=" << cfg->timeout
         << " endpoint=" << cfg->endpoint << endl;
}

// Writer: updates the entire config atomically
void updateConfig() {
    // Release the new config: all writes to new Config happen-before
    // any acquire load that sees this pointer
    Config* newCfg = new Config(60, 200, "prod.example.com:443");

    Config* old = currentConfig.exchange(newCfg, memory_order_release);
    // old config: in a real system, defer deletion (hazard pointers, RCU)
    // For simplicity here, we just delete after a delay
    this_thread::sleep_for(chrono::milliseconds(10));
    delete old;
}

int main() {
    // Readers run concurrently with writer
    thread r1(readConfig, 1);
    thread r2(readConfig, 2);
    thread w(updateConfig);
    thread r3(readConfig, 3);

    r1.join(); r2.join(); w.join(); r3.join();

    delete currentConfig.load();  // Clean up final config
    return 0;
}

#include <iostream>
#include <atomic>
#include <thread>
using namespace std;

struct Config {
    int timeout;
    int maxConnections;
    string endpoint;

    Config(int t, int m, string e)
        : timeout(t), maxConnections(m), endpoint(move(e)) {}
};

// Atomic pointer: readers get a consistent snapshot without locking
atomic<Config*> currentConfig{
    new Config(30, 100, "localhost:8080")
};

// Readers: many threads, lock-free
void readConfig(int id) {
    // Acquire load: ensures we see all writes that happened before
    // the store of this pointer
    const Config* cfg = currentConfig.load(memory_order_acquire);
    cout << "Thread " << id << ": timeout=" << cfg->timeout
         << " endpoint=" << cfg->endpoint << endl;
}

// Writer: updates the entire config atomically
void updateConfig() {
    // Release the new config: all writes to new Config happen-before
    // any acquire load that sees this pointer
    Config* newCfg = new Config(60, 200, "prod.example.com:443");

    Config* old = currentConfig.exchange(newCfg, memory_order_release);
    // old config: in a real system, defer deletion (hazard pointers, RCU)
    // For simplicity here, we just delete after a delay
    this_thread::sleep_for(chrono::milliseconds(10));
    delete old;
}

int main() {
    // Readers run concurrently with writer
    thread r1(readConfig, 1);
    thread r2(readConfig, 2);
    thread w(updateConfig);
    thread r3(readConfig, 3);

    r1.join(); r2.join(); w.join(); r3.join();

    delete currentConfig.load();  // Clean up final config
    return 0;
}

This pattern — an atomic pointer to an immutable config object, updated by replacing the whole pointer — is used in systems that need frequent lock-free reads with rare updates. Each reader gets a consistent snapshot; the writer atomically publishes a completely new version.

is_lock_free(): Checking Hardware Support

Not all types support truly lock-free atomic operations on all hardware. For types larger than the machine word size, the compiler may implement atomics using a hidden internal mutex.

C++

#include <iostream>
#include <atomic>
using namespace std;

struct SmallStruct { int x; };
struct LargeStruct { int x, y, z, w; double d1, d2; };  // 32+ bytes

int main() {
    atomic<int>         ai;
    atomic<long>        al;
    atomic<double>      ad;
    atomic<int*>        ap;
    atomic<SmallStruct> as;
    atomic<LargeStruct> aL;

    cout << "atomic<int>:         lock_free=" << ai.is_lock_free() << endl;
    cout << "atomic<long>:        lock_free=" << al.is_lock_free() << endl;
    cout << "atomic<double>:      lock_free=" << ad.is_lock_free() << endl;
    cout << "atomic<int*>:        lock_free=" << ap.is_lock_free() << endl;
    cout << "atomic<SmallStruct>: lock_free=" << as.is_lock_free() << endl;
    cout << "atomic<LargeStruct>: lock_free=" << aL.is_lock_free() << endl;

    // Compile-time check using ATOMIC_INT_LOCK_FREE
    cout << "\nAtomic lock-free constants:" << endl;
    cout << "ATOMIC_INT_LOCK_FREE:     " << ATOMIC_INT_LOCK_FREE     << endl;
    cout << "ATOMIC_LONG_LOCK_FREE:    " << ATOMIC_LONG_LOCK_FREE    << endl;
    cout << "ATOMIC_POINTER_LOCK_FREE: " << ATOMIC_POINTER_LOCK_FREE << endl;
    // 0 = never lock-free, 1 = sometimes (implementation-defined), 2 = always

    return 0;
}

#include <iostream>
#include <atomic>
using namespace std;

struct SmallStruct { int x; };
struct LargeStruct { int x, y, z, w; double d1, d2; };  // 32+ bytes

int main() {
    atomic<int>         ai;
    atomic<long>        al;
    atomic<double>      ad;
    atomic<int*>        ap;
    atomic<SmallStruct> as;
    atomic<LargeStruct> aL;

    cout << "atomic<int>:         lock_free=" << ai.is_lock_free() << endl;
    cout << "atomic<long>:        lock_free=" << al.is_lock_free() << endl;
    cout << "atomic<double>:      lock_free=" << ad.is_lock_free() << endl;
    cout << "atomic<int*>:        lock_free=" << ap.is_lock_free() << endl;
    cout << "atomic<SmallStruct>: lock_free=" << as.is_lock_free() << endl;
    cout << "atomic<LargeStruct>: lock_free=" << aL.is_lock_free() << endl;

    // Compile-time check using ATOMIC_INT_LOCK_FREE
    cout << "\nAtomic lock-free constants:" << endl;
    cout << "ATOMIC_INT_LOCK_FREE:     " << ATOMIC_INT_LOCK_FREE     << endl;
    cout << "ATOMIC_LONG_LOCK_FREE:    " << ATOMIC_LONG_LOCK_FREE    << endl;
    cout << "ATOMIC_POINTER_LOCK_FREE: " << ATOMIC_POINTER_LOCK_FREE << endl;
    // 0 = never lock-free, 1 = sometimes (implementation-defined), 2 = always

    return 0;
}

Typical output on a 64-bit x86 system:

Plaintext

atomic<int>:         lock_free=1
atomic<long>:        lock_free=1
atomic<double>:      lock_free=1
atomic<int*>:        lock_free=1
atomic<SmallStruct>: lock_free=1
atomic<LargeStruct>: lock_free=0

atomic<int>:         lock_free=1
atomic<long>:        lock_free=1
atomic<double>:      lock_free=1
atomic<int*>:        lock_free=1
atomic<SmallStruct>: lock_free=1
atomic<LargeStruct>: lock_free=0

LargeStruct is not lock-free because it is larger than what a single CAS instruction can handle on this architecture. The compiler falls back to an internal mutex. In this case, atomic<LargeStruct> provides atomicity but not the performance benefits of true lock-free operation.

For performance-critical code, ensure your atomic types are always lock-free by keeping them small (pointer-sized or smaller), or use a pointer to an immutable larger struct (as shown in the config pattern above).

Atomics vs. Mutexes: When to Use Which

Choosing between atomics and mutexes depends on the nature of the shared state and the operations performed on it.

Scenario	Best Choice	Reason
Simple counter (increments only)	`atomic<int>`	Single instruction, no blocking
Boolean flag (set once, read many)	`atomic<bool>`	No blocking, simple semantics
Pointer swap (publish new version)	`atomic<T*>`	Lock-free pointer exchange
Multiple variables updated together	`mutex`	Atomics can’t span multiple variables
Complex data structure update	`mutex`	CAS only works on one value at a time
Condition-based waiting	`mutex` + `condition_variable`	Atomics have no wait/notify
Reference counting	`atomic<int>`	Natural fit — one value, one operation
Statistics and metrics	`atomic<int>` with `relaxed`	Maximum performance, no sync needed
Read-heavy shared config	`shared_mutex` or atomic pointer	Depends on update complexity
Queue between threads	`condition_variable` + `mutex`	Need wait-on-empty semantics

Key insight: Atomics can only make a single variable’s operations atomic. If you need to atomically update two or more related variables together — or if a thread needs to wait until some condition becomes true — you need a mutex (and possibly a condition variable).

std::atomic_flag: The Simplest Atomic

std::atomic_flag is guaranteed to be lock-free on all platforms and provides the most primitive atomic: a boolean flag with only test_and_set() and clear() operations. It is the building block for spin locks.

C++

#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
using namespace std;

// Spin lock using atomic_flag
class SpinLock {
    atomic_flag flag_ = ATOMIC_FLAG_INIT;
public:
    void lock() {
        // Spin until we acquire the flag (test_and_set returns false)
        while (flag_.test_and_set(memory_order_acquire)) {
            // Hint to CPU that we're in a spin loop — improves performance
            // on x86 with Intel's PAUSE instruction
            this_thread::yield();
        }
    }

    void unlock() {
        flag_.clear(memory_order_release);
    }
};

SpinLock spinLock;
int protectedCounter = 0;

void incrementWithSpinLock(int n) {
    for (int i = 0; i < n; i++) {
        spinLock.lock();
        protectedCounter++;
        spinLock.unlock();
    }
}

int main() {
    const int THREADS = 4;
    const int ITERS   = 500000;

    vector<thread> threads;
    for (int i = 0; i < THREADS; i++)
        threads.emplace_back(incrementWithSpinLock, ITERS);
    for (auto& t : threads) t.join();

    cout << "SpinLock counter: " << protectedCounter
         << " (expected " << THREADS * ITERS << ")" << endl;

    return 0;
}

#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
using namespace std;

// Spin lock using atomic_flag
class SpinLock {
    atomic_flag flag_ = ATOMIC_FLAG_INIT;
public:
    void lock() {
        // Spin until we acquire the flag (test_and_set returns false)
        while (flag_.test_and_set(memory_order_acquire)) {
            // Hint to CPU that we're in a spin loop — improves performance
            // on x86 with Intel's PAUSE instruction
            this_thread::yield();
        }
    }

    void unlock() {
        flag_.clear(memory_order_release);
    }
};

SpinLock spinLock;
int protectedCounter = 0;

void incrementWithSpinLock(int n) {
    for (int i = 0; i < n; i++) {
        spinLock.lock();
        protectedCounter++;
        spinLock.unlock();
    }
}

int main() {
    const int THREADS = 4;
    const int ITERS   = 500000;

    vector<thread> threads;
    for (int i = 0; i < THREADS; i++)
        threads.emplace_back(incrementWithSpinLock, ITERS);
    for (auto& t : threads) t.join();

    cout << "SpinLock counter: " << protectedCounter
         << " (expected " << THREADS * ITERS << ")" << endl;

    return 0;
}

Output:

Plaintext

SpinLock counter: 2000000 (expected 2000000)

SpinLock counter: 2000000 (expected 2000000)

When to use a spin lock: Only when the critical section is extremely short (nanoseconds) and the lock is almost always uncontended. Spin locks waste CPU when threads must wait longer. For most production code, use std::mutex instead — it yields the CPU when blocked, letting other threads run.

Common Mistakes with Atomics

Mistake 1: Thinking atomics make compound operations atomic.

C++

atomic<int> x{0}, y{0};

// NOT atomic together! Another thread can observe x=1, y=0
x.store(1);
y.store(1);

// For atomic compound operations: use a mutex or pack both into one atomic

atomic<int> x{0}, y{0};

// NOT atomic together! Another thread can observe x=1, y=0
x.store(1);
y.store(1);

// For atomic compound operations: use a mutex or pack both into one atomic

Mistake 2: Using relaxed ordering when synchronization is needed.

C++

atomic<bool> ready{false};
int data = 0;

// Producer:
data = 42;
ready.store(true, memory_order_relaxed);  // WRONG: no happens-before with data

// Consumer:
while (!ready.load(memory_order_relaxed));  // WRONG: may see ready=true but data=0!
cout << data;  // Undefined behavior on weak-memory architectures

// Fix: use release/acquire pair

atomic<bool> ready{false};
int data = 0;

// Producer:
data = 42;
ready.store(true, memory_order_relaxed);  // WRONG: no happens-before with data

// Consumer:
while (!ready.load(memory_order_relaxed));  // WRONG: may see ready=true but data=0!
cout << data;  // Undefined behavior on weak-memory architectures

// Fix: use release/acquire pair

Mistake 3: Using seq_cst blindly without understanding the cost. seq_cst is correct but may add full memory fences on ARM/PowerPC. For counters and flags that don’t need cross-thread ordering guarantees, relaxed is correct and faster. Profile before assuming seq_cst is a bottleneck.

Mistake 4: ABA problem in CAS loops.

C++

// Thread reads A at address X
// Another thread changes X: A -> B -> A
// First thread's CAS succeeds (sees A again) — but the state changed!
// Solution: use a version counter alongside the pointer (atomic<pair<T*,int>>)
// or use libraries that handle ABA (hazard pointers, epoch-based reclamation)

// Thread reads A at address X
// Another thread changes X: A -> B -> A
// First thread's CAS succeeds (sees A again) — but the state changed!
// Solution: use a version counter alongside the pointer (atomic<pair<T*,int>>)
// or use libraries that handle ABA (hazard pointers, epoch-based reclamation)

Mistake 5: Lock-free != wait-free != fast. Lock-free means at least one thread makes progress at any time. Wait-free means every thread makes progress. Neither automatically means faster than a mutex — at high contention, CAS retry loops can be slower than a well-designed mutex. Measure before optimizing.

Conclusion

std::atomic provides the C++ interface to hardware atomic operations — the fundamental building blocks of lock-free concurrent programming. For simple patterns (counters, flags, pointer publishing), atomics are faster than mutexes because they map to single hardware instructions with no OS involvement, no blocking, and no scheduling overhead.

The key operations — load, store, exchange, fetch_add, and the powerful compare_exchange — cover virtually all atomic patterns. The compare-and-exchange operation, with its retry loop idiom, is the foundation of all lock-free data structures: lock-free stacks, queues, hash maps, and reference counting all use CAS at their core.

Memory ordering is the most subtle aspect of atomic programming. The acquire-release pairing establishes happens-before relationships across threads, ensuring that data written before a release store is visible after the corresponding acquire load. Relaxed ordering maximizes performance when only atomicity (not ordering) is needed. Sequential consistency provides the strongest guarantees and is the safe default when performance is not critical.

Knowing when to use atomics versus mutexes is essential: atomics for single-variable patterns, mutexes for multi-variable invariants or when waiting is needed. And always verify with is_lock_free() that your atomic type truly has lock-free hardware support — an atomic that falls back to an internal mutex gives you safety but not performance.

Lock-free programming is powerful but difficult to get right. For most concurrent code, mutexes and condition variables are the correct and maintainable choice. Use atomics where profiling shows mutex overhead is a genuine bottleneck, where the pattern naturally fits a single atomic variable, or where the lock-free guarantees (no blocking, no priority inversion) are a hard requirement.

0 Comments

Inline Feedbacks

View all comments

Discover More

Click For More

Search Techietory

Atomic Operations in C++: Lock-Free Programming

Introduction

The Problem: Counter Without Atomics

std::atomic Basics: Load, Store, and Exchange

Compare-and-Exchange: The Foundation of Lock-Free Algorithms

Memory Ordering: The Critical Detail

The Six Memory Orders Explained

Practical Patterns with std::atomic

Pattern 1: Atomic Flags for One-Time Events

Pattern 2: Reference Counting with atomics

Pattern 3: Atomic Pointer for Lock-Free Reads

is_lock_free(): Checking Hardware Support

Atomics vs. Mutexes: When to Use Which

std::atomic_flag: The Simplest Atomic

Common Mistakes with Atomics

Conclusion

Discover More

Introduction to Flutter Widgets: Stateless and Stateful Widgets

Freelancing as a Data Scientist: Getting Started

Introduction to Machine Learning

Skild AI Secures Record $1.4 Billion Funding Round

Cable Management in Robotics: Preventing Tangles and Breaks

Data Science Page is Live