std::atomic<T> in C++ is a template that wraps a value and guarantees that all operations on it — reads, writes, increments, and compare-exchanges — are indivisible (atomic). No other thread can observe the value in a partially-updated state, and no separate mutex is needed. For simple types like int, bool, and pointers, atomic operations compile down to single machine instructions or hardware lock-prefix instructions, making them significantly faster than mutex-based synchronization for fine-grained shared state.
Introduction
Mutexes and condition variables solve concurrency problems correctly and are the right tool for most situations — but they come with overhead. Every lock() and unlock() involves OS system calls, memory barriers, and potential thread blocking. For high-throughput systems — a web server handling thousands of requests per second, a game engine updating thousands of entities per frame, a financial trading system processing millions of transactions — that overhead accumulates.
For many common concurrency patterns, there is a faster alternative: atomic operations. Instead of protecting shared data with a lock, atomic operations provide hardware-level guarantees that a single read, write, or read-modify-write operation on a variable is indivisible — it either completes fully or not at all, with no possibility of another thread observing an intermediate state.
std::atomic<T> is the C++ abstraction over these hardware primitives. For integer types, it provides atomic increment, decrement, fetch-and-add, and bitwise operations. For all types, it provides atomic load, store, and compare-and-exchange. For pointers, it provides atomic pointer arithmetic.
Beyond the mechanics of std::atomic itself, this article explores the deeper topic of memory ordering — the rules that govern how memory operations in one thread are visible to other threads. Getting memory ordering right is what separates correct lock-free code from subtly broken code that works on x86 but fails on ARM.
This article builds your understanding from simple atomic counters through memory ordering to lock-free data structures. You will understand when atomics are the right tool, how to use them correctly, and where the sharp edges are.
The Problem: Counter Without Atomics
Let’s start with a clear demonstration of why a simple shared counter breaks without synchronization, and why std::atomic fixes it cleanly and efficiently.
#include <iostream>
#include <thread>
#include <atomic>
#include <vector>
#include <chrono>
using namespace std;
// Non-atomic: data race, undefined behavior
int unsafeCounter = 0;
// Atomic: correct and fast
atomic<int> safeCounter{0};
void incrementUnsafe(int n) {
for (int i = 0; i < n; i++) unsafeCounter++;
}
void incrementSafe(int n) {
for (int i = 0; i < n; i++) safeCounter++; // atomic increment
}
template<typename Func>
pair<long long, int> benchmark(Func f, int threads, int iterations) {
auto start = chrono::high_resolution_clock::now();
vector<thread> ts;
for (int i = 0; i < threads; i++)
ts.emplace_back(f, iterations);
for (auto& t : ts) t.join();
auto end = chrono::high_resolution_clock::now();
auto ms = chrono::duration_cast<chrono::milliseconds>(end - start).count();
return {ms, 0};
}
int main() {
const int THREADS = 8;
const int ITERS = 1'000'000;
const int EXPECTED = THREADS * ITERS;
// Unsafe
unsafeCounter = 0;
benchmark(incrementUnsafe, THREADS, ITERS);
cout << "Unsafe counter: " << unsafeCounter
<< " (expected " << EXPECTED << ", lost "
<< (EXPECTED - unsafeCounter) << ")" << endl;
// Safe with atomic
safeCounter = 0;
benchmark(incrementSafe, THREADS, ITERS);
cout << "Atomic counter: " << safeCounter.load()
<< " (expected " << EXPECTED << ")" << endl;
return 0;
}Output:
Unsafe counter: 5432187 (expected 8000000, lost 2567813)
Atomic counter: 8000000 (expected 8000000)Step-by-step explanation:
unsafeCounter++is three machine instructions: load, increment, store. Two threads can execute these concurrently with interleaved steps, causing lost updates. The result is always less than expected.safeCounter++on anatomic<int>compiles to a single atomicXADD(fetch-and-add) instruction on x86 — one indivisible read-modify-write. No other thread can see an intermediate state. The result is always exactlyTHREADS * ITERS.- The
atomic<int>version requires no mutex, nolock_guard, no condition variable — just a different variable type. For this pattern (many threads incrementing a counter), atomics are faster than a mutex because there is no lock contention management, no OS scheduling, and the operation is a single hardware instruction. safeCounter.load()explicitly reads the atomic value. You can also use the implicit conversion (cout << safeCounter), butload()makes the atomic operation visible in the code — a good practice for clarity.
std::atomic Basics: Load, Store, and Exchange
#include <iostream>
#include <atomic>
#include <thread>
using namespace std;
int main() {
atomic<int> counter{0};
atomic<bool> flag{false};
atomic<int*> ptr{nullptr};
// --- Basic operations ---
// Store: write a value atomically
counter.store(42);
cout << "After store: " << counter.load() << endl; // 42
// Load: read a value atomically
int val = counter.load();
cout << "Loaded: " << val << endl;
// Exchange: atomically set new value, return old
int old = counter.exchange(100);
cout << "exchange(100): old=" << old << ", new=" << counter.load() << endl;
// old=42, new=100
// Fetch-and-add: atomically add, return OLD value before addition
int before = counter.fetch_add(5);
cout << "fetch_add(5): before=" << before << ", now=" << counter.load() << endl;
// before=100, now=105
// Fetch-and-sub
before = counter.fetch_sub(3);
cout << "fetch_sub(3): before=" << before << ", now=" << counter.load() << endl;
// before=105, now=102
// Compound assignment operators (syntactic sugar for fetch_add etc.)
counter += 10; // Equivalent to counter.fetch_add(10)
counter -= 5; // Equivalent to counter.fetch_sub(5)
counter++; // Equivalent to counter.fetch_add(1)
++counter; // Also atomic increment
cout << "After +=10, -=5, ++, ++: " << counter.load() << endl; // 109
// Bitwise operations for integer atomics
atomic<unsigned int> bits{0b1111'0000};
bits.fetch_and(0b1010'1010); // AND
cout << "After AND: " << bits.load() << endl; // 0b1010'0000 = 160
bits.fetch_or(0b0000'0101); // OR
cout << "After OR: " << bits.load() << endl; // 0b1010'0101 = 165
bits.fetch_xor(0b1111'1111); // XOR
cout << "After XOR: " << bits.load() << endl; // 0b0101'1010 = 90
return 0;
}Output:
After store: 42
Loaded: 42
exchange(100): old=42, new=100
fetch_add(5): before=100, now=105
fetch_sub(3): before=105, now=102
After +=10, -=5, ++, ++: 109
After AND: 160
After OR: 165
After XOR: 90Step-by-step explanation:
store(value)andload()are the atomic write and read primitives. For simple types, these compile to single machine instructions.storeandloadare always atomic — no other thread can see a partial write.exchange(new_val)atomically sets the value tonew_valand returns the old value. This is useful for “swap-and-check” patterns — you can atomically replace a value and know what you replaced.fetch_add(n)atomically addsnand returns the value before the addition (the “fetch” part). This is subtly different from+= nwhich does not return the old value.fetch_addis essential in patterns where you need to know the before-value (e.g., claiming a slot in an array).- The compound assignment operators (
+=,-=,++,--) are convenience wrappers aroundfetch_add/fetch_sub. They are fully atomic but do not return the old value. - Bitwise operations (
fetch_and,fetch_or,fetch_xor) atomically modify individual bits. These are useful for atomic flag sets, bitmask operations, and managing sets of boolean flags as a single atomic word.
Compare-and-Exchange: The Foundation of Lock-Free Algorithms
compare_exchange_weak and compare_exchange_strong are the most powerful atomic operations. They are the fundamental building block of virtually all lock-free algorithms.
The semantics: “Atomically, if the current value equals expected, set it to desired and return true. Otherwise, load the current value into expected and return false.”
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
using namespace std;
// Lock-free max update: atomically ensure the stored value
// is the maximum of the current stored value and a new value
void atomicMax(atomic<int>& maxVal, int candidate) {
int current = maxVal.load();
// Keep trying until either:
// (a) current >= candidate (no update needed), or
// (b) we successfully update from current to candidate
while (candidate > current) {
// Try to swap: "if maxVal still equals current, set it to candidate"
if (maxVal.compare_exchange_weak(current, candidate)) {
// Success: we updated the value
break;
}
// Failure: another thread changed maxVal between our load and CAS
// 'current' has been updated to the actual current value by CAS
// Loop and try again with the new current value
}
}
// Lock-free stack (simplified Treiber stack)
template<typename T>
class LockFreeStack {
struct Node {
T value;
Node* next;
Node(T v) : value(v), next(nullptr) {}
};
atomic<Node*> head_{nullptr};
public:
void push(T value) {
Node* newNode = new Node(value);
// Atomically set head to newNode, making newNode->next the old head
// Retry if another thread changed head between our load and CAS
newNode->next = head_.load();
while (!head_.compare_exchange_weak(newNode->next, newNode)) {
// CAS failed: head changed; compare_exchange_weak updated
// newNode->next to the current head automatically — just retry
}
}
optional<T> pop() {
Node* oldHead = head_.load();
while (oldHead != nullptr) {
// Try to atomically advance head past oldHead
if (head_.compare_exchange_weak(oldHead, oldHead->next)) {
T value = oldHead->value;
delete oldHead;
return value;
}
// CAS failed: oldHead updated to current head — retry
}
return nullopt; // Stack was empty
}
~LockFreeStack() {
while (pop().has_value()) {}
}
};
int main() {
// --- atomicMax demo ---
cout << "=== Atomic Max ===" << endl;
atomic<int> maxValue{0};
vector<thread> threads;
vector<int> candidates = {5, 12, 3, 19, 7, 11, 9, 15};
for (int c : candidates) {
threads.emplace_back(atomicMax, ref(maxValue), c);
}
for (auto& t : threads) t.join();
cout << "Max of {5,12,3,19,7,11,9,15} = " << maxValue.load() << endl;
// --- Lock-free stack demo ---
cout << "\n=== Lock-Free Stack ===" << endl;
LockFreeStack<int> stack;
// Push from multiple threads
vector<thread> pushers;
for (int i = 0; i < 5; i++) {
pushers.emplace_back([&stack, i]() {
stack.push(i * 10);
cout << "Pushed: " << i * 10 << endl;
});
}
for (auto& t : pushers) t.join();
// Pop all items
cout << "Popping: ";
auto item = stack.pop();
while (item.has_value()) {
cout << *item << " ";
item = stack.pop();
}
cout << endl;
return 0;
}Output:
=== Atomic Max ===
Max of {5,12,3,19,7,11,9,15} = 19
=== Lock-Free Stack ===
Pushed: 0
Pushed: 10
Pushed: 20
Pushed: 30
Pushed: 40
Popping: 40 30 20 10 0 Step-by-step explanation:
compare_exchange_weak(expected, desired)is the CAS (Compare-And-Swap) operation. On success: atomically changes the value fromexpectedtodesired, returnstrue. On failure: the value was not equal toexpected, so nothing changes;expectedis updated to the actual current value, returnsfalse.- In
atomicMax, the pattern is: load current value, check if candidate is larger, attempt CAS. If another thread changedmaxValbetween the load and CAS, CAS fails and we loop with the updatedcurrent. This is the retry loop pattern — the core of all lock-free algorithms. compare_exchange_weakvscompare_exchange_strong: weak may fail spuriously (return false even when expected equals current) on some architectures (like ARM), but is faster in loops because it maps to a single LL/SC instruction pair. Strong never fails spuriously but may be slower. Rule: useweakin a retry loop, usestrongwhen you only want to try once.- In
LockFreeStack::push(),newNode->next = head_.load()reads the current head. The CAS attempts to sethead_fromoldHeadtonewNode. If another thread pushed between our load and CAS,head_changed — CAS fails andnewNode->nextis updated to the newhead_. We loop and try again. - The lock-free stack provides wait-free push and lock-free pop: push always completes in O(1) amortized retries; pop may retry but always makes progress. Crucially, no thread ever blocks another — even if one thread is preempted mid-push, other threads can still push and pop normally.
Memory Ordering: The Critical Detail
The most subtle aspect of atomic operations is memory ordering — specifying how atomic operations interact with the visibility of non-atomic memory operations in other threads. Getting this wrong produces code that is correct on x86 (which has a strong memory model) but fails silently on ARM, PowerPC, or RISC-V (which have weaker models).
C++ provides six memory ordering options:
#include <atomic>
#include <thread>
#include <iostream>
using namespace std;
atomic<int> data{0};
atomic<bool> ready{false};
// Example 1: Acquire-Release ordering (the most common and correct ordering)
// for producer-consumer patterns
void producer_acqrel() {
data.store(42, memory_order_relaxed); // Non-synchronizing store
// The release store: all writes before this point are visible
// to any thread that performs an acquire load of 'ready'
ready.store(true, memory_order_release);
}
void consumer_acqrel() {
// The acquire load: all writes visible to the producer before its
// release store are now visible to this thread
while (!ready.load(memory_order_acquire)) {
// Spin until ready — in practice, use a condition variable for this
}
// Guaranteed: data == 42 here, because of acquire-release pairing
cout << "data = " << data.load(memory_order_relaxed) << endl; // Always 42
}
// Example 2: Relaxed ordering — no synchronization guarantees
atomic<int> relaxedCounter{0};
void relaxedIncrement(int n) {
for (int i = 0; i < n; i++) {
// Relaxed: only guarantees the operation itself is atomic
// No ordering guarantees relative to other memory operations
relaxedCounter.fetch_add(1, memory_order_relaxed);
}
}
// Suitable for: counters where only the final total matters,
// not the order of increments relative to other operations
// Example 3: Sequential consistency (the default, and safest)
atomic<int> seqCst{0};
void sequentialOps() {
seqCst.store(1); // Default: memory_order_seq_cst
int v = seqCst.load(); // Default: memory_order_seq_cst
// All seq_cst operations form a single total order seen by ALL threads
}
int main() {
cout << "--- Acquire-Release demo ---" << endl;
data = 0;
ready = false;
thread prod(producer_acqrel);
thread cons(consumer_acqrel);
prod.join(); cons.join();
cout << "--- Relaxed counter ---" << endl;
relaxedCounter = 0;
vector<thread> threads;
for (int i = 0; i < 4; i++)
threads.emplace_back(relaxedIncrement, 1000000);
for (auto& t : threads) t.join();
cout << "Relaxed counter: " << relaxedCounter.load() << " (expected 4000000)" << endl;
return 0;
}Output:
--- Acquire-Release demo ---
data = 42
--- Relaxed counter ---
Relaxed counter: 4000000 (expected 4000000)The Six Memory Orders Explained
Understanding memory ordering requires understanding that modern CPUs and compilers reorder memory operations for performance. The memory ordering parameters control how much reordering is permitted.
memory_order_relaxed — The weakest ordering. Only guarantees the individual atomic operation is atomic — no ordering relative to other memory operations in the same thread or in other threads. Use for counters and statistics where only the final value matters, not intermediate ordering.
memory_order_acquire — Used on loads (reads). No reads or writes in the current thread can be moved before this load. When combined with a release store in another thread, it establishes a synchronization point: all writes that happened before the release store in the other thread are now visible.
memory_order_release — Used on stores (writes). No reads or writes in the current thread can be moved after this store. Paired with an acquire load, it makes all preceding writes visible to the thread doing the acquire.
memory_order_acq_rel — For read-modify-write operations (like fetch_add). Combines acquire and release: acts as an acquire for the load part and a release for the store part.
memory_order_consume — A weaker variant of acquire, only for data-dependent operations. Complex and rarely used correctly in practice — prefer acquire instead.
memory_order_seq_cst — The strongest ordering. All seq_cst operations form a single, globally consistent total order across all threads. This is the default when no ordering is specified. Safest but potentially slowest on architectures with weak memory models.
// Memory ordering quick reference:
// For a flag that signals "data is ready" (producer-consumer):
// Producer:
data.store(value, memory_order_relaxed); // The data write
flag.store(true, memory_order_release); // Signal: everything before here is visible
// Consumer:
while (!flag.load(memory_order_acquire)); // Wait for signal
int v = data.load(memory_order_relaxed); // Safe to read — guaranteed visible
// For simple counters (order doesn't matter):
counter.fetch_add(1, memory_order_relaxed); // Maximum performance
// When in doubt — use the default (seq_cst):
counter.fetch_add(1); // Safe, correct, slightly slower on weak-memory archsPractical Patterns with std::atomic
Pattern 1: Atomic Flags for One-Time Events
#include <iostream>
#include <atomic>
#include <thread>
using namespace std;
atomic<bool> shutdownRequested{false};
atomic<bool> emergencyStop{false};
void worker(int id) {
int count = 0;
while (!shutdownRequested.load(memory_order_relaxed)) {
// Do work
count++;
if (count % 100000 == 0) {
cout << "Worker " << id << ": " << count << " iterations" << endl;
}
if (emergencyStop.load(memory_order_acquire)) {
cout << "Worker " << id << ": emergency stop!" << endl;
return;
}
this_thread::yield();
}
cout << "Worker " << id << ": normal shutdown after " << count << " iters" << endl;
}
int main() {
thread w1(worker, 1);
thread w2(worker, 2);
this_thread::sleep_for(chrono::milliseconds(10));
shutdownRequested.store(true, memory_order_relaxed);
w1.join();
w2.join();
cout << "All workers shut down" << endl;
return 0;
}shutdownRequested uses relaxed ordering for the inner loop check — we only care that the flag is eventually seen, not that it establishes a happens-before relationship with specific data. For emergencyStop, acquire is used because it might need to synchronize with data written before the emergency condition was set.
Pattern 2: Reference Counting with atomics
std::shared_ptr uses atomic reference counting internally. You can build similar patterns:
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
using namespace std;
class RefCounted {
atomic<int> refCount_{1};
string name_;
public:
RefCounted(string name) : name_(name) {
cout << "Created: " << name_ << endl;
}
~RefCounted() {
cout << "Destroyed: " << name_ << endl;
}
void addRef() {
refCount_.fetch_add(1, memory_order_relaxed);
}
void release() {
// fetch_sub with release: ensures all accesses to the object
// happen-before the decrement that triggers destruction
if (refCount_.fetch_sub(1, memory_order_release) == 1) {
// We decremented from 1 to 0: we're the last owner
// acquire fence to sync with other releases
atomic_thread_fence(memory_order_acquire);
delete this;
}
}
void use() const { cout << "Using: " << name_ << endl; }
};
int main() {
RefCounted* obj = new RefCounted("SharedResource");
// Multiple threads "share" the object
vector<thread> threads;
for (int i = 0; i < 3; i++) {
obj->addRef();
threads.emplace_back([obj, i]() {
this_thread::sleep_for(chrono::milliseconds(i * 20));
obj->use();
obj->release(); // Will delete when last reference released
});
}
obj->release(); // Main thread releases its initial reference
for (auto& t : threads) t.join();
return 0;
}Output:
Created: SharedResource
Using: SharedResource
Using: SharedResource
Using: SharedResource
Destroyed: SharedResourceThe release ordering on fetch_sub ensures all uses of the object in the current thread are visible to whichever thread performs the final decrement. The acquire fence on the final decrement ensures those uses are visible before destruction — this is exactly how std::shared_ptr‘s reference counting works.
Pattern 3: Atomic Pointer for Lock-Free Reads
#include <iostream>
#include <atomic>
#include <thread>
using namespace std;
struct Config {
int timeout;
int maxConnections;
string endpoint;
Config(int t, int m, string e)
: timeout(t), maxConnections(m), endpoint(move(e)) {}
};
// Atomic pointer: readers get a consistent snapshot without locking
atomic<Config*> currentConfig{
new Config(30, 100, "localhost:8080")
};
// Readers: many threads, lock-free
void readConfig(int id) {
// Acquire load: ensures we see all writes that happened before
// the store of this pointer
const Config* cfg = currentConfig.load(memory_order_acquire);
cout << "Thread " << id << ": timeout=" << cfg->timeout
<< " endpoint=" << cfg->endpoint << endl;
}
// Writer: updates the entire config atomically
void updateConfig() {
// Release the new config: all writes to new Config happen-before
// any acquire load that sees this pointer
Config* newCfg = new Config(60, 200, "prod.example.com:443");
Config* old = currentConfig.exchange(newCfg, memory_order_release);
// old config: in a real system, defer deletion (hazard pointers, RCU)
// For simplicity here, we just delete after a delay
this_thread::sleep_for(chrono::milliseconds(10));
delete old;
}
int main() {
// Readers run concurrently with writer
thread r1(readConfig, 1);
thread r2(readConfig, 2);
thread w(updateConfig);
thread r3(readConfig, 3);
r1.join(); r2.join(); w.join(); r3.join();
delete currentConfig.load(); // Clean up final config
return 0;
}This pattern — an atomic pointer to an immutable config object, updated by replacing the whole pointer — is used in systems that need frequent lock-free reads with rare updates. Each reader gets a consistent snapshot; the writer atomically publishes a completely new version.
is_lock_free(): Checking Hardware Support
Not all types support truly lock-free atomic operations on all hardware. For types larger than the machine word size, the compiler may implement atomics using a hidden internal mutex.
#include <iostream>
#include <atomic>
using namespace std;
struct SmallStruct { int x; };
struct LargeStruct { int x, y, z, w; double d1, d2; }; // 32+ bytes
int main() {
atomic<int> ai;
atomic<long> al;
atomic<double> ad;
atomic<int*> ap;
atomic<SmallStruct> as;
atomic<LargeStruct> aL;
cout << "atomic<int>: lock_free=" << ai.is_lock_free() << endl;
cout << "atomic<long>: lock_free=" << al.is_lock_free() << endl;
cout << "atomic<double>: lock_free=" << ad.is_lock_free() << endl;
cout << "atomic<int*>: lock_free=" << ap.is_lock_free() << endl;
cout << "atomic<SmallStruct>: lock_free=" << as.is_lock_free() << endl;
cout << "atomic<LargeStruct>: lock_free=" << aL.is_lock_free() << endl;
// Compile-time check using ATOMIC_INT_LOCK_FREE
cout << "\nAtomic lock-free constants:" << endl;
cout << "ATOMIC_INT_LOCK_FREE: " << ATOMIC_INT_LOCK_FREE << endl;
cout << "ATOMIC_LONG_LOCK_FREE: " << ATOMIC_LONG_LOCK_FREE << endl;
cout << "ATOMIC_POINTER_LOCK_FREE: " << ATOMIC_POINTER_LOCK_FREE << endl;
// 0 = never lock-free, 1 = sometimes (implementation-defined), 2 = always
return 0;
}Typical output on a 64-bit x86 system:
atomic<int>: lock_free=1
atomic<long>: lock_free=1
atomic<double>: lock_free=1
atomic<int*>: lock_free=1
atomic<SmallStruct>: lock_free=1
atomic<LargeStruct>: lock_free=0LargeStruct is not lock-free because it is larger than what a single CAS instruction can handle on this architecture. The compiler falls back to an internal mutex. In this case, atomic<LargeStruct> provides atomicity but not the performance benefits of true lock-free operation.
For performance-critical code, ensure your atomic types are always lock-free by keeping them small (pointer-sized or smaller), or use a pointer to an immutable larger struct (as shown in the config pattern above).
Atomics vs. Mutexes: When to Use Which
Choosing between atomics and mutexes depends on the nature of the shared state and the operations performed on it.
| Scenario | Best Choice | Reason |
|---|---|---|
| Simple counter (increments only) | atomic<int> | Single instruction, no blocking |
| Boolean flag (set once, read many) | atomic<bool> | No blocking, simple semantics |
| Pointer swap (publish new version) | atomic<T*> | Lock-free pointer exchange |
| Multiple variables updated together | mutex | Atomics can’t span multiple variables |
| Complex data structure update | mutex | CAS only works on one value at a time |
| Condition-based waiting | mutex + condition_variable | Atomics have no wait/notify |
| Reference counting | atomic<int> | Natural fit — one value, one operation |
| Statistics and metrics | atomic<int> with relaxed | Maximum performance, no sync needed |
| Read-heavy shared config | shared_mutex or atomic pointer | Depends on update complexity |
| Queue between threads | condition_variable + mutex | Need wait-on-empty semantics |
Key insight: Atomics can only make a single variable’s operations atomic. If you need to atomically update two or more related variables together — or if a thread needs to wait until some condition becomes true — you need a mutex (and possibly a condition variable).
std::atomic_flag: The Simplest Atomic
std::atomic_flag is guaranteed to be lock-free on all platforms and provides the most primitive atomic: a boolean flag with only test_and_set() and clear() operations. It is the building block for spin locks.
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
using namespace std;
// Spin lock using atomic_flag
class SpinLock {
atomic_flag flag_ = ATOMIC_FLAG_INIT;
public:
void lock() {
// Spin until we acquire the flag (test_and_set returns false)
while (flag_.test_and_set(memory_order_acquire)) {
// Hint to CPU that we're in a spin loop — improves performance
// on x86 with Intel's PAUSE instruction
this_thread::yield();
}
}
void unlock() {
flag_.clear(memory_order_release);
}
};
SpinLock spinLock;
int protectedCounter = 0;
void incrementWithSpinLock(int n) {
for (int i = 0; i < n; i++) {
spinLock.lock();
protectedCounter++;
spinLock.unlock();
}
}
int main() {
const int THREADS = 4;
const int ITERS = 500000;
vector<thread> threads;
for (int i = 0; i < THREADS; i++)
threads.emplace_back(incrementWithSpinLock, ITERS);
for (auto& t : threads) t.join();
cout << "SpinLock counter: " << protectedCounter
<< " (expected " << THREADS * ITERS << ")" << endl;
return 0;
}Output:
SpinLock counter: 2000000 (expected 2000000)When to use a spin lock: Only when the critical section is extremely short (nanoseconds) and the lock is almost always uncontended. Spin locks waste CPU when threads must wait longer. For most production code, use std::mutex instead — it yields the CPU when blocked, letting other threads run.
Common Mistakes with Atomics
Mistake 1: Thinking atomics make compound operations atomic.
atomic<int> x{0}, y{0};
// NOT atomic together! Another thread can observe x=1, y=0
x.store(1);
y.store(1);
// For atomic compound operations: use a mutex or pack both into one atomicMistake 2: Using relaxed ordering when synchronization is needed.
atomic<bool> ready{false};
int data = 0;
// Producer:
data = 42;
ready.store(true, memory_order_relaxed); // WRONG: no happens-before with data
// Consumer:
while (!ready.load(memory_order_relaxed)); // WRONG: may see ready=true but data=0!
cout << data; // Undefined behavior on weak-memory architectures
// Fix: use release/acquire pairMistake 3: Using seq_cst blindly without understanding the cost. seq_cst is correct but may add full memory fences on ARM/PowerPC. For counters and flags that don’t need cross-thread ordering guarantees, relaxed is correct and faster. Profile before assuming seq_cst is a bottleneck.
Mistake 4: ABA problem in CAS loops.
// Thread reads A at address X
// Another thread changes X: A -> B -> A
// First thread's CAS succeeds (sees A again) — but the state changed!
// Solution: use a version counter alongside the pointer (atomic<pair<T*,int>>)
// or use libraries that handle ABA (hazard pointers, epoch-based reclamation)Mistake 5: Lock-free != wait-free != fast. Lock-free means at least one thread makes progress at any time. Wait-free means every thread makes progress. Neither automatically means faster than a mutex — at high contention, CAS retry loops can be slower than a well-designed mutex. Measure before optimizing.
Conclusion
std::atomic provides the C++ interface to hardware atomic operations — the fundamental building blocks of lock-free concurrent programming. For simple patterns (counters, flags, pointer publishing), atomics are faster than mutexes because they map to single hardware instructions with no OS involvement, no blocking, and no scheduling overhead.
The key operations — load, store, exchange, fetch_add, and the powerful compare_exchange — cover virtually all atomic patterns. The compare-and-exchange operation, with its retry loop idiom, is the foundation of all lock-free data structures: lock-free stacks, queues, hash maps, and reference counting all use CAS at their core.
Memory ordering is the most subtle aspect of atomic programming. The acquire-release pairing establishes happens-before relationships across threads, ensuring that data written before a release store is visible after the corresponding acquire load. Relaxed ordering maximizes performance when only atomicity (not ordering) is needed. Sequential consistency provides the strongest guarantees and is the safe default when performance is not critical.
Knowing when to use atomics versus mutexes is essential: atomics for single-variable patterns, mutexes for multi-variable invariants or when waiting is needed. And always verify with is_lock_free() that your atomic type truly has lock-free hardware support — an atomic that falls back to an internal mutex gives you safety but not performance.
Lock-free programming is powerful but difficult to get right. For most concurrent code, mutexes and condition variables are the correct and maintainable choice. Use atomics where profiling shows mutex overhead is a genuine bottleneck, where the pattern naturally fits a single atomic variable, or where the lock-free guarantees (no blocking, no priority inversion) are a hard requirement.








