C++ Atomic Ordering - An x86-TSO's Perspective

本文是笔者对 C++ memory model 中 atomic operations 和 memory order 的理解。

Introduction

The memory model means that C++ code now has a standardized library to call regardless of who made the compiler and on what platform it’s running. There’s a standard way to control how different threads talk to the processor’s memory.

https://www.theregister.com/2011/06/11/herb_sutter_next_c_plus_plus

在 C++11 之前,C++ 是没有明确定义 memory model 的,编写正确、可靠、高效的多线程程序是非常困难的,需要考虑不同的编译器的支持,不同处理器的 memory model 和 cache coherence 等。可以说 C++11 标准的一个最重要的特性就是引入了 memory model,使得开发者不再需要关心编写的程序是使用什么编译器编译、运行在何种处理器上,在 C++ abstract machine 层面上为多线程编程提供了标准化的支持,确保多线程程序的正确性和可移植性。

本文主要包括以下内容:

std::memory_order

std::memory_order 的作用:

std::memory_order specifies how memory accesses, including regular, non-atomic memory accesses, are to be ordered around an atomic operation.

https://en.cppreference.com/w/cpp/atomic/memory_order


C++ 共有 6 种 memory order:

enum class memory_order : /* unspecified */ {
    relaxed, consume, acquire, release, acq_rel, seq_cst
};
inline constexpr auto memory_order_relaxed = memory_order::relaxed;
inline constexpr auto memory_order_consume = memory_order::consume;
inline constexpr auto memory_order_acquire = memory_order::acquire;
inline constexpr auto memory_order_release = memory_order::release;
inline constexpr auto memory_order_acq_rel = memory_order::acq_rel;
inline constexpr auto memory_order_seq_cst = memory_order::seq_cst;

这 6 种 memory order 之间的关系如下图所示:

relaxed-->release--------------->acq_rel-->seq_cst
         \-->consume-->acquire--/

x->y means y is stricter/stronger than x

注:当前 C++ 标准明确不鼓励使用 memory_order_consume。并且目前编译器都没有实现 memory_order_consume,只是简单地将 memory_order_consume 转换为 memory_order_acquire。参见:

  1. std::memory_order - cppreference.com “The specification of release-consume ordering is being revised, and the use of memory_order_consume is temporarily discouraged.”

  2. C++ Support in Clang “memory_order_consume is lowered to memory_order_acquire”

  3. LLVM Atomic Instructions and Concurrency Guide — LLVM 19.0.0git documentation “This(Acquire) corresponds to the C++/C memory_order_acquire. It should also be used for C++/C memory_order_consume.

modification order

All modifications to a particular atomic object M occur in some particular total order, called the modification order of M.

[Note 3: There is a separate order for each atomic object. There is no requirement that these can be combined into a single total order for all objects. In general this will be impossible since different threads can observe modifications to different objects in inconsistent orders. — end note]

https://eel.is/c++draft/intro.races#4

The following four requirements are guaranteed for all atomic operations:

  1. write-write coherence: If an operation A that modifies an atomic object M happens before an operation B that modifies M, then A is earlier than B in the modification order of M.

  2. read-read coherence: If a value computation A of an atomic object M happens before a value computation B of M, and A takes its value from a side effect X on M, then the value computed by B is either the value stored by X or the value stored by a side effect Y on M, where Y follows X in the modification order of M.

  3. read-write coherence: If a value computation A of an atomic object M happens before an operation B that modifies M, then A takes its value from a side effect X on M, where X precedes B in the modification order of M.

  4. write-read coherence: If a side effect X on an atomic object M happens before a value computation B of M, then the evaluation B takes its value from X or from a side effect Y that follows X in the modification order of M.

relaxed ordering

Atomic operations tagged memory_order_relaxed are not synchronization operations; they do not impose an order among concurrent memory accesses. They only guarantee atomicity and modification order consistency.

memory_order_relaxed 的 load/store 与普通的 load/store 的唯一区别就是增加了对 load/store 操作的 atomicity 和 modification order consistency 的保证。

结合 atomic operations 保证 read/write coherence 和 relaxed atomic operations 保证 modification order consistency 这两个性质,可以得出结论:同一线程内对同一个 object 的 atomic operations 之间不可以进行 reorder。即便两个 atomic operations 是对同一个 object 的 relaxed atomic loads,也不可以对这两个 relaxed atomic loads 进行 reorder。

但是,同一线程内对某个 object 的 relaxed atomic operation 是可以与其他 object 的 memory accesses 进行 reorder 的。

下面通过几个例子来理解 relaxed ordering:

  1. 考虑如下代码:

    +-----------------------------------------+------------------------------------------+
    | Thread 1                                | Thread 2                                 |
    +=========================================+==========================================+
    | r1 = y.load(memory_order_relaxed); // A | r2 = x.load(memory_order_relaxed); // C  |
    | x.store(r1, memory_order_relaxed); // B | y.store(42, memory_order_relaxed); // D  |
    +-----------------------------------------+------------------------------------------+
    | std::atomic<int> x{0}, y{0};                                                       |
    | r1 = r2 = 42 is allowed                                                            |
    +------------------------------------------------------------------------------------+
    

    因为 thread 2 中的 C 和 D 是对不同 objects 的 relaxed atomic loads,所以 thread2 中的 D 是允许被 reorder(经 compile-time reordering 或 run-time reordering) 至 C 之前的。当 thread 1 在执行 A 时 thread 2 的 D 对 y 的修改对 thread 1 可能已经是可见的,使得 thread 1 读取到的 y 的值 r1 = 42,然后 thread 1 执行 B 使得 x 的值被修改为 r1 的值 42,之后 thread 2 执行 C 时 thread 1 的 B 对 x 的修改对 thread2 同样可能已经是可见的,所以 thread2 执行 D 读取到的 x 的值就可能为 r2 = 42。因此 r1 = r2 = 42 是可能的。

  2. 考虑如下代码:

    +----------------------------------------+-----------------------------------------+
    | Thread 1                               | Thread 2                                |
    +========================================+=========================================+
    | x.store(1, memory_order_relaxed); // A | y = x.load(memory_order_relaxed); // C  |
    | x.store(2, memory_order_relaxed); // B | z = x.load(memory_order_relaxed); // D  |
    |                                        | assert (y <= z); // never fires         |
    +----------------------------------------+-----------------------------------------+
    | std::atomic<int> x{0};                                                           |
    +----------------------------------------------------------------------------------+
    

    当 y = 1, z = 0 或 y = 2, z = 0 或 y = 2, z = 1 时,assert(y <= z); 会失败。

release-acquire ordering

memory_order_acquire: A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load.

memory_order_release: A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store.

memory_order_acq_rel: A read-modify-write operation with this memory order is both an acquire operation and a release operation. No memory reads or writes in the current thread can be reordered before the load, nor after the store.

memory_order_acquirememory_order_release 是比 memory_order_relaxed 更“强”的 memory order,不仅有与 memory_order_relaxed 一样的 atomicity 和 modification order consistency 的保证,还有如下的保证:

注意:

下面通过几个例子来理解 release-acquire ordering:

  1. 考虑如下代码:

    +------------------------------------------+-----------------------------------------------+
    | Thread 1                                 | Thread 2                                      |
    +==========================================+===============================================+
    |                                          | std::string* p2;                              |
    | auto *p = new std::string("Hello"); // A | while(!(p2 = ptr.load(memory_order_acquire))) |
    | data = 42; // B                          |     ;                                         |
    | ptr.store(p, memory_order_release); // C | assert(*p2 == "Hello"); // never fires        |
    |                                          | assert(data == 42); // never fires            |
    +------------------------------------------+-----------------------------------------------+
    | int data{0};                                                                             |
    | std::atomic<std::string*> ptr{nullptr};                                                  |
    +------------------------------------------------------------------------------------------+
    

    初始时 std::atomic<std::string*> ptr 为 nullptr,如果 thread 2 读取到的 std::atomic<std::string*> ptr 的值不为 nullptr 说明 thread 2 读取到的 std::atomic<std::string*> ptr 的值是 thread 1 执行 C 对 ptr 修改的值,所以此时 thread 1 中的 A 和 B 写入内存的内容一定是对 thread 2 可见的,因此断言 assert(*p2 == "Hello")assert(data == 42) 一定不会失败。

  2. 考虑如下代码:

    +--------------------------------------------+-------------------------------------------+
    | Thread 1                                   | Thread 2                                  |
    +============================================+===========================================+
    | y = 0;                                     | while(a.load(memory_order_acquire) != 10) |
    | x = 0;                                     |     ;                                     |
    | a.store(10, memory_order_release);         | assert(x == 0); // never fires            |
    | y = 1;                                     | r1 = x;                                   |
    | while(b.load(memory_order_acquire) != 20)) | b.store(20, memory_order_release);        |
    |     ;                                      | while(c.load(emory_order_acquire) != 30)  |
    | x = 1;                                     |     ;                                     |
    | c.store(30, memory_order_release);         | assert(y == 1); // never fires            |
    |                                            | r2 = y;                                   |
    +--------------------------------------------+-------------------------------------------+
    | int x, y;                                                                              |
    | std::atomic<int> a, b, c;                                                              |
    +----------------------------------------------------------------------------------------+
    

    与上例类似,断言 assert(x == 0)assert(y == 1) 一定不会失败。

    如前文所述,如果某个 atomic store 使用的是 memory_order_release,那么位于该 release atomic store 之前的 memory accesses 不允许被 reorder 至该 release atomic store 之后,但是位于该 release atomic store 之后的 memory accesses 是允许被 reorder 至该 release atomic store 之前的。

    因此编译器可以将 thread 1 的 y = 1; 这一语句 reorder 至 a.store (10, std::memory_order_release); 语句之前,然后编译器就可以通过 Dead Store Elimination 优化来消除 y = 0; 这一语句,即优化为如下所示代码:

    x = 0;
    y = 1;
    a.store(10, memory_order_release);
    while(b.load(memory_order_acquire) != 20))
        ;
    x = 1;
    c.store(30, memory_order_release);
    

sequentially-consistent ordering

memory_order_seq_cst: A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order.

There is a single total order S on all memory_order_seq_cst operations, including fences…

single total order means that all threads are guaranteed to see the same order of memory operations on atomic variables.

memory_order_seq_cst 是比 memory_order_acquirememory_order_release 更“强”的 memory order。使用 memory_order_seq_cst 的 atomic load 有着与 memory_order_acquire 一样的语义,使用 memory_order_seq_cst 的 atomic store 有着与 memory_order_release 一样的语义,并且所有使用 memory_order_seq_cst 的 operations 之间(不仅包括 atomic load, store 还包括 atomic fence)存在一 single total order

注:不要混淆 memory_order_seq_cst 保证的 single total ordermemory_order_relaxed 保证的 modification order consistency

下面通过几个例子来理解 sequentially-consistent ordering:

  1. 考虑如下代码(std::atomic<T>::load, std::atomic<T>::store 的默认 memory_order 参数就是 memory_order_seq_cst,本例中省略了 memory_order 参数):

    +----------------+----------------+-------------------+-------------------+
    | Thread 1       | Thread 2       | Thread 3          | Thread 4          |
    +================+================+===================+===================+
    |                |                | while (!x.load()) | while (!y.load()) |
    | x.store(true); | y.store(true); |     ;             |     ;             |
    |                |                | if (y.load())     | if (x.load())     |
    |                |                |     ++z;          |     ++z;          |
    +----------------+----------------+-------------------+-------------------+
    | std::atomic<bool> x{false}, y{false};                                   |
    | std::atomic<int> z{0};                                                  |
    | When the four threads finish their execution, z must not be 0.          |
    +-------------------------------------------------------------------------+
    

    初始时 x = y = false, z = 0,当这 4 个 threads 执行结束后,z 值一定不为 0。这是因为 memory_order_seq_cst 保证所有 memory_order_seq_cst 的 atomic operations 之间存在一 single total order。在本例中,只有 x 和 y 两个 atomic object,正好 4 个 threads 中对 x 和 y 的 atomic operations 使用的都是 memory_order_seq_cst,所以 x.store(true)y.store(true) 在 single total order 中只存在两种可能的顺序:

    1. X-Y 即在 single total order 中对 x 的修改先于对 y 的修改。这样,当 thread 4 读取到的 y 的值为 true 时,此时 thread 4 读取到的 x 的值也一定为 true,因此 ++z 一定会执行,所以 z 一定不为 0。

    2. Y-X 即在 single total order 中对 y 的修改先于对 x 的修改。这样,当 thread 3 读取到的 x 的值为 true 时,此时 thread 3 读取到的 y 的值也一定为 true,因此 ++z 一定会执行,所以 z 一定不为 0。

    问题:如果将本例中对 x 和 y 的 load/store 使用的 memory order 修改为 memory_order_acquire/memory_order_release,那么当 4 个 threads 执行结束后,z 的值是否还能保证一定不为 0 ?

    答案:不能保证,即 z 的值可能为 0。因为 release-acquire ordering 不能保证 single total order。对于 thread 3 可能是先观测到 x 先被修改为 true 然后再观测到 y 被修改为 true,当 thread 3 执行 y.load() 时还没观测到 y 被修改为 true,所以 thread 3 的 ++z 不会被执行;对于 thread 4 可能是先观测到 y 先被修改为 true 然后再观测到 x 被修改为 true,当 thread 4 执行 x.load() 时还没观测到 x 被修改为 true,所以 thread 4 的 ++z 不会被执行。因此当 4 个 threads 执行结束后,z 的值可能为 0。

  2. 考虑如下代码:

    +--------------------------------------+--------------------------------------+
    | Thread 1                             | Thread 2                             |
    +======================================+======================================+
    | x.store(true, memory_order_release); | y.store(true, memory_order_release); |
    | if (!y.load(memory_order_seq_cst))   | if (!x.load(memory_order_seq_cst))   |
    |     ++z;                             |     ++z;                             |
    +--------------------------------------+--------------------------------------+
    | int z{0};                                                                   |
    | std::atomic<bool> x{false}, y{false};                                       |
    | When the 2 threads finish their execution, z = 2 is allowed                 |
    +-----------------------------------------------------------------------------+
    

    因为位于 memory_order_acquire atomic load 之前的 memory accesses 允许被 reorder 至该 acquire atomic load 之后,位于 memory_order_release atomic store 之后的 memory accesses 允许被 reorder 至该 release atomic store 之前,又因为使用 memory_order_seq_cst 的 atomic load 有着与 memory_order_acquire 一样的语义,所以 thread 1 的 y.load(std::memory_order_seq_cst) 是可以被 reorder 至 x.store(true, std::memory_order_release) 之前的。同理 thread 2 的 x.load(std::memory_order_seq_cst)是可以被 reorder 至 y.store(true, memory_order_release) 之前的。

    初始时 x 和 y 的值都为 false,所以可能出现的一种执行情况是,thread 1 读取到的 y 的值为 false,thread 2 读取到的 x 的值为 false,因此 thread 1 和 thread 2 会都会执行 z++,使得 thread 1 和 thread 2 执行结束后 z 的值为 2。

    问题:如果将本例中 x.store(true, std::memory_order_release) 修改为 x.store(true, std::memory_order_seq_cst)y.store(true, std::memory_order_release) 修改为 y.store(true, std::memory_order_seq_cst),那么当 thread 1 和 thread 2 执行结束后,z 的值是否还可能为 2 ?

    答案:z 的值不可能为 2。注意:memory_order_seq_cst atomic operations 之间是不能 reorder 的。

mapping C++ atomic operations to x86

本节首先给出 memory_order_relaxed, memory_order_acquire, memory_order_release, memory_order_seq_cst 的 atomic load/store 在 x86 上分别是使用什么汇编指令实现的,然后根据 “Intel® 64 and IA-32 Architectures Software Developer’s Manual” 的相关内容来理解为什么这些汇编指令可以用于实现对应的 memory_order 的 atomic load/store。

本节使用的例子,可以通过 https://godbolt.org/z/zaa7TzjbT 在线查看。

relaxed ordering

int ld_relaxed(std::atomic<int> &i){
    return i.load(std::memory_order_relaxed);
}

void st_relaxed(std::atomic<int> &i) {
    i.store(1, std::memory_order_relaxed);
}
ld_relaxed(std::atomic<int>&):
  movl (%rdi), %eax
  ret

st_relaxed(std::atomic<int>&):
  movl $1, (%rdi)
  ret

GCC 和 Clang 为 memory_order_relaxed 的 atomic load/store 生成的汇编指令是普通的 MOV 指令。

对于 x86,为什么普通的 MOV 指令就能保证 atomicitymodification order consistency 呢?

release-acquire ordering

int ld_acquire(std::atomic<int> &i){
    return i.load(std::memory_order_acquire);
}

void st_release(std::atomic<int> &i) {
    i.store(1, std::memory_order_release);
}
ld_acquire(std::atomic<int>&):
  movl (%rdi), %eax
  ret

st_release(std::atomic<int>&):
  movl $1, (%rdi)
  ret

GCC 和 Clang 为 memory_order_acquire atomic load 和 memory_order_release atomic store 生成的汇编指令是普通的 MOV 指令。

对于 x86,为什么普通的 MOV 指令就能保证 release-acquire 语义呢?

回顾下 memory_order_acquire, memory_order_release 的语义:

在 “Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3” 的 “9.2.2 Memory Ordering in P6 and More Recent Processor Families” 有如下内容:

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles:

In a multiple-processor system, the following ordering principles apply:

首先讨论 reordering:

然后讨论 synchronizes-with 关系:

因此,x86 普通的 MOV 指令就能保证 release-acquire 语义。

sequentially-consistent ordering

int ld_seq_cst(std::atomic<int> &i){
    return i.load(std::memory_order_seq_cst);
}

void st_seq_cst(std::atomic<int> &i) {
    i.store(1, std::memory_order_seq_cst);
}
ld_seq_cst(std::atomic<int>&):
  movl (%rdi), %eax
  ret

st_seq_cst(std::atomic<int>&):
  movl $1, %eax
  xchgl (%rdi), %eax
  ret

GCC 和 Clang 为 memory_order_seq_cst atomic load 生成的汇编指令是普通的 MOV 指令,为 memory_order_seq_cst atomic store 生成的汇编指令则是 XCHG 指令。

对于 x86,为什么使用 MOV 指令实现 memory_order_seq_cst atomic load 配合 XCHG 指令实现 memory_order_seq_cst atomic store 就能保证 sequentially-consistent 语义呢?

回顾下 memory_order_seq_cst 的语义:

在上一节已经说明了 MOV 指令就能保证 release-acquire 语义。所以问题的关键就是如何保证 single total order

在 “Intel® 64 and IA-32 Architectures Software Developer’s Manual” 中有如下相关内容:

根据上述内容可知:XCHG 指令是隐式的带 LOCK 前缀的指令,有着 LOCK 语义,所有带 LOCK 前缀的指令存在 single execution order。

还是通过一个例子来理解:

+-------------------------------------------+-------------------------------------------+
| Thread 1                                  | Thread 2                                  |
+===========================================+===========================================+
| x.store(true, std::memory_order_seq_cst); | y.store(true, std::memory_order_seq_cst); |
| r1 = x.load(std::memory_order_seq_cst);   | r3 = y.load(std::memory_order_seq_cst);   |
| r2 = y.load(std::memory_order_seq_cst);   | r4 = x.load(std::memory_order_seq_cst);   |
+-------------------------------------------+-------------------------------------------+
| std::atomic<bool> x{false}, y{false};                                                 |
| r1 = true, r2 = false, r3 = true and r4 = false is not allowed                        |
+---------------------------------------------------------------------------------------+

如果 r1 = true, r2 = false, r3 = true, r4 = false,说明:

但是实际上 r1 = true, r2 = false, r3 = true, r4 = false 是不可能发生的,memory_order_seq_cst atomic store 生成的汇编指令是 XCHG 指令,XCHG 指令是隐式的带 LOCK 前缀的指令,“all processors agree on a single execution order of all locked instructions”,所以 thread 1 和 thread 2 观测到的 x.store(true, std::memory_order_seq_cst);y.store(true, std::memory_order_seq_cst); 的执行顺序是一样的,即 thread 1 和 thread 2 观测到的 x 和 y 的修改的顺序是一样的。

基于这个例子应该就对“为什么使用 MOV 指令实现 memory_order_seq_cst atomic load 配合 XCHG 指令实现 memory_order_seq_cst atomic store 就能保证 sequentially-consistent 语义”能有一定的理解了。但是可能还是有一些混沌,继续阅读下一节 “x86-TSO programmer’s model” 可以消除这些疑惑和不解。

x86-TSO(Total Store Ordering) programmer’s model

x86-TSO 由 Peter Sewell 在论文 x86-TSO: a rigorous and usable programmer’s model for x86 multiprocessors 中提出。有了 x86-TSO,我们就不再需要对 Intel® 64 and IA-32 Architectures Software Developer’s Manual 做阅读理解,基于 x86-TSO 就可以很容易地 reason about 多线程程序的行为。

需要强调的是:x86-TSO 是一个 programmer’s model,x86-TSO 并不是真实的处理器内部实现。

x86-TSO programmer’s model 如上图所示:

  1. 有若干 hardware threads,每一个 hardware thread 对应 single in-order stream of instruction execution。

  2. Storage subsystem 是图中通过虚线圈起来的部分:

    1. Shared memory。不解释。

    2. Lock 是一个全局锁。当某个 hardware thread 需要对 shared memory 进行独占访问时,需要持有该 lock。

    3. Store buffer。每一个 hardware thread 都对应一个 FIFO 的 store buffer。

  3. Hardware thread 与 storage subsystem 之间的交互用实线表示:

    1. 如果 storage subsystem 的 lock 被某个 hardware thread 持有,那么除该 hardware thread 以外的其他 hardware threads 就是 blocked 状态。

    2. $R_p[a]=v$:如果 hardware thread $p$ 没有被 blocked,那么 hardware thread $p$ 可以读内存地址为 $a$ 处的值,读取到的值记作 $v$。如果 hardware thread $p$ 的 store buffer 中存在对内存地址 $a$ 的写,那么读到的值 $v$ 就是 store buffer 中最新的向内存地址 $a$ 写入的值(称为 store forwarding);如果 hardware thread $p$ 的 store buffer 中不存在对内存地址 $a$ 的写,那么读到的值 $v$ 就是 shared memory 中内存地址 $a$ 保存的值。

    3. $W_p[a]=v$:在任意时刻,不论 hardware thread $p$ 是否被 blocked,hardware thread $p$ 都可以执行向内存地址 $a$ 写入值 $v$ 的写操作,写操作总是先保存在该 hardware thread 的 store buffer 中。

    4. $\tau_p$:如果 hardware thread $p$ 没有被 blocked,它可以默默地将保存在 store buffer 中的写操作生效至 shared memory 中,并且无需与任何其他的 hardware thread 协调。

    5. $F_p$:Hardware thread $p$ 执行 MFENCE 指令会 flush 该 hardware thread $p$ 的 store buffer,即执行多次 $\tau_p$ 操作将保存在 store buffer 中的写操作生效至 shared memory,直至 store buffer 为空。

    6. $L_p$:Hardware thread $p$ 执行带 LOCK 前缀的指令,首先会申请 storage subsystem 的 lock,然后执行指令,在指令执行结束后会 flush store buffer,最后释放 storage subsystem 的 lock。 注意:当某个 hardware thread 持有 storage subsystem 的 lock 时,其他 hardware threads 是 blocked,无法执行读操作。

litmus tests

在 “Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3” 的 “9.2 MEMORY ORDERING” 这一节和 “AMD64 Architecture Programmer’s Manual, Volume 2” 的 “7.2 Multiprocessor Memory Access Ordering” 这一节中,给出了一些示例来帮助理解 memory ordering,这些示例称为 litmus tests。

litmus tests 中使用的符号约定如下:

本节接下来从 x86-TSO 的角度解读这些 litmus tests:

conclusion

一句话总结 x86-TSO 就是 sequentially-consistent ordering + a store buffer with store forwarding

所以在 x86 上实现 sequentially-consistent ordering 只需要避免 store forwarding,因此有以下几种实现:

  1. memory_order_seq_cst atomic load 用 MOV 实现,memory_order_seq_cst atomic store 用 XCHG 实现

  2. memory_order_seq_cst atomic load 用 MOV 实现,memory_order_seq_cst atomic store 用 MOV + MFENCE 实现

  3. memory_order_seq_cst atomic load 用 LOCK XADD(0) 实现,memory_order_seq_cst atomic store 用 MOV 实现

  4. memory_order_seq_cst atomic load 用 MFENCE + MOV 实现,memory_order_seq_cst atomic store 用 MOV 实现

因为通常情况下程序中的 load 比 store 多,所以编译器选择的是前两种实现方案。目前 GCC 13.2 和 Clang 18.1.0 选择的都是方案 1,一些老版本的 GCC 如 8.3 则选择的是方案 2。

感兴趣可以阅读 llvm/lib/Target/X86/X86ISelLowering.cpp 的代码,LowerATOMIC_STORE() 函数实现了将 atomic store 转换为对应的 x86 汇编指令。

memory barrier

前面提到,reordering 包括 compile-time reordering 和 run-time reordering,memory barrier 的作用就是限制编译器和处理器进行相应的 reordering,所以 memory barrier 可以分为 compile-time barrier 和 run-time barrier。

注意,run-time barrier 也会起到 compile-time barrier 的作用, 限制编译器进行 reordering。

在 C++11 引入 memory model 之前,常见的 compile-time barrier 是 asm volatile ("" ::: "memory"),常见的 run-time barrier 则是 GCC 提供的 built-in function __sync_synchronize()

在 C++11 中引入了 std::atomic_signal_fencestd::atomic_thread_fenceatomic_signal_fence 是 compile-time barrier,atomic_thread_fence 是 run-time barrier。

implement your own atomic_load/store on x86-64

#if defined(__x86_64__)

typedef unsigned char u8;
typedef unsigned short u16;
typedef unsigned int u32;
typedef unsigned long long u64;
typedef signed char s8;
typedef signed short s16;
typedef signed int s32;
typedef signed long long s64;
#if __LP64__
typedef unsigned long uptr;
typedef signed long sptr;
#else
typedef unsigned int uptr;
typedef signed int sptr;
#endif

template<typename T>
struct atomic {
  typedef T Type;
  volatile Type __attribute__((aligned(sizeof(Type)))) val_dont_use;
};

typedef atomic<s8> atomic_int8_t;
typedef atomic<u8> atomic_uint8_t;
typedef atomic<s16> atomic_int16_t;
typedef atomic<u16> atomic_uint16_t;
typedef atomic<s32> atomic_int32_t;
typedef atomic<u32> atomic_uint32_t;
typedef atomic<s64> atomic_int64_t;
typedef atomic<u64> atomic_uint64_t;

template<typename T>
inline typename T::Type atomic_load(
    const volatile T *a, std::memory_order mo) {
  assert(mo == std::memory_order_relaxed ||
         mo == std::memory_order_consume ||
         mo == std::memory_order_acquire ||
         mo == std::memory_order_seq_cst);
  assert(!((uptr)a % sizeof(*a)));

  typename T::Type v;
  if (mo == std::memory_order_relaxed) {
    // On x86 naturally-aligned loads are atomic.
    v = a->val_dont_use;
  } else if (mo == std::memory_order_consume ||
             mo == std::memory_order_acquire) {
    // On x86 loads are implicitly acquire.
    __asm__ __volatile__("" ::: "memory");
    v = a->val_dont_use;
    __asm__ __volatile__("" ::: "memory");
  } else {  // seq_cst
    // On x86 plain MOV is enough for seq_cst load.
    __asm__ __volatile__("" ::: "memory");
    v = a->val_dont_use;
    __asm__ __volatile__("" ::: "memory");
  }
  return v;
}

template<typename T>
inline void atomic_store(volatile T *a, typename T::Type v, std::memory_order mo) {
  assert(mo == std::memory_order_relaxed ||
         mo == std::memory_order_release ||
         mo == std::memory_order_seq_cst);
  assert(!((uptr)a % sizeof(*a)));

  if (mo == std::memory_order_relaxed) {
    // On x86 naturally-aligned stores are atomic.
    a->val_dont_use = v;
  } else if (mo == std::memory_order_release) {
    // On x86 stores are implicitly release.
    __asm__ __volatile__("" ::: "memory");
    a->val_dont_use = v;
    __asm__ __volatile__("" ::: "memory");
  } else {  // seq_cst
    // On x86 stores are implicitly release.
    // call __sync_synchronize() to flush store buffer.
    __asm__ __volatile__("" ::: "memory");
    a->val_dont_use = v;
    __sync_synchronize();
  }
}

#endif

quiz time

quiz-1

-- Initially --
std::atomic<int>x{0};
std::atomic<int>y{0};
-- Thread 1 --
x.store(1, std::memory_order_release);
-- Thread 2 --
y.store(2, std::memory_order_release);
-- Thread 3 --
int r1 = x.load(std::memory_order_acquire);   // x first
int r2 = y.load(std::memory_order_acquire);
-- Thread 4 --
int r3 = y.load(std::memory_order_acquire);   // y first
int r4 = x.load(std::memory_order_acquire);

分别从 C++ 标准的角度和 x86 的角度讨论是否可能出现 r1==1, r2==0 且 r3==2, r4==0 的情况。

quiz-2

-- Initially --
std::atomic<bool> x,y;
std::atomic<int> z;
x=false;
y=false;
z=0;
-- Thread 1 --
void write_x_then_y(){
  x.store(true,std::memory_order_relaxed);  // 1
  y.store(true,std::memory_order_relaxed);  // 2
}
-- Thread 2 --
void read_y_then_x(){
  while(!y.load(std::memory_order_relaxed));  // 3
  if(x.load(std::memory_order_relaxed))  // 4
    ++z;
}

分别从 C++ 标准的角度和 x86 的角度讨论 z 是否一定不为 0。

quiz-3

atomic<int> x;
atomic<int> y;

y.store(1, memory_order_relaxed);            //(1)
atomic_thread_fence(memory_order_seq_cst);   //(2)
x.load(memory_order_relaxed);                //(3)

上述代码是否可能被 reorder 为如下情况:

x.load(memory_order_relaxed);                //(3)
y.store(1, memory_order_relaxed);            //(1)
atomic_thread_fence(memory_order_seq_cst);   //(2)

see also