### Shared Memory Consistency Models

Dimitris Nikolopoulos

```
Shared-memory abstraction

Programmers expect latest data written by P1

Initially all pointers = null, all integers = 0.

P1

While (there are more tasks) {

Task = GetFromFreeList();

Task — Data = ...;

insert Task in task queue
}

Head = head of task queue;

P2, F3, ..., Pn

while (there are more tasks) {

Begin Critical Section

if (Head != null) {

MyTask = Head;

Head = Head — Next;

End Critical Section

}

End Critical Section

... = MyTask — Data;
```

### Memory consistency

- Formal specification of how updates of memory locations appear to the programmer
- Read returns value of last write
  - Easy in uniprocessors: program order
  - Harder on multiprocessors: what is the last write?
- Sequential consistency
  - All memory operations atomic (although they are not in reality)
  - All operations from the same processor in program order

### Memory consistency model

- Enables programmers to reason about correctness in their programs
- Affects performance since it enables (or disables) several compiler and hardware optimizations
- Affects portability, since it defines what changes (or not) need to be performed to preserve correctness across platforms

### Sequential consistency

- · Close to programmer intuition
- Restricts many compiler and hardware optimizations
- · Relaxed consistency models
  - Allow reordering of certain sequences of memory operations
  - Improve performance compared to sequential consistency

### Uniprocessor consistency

- Memory operation ordering should preserve true data and control dependencies
- Enables optimizations such as:
  - Register allocation
  - Code motion
  - Loop transformations
  - Pipelining, multiple instruction issue
  - Write buffer bypassing, forwarding
  - Lockup-free (non-blocking) caches

## Memory seen as a global memory with a single switch enabling access to processors Figure 3: Programmer's view of sequential consistency.



### 

### Violations of SC (no caches) • Bypassing in interconnection network PI General Interconnect Read Data 13 Read Data 12 PI Data = 2000 while (Head == 0) {:} Head = 1 ... = Data

### 

### Violation of SC (with caches)

- Similar violations with systems without caches
- Read by a processor that hits in the cache
  - May not allow completion to prevent bypassing!
- Other issues
  - Cache coherence protocol required to propagate writes
  - Detecting when a write is complete requires more transactions in the presence of caching
  - Propagating changes to multiple copies of data is inherently non-atomic

### Cache consistency = SC?

- Cache consistency requires that a write is eventually seen by all processors
- Cache consistency requires that writes by all processors to the same memory location are seen in the same order by all processors
- Sequential consistency requires that writes to all memory locations are seen in the same order by all processors



# Writes to A may arrive out of order to P3 and P4, therefore violating sequential consistency Initially A = B = C = 0 P1 P2 P3 P4 A = 1 A = 2 while (B != 1) {;} while (B != 1) {;} register 1 = A Figure 6: Example for serialization of writes. Need to ensure that writes to the same memory location are strictly serialized (write atomicity)



### Illusion of write atomicity

- Prevent read to return a newly written value before all invalidations have been acknowledged
- Easy in invalidate-based protocols
- Harder in update-based protocols
  - Two-phase update
  - Send updates to all caching processors
  - Receive acknowledgment
  - Send acknowledgement to all caching processors

### Similarity between compilers and hardware

- Consistency prevents reordering of instructions
- Register allocation can be hazardous (equivalent to caching)
- Software pipelining can be hazardous (recall midterm!)
- Code motion can be hazardous (recall dynamic scheduling processors)

### 

### SC explained

- Program order
  - A processor must ensure that a memory operation is complete before proceeding to the next. Detecting completion for writes requires ack from memory and invalidations or updates in cache-based systems
- · Write atomicity
  - Writes to a single location are serialized, no read returns a new value of a write until all invalidations/ updates are finished and acknowledged

### Hardware optimizations for SC

- Prefetching ownership on a write
  - Prefetch exclusive requests for any writes pending in the write buffer
  - Applicable only in invalidation-based systems
- Speculative service of reads
  - Rollback if SC is violated due to a write
  - Good for superscalar processors that already have rollback machinery (e.g. to handle branch mispredictions)

### Relaxed consistency

- Characterization
  - How do we relax program order?
    - · From write to read
    - · From write to write
    - · From read to read or write
  - How do we relax write atomicity?
    - Do we allow reads to return another processor's write before all updates/invalidations are sent?
    - Processor allowed to see its own write before the write is made visible to other processors
- Mechanisms for overriding relaxations to programmers

### **RC** summary

| Relaxation      | $W \rightarrow R$ | $W \rightarrow W$ | $R \rightarrow RW$ | Read Others' | Read Own    | Safety net                      |
|-----------------|-------------------|-------------------|--------------------|--------------|-------------|---------------------------------|
|                 | Order             | Order             | Order              | Write Early  | Write Early |                                 |
| SC [16]         |                   |                   |                    |              | V           |                                 |
| IBM 370 [14]    | V                 |                   |                    |              |             | serialization instructions      |
| TSO [20]        | √                 |                   |                    |              | V           | RMW                             |
| PC [13, 12]     | ✓                 |                   |                    | V            | V           | RMW                             |
| PSO [20]        | √                 | V                 |                    |              | √           | RMW, STBAR                      |
| WO [5]          | √                 | V                 | V                  |              | ✓           | synchronization                 |
| RCsc [13, 12]   | <b>√</b>          | ✓                 | ✓                  |              | <b>√</b>    | release, acquire, nsync,<br>RMW |
| RCpe [13, 12]   | V                 | V                 | V                  | V            | V           | release, acquire, nsync,<br>RMW |
| Alpha [19]      | √                 | V                 | V                  |              | V           | MB, WMB                         |
| RMO [21]        | √                 |                   | ✓                  |              | ✓           | various MEMBAR's                |
| PowerPC [17, 4] | V                 | V                 | V                  | √            | V           | SYNC                            |

Figure 8: Simple categorization of relaxed models. A  $\surd$  indicates that the corresponding relaxation is allowed by straightforward implementations of the corresponding model. It also indicates that the relaxation can be detected by the programmer (by affecting the results of the program) except for the following cases. The 'Read Own Write Early' relaxation is not detectable with the SC, WO, Alpha, and PowerPC models. The "Read Others' Write Early" relaxation is possible and detectable with complex implementations of RCsc.

### RC in commercial systems

|                        | Relaxation               | Example Commercial Systems Providing the Relaxation                    |  |  |  |  |
|------------------------|--------------------------|------------------------------------------------------------------------|--|--|--|--|
|                        | $W \rightarrow R$ Order  | AlphaServer 8200/8400, Cray T3D, Sequent Balance, SparcCenter1000/2000 |  |  |  |  |
|                        | W → W Order              | AlphaServer 8200/8400, Cray T3D                                        |  |  |  |  |
|                        | R RW Order               | AlphaServer 8200/8400, Cray T3D                                        |  |  |  |  |
|                        | Read Others' Write Early | Cray T3D                                                               |  |  |  |  |
| Read Own Write Early A |                          | AlphaServer 8200/8400, Cray T3D, SparcCenter1000/2000                  |  |  |  |  |

Figure 9: Some commercial systems that relax sequential consistency.

### Relaxing Write-to-Read

- Relax ordering of write followed by read to different memory location (performance)
- IBM 370, SPARC V8



### Relaxing Write-to-Read No violation of SC while (Head == 0) {;} (b) overlapped writes

### Relaxing Write-to-Read No violation of SC P1 while (Head == 0) {;} Data = 2000 ... = Data Figure 5: Canonical optimizations that may violate sequential consistency.

### Relaxing W→R implementations

- IBM 370
  - Prohibits a read to return before a write to the same location is visible to all processors
- TSO
  - Allows a read to return before a write from the same processor, before the write from the processor becomes visible to other processors
- PC
  - Relaxes both conditions on read

### Example

$$\begin{aligned} & \text{Initially A = Flag 1 = Flag 2 = 0} & & \text{Initially A = B = 0} \\ & \text{Pl} & \text{P2} & \text{Pl} & \text{P2} & \text{P3} \\ & \text{Flag I = I} & \text{Flag 2 = I} & \text{A = I} \\ & \text{A = I} & \text{A = 2} & \text{register I = A} \\ & \text{register 2 = Flag 2} & \text{register 3 = A} & \text{B = I} \\ & \text{register 4 = Flag I} & \text{if } (A = 1) \\ & \text{register 2 = Flag 2} & \text{register 3 = 2}, \\ & \text{Result: register 1 = 1, register 3 = 2}, \\ & \text{register 2 = register 4 = 0} & \text{Result: B = 1, register 1 = 0} \\ \end{aligned}$$

Figure 10: Differences between 370, TSO, and PC. The result for the program in part (a) is possible with TSO and PC because both models allow the reads of the flags to occur before the writes of the flags on each processor. The result is not possible with IBM 370 because the read of A on each processor is not issued until the write of A on that processor is done. Consequently, the read of the flag on each processor is not issued until the write of the flag on that processor is done. Consequently, the read of the flag on each processor is not insected until the write of the flag on that processor is done. The program in part (b) is the same as in Figure 4(b). The result shown is possible with PC because it allows P2 to return the value of P1's write before the write is visible to P3. The result is not possible with IBM 370 or TSO.

### Safety nets

- IBM 370, none
- TSO
  - Safety net required from write to read to the same location on the same processor
  - Solved with an atomic RMW (read-modify-write) instruction (at extra cost, can be significant)

### Performance implications

- Write-to-read relaxation improves substantially the performance of hardware implementations
- Little benefit for compilers, typically because writes are closely followed by reads to the same memory location, therefore no huge opportunities for reorderings
- Compilers typically benefit from full independence in large windows of instructions

### Write-to-write, write-to-read

- Eliminate ordering between writes to *different* locations
- SPARV V8, PSO (partial store ordering)
  - Writes from the same processor can be pipelined, overlapped and reach destination caches out-oforder





### Write-to-write relaxation

- Safety net
  - Store barrier
  - Inserted in the write buffer
  - Delays writes following the STBAR in the write buffer



### Relaxing all orders

- Any read or write may be reordered with respect to any following read or write
- Memory operations following a read may be reordered with respect to the read (including writes)
- Hardware can hide latency of read operations (true non-blocking read implementation)
- All relaxed models allow processor to read its write early

### Weak Ordering

- Data operations distinguished from synchronization operations
- To enforce ordering between two operations programmer needs to specify at least one as a synchronization operation
- Intuition: reordering operations outside synchronization regions typically does not affect performance (think mutual exclusion)

### Weak Ordering

- · Hardware implementation using counters
  - Processor increments counter when issuing operation, decrements when operation completes
  - Synchronization operation not issued until counter reaches zero (all operations complete)
  - Processor does not issue operations until after synchronization operation completes
  - Synchronization instructions serve as "fences" or "barriers"

## shared special ordinary sync nsync acquire release Figure 11: Distinguishing operations for release consistency.

### **Release Consistency**

- Ordinary and Special operations: roughly correspond to data and synchronization operations in WO
- Special operations are either sync or nsync
  - Sync operations correspond to "real" synchronizations
  - "nsync" operations correspond to asynchronous operations that may be subject to races
  - Sync operations are further classified into acquire and release

### **Release Consistency**

- Ordinary and Special operations: roughly correspond to data and synchronization operations in WO
- Special operations are either sync or nsync
  - Sync operations correspond to "real" synchronizations
  - "nsync" operations correspond to asynchronous operations that may be subject to races
  - Sync operations are further classified into acquire and release
    - Acquire and release correspond to locks used to grant exclusive access to a set of shared memory locations

### Release consistency

- First implementation (RCsc): sequential consistency between special operations
- Second implementation (RCpc): processor consistency between special operations
- RCsc constraints:
  - Acquire to all, all to release, special to special
- RCpc constraints:
  - Acquire to all, all to release, special to special except from special write followed by special read

### RC: imposing program order

- From a write to a read, we use an RMW instruction
- Write in RMW needs to be a release if the write being ordered is ordinary, otherwise it can be any special write
- Other primitives: write barrier, memory barrier