# HY425 Lecture 02: Pipelining

#### Dimitrios S. Nikolopoulos

University of Crete and FORTH-ICS

October 13, 2011

# **Review from last lecture**

### Important technological implications

- Latency lags bandwidth
- Shrinking transistors do not necessarily improve performance
- Power wall, deteriorating reliability
- Measuring and summarizing performance
  - Wall-clock time
  - Geometric mean of execution time ratios
  - No single averaging metric is perfect
- Quantitative principles of design
  - Parallelism, locality, common case fast, Amdahl's law

# Update on assignments

#### Homework

Homework 1 up today due in one week.

 Recap

 Processor review

 Hazards

 Datapath

 Control

 Data hazards

 Control hazards

 Exceptions

### **Processor review**

#### Datapath

- Storage elements (registers, caches, memory)
- Functional (execution) units (ALU, adders)
- Operated by control signals

#### Control

State machine producing control signals

Datapath Control

# **Multi-cycle datapath**

#### Five-stage instruction execution sequence



Reg[IR<sub>rd</sub>] <= Reg[IR<sub>rs</sub>] op<sub>IRop</sub> Reg[IR<sub>rt</sub>]

Datapath Control

# **Multi-cycle datapath**

#### Five-stage instruction execution sequence



Datapath Control

### Instruction operation control



Datapath Control

# **Data stationary control**



Datapath Control

# Simplified visualization of pipelines



# Limits of pipelining Hazards

- Hazard is a condition which prevents an instruction from executing during a pipeline stage
- Structural hazards occur when the hardware does not have enough resources in a pipeline stage to accommodate an instruction
  - Older instructions occupy resources in same stage
- Data hazards occur when an instruction needs input from a prior instruction and the input is not ready
- Control hazards occur when execution of an instruction depends on a branch and branch outcome is not known yet

# Example of structural hazard

#### Single memory port for instructions and data



# Resolving structural hazards Bubbles



# **Bubbles**

#### Software vs. hardware bubbles

- Software (compiler) inserts bubbles by inserting nop instructions in the pipeline
- Hardware uses hazard detection unit in the control logic
  - Detection unit evaluates conditions for hazards
  - Stalls the pipeline briefly (one cycle) to resolve the hazard
  - Stalling the pipeline at any stage amounts to zeroing output control signals

# **Pipeline control**

#### Recap

- Control signals in EX stage (ALUOp, RegDst, ALUSrc)
- Control signals in MEM stage (Branch, MemRead, MemWrite)
- Control signals in WB stage (MemtoReg, RegWrite)

# Impact of pipeline stalls on performance

#### Speedup of pipelining

| Speedup avg instruction time unpipelined                                                                           |  |  |  |  |  |
|--------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| $Speedup_{pipelined} = \frac{avg}{avg}$ instruction time pipelined                                                 |  |  |  |  |  |
| $\_$ $CPI_{unpipelined} 	imes Clock cycle_{unpipelined}$                                                           |  |  |  |  |  |
| $- \frac{1}{CPI_{pipelined}} \times Clock \ cycle_{pipelined}$                                                     |  |  |  |  |  |
| $= \frac{CPI_{unpipelined}}{CPI_{pipelined}} \times \frac{Clock \ cycle_{unpipelined}}{Clock \ cycle_{pipelined}}$ |  |  |  |  |  |
| <i>CPI</i> <sub>pipelined</sub> × <i>Clock cycle</i> <sub>pipelined</sub>                                          |  |  |  |  |  |
| $CPI_{pipelined} = IdealCPI + Pipeline stall cycles per instruction$                                               |  |  |  |  |  |
| = 1 + Pipeline stall cycles per instruction                                                                        |  |  |  |  |  |
| Speedup CPI <sub>unpipelined</sub>                                                                                 |  |  |  |  |  |
| $Speedup_{pipelined} = \frac{O(1)}{1 + Pipeline stall cycles per instruction}$                                     |  |  |  |  |  |

### Impact of pipeline stalls on performance Speedup of pipelining

$$Speedup_{pipelined} = rac{1}{1 + Pipeline \ stall \ cycles \ per \ instruction} imes rac{Clock \ cycle_{unpipelined}}{Clock \ cycle_{pipelined}}$$

Ideal balanced pipeline

$$Clock \ cycle_{pipelined} = rac{Clock \ cycle_{unpipelined}}{Pipeline \ depth}$$
  
 $Speedup_{pipelined} = rac{1}{1 + Pipeline \ stall \ cycles \ per \ instruction} imes Pipeline \ depth$ 

## **Data Hazards**

#### Read-after-write data hazard through registers



# Read-after-write (RAW) hazard

#### **Details**

Inst I: add r1, r2, r3 Inst J: sub r4, r1, r3

- Instruction I precedes instruction J in program order
- Instruction I produces result used by instruction J
- Instruction J is data-dependent on instruction I
- Result is not actually committed in register r1 in simple 5-stage pipeline until instruction I finishes the WB stage
- Value of result actually produced earlier, i.e. during the EX stage of instruction I

# Write-after-read (WAR) hazard

#### **Details**

Inst I: sub r4, r1, r3 Inst J: add r1, r2, r3

- Instruction I precedes instruction J in program order
- Instruction I reads the register written by instruction J
- If instruction I reads r1 in cycle C, instruction J writes r1 in cycle C+4
- No hazard in simple 5-stage pipeline
- Hazard may occur if we attempt to reorder the two instructions. Will see examples later in the course ...

# Write-after-write (WAW) hazard

#### **Details**

Inst I: add r1, r2, r3 Inst J: add r1, r1, r4

- Instruction I precedes instruction J in program order
- Instruction I writes in the same register as instruction J
- If instruction I writes r1 in cycle C, instruction J writes r1 in cycle C+1
- No actual hazard in simple 5-stage pipeline
- Hazard may occur if we attempt to reorder the two instructions.

# Forwarding

### Exploit early production of results in pipeline



# **Forwarding control**

#### Additional multiplexers select ALU input



### RAW hazards through memory

#### Store following load to same memory location



# Forwarding can not resolve all hazards Load-use hazard



# Resolving load-use hazard Bubble to provide chance for forwarding



# **Resolving load-use hazard**

#### **Control logic**

//Instruction needs to stall in the EX stage

if (ID/EX.MemRead and // A load instruction has been issued a cycle ago ((ID/EX.RegisterRd = IF/ID.RegisterRs) or // Id destination is source A (ID/EX.RegisterRd = IF/ID.RegisterRt))) // or Id destination is source B stall pipeline // zero out all control signals, thwarts EX, MEM, WB stages

# Pipeline forwarding data logic summary

#### Forwarding from ALU (EX/MEM) and memory (MEM/WB)

| Pipeline<br>register<br>containing<br>source<br>instruction | Opcode<br>of source<br>instruction | Pipeline<br>register<br>containing<br>destination<br>instruction | Opcode of<br>destination<br>instruction                         | Destination<br>of the<br>forwarded<br>result | Comparison<br>(if equal then<br>forward) |
|-------------------------------------------------------------|------------------------------------|------------------------------------------------------------------|-----------------------------------------------------------------|----------------------------------------------|------------------------------------------|
| EX/MEM                                                      | Register-<br>register ALU          | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | EX/MEM.IR1620 =<br>ID/EX.IR610           |
| EX/MEM                                                      | Register-<br>register ALU          | ID/EX                                                            | Register-register ALU                                           | Bottom ALU<br>input                          | EX/MEM.IR1620 =<br>ID/EX.IR1115          |
| MEM/WB                                                      | Register-<br>register ALU          | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | MEM/WB.IR1620 =<br>ID/EX.IR610           |
| MEM/WB                                                      | Register-<br>register ALU          | ID/EX                                                            | Register-register ALU                                           | Bottom ALU<br>input                          | MEM/WB.IR1620 =<br>ID/EX.IR1115          |
| EX/MEM                                                      | ALU<br>immediate                   | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | EX/MEM.IR1115 =<br>ID/EX.IR610           |
| EX/MEM                                                      | ALU<br>immediate                   | ID/EX                                                            | Register-register ALU                                           | Bottom ALU<br>input                          | EX/MEM.IR1115 =<br>ID/EX.IR1115          |

# Pipeline forwarding data logic summary (cont.)

#### Forwarding from ALU (EX/MEM) and memory (MEM/WB)

| Pipeline<br>register<br>containing<br>source<br>instruction | Opcode<br>of source<br>instruction | Pipeline<br>register<br>containing<br>destination<br>instruction | Opcode of<br>destination<br>instruction | Destination<br>of the<br>forwarded<br>result | Comparison<br>(if equal then<br>forward) |
|-------------------------------------------------------------|------------------------------------|------------------------------------------------------------------|-----------------------------------------|----------------------------------------------|------------------------------------------|
| MEM/WB                                                      | ALU                                | ID/EX                                                            | Register-register ALU,                  | Top ALU                                      | MEM/WB.IR1115 =                          |
|                                                             | immediate                          |                                                                  | ALU immediate, load,                    | input                                        | ID/EX.IR610                              |
| MEM/WB                                                      | ALU                                | ID/EX                                                            | Register-register ALU                   | Bottom ALU                                   | MEM/WB.IR1115 =                          |
|                                                             | immediate                          |                                                                  |                                         | input                                        | ID/EX.IR1115                             |
| MEM/WB                                                      | Load                               | ID/EX                                                            | Register-register ALU,                  | Top ALU                                      | MEM/WB.IR1115 =                          |
|                                                             |                                    |                                                                  | ALU immediate, load,                    | input                                        | ID/EX.IR610                              |
| MEM/WB                                                      | Load                               | ID/EX                                                            | Register-register ALU                   | Bottom ALU<br>input                          | MEM/WB.IR1115 =<br>ID/EX.IR1115          |

# **Resolving branches in pipeline**

#### **Control-dependent instructions**



# Understanding control hazards MIPS datapath



#### **Branch execution**

- Comparison with zero and target address calculation at EX
  - stage
- 2 stall cycles

# Understanding control hazards

### **MIPS datapath**



#### **Branch execution**

- Branch taken decision plus potential branch target out of EX stage
- Next PC forwarded from MEM stage through multiplexer
- 1 more stall cycle for a total of 3

# **Reducing branch stall impact**

#### Impact of branches on performance

- Branch frequency (conditional, unconditional)
  - ca. 20% for integer programs
  - ca. 10% for floating point programs

 $\begin{aligned} \textit{Stall CPI from branches} &= \textit{branch frequency} \times \textit{branch penalty} \\ \textit{Speedup}_{\textit{pipeline}} &= \frac{\textit{Pipeline depth}}{1 + \textit{branch frequency} \times \textit{branch penalty}} \end{aligned}$ 

Max speedup drops from 5.0 to 3.1 (int), or 3.8 (fp)

# Reducing branch stall impact HW solution



#### **Explanation**

- Comparison with zero happens at EX stage
- Move comparison to EX stage
- May increase cycle time!

# **Reducing branch stall impact**

### HW/SW solution

- Delayed branches always execute the instruction in the slot following the branch (PC+4)
- Instruction two slots down (PC+8) affected by the branch
- Software (compiler) tasked with filling delay slots
- Choices are from before the branch, from the target (branch taken), or from the fall through path (branch not taken)

# Options for filling delay slot Three paths to look for instructions



# **Options for filling delay slot**

#### A: reduces instructions and improves performance



## Options for filling delay slot

#### B: may require to copy instruction if branch taken



# Options for filling delay slot

#### C: conditionally dependent instruction should not execute



# **Exception handling in pipelines**

### **Exception difficulties in pipelining**

- Exceptions may occur in different stages (e.g. overflows at EX, page faults at MEM, I/O device requests anywhere)
- Some exceptions are restartable

#### Instruction flush and restart

- Flush instructions following instruction causing the exception
- Start execution of exception handler from new address
  - Instruction flush is done using nop or trap (IF) or zeroing of control signals (ID, EX, MEM)
- Save address of offending instruction plus 4, if restartable

## Precise vs. imprecise exceptions

- Instructions before offending instruction have committed results to registers/memory
- Offending and following instructions execute from the beginning
- Exceptions may happen out-of-order
- LW followed by an ADD
- HW maintains exception status vector for "early" exceptions
- Exceptions are "processed" in WB stage
- Status vector is read to cancel register or memory update