## **Computer Architecture**

0

### **Lecture 3: Pipelining**

**lakovos Mavroidis** 

Computer Science Department University of Crete Measurements and metrics : Performance, Cost, Dependability, Power

Guidelines and principles in the design of computers

- ■Monday 8/10 → Friday 12/10
- ■Monday 15/10 → Friday 19/10
- ■Wednesday 17/10 → TBD

### Outline

Processor review

### □Hazards

- Structural
- Data
- Control
- Performance
- Exceptions

### **Clock Cycle**



□Old days: 10 levels of gates

Today: determined by numerous time-of-flight issues + gate delays

clock propagation, wire lengths, drivers

4

## **Datapath vs Control**



Datapath: Storage, FU, interconnect sufficient to perform the desired functions

- Inputs are Control Points
- Outputs are signals

Controller: State machine to orchestrate operation on the data path

Based on desired function and signals

□ 32-bit fixed format instruction (3 formats)

- □ 32 32-bit GPR (R0 contains zero, DP take pair)
- □ 3-address, reg-reg arithmetic instruction
- Single address mode for load/store: base + displacement
  - no indirection
- Simple branch conditions
- Delayed branch

see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

6

### Example: 32bit MIPS

#### **Register-Register**

| 31 | 26 | 25 2 | 21 20 | 16 | 15 | 11 | 10 | 6 | 5 | 0   |
|----|----|------|-------|----|----|----|----|---|---|-----|
| Ор |    | Rs1  |       | s2 | Rd |    |    |   | C | Орх |

#### **Register-Immediate**

| 31 | 26 | 25  | 21 20 | 16 | 15        | 0 |
|----|----|-----|-------|----|-----------|---|
| Ор |    | Rs1 | Rd    |    | immediate |   |

#### Branch

| 31 | 2  | 26 | 25  | 21 | 20 1  | 16 | 15 |           | 0 |
|----|----|----|-----|----|-------|----|----|-----------|---|
|    | Ор |    | Rs1 |    | Rs2/O | c) |    | immediate |   |

#### Jump / Call



# **Example Execution Steps**



## Pipelining: Latency vs Throughput



Pipelining doesn't help **latency** of single task, it helps **throughput** of entire workload

### 5-stage Instruction Execution - Datapath



# **Visualizing Pipelining**



### 5-stage Instruction Execution - Control



### Pipeline Registers: IR, A, B, r, WB

Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

- <u>Structural hazards</u>: HW cannot support this combination of instructions (single person to fold and put clothes away)
- <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline (missing sock)
- <u>Control hazards</u>: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

### **Example of Structural Hazard**



### **Example of Structural Hazard**



# Speed Up Equation of Pipelining

Speedup =  $\frac{\text{Average instruction time unpipelined}}{\text{Average instruction time pipelined}}$ =  $\frac{\text{CPI unpipelined}}{\text{CPI pipelined}} \times \frac{\text{Clock cycle unpipelined}}{\text{Clock cycle pipelined}}$ 

CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction = 1 + Pipeline stall clock cycles per instruction For simple RISC pipeline, Ideal CPI = 1:

Speedup = 
$$\frac{1}{1 + \text{Pipeline stall cycles per instruction}} \times \frac{\text{Clock cycle unpipelined}}{\text{Clock cycle pipelined}}$$
  
=  $\frac{1}{1 + \text{Pipeline stall cycles per instruction}} \times \text{Pipeline depth}$ 

## Example: Dual-port vs Single-port

□ Machine A: Dual read ported memory ("Harvard Architecture")

Machine B: Single read ported memory, but its pipelined implementation has a 1.05 times faster clock rate

□Ideal CPI = 1 for both

□Suppose that Loads are 40% of instructions executed

Average instruction time =  $CPI \times Clock \text{ cycle time}$ =  $(1 + 0.4 \times 1) \times \frac{Clock \text{ cycle time}_{ideal}}{1.05}$ =  $1.3 \times Clock \text{ cycle time}_{ideal}$ 

### □Machine A is 1.33 times faster

### Data Hazard

Time (clock cycles)



#### Read After Write (RAW)

Instr<sub>J</sub> tries to read operand before Instr<sub>I</sub> writes it

*I: add r1,r2,r3 J: sub r4,r1,r3* 

Caused by a "Dependence" (in compiler nomenclature). This hazard results from an actual need for communication. Write After Read (WAR) Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> reads it

> *I:* sub r4,r1,r3 *J:* add r1,r2,r3 *K:* mul r6,r1,r7

Called an "anti-dependence" by compiler writers. This results from reuse of the name "r1".

Can't happen in MIPS 5 stage pipeline because:

- All instructions take 5 stages, and
- Reads are always in stage 2, and
- Writes are always in stage 5

### Write After Write (WAW) Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> writes it.

Called an "output dependence" by compiler writers This also results from the reuse of name "r1".

Can't happen in MIPS 5 stage pipeline because:

- All instructions take 5 stages, and
- Writes are always in stage 5

□ Will see WAR and WAW in more complicated pipes

### Forwarding to avoid data hazards



## HW Change for Forwarding



#### What circuit detects and resolves this hazard? Why we need forwarding lines for both inputs of the ALU?

Fall 2012 – Lecture 1

## Forwarding to Avoid LW-SW Data Hazard



### Data Hazard Even with Forwarding

Time (clock cycles)



### Data Hazard Even with Forwarding

Time (clock cycles)



| Try producing fast code for |         |                        |          |       |          |  |  |
|-----------------------------|---------|------------------------|----------|-------|----------|--|--|
| а                           | a = b + | ⊦ C;                   |          |       |          |  |  |
| d                           | l = e - | - f;                   |          |       |          |  |  |
| assumir                     | ng a, l | b, c, d ,e, ar         | nd f in  | memor | у.       |  |  |
| Slow code                   | 1       |                        | Fast coc | le:   |          |  |  |
| Ľ                           | W       | Rb,b                   |          | LW    | Rb,b     |  |  |
| Ľ                           | W       | Rc,c                   |          | LW    | Rc,c     |  |  |
| А                           | DD      | Ra,Rb, <mark>Rc</mark> |          | LW    | Re,e     |  |  |
| S                           | SW      | a,Ra 🔨 🦯               |          | ADD   | Ra,Rb,Rc |  |  |
| Ľ                           | W       | Re,e                   |          | LW    | Rf,f     |  |  |
| Ľ                           | W       | Rf,f                   |          | SW    | a,Ra     |  |  |
| S                           | UB      | Rd,Re, <mark>Rf</mark> |          | SUB   | Rd,Re,Rf |  |  |
| S                           | SW      | d,Rd                   |          | SW    | d,Rd     |  |  |

Ο

### Control Hazard on Branches Three Stage Stall



What do you do with the 3 instructions in between? How do you do it? Where is the "commit"? □If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!

Two part solution:

- Determine branch taken or not sooner, AND
- Compute taken branch address earlier

 $\Box MIPS branch tests if register = 0 or \neq 0$ 

□ MIPS Solution:

- Move Zero test to ID/RF stage
- Adder to calculate new PC in ID/RF stage
- I clock cycle penalty for branch versus 3

## **Pipelined MIPS Datapath**



## Four Branch Hazard Alternatives

- #1: Stall until branch direction is clear
- #2: Predict Branch Not Taken
  - Execute successor instructions in sequence
  - "Squash" instructions in pipeline if branch actually taken
  - Advantage of late pipeline state update
  - 47% MIPS branches not taken on average
  - PC+4 already calculated, so use it to get next instruction
- #3: Predict Branch Taken
  - 53% MIPS branches taken on average
  - But haven't calculated branch target address in MIPS
    - MIPS still incurs 1 cycle branch penalty
    - Other machines: branch target known before outcome

### Four Branch Hazard Alternatives

#### #4: Delayed Branch

Define branch to take place AFTER a following instruction

```
branch instruction
sequential successor1
sequential successor2
sequential successorn
Branch delay of length n
sequential successorn
branch target if taken
```

- 1 slot delay allows proper decision and branch target address in 5 stage pipeline
- MIPS uses this

# Scheduling Branch Delay Slots



□ A is the best choice, fills delay slot & reduces instruction count (IC)

□ In B, the sub instruction may need to be copied, increasing IC

□ In B and C, must be okay to execute sub when branch fails

## **Delayed Branch**

Compiler effectiveness for single branch delay slot:

- Fills about 60% of branch delay slots
- About 80% of instructions executed in branch delay slots useful in computation
- About 50% (60% x 80%) of slots usefully filled
- Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot
  - Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches
  - Growth in available transistors has made dynamic approaches relatively cheaper

### **Evaluating Branch Alternatives**

Pipeline speedup =  $\frac{\text{Pipeline depth}}{1 + \text{Branch frequency} \times \text{Branch penalty}}$ 

| Unconditional branch        | 4%  |
|-----------------------------|-----|
| Conditional branch, untaken | 6%  |
| Conditional branch, taken   | 10% |

#### Deep pipeline in this example

| Branch scheme      | Penalty unconc            | litional                        | Penalty untaken               | Penalty taken |  |
|--------------------|---------------------------|---------------------------------|-------------------------------|---------------|--|
| Flush pipeline     | 2                         |                                 | 3                             | 3             |  |
| Predicted taken    | 2                         |                                 | 3                             | 2             |  |
| Predicted untaken  | 2                         |                                 | 0                             | 3             |  |
| Branch scheme      | Unconditional<br>branches | Untaken conditional<br>branches | Taken conditional<br>branches | All branches  |  |
| Frequency of event | 4%                        | 6%                              | 10%                           | 20%           |  |
| Stall pipeline     | 0.08                      | 0.18                            | 0.30                          | 0.56          |  |
| Predicted taken    | 0.08                      | 0.18                            | 0.20                          | 0.46          |  |
| Predicted untaken  | 0.08                      | 0.00                            | 0.30                          | 0.38          |  |

## **Problems with Pipelining**

Exception: An unusual event happens to an instruction during its execution

- Examples: divide by zero, undefined opcode
- Interrupt: Hardware signal to switch the processor to a new instruction stream
  - Example: a sound card interrupts when it needs more audio output samples (an audio "click" happens if it is left waiting)
- Problem: It must appear that the exception or interrupt must appear between 2 instructions (I<sub>i</sub> and I<sub>i+1</sub>)
  - The effect of all instructions up to and including I<sub>i</sub> is totalling complete
  - No effect of any instruction after I<sub>i</sub> can take place
- The interrupt (exception) handler either aborts program or restarts at instruction I<sub>i+1</sub>

### Precise Exceptions in Static Pipelines



Key observation: architected state only change in memory and register write stages.

# Summary: Pipelining

- Next time: Read Appendix A
- Control VIA State Machines and Microprogramming
- □Just overlap tasks; easy if tasks are independent
- Speed Up  $\leq$  Pipeline Depth; if ideal CPI is 1, then:

Speedup = 
$$\frac{\text{Pipeline depth}}{1 + \text{Pipeline stall CPI}} \times \frac{\text{Cycle Time}_{\text{unpipelined}}}{\text{Cycle Time}_{\text{pipelined}}}$$

□ Hazards limit performance on computers:

- Structural: need more HW resources
- Data (RAW,WAR,WAW): need forwarding, compiler scheduling
- Control: delayed branch, prediction

Exceptions, Interrupts add complexity