# 4.4 Internal Speedup & Related Topics

- Inefficiencies of Crossbar Scheduling
- Combined Input-Output Queueing (CIOQ) Internal Speedup
- Variable-Size Packets Segmentation and Reassembly
- Packet-Mode Scheduling
- Per-QoS-class output queues in CIOQ: outp. sched. for QoS
- Crossbars with Reconfiguration Overhead: Enveloppe Sched.
- Theoretical Questions:
  - can a CIOQ switch emulate Output Queueing?
  - with what internal speedup?

#### Unbalanced Traffic: a simple example of "hard" traffic pattern

- Each input has a "favored" output
  - "favored" input-output pairs are disjoint (they form a permutation)
- Each input sends traffic @ total rate = *load* as follows:
  - $(u \times load)$  to its favored output (u is the "unbalance factor), plus
  - $((1-u) \times load)$  to all outputs, uniformly distributed
  - $\Rightarrow$  each output receives traffic @ total rate = *load*
- *"u"* is the "Unbalance Factor":
  - u = 0 %  $\Rightarrow$  totally uniform traffic (usually easy)
  - u = 100 %  $\Rightarrow$  totally directional traffic (permutation) (often easy)
  - u = intermediate  $\Rightarrow$  ... usually hard traffic ...





## Internal Speedup

## Combined Input-Output Queueing (CIOQ)

- Make the crossbar faster than the external lines, in order to:
  - compensate for the inefficiencies of the scheduler (e.g. unbal. traffic)
  - compensate for the segmentation overhead of variable-size packets
  - allow for separate (output) queues per QoS class
- Typical Speedup Factor values are between 2 and 3:
  - speedup of about 2 needed for variable-size packets (see exer. 4)
  - theoretical results: speedup of 2 suffices to emulate output queueing (using complex schedulers though – hard to totally unrealistic)
- The cost of Internal Speedup:
  - buffers at outputs too, increased throughput for crossbar & buffers
  - for a given peak achievable internal throughput, the highest achievable line rate decreases in proportion to the speedup factor

### Combined Input-Output Queueing (CIOQ)





### Variable-Size Packets in Fixed-Size-Cell Crossbars

- Crossbars operate on a fixed-cell-time periodicity
  - need periodic time intervals at which to make scheduling decisions
  - need to reconfigure all input-output pairs at once (but see about "packet mode" scheduling, below)
  - we refer to *"unbuffered" crossbars* here "buffered" crossbars, which contain small buffers at their crosspoints, can operate asynchronously (to be discussed later)
- Variable-size packets can be switched only after segmentation into fixed-size cells
  - Segmentation Overhead: partially filled last cell of packet
  - Reassemble packets after the crossbar buffers & queues needed
  - Must store-and-forward no cut-through possible any more

Segmentation and Reassembly (SAR): Reassembly Queues and Buffer Space needed

- One reassembly queue per input needed at each output
- Reassembly space per output, when this space is shared among all reassembly queues at that output:
  - a max-size packet per input may be almost complete at the output
  - after the first completed packet starts being forwarded to the output, outgoing rate ≈ incoming rate ⇒ fixed total buffer occupancy
- Reassembly space, when partitioned per reassembly queue:
  - much more than with shared space:
  - after a first max-size packet, A, gets completed, and while it is being forwarded to the output (time ~ max.pck.size), one of the other reassembly queues (already containing almost one max. packet) may be receiving cells of total size ~ 1 max. packet, and so on...



Packet-Mode Crossbar Scheduling

- Maintain input-output pairings until full packet is forwarded
- During each cell-time, scheduler only considers candidacies for those inputs and outputs that are not in the middle of a packet-mode "connection"
- No reassembly queues needed
- Cut-through forwarding is feasible
- *Reference:* Ajmone Marsan e.a.: "Packet-Mode Scheduling in Input-Queued Cell-Based Switches", IEEE/ACM Tr. on Networking, Oct. 2002.





#### Packet-Mode Scheduling: Danger of Pathological Locking in one Configuration



 during the cell-time when a packet transmission is completed at an input-output pair, if this is the only input-output pair that gets released and there are more packets in that VOQ, then the same pair will get re-connected ⇒ other flows are locked out <u>Switches with Reconfiguration Overhead:</u> <u>large, multi-packet "Enveloppes" in lieu of cells</u>

- Reconfiguration Overhead
  - crossbars without central clock  $\Rightarrow$  rcvrs' resynchronization delay
  - optical crossbars
- Reduce reconfiguration frequency
- $\Rightarrow$  Increase cell time, i.e. increase cell size
  - call the large cells "Enveloppes"
  - Allow multiple segments from multiple packets in an enveloppe
  - enveloppes are formed in the VOQ's

### Switches with Reconfiguration Overhead: Enveloppe Scheduling

- When to "release" an Enveloppe to the Switch?
  - partially filled enveloppes waste switch throughput
    ⇒ wait for enveloppes to fill up
  - but if excessive waiting ⇒ too much delay introduced by the switch
    ⇒ use some kind of timeout mechanism, or use a portion of the
    switch throughput to periodically "scan" all partially-filled VOQ's
- *Reference:* Kar e.a.: "Reduced Complexity Input Buffered Switches", Hot Interconnects 2000.

Can a CIOQ Switch with Internal Speedup Emulate on Ontext Queued Switch?

Recent theoretical results: IEEE JSAC, June 1999:

(1) Chuang, Goel, Mckeown, Prabhakar (2) Krishna, Patel, Charny, Simcoe "CIOQ" = Combined Input & output Queueing

<u>Full Emulation</u>: consider a CIOQ switch and an OQ switch, with precisely the same cells entering into both switches, at precisely the same time. Full Emulation: precisely the same cells will depart from the CIOQ switch, as from the OQ switch, at precisely the same time.
 <u>Work Conserving Operation</u>: whenever cells destined to an output port exist in the switch, that output port will be busy transmitting one of them, i.e. an output port is never left idle, in the presence of traffic destined to it.

CS-534, Copyright Univ. of Crete



Fig. 4. Lower bound input traffic pattern for a  $4 \times 4$  switch.

| Phase | PA  | PB  | PC  | PD  | Phase | PA  | PB  | PC  | PD  |
|-------|-----|-----|-----|-----|-------|-----|-----|-----|-----|
| 1     | A-1 |     |     |     | 1     | A-1 |     |     |     |
| 2     | -   |     |     |     | 2     | A-2 | -   |     |     |
| 3     |     | B-2 |     |     | 3     |     | B-2 |     |     |
| 4     |     |     |     |     | 4     |     | B-3 |     |     |
| 5     |     |     | C-3 |     | 5     |     |     | C-3 |     |
| 6     |     |     | _   |     | 6     |     |     | C-4 |     |
| 7     |     |     |     | D-4 | 7     |     |     |     | D-4 |

17

| Phase | PA  | PB  | PC  | PD  | Phase | PA  | PB  | PC  | PD  |
|-------|-----|-----|-----|-----|-------|-----|-----|-----|-----|
| 1     | A-1 |     |     |     | 1     | A-1 |     |     |     |
| 2     | A-2 |     |     |     | 2     | A-2 |     |     |     |
| 3     | A-3 | B-2 |     |     | 3     | A-3 | B-2 |     | 1   |
| 4     |     | B-3 |     |     | 4     | A-4 | B-3 |     |     |
| 5     |     | B-4 | C-3 |     | 5     |     | B-4 | C-3 |     |
| 6     |     |     | C-4 |     | 6     |     |     | C-4 |     |
| 7     |     |     |     | D-4 | 7     |     |     |     | D-4 |

(c)

(d)

Fig. 5. Scheduling order for the lower bound input traffic pattern in Fig. 4.

Chuang, Goel, Mckeown, Prabhakar: JSAC, June 1999

| Architecture                       | maximum<br>single-mem<br>thoughput | number<br>of<br>memories | Total<br>memory<br>throughput | Menory<br>Space<br>Utilitation     | Pertorman         | Complexity<br>Cost               |
|------------------------------------|------------------------------------|--------------------------|-------------------------------|------------------------------------|-------------------|----------------------------------|
| Shared<br>Buffer                   | m+n                                | 1                        | m+n                           | BEST                               | BEST              | high-thrup-<br>singlen mem       |
| Queveing                           | m+1                                | n                        | (m+s).n                       | medium                             | BEST              | large total mem. cost            |
| Crosspoint<br>Queueing             | 2                                  | พ-บ                      | 2·m·n                         | WORST                              | BEST              | memoriel                         |
| Internal<br>Speci-ul<br>of 5 (5>1) | S+1                                | m+1                      | (++1)(m++)                    | melium                             | very good<br>(if) | expensive<br>switching<br>fabric |
| Imp. Buffering<br>(Adv. Imp. Q.)   | 2                                  | m                        | 2 <b>m</b>                    | medium (sh.sp.)<br>worst (part.sp) | very good<br>(if) | scheduling algorithm             |
| Input Queueing                     | 2                                  | m                        | 2m                            | nedium                             | WORST             | (ox ?)                           |