# Queue and Flow Control Architectures for Interconnection Switches

- 1. Basic Concepts and Queueing Architectures
- 2. High-Throughput Multi-Queue Memories
- 3. Crossbars, Scheduling, & Combination Queueing
- 4. Flow and Congestion Control in Switching Fabrics

#### Manolis Katevenis

FORTH and Univ. of Crete, Greece

http://archvlsi.ics.forth.gr/~kateveni/534

### 2. High-Throughput Multi-Queue Memories

#### Table of Contents:

- 2.1 Wide Memories for High Throughput
  - wide, pipelined, interleaved memories, esp. for shared buffers
- 2.2 Variable Size Packet Segmentation Overhead
  - mem. accesses or crossbar cell-times vs. pck-time on the line
- 2.3 Multiple Queues within a Buffer Memory
  - partitioned queue space: circular-buffer queue
  - shared queue space: linked-list queues
  - DRAM optimizations, free-list bypass / free-block cache
- 2.4 Queueing for Multicast Traffic
  - each segment allowed in single queue
  - each segment allowed in multiple queues
  - decoupled linked-list node from data-block addresses

M. Katevenis, FORTH and U.Crete



Note hidden input and output crossbars; note required cell time alignment
 1.1 - M. Katevenis, FORTH and U.Crete, Greece



# 2.1 Optimized Implementation: Pipelined Memory

- Monolithic wide memory replaced by multiple banks
- concurrent rd/wr access to entire width replaced by a "wave" of same-address accesses moving from left to right



- Benefit 1: no double buffering needed at inputs, single latch at outputs
- Benefit 2: cut-through occurs automatically no paths, no control needed

2.1 - M. Katevenis, FORTH and U.Crete, Greece

5

# Pipelined Memory Operation Illustrated



<u>Ref:</u> Katevenis, Vatsolaki, Efthymiou: "Pipelined Memory Shared Buffer for VLSI Switches", ACM SIGCOMM Conf. 1995, http://archvlsi.ics.forth.gr/sw\_arch/pipeMem.html

2.1 - M. Katevenis, FORTH and U.Crete, Greece



#### Interleaved Mem. Sh.Buf. – PPS, Birkhoff-vonNeumann

- Wide memory width versus packet size
  - Wide (or pipelined) memory owes its simplicity to the controller generating a single address per cycle for all "banks" (entire width)
  - To have multiple packets per line (e.g. red & blue in figure): need independent address controllers per bank (per effective xbar port)
  - Note: infeasible to pack together into the same line packets that both arrive and depart consecutively in time
  - Scheduling interleaved memory accesses against bank conflicts is equivalent to crossbar scheduling under input queueing...
- Scalable Shared-Buffer Switches through Interleaved Banks:
  - Complexity: maintain distributed queues, forward cells in-order
  - The Parallel Packet Switch (PPS) Khotimsky, Krishnan: IEEE Int.
     Conf. Commun. (ICC) 2001; Iyer, McKeown: IEEE/ACM ToN, Apr. 2003
  - Birkhoff-vonNeumann C. Chang, W. Chen, H. Huang: Infocom 2000

2.1 - M. Katevenis, FORTH and U.Crete, Greece

# 2.2 Variable Size Packet Segmentation Overhead

· most buffer memories, queueing structures, & crossbars operate on fixed-size "segments", "blocks", or "cells"

· most appl'ns

use variable



- size packets Packets segmented into blocks/cells upon entry from line to switch
- Throughput of buffer memories & crossbars, measured in terms of cells, must be higher than line thruput, to compensate for segment'n overhead

2.2 - M. Katevenis, FORTH and U.Crete, Greece

2.2 - M. Katevenis, FORTH and U.Crete, Greece

10

#### Overspeed reg'd to compensate for Segmentation · Example: 144 160 176 - min. pck. size = 40 B third cell Buffer/Crossbar Cost (Bytes) - segment (cell) = 64 B - line overhead = 16 B · Rules of thumb: – For line overhead = $0 \Rightarrow$ second cell overspeed = often 2.0 - ... but check min.pck.sz.: if minimum packet is too 80 small (smaller than about 8 half a cell), then higher overspeed is required Sell – for line overhead > $0 \Rightarrow$ one overspeed < 2.0, but minimum 9 pck size check minimum packets - for very large line ovrhd 16 32 48 64 80 96 112 128 144 ⇒ check larger pck sizes one cell second cell line overhead Line Cost (Bytes)





# 2.3 Multiple Queues - 1 of 2 Statically Partitioned Space

- Multiple queues within a same SRAM block
- Each queue: circular array implementation
- Control overhead: two pointer words per queue (head, tail), incrementor, comparator
- Queue space bounds (partitions) can be hardwired, or off-line configurable (when queues are empty); in the latter case, also need bounds pointers.
- + Advantage: simplicity.
- Disadvantage: <u>partitioned</u> memory space leads to <u>underutilization</u> – one queue may overflow while lots of empty space exists in other memory space partitions.

2.3 - M. Katevenis, FORTH and U.Crete





#### <u>Data vs. Pointer Access Rate – Free List Bypass</u>



- Data memory throughput = 2 cells/cell-time (1 write + 1 read)
   ⇒ data memory access rate = 2 addresses/cell-time
- Both Queue & Free-List operations touch the Next-Pointers, once per op
   naïve implementation would require 4 addresses/cell-time to nxtPtr
- <u>Free List Bypass</u>: put incoming cell into just freed block of departing cell
   next -pointer memory access rate = 2 addresses/cell-time
- When no arrival or no departure, other side can use full 2 acc/cl-time rate
- Multicast: departure not always frees the block ⇒ use *Free Block Cache*

2.3 - M. Katevenis, FORTH and U.Crete, Greece

15

#### *nxtPtr* in DRAM – Free Block Preallocation nxtPtr nxtPtr Data Data head head pck1 old pck1 old tail pck2 old pck2 old pck3 new pck3 new write ptr here write w free allocated ptr here write data here DRAM burst access DRAM burst access data here Eng. w. Free-Block Preallocation Conventional Enqueue

- To economize on nxtPtr memory, place these pointers inside data DRAM
   conventional enq costs twice the number of DRAM row activate's
- · Preallocate one free block per queue, at tail, to remedy this
- Reference: Nikologiannis, Katevenis: "Efficient per-flow queueing in DRAM at OC-192 line rate using out-of-order execution...", IEEE Int. Conf. Commun. (ICC) 2001.

2.3 - M. Katevenis, FORTH and U.Crete, Greece

## 2.4 Queueing for Multicast Traffic

- Multicast traffic is expected to become very important in the future
   but so has it been for many years in the past...
- · Supporting multicast traffic usually increases complexity and cost
- Queueing for Multicast Traffic:
  - Each segment (block) allowed in only one gueue ⇒ HOL blocking
  - Each segment allowed in multiple queues ⇒ need many nxtPtr's
  - Enqueue throughput and nxtPtr space: static vs. dynamic sharing

#### References:

- F. Chiussi, Y. Xia, V. Kumar: "Performance of Shared-Memory Switches under Multicast Bursty Traffic", IEEE Jour. Sel. Areas in Communications (JSAC), vol. 15, no. 3, April 1997, pp. 473-487.
- D. Stiliadis: "Efficient Multicast Algorithms for High-Speed Routers",
   Proc. IEEE Workshop on High Performance Switching and Routing (HPSR 2003), Torino, Italy, June 2003, pp. 117-122.

2.4 - M. Katevenis, FORTH and U.Crete, Greece









