# 3.1 The need for multiple queues within a same buffer memory: Multi-Queue Data Strucutres

- Buffers memories for time switching: what data structures?
- Single queue feeding multiple destinations/classes  $\Rightarrow$  Head-of-Line (HOL) Blocking  $\Rightarrow$  poor performance
- Multiple Queues in one buffer:
  - Partitioned Space (underutilized)  $\Rightarrow$  circular queues
  - Shared Space (efficient)  $\Rightarrow$  linked-list queues





CS-534, Copyright Univ. of Crete





CS-534, Copyright Univ. of Crete

Shared Queue  $\Rightarrow$  Head-of-Line (HOL) Blocking



- Queue shared among multiple destinations/classes (A and B)
- $C_{line} < C_{buf}$  or  $C_{eff} < C_{buf}$  due to other traffic
- $\Rightarrow$  HOL packet can block packets behind it to other dest./classes

#### **Reminder: Circular Array Implementation of FIFO Queue**



B.2 Multiple FIFO Queues with Statically Partitioned Space for each of them:

- · One RAM for all queues; Two counters/pointers per queue.
- Limits between queue areas (partitions)
  com be "hardwired", or "configurable"
  off-line (when queues are empty);
  limit registers needed in the latter case
- Under utilization of memory space (one queue may overflow, while space exists on others)
- · Variable-size packets OK



534

### Multiple Queues Sharing a Buffer Memory Space

- Often need many queues where most have low occupancy (or empty) but few have high occupancy
- Departures create holes ⇒ empty space is fragmented
  ⇒ neighboring packets physically non-contiguous
  - $\Rightarrow$  linked-list data structure
    - Reference: Y. Tamir, G. Frazier: "High Performance Multi-Queue Buffers for VLSI Communication Switches", ISCA 1988
- "malloc()" works on fixed-size blocks
  - block size is a tradeoff between fragmentation cost (for very large blocks) and memory address rate (for very small blocks): see exercise set 4
  - ... unless multiple packet fragments in a same block ex. 6.1



CS-534, Copyright Univ. of Crete





| Cost-Performance Tradeoffs:                                   |                   |                        |                        |                        | Enqueue Dequeue   |                            |                                |                   |
|---------------------------------------------------------------|-------------------|------------------------|------------------------|------------------------|-------------------|----------------------------|--------------------------------|-------------------|
| Assumptions:<br>• "off-chip" means                            | E                 | Hd                     | TL                     | NxtPtr                 | Latary<br>(clock) | Thruput<br>( <u>op's</u> ) | Latency<br>(clect)<br>(cycles) | Thrupit<br>(op's) |
| ang/deg controller FSM                                        |                   | in a t                 | single, s<br>hip me    | 1-port                 | n con             | 1                          | 5                              | 1 504             |
| • "on-chip" means all together                                | 1-root<br>on-chip | in a off.              | single<br>chip m       | 1-port<br>iemors       | 4                 | 1/4                        | 5                              | 1/4               |
| · 1 memace./cycle/port<br>· off-chip dependent                | S-tort<br>on-chip | s-port,<br>log_N       | off-chip<br>wide       | 1-port<br>off-chit     | 3*                | 1/3                        | 5                              | 1/3               |
| accesses (read, then<br>use data as next<br>address) cost one | 1-Port<br>on-chip | J-port<br>off-<br>chir | 1-part<br>off-<br>chip | 1-port<br>off-chip     | 3*                | 1/2                        | 5                              | 1/2               |
| extra cycle of latency between them                           | ou-chip           | DM-chil                | J-port<br>ON-chip      | s - port<br>off- dut   | 2*                | 1/2                        | 3                              | 1/2               |
| Note:<br>* enqueue latency                                    | 2-port<br>on-chir | 2-port<br>on-chit      | 2. port<br>m.chip      | 1-part<br>off-<br>chip | 2*                | 1                          | 3                              | 1                 |
| may increase<br>when mixing enqueue                           | and de            | queue o                | peratio                | ns in the              | same              | pipe lin                   | e.                             |                   |

#### Notes on Enqueue/Dequeue Performance Table

- Table assumes we always need to perform both normal and exceptional accesses see ex. 5.1
- Table assumes fall-through timing for off-chip SRAM: one-cycle latency per access, plus one extra cycle between dependent off-chip accesses – see ex. 5.2
- For peak throughput: overlap successive operations (latency of individual operations increases)

For a highly pipelined implementation, refer to: Kornaros, Kozyrakis, Vatsolaki, Katevenis: ``Pipelined Multi-Queue Management in a VLSI ATM Switch Chip with Credit-Based Flow Control'', 17<sup>th</sup> Conf. on Advanced Research in VLSI, 1997



Nikologiannis, Katevenis: "Efficient Per-Flow Queueing in DRAM at OC-192 Line Rate using Out-of-Order Execution Techniques", IEEE ICC, Helsinki, Jun. 2001 15

Free List Rate. Free List Bypass: Te, MLoming Traffic Outgoing Traffic Data Memory Rout Queues dequeue free block freeblock tor each block bypass of each incoming dequare Free List lengueue or outgoing packet: 100 10P => Ho/Tl mem. (anound) op Rate ~ Rin + Rout Free List Queues but NxtPtr mem. Op Rate ~ 2× (Rin+Ront) 108 2 Hola NxtPtr 田口 However, Free List Bypass can reduce this to 1. reg. MEM WERMA C P. Andersson, C. Svensson (Lund Univ., Swedon): "AVIST Architecture for an 80 Gb/s ATM Switch Care", Assumptions: · each incoming block/pactet is enqueued into a single queue IEEE MADOR HA Susteens in Silicon Conference, Oct. 96. (e.g. multicast packets not enqueued into multiple queues) · for freelist bypass, each outgoing (transmitted) block is also freed (e.g. multicost packets depart a single time, simultaneously to all links) Also considera Free Block Cache: Keep in a small on-chip memory the pointers Also considera Free Block Cache: to the top few blocks of the Free List.





## **Queueing for Multicast Traffic**

- Multicast traffic is expected to become very important in the future –but so has it been for many years in the past…
- Supporting multicast traffic usually increases complexity and cost
- Queueing for Multicast Traffic:
  - Each segment (block) allowed in only one queue  $\Rightarrow$  HOL blocking
  - Each segment allowed in multiple queues  $\Rightarrow$  need many nxtPtr's
  - Enqueue throughput and nxtPtr space: static vs. dynamic sharing
- References:
  - F. Chiussi, Y. Xia, V. Kumar: "Performance of Shared-Memory Switches under Multicast Bursty Traffic", IEEE Jour. Sel. Areas in Communications (JSAC), vol. 15, no. 3, April 1997, pp. 473-487.
  - D. Stiliadis: "Efficient Multicast Algorithms for High-Speed Routers", Proc. IEEE Workshop on High Performance Switching and Routing (HPSR 2003), Torino, Italy, June 2003, pp. 117-122.







CS-534, Copyright Univ. of Crete





CS-534, Copyright Univ. of Crete