# 3.1 The need for multiple queues within a same buffer memory: Multi-Queue Data Strucutres

- Buffers memories for time switching: what data structures?
- Single queue feeding multiple destinations/classes
  - ⇒ Head-of-Line (HOL) Blocking ⇒ poor performance
- Multiple Queues in one buffer:
  - Partitioned Space (underutilized) ⇒ circular queues
  - Shared Space (efficient) ⇒ linked-list queues



CS-534, Copyright Univ. of Crete



CS-534, Copyright Univ. of Crete





#### Shared Queue ⇒ Head-of-Line (HOL) Blocking



- Queue shared among multiple destinations/classes (A and B)
- $C_{line} < C_{buf}$  or  $C_{eff} < C_{buf}$  due to other traffic
- ⇒ HOL packet can block packets behind it to other dest./classes

#### Reminder: Circular Array Implementation of FIFO Queue



## B.2 Multiple FIFO Queues with Statically Partitioned Space

#### for each of them:

- · One RAM for all queues; Two counters/pointers per queue.
- · Limits between queue areas (partitions)
  can be "hardwired", or "configurable"
  off-line (when queues are empty);
  limit registers needed in the latter case
- · Underutilization of memory space (one queue may overflow, while space exists on others)
- · Variable-size packets OK



534

12.6

#### Multiple Queues Sharing a Buffer Memory Space

- Often need many queues where most have low occupancy (or empty) but few have high occupancy
- Departures create holes ⇒ empty space is fragmented
  - ⇒ neighboring packets physically non-contiguous
  - ⇒ linked-list data structure

Reference: Y. Tamir, G. Frazier: "High Performance Multi-Queue Buffers for VLSI Communication Switches", ISCA 1988

- "malloc()" works on fixed-size blocks
  - block size is a tradeoff between fragmentation cost (for very large blocks) and memory address rate (for very small blocks): see exercise set 4
  - unless multiple packet fragments in a same block ex. 6.1





CS-534, Copyright Univ. of Crete



| . 0                                             |                   |                   |                        |                    | (clock) (op's) (clock) (op's) (cycles) (op's) |        |         |        |
|-------------------------------------------------|-------------------|-------------------|------------------------|--------------------|-----------------------------------------------|--------|---------|--------|
| Assumptions:<br>"off-chip" means                | E                 | Hd                | Tl                     | NxtPtr             | (clock)                                       | (OP'S) | (clock) | ( op's |
| ang/deg controller FSM<br>Imemories are located |                   |                   | single,<br>hip me      | 1-port             | 40,5                                          | 1      | 5       | 5 0 4  |
| "on-chip" means all together                    | 1-Port<br>on-thip | Tin a             | single,<br>chip m      | 1-port             | 4                                             | 1/4    | 5       | 1/4    |
| 1 memace/cycle/port off-chip dependent          | 1-tort<br>on-chip | 1-post,           | off-chip<br>wide       | 1-port<br>off-chi) | 3*                                            | 1/3    | 5       | 1/3    |
| use data as next<br>address) cost one           | 1-port<br>on-chip |                   | 1-part<br>off-<br>chip |                    | 3*                                            | 1/2    | 5       | 1/2    |
| extra cycle of latency between them             | 1-Port<br>Ou-Chip | 1-port            | 1-port                 | off-dip            | 2*                                            | 1/2    | 3       | 1/2    |
| Note:<br>* enqueue latency                      | 2-port<br>on-chip | 2-port<br>on-chip | 2. port<br>on. chip    | off-<br>chip       | 2*                                            | 1      | 3       | 1      |

#### Notes on Enqueue/Dequeue Performance Table

- Table assumes we always need to perform both normal and exceptional accesses – see ex. 5.1
- Table assumes fall-through timing for off-chip SRAM: one-cycle latency per access, plus one extra cycle between dependent off-chip accesses – see ex. 5.2
- For peak throughput: overlap successive operations (latency of individual operations increases)

For a highly pipelined implementation, refer to: Kornaros, Kozyrakis, Vatsolaki, Katevenis: ``Pipelined Multi-Queue Management in a VLSI ATM Switch Chip with Credit-Based Flow Control", 17<sup>th</sup> Conf. on Advanced Research in VLSI, 1997







CS-534, Copyright Univ. of Crete

### Dropping Entire Packet (multi-segment) or moving to other Queue:

## 1) In O(pckSize) Time:

- Drop or move one segment at a time...
- ⇒takes time proportional to the size of the packet

Howally, this consumes
the same resources
(time, memory ports)
as if the packet
were transmitted
to the output port



· Need two next-pointers per segment:



(the utilization of the second set of pointers is only: (average number of segments per packet)

- alternative: separate "packet descriptor"

balternative: separate "packet descriptor data structures, like for multicoust segment descriptors—see below).

#### **Queueing for Multicast Traffic**

- Multicast traffic is expected to become very important in the future
   –but so has it been for many years in the past…
- Supporting multicast traffic usually increases complexity and cost
- Queueing for Multicast Traffic:
  - Each segment (block) allowed in only one queue ⇒ HOL blocking
  - Each segment allowed in multiple queues ⇒ need many nxtPtr's
  - Enqueue throughput and nxtPtr space: static vs. dynamic sharing

#### References:

- F. Chiussi, Y. Xia, V. Kumar: "Performance of Shared-Memory Switches under Multicast Bursty Traffic", IEEE Jour. Sel. Areas in Communications (JSAC), vol. 15, no. 3, April 1997, pp. 473-487.
- D. Stiliadis: "Efficient Multicast Algorithms for High-Speed Routers", Proc. IEEE
   Workshop on High Performance Switching and Routing (HPSR 2003), Torino, Italy,
   June 2003, pp. 117-122.



#### Case(2): Each segment is allowed to belong to multiple queues: Reference Data Buffer: per-output queues Counts: blocks: addr: 40 33 1 (unicoust) 34 0 2 (multicost) 1 36 (unicout) · Solves all QoS problems! One copy of "37" but ... (multicost) has already · Increases the worst-case BD deported and has decremented 1 queue-operation rate 38 (unicoust) the corresponding by a factor of n reference count. 39 Ø (n = number of output ports)!

(unitast)

1





## Enqueue Operation Rate for multicust segments onto multiple per-output queues



- · F. Chiussi, Y. Xia, V. Kumar: IEEE JSAC, April 1997, pp. 473-487
- D. Stiliadis: HPSR 2003, pr. 117-122