|CS-534: Packet Switch Architecture
|Department of Computer Science
© copyright: University of Crete, Greece
|[Lecture: 4.4 CIOQ]||[print version, in PDF]|
To simplify our task, let us assume that the SRAM chips to be used: (i) have a shared DQ (read/write) data bus --so that we do not have to worry about read-to-write access ratio-- and (ii) use ZBT timing without any lost cycles when the bus direction changes between reads and writes (not a realistic assumption at such high clock rates as we have here). Further assume that each chip has a 32-bit (4-Byte) wide data bus, and uses double-clocking (DDR timing) with a burst-of-2. Hence, we can have one access (to an arbitrary address) per clock cycle, and each access concerns 8 Bytes of data (read or written). The maximum allowable clock frequency for these SRAM chips is 200 MHz.
The "10-Gigabit Ethernet" standard for high-speed links is 10 times faster than Gigabit Ethernet. Assume that the interframe gap and the preamble have the same size in both standards (I am not sure if this is true in reality); the ethernet header, CRC, and payload sizes should also be the same. Thus, assuming the same sizes as in exercise 3.3(e), the total per-packet overhead is 38 Bytes, and the packet size is 46 to 1500 Bytes, where by "packet" we mean the information that has to be stored in our buffer memories. In our switch queueing architectures, we will have two different kinds of buffer memories, each with a different packet size as its limiting factor:
(a) Output Queueing:
Using this architecture, our 12x12 switch will need 12 output buffer memories. Each of these memories must provide a very high throughput, due to its fan-in of 12 links, hence it must follow organization (i) above. What is the peak segment rate that each of these memories must support? What SRAM clock frequency must we use to achieve that? How many SRAM chips do we need to use in parallel, in order to build each such memory? (Hint: just one memory access must suffice to read or write an entire 128-Byte segment). Given that our switch needs 12 output buffer memories, what is the total number of SRAM chips needed for the entire switch? If each SRAM chip consumes 2 Watts of power, how much power does the entire buffer memory consume? Fast SRAM chips are expensive; if each chip costs 20 Euro, what is the cost of buying all SRAM chips for one switch? Assume that the remaining components of the switch cost 3 times as much as the SRAM chips (a conservative estimate); if the average selling price (ASP) of the switch is 5 times its components cost (see Hennessy and Patterson, Computer Architecture, chapter 1), what would be the ASP of this switch?
(b) Block-Crosspoint (Block-Shared) Queueing:
According to this architecture, our 12x12 switch is made of a 2x2 array of buffer memories, where each buffer memory serves 6 inputs and 6 outputs, i.e. each buffer memory forms a small 6x6 shared-buffer switch. Each of these memories must again provide a very high throughput, so it must follow again organization (i) above. What is the peak segment rate that each of these memories must support? What SRAM clock frequency must we use to achieve that? How many SRAM chips do we need to use in parallel, in order to build each such memory? What is the total number of SRAM chips needed for the entire switch, in this case? How much power does the entire buffer memory consume, now? What is the cost of buying all these SRAM chips, and what would be the ASP of this switch?
(c) Input (Virtual-Output) Queueing:
For this architecture, we need 12 input buffer memories for our 12x12 switch. (Within each buffer memory, multiple logical queues (per-output, per-priority, etc) should be implemented; this does not affect our throughput calculation, here, provided of course that the (separate!) queue-pointer memories can keep up with the required operation rate). Each of these memories must now provide a throughput just twice as much as the link throughput, hence it will follow organization (ii) above. What SRAM clock frequency must we use to achieve that, for the worst-case (for this buffer) packet size? How many SRAM chips do we need for the entire switch? How much power do all the buffer memories consume? What is the cost of buying all these SRAM chips, and what would be the ASP of this switch?
(d) Internal Speedup with Input and Output Queues:
By running each input buffer of question (c) with a faster clock, its throughput is increased. Using the fastest allowable clock (200 MHz), the aggregate memory access rate can reach 200 Maccesses/second. How much of that access rate can be consumed by the incoming link, for worst-case (for this buffer) packet sizes? How many accesses per second remain available for the crossbar to use? What speedup factor does that represent? Besides this worst-case analysis, also make an average-case calculation for the speedup factor, assuming 320 Byte packets (320 = 2.5 * 128), which require 3 segment accesses each. This switch also includes output buffer memories: how many are these, what aggregate access rate must each provide, and what clock frequency and how many SRAM chips do they need? How many SRAM chips do we need for the entire switch, how much power do they consume, what is the cost of buying them, and what would be the ASP of this switch?
Assume that the chip will be implemented in the 45 nm CMOS technology that we saw in class (section 2.2), and that the shared buffer will be built out of 1K x64 two-port SRAM blocks, operating at 600 MHz (their maximum rate, with a small margin, as we saw).
(a) Number of Ports:
What is the maximum access rate that a shared buffer can offer, when implemented using the above SRAM blocks? Given the worst-case segment rate analysis for 10-Gigabit Ethernet that was presented in the above exercise 7.1 (14.9 Msegments/second/port), and given that our shared-buffer architecture uses the "very wide memory" organization above, how many ports, N, can this buffer support?
(b) Number of Blocks, SRAM Capacity, Area, and Power Consumption:
How many SRAM blocks are needed in parallel to yield the performance required by (a), i.e. in order for an entire segment to be accessible in just one clock period? What is the capacity of the resulting shared buffer in Kbits, KBytes, and Ksegments? How much silicon area will these SRAM blocks occupy on the chip? How much power will they consume at the clock frequency at which they need to operate?
--- Optional Part: ---
(c) Evaluation - Adaptation:
|Up to the Home Page of CS-534
University of Crete, Greece.
Last updated: 3 Apr. 2013, by M. Katevenis.