# 2. Link and Memory Architectures and Technologies

2.1 Links, Thruput/Buffering, Multi-Access Ovrhds
<u>2.2 Memories: On-chip / Off-chip SRAM, DRAM</u>
2.A Appendix: Elastic Buffers for Cross-Clock Commun.

Manolis Katevenis

CS-534 – Univ. of Crete and FORTH, Greece

http://archvlsi.ics.forth.gr/~kateveni/534

2.2 Memories: On-chip / Off-chip SRAM, DRAM

#### Table of Contents:

#### • 2.2.1 On-Chip SRAM blocks

- Area, Power Consumption, Access Cycle Time; 1 or 2 ports
- Power cons. per unit throughput: SRAM, pin transceivers

#### • 2.2.2 Off-Chip SRAM technologies

- Address-Read-Data Pipelining
- Separate Unidirectional versus Unified Bidirectional Data Lines

#### • 2.2.3 DRAM Chips and their Pin Interface

- Row Access versus Column Access
- Interleaved accesses to the internal DRAM banks

#### 2.2.1 On-Chip SRAM

Read Cycle Includes:

- Precharge bit lines
- Decode row address
- Activate word line
   faster when narrow
- Discharge bit lines
  - faster when short
- Sense amplifiers
  - don't wait for full discharge before telling the result
- Column multiplexors

   use column address



#### Sense Amplifiers: Role, Consequences

- Sense amplifiers significantly speed up read access time – sense 0-contents soon after bit-line discharge has started
- Sense amplifiers (SA) are large in size
  - can fit only one SA per 4 or 8 (typically) columns
  - analog multiplexors before SA select columns to be read
  - digital multiplexors after SA for narrow port widths
- Sense amplifiers consume significant energy when activated
  - only activate the block when read data are actually needed
  - power consumption is proportional to access frequency
  - power consumption is proportional to number of sense amp's (increases with port width, or with bit capacity of SRAM)



# Area per (Kilo/Mega-) bit: Comments

- Older generation (~2005); values are  $(\mu m)^2/bit = (mm)^2/Mbit$
- Area efficiency increases with block capacity
  - peripheral overhead (address decoders, column multiplexors, sense amplifiers) grows slower than core
- Port width costs a lot for small memories
  - more sense amplifiers, possibly non-square aspect ratio
  - (large memories may already have more SA's than port width)
- 1 sense amplifier per 8 columns, usually
- Two-port area  $\approx 2 \times \text{one-port}$  area
- Power ring is not included in the quoted area figures
  - add 25  $\mu m$  on each side of the block that is given in the above chart: width and heigth increase by 50  $\mu n$  each
- Quoted blocks have per-Byte write-enable signals



#### Power Consumption (mW/MHz): Comments

- Slightly old generation (~2005): 130 nm
- Worst-case consumption quoted;  $V_{DD} = 1.2$  Volts
- Consumption is proportional to access frequency: *mW / MHz*
- Consumption is dominated by port-width, esp. for small blocks
  - actually by the number of SA's –may be larger than needed, for narrow-port memories
- Consumption increases with block size due to increasing word-line and bit-line capacitance
  - also increases when size is such that it requires more SA's
- Two-port memory consumption is *per-port*
- Two-port total consumption ≈ 2 × one-port consumption



#### Cycle Time (1/AccessRate): Comments

- Slightly old generation (~2005): 130 nm
- Worst-case cycle-time quoted;  $V_{DD} = 1.2$  Volts
  - Blocks compiled for performance
- Large SRAM's are slower than small SRAM's (small is fast)
  - bit-line (and word-line) capacitance increases with length
  - beyond the "knee" of the curve, it is advantageous to use smaller
     SRAM's + external data mux than to use single large SRAM
     (tree of read-multiplexors becomes faster than single large mux)
- For large blocks, narrow ports increase the read latency, due to extra multiplexors after the sense amplifiers

- looks like this also increases the cycle time

• Two-port speed ≈ speed of 1-port block with 2 × num. of bits







[intentionally left blank]

On-Chip SRAM Buffer Example 1 of 2: 40-Byte wide

- <u>Width</u> = 1 min-size IP packet = 40 Bytes = 320 bits = = 5 blocks × 64 bits/block
- <u>One-Port</u>, 2048 packets × 40 B/pck = 80 KB = <u>640 Kb</u>
- 130 nm CMOS, 1.2 Volts
- <u>Area</u> = 5 banks × 128 Kb/bank × 3 mm<sup>2</sup>/Mb = = 0.64 Mb × 3 mm<sup>2</sup>/Mb ≈ <u>2 mm<sup>2</sup></u>
- <u>Throughput</u> = 320 bits × 300 Maccesses/s ≈ <u>100 Gb/s</u>
- <u>Power Consumption</u> =

= 5 banks × 0.11 mW/MHz × 300 MHz = 165 mW

On-Chip SRAM Buffer Example 2 of 2: 256-Byte wide

- Width ≈ 1 average-size IP packet = 256 Bytes = 2048 bits = = 64 blocks × 32 bits/block
- <u>Two-Port (1rd+1wr)</u>, 2048 packets × 256 B = 512 KB = <u>4 Mb</u>
- 130 nm CMOS, 1.2 Volts
- <u>Area</u> = 64 banks × 64 Kb/bank × 6.1 mm<sup>2</sup>/Mb = = 4 Mb × 6.1 mm<sup>2</sup>/Mb ≈ <u>25 mm<sup>2</sup></u>
- <u>Throughput</u> = 2 ports × 2048 b/port × 240 MHz ≈ <u>1 Tb/s</u> (500 Gb/s writes + 500 Gb/s reads)
- <u>Power Consumption</u> =
  - = 64 banks × 2 ports × 0.08 mW/MHz × 240 MHz ≈ <u>2.4 W</u>
- **Conclusion:** "no problem" on-chip, except for short packets

#### Power Cons./Throughput (1 of 2): on-chip SRAM

- Consider some "usual, medium-size" SRAM blocks (130 nm):
  - 1-port, ×16: ≈ 0.03 mW/MHz = 0.03 mW / 16 Mbps ≈ 2.0 mW/Gbps
  - 1-port, ×32: ≈ 0.05 mW/MHz = 0.05 mW / 32 Mbps ≈ 1.6 mW/Gbps
  - 1-port, ×64: ≈ 0.10 mW/MHz = 0.10 mW / 64 Mbps ≈ 1.6 mW/Gbps
  - 2-port, ×8: ≈ 0.02 mW/MHz = 0.02 mW / 8 Mbps ≈ 2.5 mW/Gbps
  - 2-port, ×32: ≈ 0.06 mW/MHz = 0.06 mW / 32 Mbps ≈ 2.0 mW/Gbps

Conclusion: <u>**1.5 to 2.0 mW/GBps</u></u> power consumption for on-chip buffer memories
</u>** 

# Power Cons./Throughput (2 of 2): Chip I/O

- High-speed serial off-chip transceiver ≈ <u>10 to 25 mW/Gbps</u>
  - e.g. differential pair, 3.125 Gbaud (8b/10b encoding) = 2.5 Gb/s
  - 130 nm CMOS, both transmitter and receiver power considered
  - assume no pre-emphasis at the transmitter for line equalization purposes – such pre-emphasis would consume considerably
  - copper cable consumption is very small, compared to others
- $\Rightarrow$  <u>Conclusion</u>: <u>chip-to-chip</u> communication costs <u>an order of</u> <u>magnitude more</u> than on-chip buffering, in term of power cons.
- Total chip power consumption (limited to ≈ 10 to 30 Watts) limits total chip throughput to <u>about 1 Tbps/chip</u> or less

# 2.2.2 Off-Chip SRAM Technologies

- Large on-chip throughput, owing to parallelism of accesses
- Gradual improvements in pin-interface protocols (late 90's):
- 1. Clock-synchronous, pipelined address/data communication
- 2. Double-Data Rate (DDR) data-pin timing (see §2.1)
- 3. Source-synchronous clocking
  - clock signal propagating in the same direction as data (or address) signals – normally implies two separate clocks
- 4. Separate, unidirectional Write-Data and Read-Data buses
  - avoids bus turn-around overhead, but
  - requires 50% writes 50% reads for full utilization
- 5. Write-data timing similar to read-data timing
  - first send the address, later send the data, so that addressbus to data-bus time-offset stays fixed for reads & writes

# **Clock-Synchronous RAM: Pipelined Communication**





"Flow Through": old timing

 no overlapping between SRAM operation and communication





"Synchronous" Registered Interface

 pipelined SRAM operation and chip-to-chip communication



2.2 - U.Crete - M. Katevenis - CS-534





#### Example QDR SRAM (2007): CY7C1545V18

- 72 Mbits = 4 M × 18 bits (width = 2 Bytes + parity/ECC)
- $\leq$  375 MHz clock  $\Rightarrow$  cycle = 2.67 ns; bit-time = 1.33ns (DDR)
- Burst-of-4 words ↔ simple (non-DDR) address timing
- Peak Write Throughput:
   375 MHz × 2 (DDR) × 16 bits = 12 Gb/s/chip = 1.5 GB/s
- Peak Read Throughput = (similarly) 12 Gb/s
- Peak Total throughput for balanced (50%-50%) read-write:
   12 + 12 = <u>24 Gb/s</u> = 3 GB/s
- Power consumption ≈ 2.4 W (typical) @ 375 MHz, 1.8 Volt
   ⇒ Power per throughput ≈ 2.4 W / 24 Gbps ≈ 100 mW/Gbps



#### Example Shared-Bus SRAM (2007): CY7C1550V18

- 72 Mbits = 2 M × 36 bits (width = 4 Bytes + parity/ECC)
- $\leq$  375 MHz clock  $\Rightarrow$  cycle = 2.67 ns; bit-time = 1.33ns (DDR)
- Peak Throughput = 375 MHz × 2 (DDR) × 32 bits = 24 Gb/s
- "NoBL" (No Bus Latency) = "ZBT" (Zero Bus Turn-Around, ala Micron)
- Although NoBL/ZBT, one clock cycle is lost every time the bus direction changes from read to write (bus turn-around)

⇒ throughput with alternating read/writes ≈
≈ 2/3 × peak throughput ≈ <u>16 Gb/s</u>

• Power consumption ≈ 2.4 W (typical) @ 375 MHz, 1.8 Volts

⇒ Power per throughput ≈ 2.4 W / 24 Gbps ≈ <u>100 mW/Gbps</u>

# 2.2.3 Dynamic RAM Chips and their Pin Interface

- Highest density and longest internal latency RAM chips
- Huge internal parallelism, when addresses are *favorable*:
  - multiple banks memory interleaving
  - per-bank: entire row (hundreds of bits) accessed in parallel
- Pin Interface: advanced techniques to increase throughput

   pins synchronized to a high-speed clock (Synchronous DRAM)
   100's of bits piped thru 10's of data pins during several clocks
   internal RAM access is independent of clock multiple cycles
- Three-step internal accesses each bank independently
  - row access: activate a row in a bank, copy into sense amp's
  - column access: read/write multiple bits in selected row
  - precharge: get this bank ready for activating another row
- Address pins time-shared: row column addr; multiple banks

## Example DDR3 SDRAM (2007): MT41J64M16

- 1 Gbit = 64 M × 16 bits = 8 banks × 8 Mw/bank × 16 b/w
- $\leq$  800 MHz clock
- Bidirectional data pins, DDR timing  $\Rightarrow$  up to 1.6 Gbps/pin
- Internal latencies specified as absolute times:
  - row-addr. to column-addr.  $\geq$  14 ns
  - column-addr. to read-data  $\geq$  14 ns
  - bank-cycle time  $\geq$  48 ns; precharge time  $\geq$  14 ns
- Translated to # of clock cycles by user @ boot time
  - e.g. at 800 MHz: row-acc  $\geq$  11~, col-acc  $\geq$  11~, bnk-cycle  $\geq$  38~
- (Remaining slides are for a much older chip (~2001)...)



| Fast DRAM Example (2001)<br>Micron MT46 V2 M32<br>DDR SDRAM<br>· <u>200 MHz</u> max. clock frequency<br>· <u>64 Mbits</u> = <u>2M × 32 bits</u> =                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| (Synchronous DRAM) = 512K × 326 × 4 Banks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| <ul> <li>32-bit (shared DQ) databus, DDR timing =&gt;</li> <li>2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock cycle</li> <li>9 2 words x 32 bits each per clock c</li></ul> |
| <ul> <li>Row Address - to - Column Address:</li> <li>Column Address - to - Read Darton (CAS latency): CL≥15ns (@9001He: 3~)</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| <ul> <li>Write Recovery Time (write data to precharge): two ≥ 20ns (equality 2~)</li> <li>Precharge Time: trc≥ 20ns (equality 2~)</li> <li>Cycle Time (same bank): trc≥ 60ns (equality 2~)</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| · Read-to-Write bus turn-around lost cycles: 3~                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| · Kend-to-Write out that lost cycles (write recovery time): 2 ~<br>· Write-to-Read other bank lost cycles: \$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| · Write to the start for a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |



2.2 - U.Crete - M. Katevenis - CS-534



2.2 - U.Crete - M. Katevenis - CS-534



#### Multi-Bank Operation: Memory Interleaving



• burst length set to 8; each successive READ command interrupts the preceding burst, resulting in net bursts of 6.