During the last two decades, there has been an exponential growth in the operational speed of microprocessors. Also RAM capacities have been improving at more than fifty percent per year. However the speed and access time of the memory have been improving at slower rate. In order to keep up in performance and reliability with processor technology it is necessary to make considerable improvements in the memory access time.
The Rambus founders emerged with a memory technology-RD RAM. RDRAM memory provides the highest bandwidth -2.1GB/sec. per pin- from the fewest pins at five-times the speed of industry available DRAM. The RDRAM memory channel achieves its high-speed operation through several innovative techniques including separate control and address buses, highly efficient protocol, low voltage signaling, and precise clocking to minimize skew between clock and data lines. A single RDRAM device is capable of transferring data at 1066Mb/sec. per-pin to Rambus-compatible ICs. Data rate per-pin will increase beyond 1066Mb/sec per pin in the future.
One of the constants in computer technology is the continuing advancement in operational speed. A few years ago, a 66 MHz PC was considered lightning fast. Todayâ„¢s common desktop machine operates at many times that frequency. All this speed is the foundation of a trend towards visual computing, in which the PC becomes ever more graphical, animated, and three- dimensional..
In this quest for speed, most of the attention is focused on the microprocessor. But a PCâ„¢s memory is equally important in supporting the new capabilities of visual computing. And commodity Dynamic RAMs (DRAMs), the mainstay of PC memory architecture, have fallen behind the microprocessor in their ability to handle data in the volume necessary to support complex graphics. While device densities have increased by nearly six orders of magnitude, DRAM access times have only improved by 10. Over the same time, microprocessor performance has jumped by a factor of 100. In other words, while bus frequency has evolved from 33 MHz for EDO to the current standard of 100 Mhz for SDRAMs and up to 133 MHz for the latest PC-133 specification, memory speed has been out spaced by the operation frequency of the microprocessor which reached 600 MHz plus by the turn of the century. Thus, the memory subsystem risked to become a bottleneck for overall system performance or had created a significant performance gap between computing elements and their associated memory devices.
Traditionally, this gap has been filled by application specific memories like SRAM caches, VRAMs etc. In order to broaden the usage, we thus need a high density, low cost, high bandwidth DRAM.
This technology is based on a very high-speed, chip-to-chip interface and has been incorporated into DRAM architectures called Rambus DRAM or RDRAM. It can also be used with conventional processors and controllers to achieve a performance rate that is 100 times faster than conventional DRAMs. At the heart of the Rambus Channel Memory architecture, is ordinary DRAM cells to store information. But the access to those cells, and the physical, electrical and logical construction of a Rambus memory system is entirely new and much, much faster than conventional DRAMs. The Rambus channel transfers data on each edge of a 400 MHz differential clock to achieve an 800- MB/s data rate. It uses a very small number of very high speed signals to carry all the address, data and control information, greatly reducing the pin count and hence cost while maintaining high performance levels. The data and control lines have 800-mV logic levels that operate in a strictly controlled impedance environment and meet the specific high-speed timing requirements. This memory performance satisfies the requirements of the next generation of processors in PCs, servers, workstations as well as communications and consumer applications.
2. RDRAM â€œ WITH A DIFFERENCE
2.1 The Memory Landscape
Currently, there are three major groups of memory technology widely available in the market: Synchronous DRAM (SDRAM), Double Data Rate Synchronous DRAM (DDR SDRAM) and RDRAM memory. SDRAM and DDR SDRAM share many architectural and signaling features. Both use a parallel data bus, mainly available in component widths of x8 or x16, both have a single addressing command bus that must be shared to transmit row and column addresses.
DDR SDRAM increases data bandwidth over conventional SDRAM by transmitting data on both edges of the synchronous clock signal using SSTL-2 signaling, thus in theory doubling the data rate of the memory. It, However does not double the address command bandwidth of the system by using both the edges of the clock on the command bus, a factor that ultimately limits the performance from using DDR signaling on the data bus.
RDRAM memory takes a totally different approach. It combines a conventional DRAM core with a high-speed serial interface called the RDRAM channel. The Channel uses 16 pins ( 2 bytes) for a data path operating at an effective data rate of 800 MHz per pin by transmitting data on both edges of the clock. To facilitate maximum performance, the RDRAM channel utilizes double data rate signaling on non-multiplexed row and column address command buses. Since each RDRAM deviceâ„¢s data path is as wide as the Channel, a single device can service an entire memory request, unlike SDRAM, which uses multiple devices in parallel to satisfy a request. Up to 32 RDRAM devices can be placed on each Channel without a buffer. The Channel is common to all devices and incorporates the command bus, data bus and a serial control bus for initialization. The RDRAM Channel is uniformly loaded as new devices are added. The RDRAM protocol supports many features that optimize the bandwidth and efficiency of the overall system.
A brief overview of the Rambus (RDRAM) memory subsystem and comparing it to SDRAM PC133 gives the following information. The maximum theoretical data bandwidth of SDRAM (PC133 specification) is 1064 Mbytes/sec. This is calculated from 64 data lines (8 bytes) working at 133 MHz. In practice this theoretical maximum, or peak, throughout can be sustained for only short bursts of time. For extended data transfers only 65 percent of this value (around 692 MBytes/sec) can be maintained. The 800 MHz Rambus channel offers a theoretical and practical maximum bandwidth of 1600 MBytes/sec per channel.
The following diagrams compare the different data bandwidths within the PC using an Intel PentiumÃ‚Â® III 800 MHz processor, and a 133 MHz system bus:
:Comparison of data bandwidths for PC systems with SDRAM and Rambus memory
As can be seen, the memory data bandwidth is similar to that of the graphics bandwidth which helps to Ëœbalanceâ„¢ out the different data channels (between processor, chipset, memory, and graphics subsystem) and ensure that no single bus creates a bottleneck. The result is a better overall system performance.
2.2 Comparing SDRAM and Rambus architectures
The following diagram highlights the major architectural differences between SDRAM memory and the Rambus memory subsystem:
The Rambus memory subsystem uses its own separate bus called a Ëœchannelâ„¢ operating at a clock speed independent of the PC system bus. As mentioned earlier, the data path is only 16 bits wide and the clock speed is much higher. The maximum bandwidth for Rambus is 1600 Mbytes/sec per channel. With two Rambus channels (as found in the Kayak xu800), the maximum bandwidth doubles to 3200 Mbytes/sec.
Memory performance is usually summarized by two measures: bandwidth and latency. Peak system bandwidth can be defined as the maximum transfer rate of a particular memory technology given ideal conditions. Effects particular to each memory system degrade this peak bandwidth to a lower number called effective bandwidth. Effective bandwidth can be expressed as the product of peak bandwidth and efficiency, a measure of the ability of the memory system to deliver on its peak bandwidth given real world memory transaction patterns. In addition to examining the pitfalls of bandwidth measures, it is also important to look at latency. Devices are usually characterized by component latency, which is the time between row strobe and the delivery of the first bit of data on the output pin for an individual component. When evaluating system performance, it is useful to look at system latency, or the time from when a memory controller drives a requested address to when it samples the last of the read data.
2.3.1 Real World Performance- Bandwidth
a) SDRAM AND DDR
SDRAM and DDR SDRAM share some architectural issues that impact their effective bandwidth. Each quotes a peak bandwidth on a 64-bit bus, but these peak bandwidths decrease to substantially lower numbers when it comes to real world analysis. There are two components largely responsible for this decrease in bandwidth for both memory types: bank conflict among adjacent transactions and constraints on the address command bus.
The greatest loss on performance comes from bank conflict among adjacent transactions. DRAM cells share cells to save cost by organizing into banks. Banking restricts the ability of the DRAM to service simultaneous requests to DRAM addresses/cells that lie in the same bank as a sense amplifier can only service one cell/address at a time. Thus, when two requests collide on a bank of memory that is already in use, delay must be introduced until the previous transaction completes and during this time no new memory transactions can be initiated. Each SDRAM or DDR device is organized into 4 internal banks and these banks span across the module as each device only contributes part of the 64-bit interface. SDRAM/DDR controllers allow operations to pipeline to attempt to achieve the highest performance. Since service time is greater than issue time, the first memory operation will not complete before the second one is issued. Thus, the probability of sequential memory accesses hitting the same bank is 25%, and it increases with each additional outstanding transaction. The performance degradation due to bank conflict is large: approximately 85% of SDRAMâ„¢s efficiency loss and 80% of DDR SDRAMâ„¢s efficiency loss can be attributed to bank conflict. Ranking modules can be used in SDRAM and DDR SDRAM systems to provide collective banks at the expense of more pins for chip enable signals to select which rank to activate as well as poorer upgrade granularity: modules must be added in pairs.
DDR SDRAM faces additional performance limitations due to the fact that the address command bandwidth has not been scaled to match the data bus bandwidth. The limited bandwidth of the command bus restricts the ability of the memory system to concurrently process pipelined transactions. The constraints above exhibits themselves in delays that maybe introduced between sequential operations and when transitioning between read and write operations due to unequal lengths of these operations.
RDRAM memory uses a narrow, high-speed bus architecture to reduce pin-out while maximizing bandwidth. Command and Data are send along the RDRAM Channel that is uniformly routed to each RDRAM device. This results in a system that has equal load and fan-out on all signals, optimizing signal integrity and reducing performance loss due to signal stabilization time.
RDRAM memory uses separate buses to transmit row and column address information on both edges of the clock. This enables higher efficiency by allowing the issue of a column packet while a row packet is simultaneously being issued to another device, a feature not present in SDRAM or DDR. Bandwidth of the command and data buses precisely matched to minimize delays introduced between subsequent operations. Equalization of write and read data packet lengths prevents added delay when transitioning from a read to a subsequent write operation and minimizes delay when transitioning from a write to a subsequent read operation. On-chip write posting enables a RDRAM to allow a pending read operation to go around a slower write operation to ensure delivery of critical read data first with minimal controller overhead.
A single RDRAM device is capable of serving an entire memory transaction. One of the benefits of this architecture is that it allows the banks of the individual devices to be additive, pooling them to decrease the probability of bank conflict and thus minimize bank conflict effects. In the current 128 and 256 MB RDRAM generation, 32 paired internal banks are supplied per RDRAM device. Future generations of the 256 MB device and all devices at 512/576 Mb generation and beyond will incorporate a reduced-cost 4 independent bank core (4i ). Since banks are additive, the effect of reduced device bank count on performance is diminished and ranking is not needed to increase the number of banks.
RDRAM memory utilizes a 400 MHz clock signal that facilitates fast data transfers and allows for fine timing granularity for tiling or pipelining of memory operations in 1.25ns increments. RDRAM memory is capable of servicing a memory request every 10ns. These architectural characteristics result in a high efficiency memory subsystem that delivers very high effective bandwidth in real world conditions: 1301 MB/s effective bandwidth from 1600 MB/s peak bandwidth at 81% efficiency. The efficiency and effective bandwidth of RDRAM increases rapidly as the transfer size increases beyond 32 bytes.
2.3.2 Real World Performance: Latency
Component latency is defined as the time between row strobe on the command pin and the delivery of the first bit of data on the output pin. Looking at the controllerâ„¢s perspective, system latency is the total time from when a controller drives an address to when it samples the last of the read data. The latency of the memory core (t-RAC) is the largest contributor to component latency. Despite relatively slow memory offerings from memory vendors, (t-RAC = -45 ns for RDRAM memory, 39 ns for DDR and 226 ns for SDRAM) RDRAM maintains a very respectable system latency for transferring 16 and 32 bytes of data. As faster memory cores are mated with the RDRAM interface, RDRAM latency will continue to decrease, resulting in higher performance and efficiency from a RDRAM system over competing memory technologies.
2.4 System Cost
2.4.1 System Cost: Memory Granularity
An RDRAM system has a very compelling memory advantage over competing SDRAM or DDR-based systems. Since a single RDRAM device can act as an entire memory subsystem, the granularity of the memory footprint is a single device: 32 MB for todayâ„¢s 256 Mb densities of RDRAM and 64 MB for tomorrowâ„¢s 512 M bit generation. 64-bit DDR systems on the other hand, have a minimum granularity of 128 MB using x16 devices in 256 M bit technology. The granularity problem is illustrated in the table. Applications such as networking, communications and consumer electronics can take advantage of RDRAM memoryâ„¢s single device performance to optimize the total memory footprint while providing the highest system performance.
2.4.2 System Cost: Pin-count and Bandwidth
A single RDRAM memory system delivers 1.3 GB/s effective bandwidth (1.6 GB/s peak bandwidth) in only 76 pins. Contrast this with a standard 64-bit SDRAM that requires approximately 150 pins or a 64-bit DDR system that requires approximately 180 pins at approximately half of the effective bandwidth of RDRAM. 128-bit SDRAM or DDR systems can require 300-400 pins to match the effective bandwidth of a 76-pin RDRAM system.
Pincount impacts designs in many areas. The memory controller is probably the most affected due to increase in pincount required to support SDRAM or DDR SDRAM systems. For systems targeted at lower cost market segments, the ability to save 75-110 pins can significantly reduce controller cost by allowing a smaller package while delivering superior performance. In addition, for systems incorporating a memory interface integrated with a CPU, interface pincounts become even more critical. Pincounts also impacts die area and thus chip power due to increased I/O states and the receiver, driver and protection components necessary to protect them.
2.4.3 System Cost: I/Os and Support Components
Interface pincount translates directly into signals or that must be routed across motherboards, connectors and modules. SDRAM and DDR must deal with wide parallel buses that become unevenly loaded as the system capacity increases: every DRAM device must attach to the command bus, while only a portion of the data bus connects to each device. The resulting system may be required to deal with decreased signal integrity and smaller timing windows by using multiple address bus copies, 2 cycle addressing, extra system components or limitations on expansion to compensate for losses. DDR SDRAM may require more support components in an implementation including: registers on memory modules to buffer signals, clock recovery components and many decoupling capacitors and termination resistors to tune high-frequency signals.
In addition, the increase in I/Oâ„¢s may translate into increased PCB layers necessary to route signals escaping from a controller package and through a motherboard. Inclusion of termination resistors for each signal into the design further complicates the routing problem. Motherboard cost (and thus the system cost) increases as the number of layers grows. Recent press coverage of the next generation DDR-2 working group states that interface pincount is expected to exceed 200 signal pins for a 64-bit interface, primarily due to the signal integrity issues.
A single or dual-Channel RDRAM memory system can be routed in a cost-effective 4 layer motherboard and its modular channel layout ensures maximum signal integrity while minimizing support components and thus reducing overall system cost.
2.5 Power Dissipation
System power budgets are shrinking due to reduced airflow, fanless systems, rack-mounted systems and shrinking form factors. Simultaneously, performance requirements are increasing.
RDRAM memory was architectured to meet this challenge with its 1.8V native signaling, low signal swing of 0.8V and fine grained power management modes, which include stand-by, nap and power-down. Each mode successively reduces system power while maintaining different levels of functionality.
2.6 RDRAM Advantages: A Summary
Due to the above features, an RDRAM memory system consumes approximately equivalent or less power than a comparably built SDRAM or DDR SDRAM system. The advantage widens for RDRAM memory as sustained bandwidth increases.
An RDRAM memory system thus offers a designer highly compelling advantages to address a broad array of design needs:
Performance Headroom: Twice the Effective Bandwidth of competing technologies in half the pins, roadmap to 9.6 GB/s performance from a single module in 2004.
Proven Technology: System and Infrastructure component manufacturers in mass-production since 1999.
Granularity: Full Bandwidth performance (1.6GB/s) from a single device, add devices one at a time.
Simplicity: Lowest overall system component count, cost-effective board and modular implementation.
Other benefits of Rambus include:
Reduced interference: this is where demands are made simultaneously on the memory from the processor, graphics controller, network controller, I/O, and other bus mastering devices. Increased memory bandwidth reduces interference.
Up to 95% efficiency due to superior pipelining and command handling. This means that the usable bandwidth is close to the theoretical bandwidth. SDRAM and DDR SDRAM only have around 65% efficiency for sustained data transfers, meaning the usable bandwidth is considerably lower than the theoretical bandwidth.
Scalable: in the future it will be possible to expand the Rambus memory bandwidth by adding more Rambus channels. Each Rambus channel requires only 33 signal conductors compared to 132 for SDRAM. This means that the channels are physically easier to add.
3. THE RAMBUS MEMORY CHANNEL
3.1 An Overview :
Rambus is a complete memory subsystem that uses a dedicated bus called a channel. A Rambus memory subsystem includes these components:
Controller (memory interface)
Channel (three-byte wide data and command/address bus; two bytes for data and 1 byte for commands and addresses)
Rambus DRAMs (RDRAMs)
RIMM modules (in-line memory modules for RDRAM chips)
Continuity modules (used in empty connectors)
Repeater chips (optional)
The figure below shows how these components interconnect.
The memory interface, the channel, and the connectors are all integrated onto the system board. A single Rambus channel can support up to 32 RDRAM chips on up to 2 RIMM modules. Initially, RIMM modules will be available with 8 or 16 RDRAM chips. Any connector that does not have a RIMM module installed must have a continuity module installed to provide continuity for the channel (unused connectors are not permitted in a channel). RDRAM chips will be available in capacities of 64 and128 megabits (Mb) when Rambus memory first ships in PCs in the fall of 1999. RDRAMs with capacities of 256 Mb, 512 Mb, and 1 gigabit (Gb) are planned.
3.2 Channel Operation
A Rambus Channel contains controlled impedance, matched transmission lines for data, clock and control. As shown in figure below, the Channel has a bus topology with the master device (microprocessor or ASIC controller) at one end; terminators at the other end, and slaves (RDRAMâ„¢s) in between.
Rambus channel signals
Several key features of the system can be noted from the figure. The first important feature is the master, which is located at one end of the bus. In the Rambus protocol, direct data transfers occur only between master and slave devices and not between slaves. This allows signals to be terminated at only one end of the Channel, resulting in considerable savings in I/O power dissipation. In operation, data driven by the master propagates past all the slaves, allowing all slaves to correctly sense the data. The matched terminator prevents any reflections. Data driven by a slave initially propagates past all slaves, allowing all slaves to correctly sense the data. The matched terminator prevents any reflections. Data driven by a slave initially propagates in both directions along the bus at half the voltage swing. The wave front traveling toward the terminator stops when it reaches the terminating resistance. The signal traveling toward the master, on the other hand, encounters an open circuit. This causes the wave front to double in amplitude, supplying full logic to the master.
Output drivers on the Channel are current source drivers, rather than the more common voltage source drivers. Current source drivers present a high source impedance to the waveform reflecting from the master end of the bus. This prevents secondary reflections from occurring due to the active slave driver. Thus, the worst-case bus settling time is 2 T-f (T-f is the time of flight on the bus) when the slave nearest the terminator is transmitting. The worst-case data delivery time however, between any master-slave pair is only 1 T-f.
4. RAMBUS PROTOCOL
The transactions between the devices in the Channel are completely different from the existing RAMs. Data and commands are routed over the Rambus channel in packets. Each packet is 10 ns long (four 2.5-ns clock cycles) and contains eight items (data or commands). ROW packets consist of 24 bits (8 bursts of 3 bits); COLUMN packets are made up of 40 bits of information (8 bursts of 5 bits). DATA packets consist of 144 bits (8 bursts of 18 bits, including parity) that follow COLUMN packets for certain operations. ROW, COLUMN and DATA packets are largely independent; individual bus operations employ different combinations of theses packet types. Subsequent packets can begin an arbitrary number of clock cycles after a prior packet completes.
Data transfers occur on both edges of the 400-MHz clock for an effective clock rate of 800 MHz. All packets begin on the falling edge of the clock. All data, row, and column packets are referenced to the high-speed differential clocks. Signals traveling toward the memory interface (READ data) are all referenced to the clock that travels toward the memory interface from the clock source. This clock passes through the interface and
back out through the channel where it is used as the reference for ROW, COLUMN, and write DATA packets.
Each ROW and COLUMN packet includes a device address. DATA packets require no device address because the source or destination is always the memory interface. Data transfers occur only between the memory interface and the RDRAMs, never directly between RDRAMs. ROW packets, transferred over the three row lines, can include activate, precharge, refresh, and power-state control commands along with explicit device and bank addresses. COLUMN packets, delivered across the five column lines, include two fields. The first field contains the command and address (read or write); the second field can contain masks (for writes) or extended operation (XOP) commands. DATA packets always include 16 bytes of data regardless of how many bytes the microprocessor may request. Therefore, the column packets include byte mask bits that permit a packet to write as little as one byte of data to memory. Because the column bus operates at the same speed as the data bus, burst commands are not necessary. Instead, the controller issues multiple column read or write commands in any order, to any RDRAM, and to any open page.
The Master initiates channel transactions with a request Packet. The Request Packet contains the address of the RDRAM and memory location to be accessed, as well as the byte-count and op-code fields. For a WRITE request, the write data immediately follows the request. For READS, the access time from request to the first data word is 28 ns. Up to 256 bytes of data can be streamed in a single transaction.
Each RDRAM is broken down into two independent banks of memory. Each of these banks has a 2 Kbyte cache line associated with it that is built out of large sense amplifier arrays. These caches row by holding the last accessed row of their associated bank in the sense amplifiers allowing further accesses to the same row of memory to result in cache hits. With the row already stored in the cache, data can be accessed with very low latency. Each RDRAM added to a system adds two cache lines to the memory system helping to increase cache hit rates.
A cache miss results when a row is accessed that is not currently stored in one of the cache lines. When this happens, the requesting master is sent a Negative Acknowledgement packet indicating the requested row is not yet available. The RDRAM then loads the requested row into the cache line and waits for the master to submit a retry of the previous request. Address mapping hardware is provided to increase cache hit rates by allowing system designers to easily perform n-way RDRAM interleaving.
READ and WRITE transactions to RDRAMs are not limited to simple sequential operations. Non-contiguous blocks of memory can be accessed through the use of the read and write non-sequential operations. With these commands, multiple eight-byte blocks of data within a DRAM cache line can be accessed in a non-sequential fashion. The address for the next data block is transmitted as a serial address packet on one of the control signals. Successive serial address packets continue to specify new addresses within the cache while data is continuously transferred until the access is complete. Non-sequential accesses are useful in applications such as graphics, where data is often accessed in a localized but non-linear fashion, or in main memory applications when performing functions such as write gathering.
Concurrent transactions can be used to optimize RDRAM utilization in high performance applications by taking advantage of available Channel bandwidth during cache miss latency periods. When a miss in one RDRAM takes place, that device will be busy loading a new row into one of its cache lines. Other than that, the Channel and all other RDRAMs will be available for use. Instead of waiting for the first RDRAM to finish loading its cache, a transaction to another RDRAM can be initiated.
5.1 The High-Speed Memory Interface
The Direct Rambusâ€žÂ¢ ASIC Cell (Direct RAC) is a library macrocell used in ASIC designs to interface the core logic of a CMOS ASIC device to a high-speed Direct Rambus Channel. The Direct RAC incorporates all of the high-speed interface circuitry and logic necessary to provide the ASIC designer full access and control over the Direct Rambus Channel without forcing any particular design implementation. The Direct RAC is flexible enough to be used for implementations ranging from simple memory controllers, complex multi-port memory controllers, or as a communication path for high speed chip-to-chip interface.
The Direct RAC typically resides in a portion of the ASICâ„¢s I/O pad ring and converts the high-speed Rambus Signal Level (RSL) on the Rambus Channel into lower-speed CMOS-level signals usable by the ASIC designer. The Direct RAC functions as a high performance parallel-to-serial and serial-to-parallel converter performing the packing and unpacking functions of high frequency data packets into wider and synchronous 144-bit (Direct Rambus) data words. Use of the Rambus Signaling Level (RSL) technology over the Channel permits 600MHz or 800MHz transfer rates. The Direct Rambus Channel is capable of sustained data transfers at 1.25 ns per two bytes (10ns per sixteen bytes). Separate control and data buses with independent row and column control yield over 95% bus efficiency. RACs are available across a wide selection of different processes, vendors, and design rules. RACs based on leading-edge processes for integration into controller designs are made available on a regular basis.
5.2 The D-RAC Core at a Glance
Figure 2 is a block diagram of the Direct RAC. The RSL signals of the Direct Rambus Channel enter and exit at the top of the figure, and the CMOS signals which connect the RAC to the rest of the ASIC enter and exit at the bottom.
The primary function of the RAC is to perform the parallel-to-serial conversions of slow, wide CMOS buses to/from fast, narrow RSL buses. There are three bidirectional RSL buses (RQ, DQA, and DQB) each eight or nine bits wide, and six corresponding unidirectional CMOS buses (RDataQ, TDataQ, RDataA, TDataA, RDataB, and TDataB) each 64 or 72 bits wide. This conversion is accomplished with the four 1:8 Demux blocks and the four 8:1 Mux blocks in the figure.
Each one-bit slice of a 1:8 Demux block is responsible for receiving a serial RSL signal and converting it into a parallel eight-bit CMOS bus. It does this using an RClk clock (derived from the CTM/CTMN RSL signals) and a four bit control bus (RQ1Sel[3:0], RQ0Sel[3:0], or RDSel[3:0]). This control bus selects the relative phase of the drive points for the CMOS bus.
Similarly, each one-bit slice of an 8:1 Mux block is responsible for converting a parallel eight-bit CMOS bus into a serial RSL signal and transmitting it on the Channel. It does this using a TClk clock (derived from the CFM/CFMN RSL signals) and a four bit control bus (TQ1Sel[3:0], TQ0Sel[3:0], or TDSel[3:0]). This control bus selects the relative phase of the sample points for the CMOS bus.
In addition to the four bit control buses, the clock for each Demux and Mux block can be halted with a set of control signals (StopRQ, StopRDA, StopRDB, StopTQ, StopTDA, and StopTDB). Note that the RQ bus has been further subdivided into three bit ROW bus and five bit COL bus, and the RDataQ and TDataQ buses have been subdivided into 24- and 40-bit buses. This permits a finer degree of control of the RDRAMs on the Channel with the four bit control buses. For example, the RDRAMâ„¢s tRCD interval (offset between ROW and COL packet) can be fine-tuned with tCYCLE granularity using this feature. Also, note that the four bit control buses (RDSel[3:0] and TDSel[3:0]) for the DQA and DQB Mux and Demux blocks are shared.
The RClk and TClk blocks are shown broken into two pieces each in Figure 2. In addition to the RCLK and TCLK drivers shown at the top of the figure (for receiving the CTM/CTMN/CFM/CFMN RSL signals), there are two additional blocks in the lower left corner. These blocks receive and drive CMOS signals used by the ASIC. The RClk block receives the StopRQ, StopRDA, StopRDB, PwrUp, Nap, and Hold signals for managing transitions between the Direct RAC power states. In addition, the Reset signal initializes the block to a known state. Likewise, the TClk block receives the StopTQ, StopTDA, StopTDB, PwrUp, Nap, and Hold signals for managing transitions between the Direct RAC power states. In addition, the Reset signal initializes the block to a known state, and the ClkLock output indicates when stable clocks are available.
The RClk block is also responsible for generating the SynClk clock output. This is needed by the ASIC in order to communicate synchronously with the CMOS buses that connect to the Mux and Demux blocks. The Rclk block also uses the SynClkIn input and SynClkFd output to synchronize two or more Direct RACs on a single ASIC. The PhStall input provides a second mechanism for synchronization. The Current Control block is responsible for keeping the output sink current (IOL) and slew rate adjusted to optimal values. It contains hardware for compensating for temperature, voltage, and process variations within the memory subsystem. The CCtlAuto, CCtlEn, CCtlLd, and SRCtrl signals control when the hardware compensation takes place. The CCtlIn[6:0] bus allows the automatic compensation to be overridden manually.
5.3 The Rambus Memory Controller
The RMC handles all RDRAM protocol and housekeeping functions. It provides optimized support for 16 byte transfers as well as for variable burst length requests. It interfaces directly to the RAC.
RMC2 is referred to as a constraint-based memory controller. That is, it explicitly models all the logical and physical constraints on the operation of a Rambus memory system. To simplify the RMC2 design, logical constraints (e.g. a bank canâ„¢t be activated unless all neighbor banks are precharged) are considered separately from timing (e.g.bank canâ„¢t be activated until tRP after it is precharged) and retire (e.g. following a write, that word canâ„¢t be read until the write data has been transferred to the DRAM core) constraints.
It has four main units or blocks. They are:
Bus Interface Unit (BIU).
Protocol Module (PM).
Constraint Module (CM).
Maintenance Module (MM).
Logical constraints are tracked by the Protocol Module PM), which receives transaction requests from the Bus Interface Unit, and requests all the Row and Column packets necessary to implement the requested transaction, in the correct logical order. Timing and retire constraints are tracked by the Constraint Module (CM), which receives the packet requests from the PM and outputs the formatted packets to the RAC when all timing and retire constraints are satisfied.
Within the PM there is a separate Service Protocol Unit SPU) for each outstanding transaction and within the CM there is a separate Constraint Timer for each packet whose timing constraints have not all been satisfied .
6. PCB AND PACKAGING
6.1 The Difficulty
Because the Rambus Channel operates at data rates up to 800 million transfers/second, it exhibits all the properties of an RF signal. Phenomena like reflections and crosstalk take on unprecedented importance in the Rambus environment. The key to a successful design implementation is step-by-step adherence to the Rambus design rules, starting with the all-important circuit board impedance specification.
Itâ„¢s no secret that high-speed signals travelling along a transmission line tend to reflect energy backward (toward their source) when they encounter a change in impedance. The amount of reflected energy depends on both the energy of the original transmission and the magnitude of the impedance change. Due to the innately high speed of Rambus circuits, reflected energy can make the difference between a circuit that works and one that doesnâ„¢t. Consider this startling fact: at a given instant in time, approximately three bits of information are in transit between a source and a destination (for example, between the memory controller IC and an RDRAM). In conventional dynamic RAM circuits, a single bit (that is, the energy representing a pulse) travels down the transmission line alone, reaches its destination, and dissipates any reflections before the next bit is launched. Not so in the Rambus world.
At Rambus speeds, the second and third bits in transit encounter any reflections that occur when that first bit hits the impedance mismatch. A common mismatch point is the node at which the signal enters a connector or IC pin. The reflection can degrade signal margins and cause timing errors, potentially causing failures.
6.2 Packaging- An Overview
Rambus systems can be classified either as short Channel systems or as long Channel systems depending on the electrical length of the Rambus Channel. Long Channel systems typically have eight or more RDRAMs per Channel and consist of a motherboard and one or more memory modules. Such an arrangement allows the user to expand the memory size either by increasing the number of modules plugged into the motherboard or by using higher capacity modules. The impedance of long Channel systems is generally constrained to 28_ because of the need to conform to the impedance of available memory modules. Short Channel systems typically have fewer than five RDRAMs per Channel and are implemented on a common motherboard. An example of a short Channel system is a graphics controller which uses RDRAMs for the frame buffer. Such systems usually have a fixed memory size which cannot be changed by the user.
The slave packaging is crucial to maintaining a uniform transmission line environment for the Channel. Since there can be many RDRAMs in a system, the stub introduced by the leads of the device must be kept as small as possible. RDRAMs are designed such that the internal bonding pads of the die are pitch matched to the Channel traces on the printed circuit board (PCB). This allows the lead frame to have a uniform length of about 2 mm, which in turn enables pin parasitics of less than 2 pf and 3 nH.
All RSL data signals, the SCK and CMD CMOS signals, and the ClockFromMaster differential clock pair should be terminated at the end of the Channel. Each of the signal classes has a different termination scheme, as shown in Figure
6.2.2 RSL Data and Address
Each RSL data and address signal must be terminated by a resistor, RTERM, to the termination voltage VTERM. RTERM should have a resistance 1__ less than the characteristic impedance of the Channel (e.g. a 40_ Channel should have 39_ RTERM resistors and a 34_ Channel should have 33_ RTERM resistors)_ RTERM resistors should have 2% tolerance.
The ClockFromMaster differential clock pair should be terminated differentially by connecting a resistance, equal to twice the Channel characteristic impedance, between the two lines. This is accomplished by connecting each line of the clock pair to a resistor, RCHANNEL, whose value equals the Channel impedance. The other ends of the two resistors are then tied together to achieve the desired termination value. RDRAM pin-to-pin variations in the input capacitance may result in unequal capacitive loading on the CTM/ CFM and CTMN/CFMN lines, and consequently, the signal amplitudes on the two lines of the differential pair may be slightly different. A capacitor between the junction of the two RCHANNEL resistors and ground is recommended to compensate for the unequal loading.
RDRAM memory ensures that the computing experience is unbounded by memory and increases the lifecycle of systems by providing:
Balance: 3.2GB/s 2-Channel RDRAM memory matches todayâ„¢s Pentium 4 Processor front side bus.
Performance: Twice the Effective Bandwidth of competing technologies in half the pins.
Value: System performance unencumbered by memory allows RDRAM systems with slower CPUs to outperform faster systems on competing memory technologies.
Proven Technology: RDRAM systems in mass production since 1999 in over 250 PC systems shipping tens of millions of units.
Granularity & Simplicity: Single device upgrades, costeffective board and module solutions.
Evolutionary Roadmap: Faster devices plus wider modules enable 9.6GB/s performance from a single module in 2004 with no device changes.
1. Mike Sobleman, Rambus Technology Basics, Rambus Developers Forum, October 2001.
2. Abhijit Mahajan, Board Layout, Rambus Developers Forum, October 2001.
3. Frank Fox, RDRAM Device, Rambus Developers Forum, October 2001.
4. James A. Gasbarro, The Rambus Memory System, IEEE, February 1997.
5. Rambus, IEEE Spectrum, May 2001.
I would like to express my gratitude to our principal,
Prof. K. Achuthan for providing the adequate facilities required for the completion of the seminars.
Next, I would like to thank the Head of the Computer Department Mr. Agni Sarman Namboodiri, I would also like to thank my seminars conductor Mr. Zaheer and also Ms. Deepa for their excellence guidance in preparation and presentation of the topic.
And finally, to the most important person, the God Almighty, for without his blessings, all this wouldnâ„¢t have been possible.
1. INTRODUCTION 1
2. RDRAM-WITH A DIFFERENCE 3
2.1 The Memory Landscape 3
2.2 Comparing SDRAM & RDRAM Architectures 5
2.3 Performance 6
2.4 System Cost 10
2.5 Power Dissipation 12
2.6 RDRAM Advantages 12
3. THE RAMBUS MEMORY CHANNEL 15
3.1 An Overview 15
3.2 Channel Operation 16
4. THE RAMBUS PROTOCOL 18
4.1 Packets 18
4.2 Transactions 19
4.3 Concurrency 20
5. INTERFACING 22
5.1 The High-Speed Memory Interface 22
5.2 The D-RAC Core At A Glance 23
5.3 The Rambus Memory Controller 26
6. PCB AND PACKAGING 28
6.1 The Difficulty 28
6.2 Packaging- An Overview 29
7. CONCLUSION 32
8. REFERENCES 33