Primers Archives - Rambus

High Bandwidth Memory (HBM): Everything You Need to Know

Rambus Press — Thu, 30 Oct 2025 17:00:18 +0000

[Updated on October 30, 2025] In an era where data-intensive applications, from AI and machine learning to high-performance computing (HPC) and gaming, are pushing the limits of traditional memory architectures, High Bandwidth Memory (HBM) has emerged as a high-performance, power-efficient solution. As industries demand faster, higher throughput processing, understanding HBM’s architecture, benefits, and evolving role in next-gen systems is essential.

In this blog, we’ll explore how HBM works, how it compares to previous generations, and why it’s becoming the cornerstone of next-generation computing.

What is High Bandwidth Memory (HBM) and How is it Reshaping the Future of Computing?

As computing races toward higher speeds and greater efficiency, memory bandwidth has emerged as a major bottleneck for workloads like AI, high-performance computing, and data analytics. This is where High Bandwidth Memory (HBM) comes in. HBM is a cutting-edge 2.5D and 3D memory architecture designed with an exceptionally wide data path, enabling massive throughput and performance gains. Unlike traditional memory architectures that rely on horizontal layouts and narrow interfaces, HBM takes a vertical approach: stacking memory dies atop one another and connecting through through-silicon vias (TSVs). This 3D-stacked design drastically shortens data travel paths, enabling bandwidth and lower power consumption in a compact footprint.

HBM operates at incredible multi-gigabit speeds. When you combine that speed with a very wide data path, the result is staggering bandwidth, often measured in hundreds of Gigabytes per second (Gb/s) and even reaching into the Terabytes per second (TB/s) range.

To put this into perspective, an HBM4 device running at 8 GB/s delivers 2.048 TB/s of bandwidth. That level of performance is what makes HBM4 a leading choice for AI training hardware.

What is a 2.5D/3D Architecture?

2.5D and 3D architectures refer to advanced integration techniques that improve performance, bandwidth, and power efficiency by bringing components closer together—literally.

HBM4 Uses a 2.5D/3D Architecture

3D Architecture
The “3D” part is easy to see. In 3D architecture, chips are stacked vertically and connected through TSVs (vertical electrical connections that pass through the silicon dies). An HBM memory device is a packaged 3D stack of DRAM, forming a compact, high-performance memory module. Think of it as a high-rise building of chips with elevators (TSVs) connecting the floors.

2.5D Architecture
In a 2.5D setup, multiple chips, like a CPU, GPU, and in our case, HBM devices stacks are placed side-by-side on a silicon interposer – a thin substrate of silicon that acts as a high-speed communication bridge. The interposer contains the fine-pitch wiring that enables fast, low-latency connections between the chips.
Why do we need to use a silicon interposer? The data path between each HBM4 memory device and the processor requires 2,048 “wires” or traces. With the addition of command and address, clocks, etc. the number of traces necessary grows to about 3,000.

Thousands of traces are far more than can be supported on a standard PCB. Therefore, a silicon interposer is used as an intermediary to connect memory device(s) and processor. As with an integrated circuit, finely spaced traces can be etched in the silicon interposer enabling the desired number of wires needed for the HBM interface. The HBM device(s) and the processor are mounted atop the interposer in what is referred to as a 2.5D architecture.

HBM uses both 2.5D and 3D architectures described above, so it’s a 2.5D/3D architecture memory solution.

How is HBM4 Different from HBM3E, HBM3, HBM3, or HBM (Gen 1)?

HBM4 represents a significant leap forward from its predecessors—HBM3E, HBM3 and earlier generations—in terms of bandwidth, capacity, efficiency and architectural innovation. With each generation, we’ve seen an upward trend in data rate, 3D-stack height, and DRAM chip density. That translates to higher bandwidth and greater device capacity with each upgrade of the specification.

When HBM launched, it started with a 1 Gb/s data rate and a 1024-bit wide interface. HBM delivered 128 GB/s of bandwidth, a huge step forward at the time.
Since then, every generation has pushed the limits a little further. HBM2, HBM3, and now HBM3E have all scaled bandwidth primarily by increasing the data rate. For example, HBM3E runs at 9.6 Gb/s, enabling a 1229 GB/s of bandwidth per stack. That’s impressive, but HBM4 takes things to an entirely new level. HBM4 doesn’t just tweak the speed; it doubles the interface width from 1024 bits to 2048 bits. This architectural shift means that even at a modest 8 Gb/s data rate, HBM4 can deliver 2.048 TB/s of bandwidth per stack. That’s nearly double what HBM3E offers.

Chip architects aren’t stopping at one stack. In fact, they’re designing systems with higher attach rates to feed the insatiable appetite of AI accelerators and next-gen GPUs. Imagine a configuration with eight HBM4 stacks, each running at 8 Gb/s. The result? A staggering 16.384 TB/s of memory bandwidth. That’s the kind of throughput needed for massive AI models and high-performance computing workloads.

This table below shares the key differences between HBM4 and its earlier generations.

Generation	Data Rate (Gb/s)	Interface Width (b)	Bandwidth per Device (GB/s)	Stack Height	Max. DRAM Capacity (Gb)	Max. Device Capacity (GB)
HBM	1.0	1024	128	8	16	16
HBM2	2.0	1024	256	8	16	16
HBM2E	3.6	1024	461	12	24	36
HBM3	6.4	1024	819	16	32	64
HBM3E	9.6	1024	1229	16	32	64
HBM4	8.0	2048	2048	16	32	64

What are the Additional Features of HBM4?

But that’s not all. HBM4 also introduces enhancements in power, memory access and RAS over HBM3E.

- Double the Memory Channels: HBM4 doubles the number of independent channels per stack to 32 with 2 pseudo-channels per channel. This provides designers more flexibility in accessing the DRAM devices in the stack.
- Improved Power Efficiency: HBM4 supports VDDQ options of 0.7V, 0.75V, 0.8V or 0.9V and VDDC of 1.0V or 1.05V. The lower voltage levels improve power efficiency.
- Compatibility and Flexibility: The HBM4 interface standard ensures backwards compatibility with existing HBM3 controllers, allowing for seamless integration and flexibility in various applications.
- Directed Refresh Management (DRFM): HBM4 incorporates Directed Refresh Management (DRFM) for improved Reliability, Availability, and Serviceability (RAS) including improved row-hammer mitigation.

Rambus HBM Memory Controller Cores for AI and High-Performance Workloads

Rambus delivers a comprehensive portfolio of HBM controller cores engineered for maximum speed and efficiency. Designed for high bandwidth and ultra-low latency, these controllers enable cutting-edge performance for AI training, machine learning, and advanced computing applications.

The lineup includes our industry-leading HBM4 memory controller, supporting data rates up to 10 Gb/s and offering exceptional flexibility for next-generation workloads. With Rambus HBM controllers, designers can achieve superior throughput, scalability, and reliability for demanding AI and HPC environments.

Summary

As computing demands continue to skyrocket, HBM stands out as a transformative technology that addresses the critical bottleneck of memory bandwidth. By leveraging advanced 2.5D and 3D architectures, HBM delivers massive throughput, exceptional power efficiency, and scalability for next-generation workloads. With HBM4 doubling interface width and introducing new features for flexibility and reliability, it is poised to become the backbone of AI, HPC, and data-intensive applications. Understanding this evolution is key to achieving the performance required for tomorrow’s most demanding systems.

Explore more resources:
– HBM4 Memory: Break Through to Greater Bandwidth
– Unleashing the Performance of AI Training with HBM4
– Ask the Experts: HBM3E Memory Interface IP

MIPI: Powering the Future of Connected Devices

Rambus Press — Thu, 17 Jul 2025 16:06:50 +0000

From the first monochrome mobile displays to today’s ultra-high-definition automotive dashboards and immersive AR/VR headsets, MIPI technology has quietly become the backbone of modern data connectivity. Let’s explore how MIPI standards have evolved, the markets they serve, and why Rambus is at the forefront of this transformation.

What does MIPI stand for?

The MIPI Alliance (Mobile Industry Processor Interface Alliance) primary mission is to develop interface specifications to standardize the communication between components in mobile and mobile-influenced devices.

At the time of its creation, the mobile industry was rapidly evolving, but it lacked standardized interfaces for connecting components like cameras, displays, and processors. Each manufacturer often used proprietary solutions, which led to:

Increased development costs
Longer time-to-market
Compatibility issues
Limited scalability and innovation

Over the years, its scope has expanded dramatically: MIPI now covers a wide range of physical and protocol layers, enabling high-speed, low-power, and low-latency data transfer for everything from smartphones to smart cars.

The MIPI Protocols

At its core, the MIPI protocol defines how data moves between components inside a device. This includes both the physical layer—how bits are transmitted electrically—and higher-level rules for organizing and managing that data. The most widely used MIPI protocols are:

CSI-2 (Camera Serial Interface): Handles high-speed image sensor data, crucial for modern cameras in phones and cars.
DSI-2 (Display Serial Interface): Transmits video data from the processor to the display.
D-PHY, C-PHY, A-PHY: These are physical layer standards that support different speeds, cable lengths, and use cases.

A common point of confusion is the difference between MIPI and DSI-2. In fact, DSI is a specific protocol within the MIPI umbrella, focused on display data. Similarly, CSI-2 is MIPI’s protocol for camera data. Both use the same underlying physical layers but serve different functions: CSI-2 brings images from sensors to the processor, while DSI-2 sends processed video to the display.

MIPI vs. Other Interfaces: SPI and LVDS

To understand MIPI’s advantages, it helps to compare it with other well-known protocols.

SPI (Serial Peripheral Interface) is a simple, widely used protocol for connecting basic peripherals such as sensors and low-resolution displays. While SPI is easy to implement and cost-effective, it’s limited in speed and scalability. In contrast, MIPI interfaces—such as CSI (Camera Serial Interface) and DSI (Display Serial Interface)—support much higher data rates, are optimized for low power consumption, and use differential signaling for better noise immunity. This makes MIPI ideal for high-resolution cameras and displays where large amounts of data need to be transferred quickly and efficiently.

LVDS (Low-Voltage Differential Signaling) was once the standard for connecting displays and other high-speed peripherals. While LVDS also uses differential signaling, it lacks the advanced protocol features and scalability of MIPI. MIPI’s packetized, high-speed data transfer and support for virtual channels allow for more efficient and flexible system designs, especially as devices become more complex.

Target Markets: Where MIPI Shines

MIPI’s versatility is reflected in the diversity of its target markets:

Mobile & Tablets: MIPI DSI-2 is the leading display interface, powering the crisp visuals and fast refresh rates of today’s smartphones and tablets.
Automotive: Modern vehicles rely on MIPI for everything from ADAS cameras and driver monitoring systems to high-resolution infotainment and digital cockpits. The robust, scalable, and low-latency nature of MIPI protocols—especially with the introduction of long-reach A-PHY—makes them ideal for the demanding requirements of automotive environments.
AR/VR: MIPI’s high bandwidth and low power consumption enable the ultra-high pixel density and rapid frame rates needed for immersive AR/VR experiences.
IoT & Wearables: Low power, small form factor, and low EMI make MIPI a natural fit for battery-powered IoT devices and wearables, where efficiency and reliability are paramount.

MIPI and Automotive: A Transformational Use Case

Perhaps the most compelling story for MIPI today is its rapid adoption in the automotive sector. Modern vehicles are evolving into sophisticated sensor platforms, with advanced driver-assistance systems (ADAS) and autonomous driving features relying on a fusion of data from cameras, radar, LIDAR, and ultrasound sensors. Each of these sensors generates massive amounts of data that must be transmitted quickly, reliably, and securely to electronic control units (ECUs) for real-time processing.

MIPI CSI-2 has emerged as the protocol of choice for this sensor data transport. Its high throughput and low latency are essential for applications like emergency braking or lane-keeping, where every millisecond counts. MIPI’s flexible physical layers—D-PHY and C-PHY for short-reach connections, A-PHY for long-reach links—allow automakers to design zonal architectures that reduce wiring complexity and weight, improving both reliability and efficiency.

Security and functional safety are paramount in automotive applications. The latest MIPI specifications, such as the Camera Service Extension (CSE), add robust authentication, encryption, and error detection capabilities, ensuring that sensor data remains trustworthy from the edge to the processor. This is critical not only for passenger safety but also for protecting vehicles from cyber threats.

Manufacturers like Rambus are at the forefront of implementing these advanced MIPI features. Their MIPI controller IP supports the latest CSI-2 versions, enabling sensor aggregation, advanced compression, and seamless integration with both short- and long-reach physical layers. Rambus next generation CSI-2 controllers will incorporate CSE for end-to-end security and functional safety, helping automakers meet the stringent requirements of next-generation vehicles.

Keep on reading:
MIPI DSI-2 & VESA Video Compression Drive Performance for Next-Generation Displays

The Rambus Offering: Integrated, High-Performance MIPI Solutions

Rambus, a leader in interface IP, offers a comprehensive portfolio of MIPI solutions designed for next-generation applications:

MIPI CSI-2 Controller

Fully CSI-2 standard compliant
32-bit, 64-bit and now 128-bit core widths available
Transmit and Receive versions
Supports 1-8, 9.0+ Gbps D-PHY data lanes
Supports 1-4, 6.0+ Gsym/s C-PHY lane (trio)
Supports all data types
Easy-to-use pixel-based interface
Optional video interface
Delivered fully integrated and verified with target MIPI PHY
Delivered with a CSI-2 Testbench
Support optional FPGA-based system validation
Optional ASIL-B Ready safety deliverables

MIPI DSI-2 Controller

Fully DSI-2/DSI standard compliant
32-bit or 64-bit core widths available
Host (Tx) and Peripheral (Rx) versions
Supports 1-4, 9.0+ Gbps D-PHY data lanes
Supports 1-4, 6.0+ Gsym/s C-PHY lane (trio)
Supports all data types
Easy-to-use native interface
Optional video interface
Delivered fully integrated and verified with target MIPI PHY
Delivered with a DSI-2 Testbench
Support optional FPGA-based system validation
Optional ASIL-B Ready safety deliverables

Advanced Video Compression: VESA DSC and VDC-M

As display resolutions increase and bandwidth demands grow, efficient high image quality compression becomes essential. Rambus leads the industry with its implementation of VESA’s advanced compression technologies:

VESA DSC (Display Stream Compression): Delivers visually lossless compression at an appreciable 3:1 ratio, reducing a standard 24 bpp image to just 8 bpp. For HDR content at 30 bpp, DSC achieves an even more remarkable 3.75:1 compression ratio.
VESA VDC-M (Video Display Compression – Mobile): Takes compression further with sophisticated encoding tools, achieving up to 5:1 compression ratio. VDC-M can reduce a 30 bpp (bits per pixel) uncompressed image to just 6 bpp while maintaining visually lossless quality in many scenarios, and can even reach 6:1 compression in specific use cases like instrument cluster displays (for automotive).

These compression technologies are game-changers for bandwidth-constrained applications, enabling higher resolutions, faster refresh rates, and reduced power consumption without sacrificing visual quality. The Rambus implementation of these compression codecs offers the best-in-class performance. The IP easily integrates with their MIPI DSI-2 controllers and your choice of C/D-PHY, creating a complete, optimized display solution.

Keep on reading:
VESA Display Stream Compression (DSC): The Complete Guide

Ready to Power Your Next Innovation?

As the demand for higher resolution, richer visuals, and smarter connectivity continues to grow, MIPI standards—and Rambus’s industry-leading solutions—are paving the way for the devices and vehicles of tomorrow. Whether you’re building the next breakthrough smartphone, an immersive AR headset, or a safer, smarter car, Rambus has the MIPI IP you need to succeed.

Discover how Rambus can accelerate your design. Visit www.rambus.com/interface-ip to learn more and get in touch with our experts today!

Explore more resources:
– MIPI DSI-2 & VESA Video Compression Drive Performance for Next-Generation Displays
– The Ultimate Guide to Secure Silicon: Certified Silicon IP
– Leveraging VESA Video Compression & MIPI DSI-2 for High-Performance Displays

All You Need to Know About GDDR7

Rambus Press — Thu, 29 May 2025 18:39:51 +0000

In this blog post, we explore everything you need to know about Graphics Double Data Rate, most commonly known as GDDR. Since its introduction in 2000, GDDR has become the primary memory technology for graphics cards, evolving through several generations—from GDDR2 up to the latest GDDR7—to provide ever-increasing speed and efficiency for advanced visual and computational applications.

Let’s dive right in to everything you need to know about GDDR in the blog below.

What Does GDDR Stand For?

GDDR stands for Graphics Double Data Rate. It is a specialized type of memory designed specifically for graphics processing units (GPUs) and is engineered to deliver high bandwidth for the demanding data transfer needs of modern graphics rendering and computation. Unlike standard DDR (Double Data Rate) memory, which is used for general system tasks and CPUs, GDDR is optimized for the parallel processing and rapid data throughput required by tasks such as gaming, 3D rendering, and AI workloads.

Today, GDDR has evolved into a state-of-the-art memory solution, with the latest GDDR7 specification offering speeds of up to 48 Gbps per pin and a bandwidth of 192 GB/s per device. Beyond gaming, GDDR has become a solution for AI accelerators and GPUs requiring high memory bandwidth to handle demanding workloads such as AI inference. The latest generation of GPUs and AI systems now leverage GDDR7 to meet the performance needs of these advanced applications.

Which is Faster DDR or GDDR?

GDDR memory is faster than DDR memory when it comes to bandwidth and data transfer rates. GDDR is specifically engineered for graphics cards and GPUs, prioritizing high bandwidth to handle large volumes of graphical data, such as high-resolution textures and complex 3D models. In contrast, DDR memory is optimized for general-purpose computing tasks managed by the CPU, focusing on lower latency rather than raw bandwidth.

For example, the latest GDDR7 memory can achieve per-pin speeds up to 48 Gbps and overall memory subsystem bandwidths reaching 1.5 terabytes per second, while DDR5, the fastest mainstream DDR standard, typically operates at data rates between 4.8 and 8.4 Gbps per pin. This makes GDDR significantly faster for GPU workloads, though DDR retains an advantage in latency and energy efficiency for CPU and multitasking environments.

When Did GDDR7 Launch?

JEDEC published the GDDR7 standard in March 2024, with memory vendors reaching mass production in 2025.

Key Features of GDDR7

Ultra-High Speed: GDDR7 supported data rates up to 32 Gbps per pin in its initial rollout, with a roadmap extending to 48 Gbps in the future. This is more than double the practical speed of previous generations like GDDR6X, which tops out at 21 Gbps.
Exceptional Bandwidth: Each GDDR7 device can deliver 128 GB/s of bandwidth at 32 Gbps (and 192 GB/s at 48 Gbps), providing the throughput needed for data-intensive workloads such as AI inference and next-generation graphics.
Advanced Signaling (PAM3): GDDR7 introduces three-level pulse amplitude modulation (PAM3) signaling, which transmits 50% more data per clock cycle compared to the NRZ (PAM2) used in previous generations. This innovation enables higher data rates without requiring higher clock speeds, improving efficiency and reducing signal integrity challenges.
Lower Voltage and Improved Efficiency: Operating at 1.2V, GDDR7 is more power-efficient than GDDR6X (1.35V), helping to manage overall system power consumption while delivering higher performance.
Enhanced Reliability and RAS Features: GDDR7 incorporates advanced data integrity features such as on-die ECC with real-time reporting, data poison detection, error check and scrub, and command address parity with command blocking. These features improve reliability, availability, and serviceability (RAS), which are critical for mission-critical AI and graphics applications.
Increased Channel Parallelism: GDDR7 moves from two 16-bit channels (in GDDR6) to four 10-bit channels (8 bits data, 2 bits error reporting), enabling greater parallelism and more efficient data handling.
JEDEC Standardization: GDDR7 is a JEDEC-approved open standard, ensuring broad industry support and interoperability.

These features make GDDR7 a state-of-the-art memory solution, delivering the high bandwidth, efficiency, and reliability required for the latest AI, gaming, and graphics workloads.

Jump to: GDDR Solutions »

What is the Difference Between GDDR6 and GDDR7?

GDDR7 represents a significant upgrade over GDDR6, offering higher performance, improved efficiency, and advanced features. The most notable difference is its speed: GDDR7 delivers data rates of up to 48 Gbps per pin, compared to GDDR6’s maximum of 24 Gbps. This results in up to 2x the bandwidth, enabling faster data access and processing for demanding applications like AI inference, gaming, and high-resolution rendering. Additionally, GDDR7 utilizes PAM3 signaling, which transmits 50% more data per clock cycle than GDDR6’s NRZ encoding. It also operates at a lower voltage (1.1–1.2V vs. GDDR6’s 1.35V), improving energy efficiency per bit. Furthermore, GDDR7 features four 8-bit channels per chip (compared to GDDR6’s two 16-bit channels), enhancing parallelism and reducing latency for real-time workloads.

Features	GDDR6	GDDR7
Data Rate	Up to 24 Gbps	Up to 48 Gbps
Bandwidth per Device	Up to 96 GB/s	Up to 192 GB/s
Voltage	1.35V	1.1–1.2V
Signaling	NRZ (PAM2)	PAM3
Channels/Chip	Two 16-bit channels	Four 8-bit channels
Use Cases	High-end gaming, VR	AI workloads, 8K+ gaming

What is the Difference Between GDDR6X and GDDR7?

To understand the difference between GDDR6X and GDDR7, we need to learn more about GDDR6x.

What is GDDR6x?

GDDR6X is a high-performance graphics memory standard designed to deliver faster data transfer rates and greater memory bandwidth (vs. GDDR6) for demanding GPU applications. The key innovation in GDDR6X is its use of PAM4 (Pulse Amplitude Modulation 4-level) signaling, which allows it to transmit two bits of data per clock cycle—double what traditional NRZ (Non-Return to Zero) signaling achieves in GDDR6. This enables GDDR6X to reach data rates of up to 21 Gbps per pin, significantly increasing overall memory bandwidth, which can reach up to 768 GB/s on a 256-bit bus.

GDDR7 and GDDR6X are both high-performance graphics memory standards, but they differ in several key technical aspects:

Data Rate: GDDR7 offers a much higher maximum speed, reaching up to 48 Gbps per pin in the future, compared to GDDR6X’s maximum of 21 Gbps per pin.
Signaling Technology: GDDR7 uses PAM3 (Pulse Amplitude Modulation with 3 levels), while GDDR6X relies on PAM4 (4 levels).
Bandwidth per Device: At 32 Gbps, a GDDR7 device can deliver 192 GB/s of bandwidth, whereas GDDR6X at 21 Gbps achieves 84 GB/s per device.
Voltage and Efficiency: GDDR7 operates at a lower voltage (1.2V) compared to GDDR6X (1.35V), resulting in better power efficiency.
Standardization: GDDR7 is a JEDEC-approved open standard, ensuring broad industry support, while GDDR6X is a proprietary technology developed by Micron and NVIDIA

Parameter	GDDR7	GDDR6X
Max Speed	Up to 48 Gbps	21 Gbps
Signaling	PAM3	PAM4
Bandwidth per Device	192 GB/s (at 48 Gbps)	84 GB/s (at 21 Gbps)
Voltage	1.1-1.2V	1.35V

What is the Difference Between GDDR7 and HBM3?

GDDR7 and High Bandwidth Memory 3 (HBM3) are advanced memory technologies designed for GPUs and AI accelerators, but they serve distinct purposes and excel in different scenarios. GDDR7, the latest iteration of Graphics Double Data Rate memory, is optimized for high-speed, cost-effective applications, such as gaming and edge AI. On the other hand, HBM3 is tailored for ultra-high-bandwidth workloads in data centers, HPC (High-Performance Computing), and AI training, where efficiency and scalability are critical.

Feature	GDDR7	HBM3
Bandwidth per Device	192 GB/s (at 48 Gb/s)	819 GB/s (at 6.4 Gb/s)
Bus Width	32-bit	1024-bit
Memory Configuration	Soldered onto PCB	Stacked DRAM in package
Use Cases	Gaming, edge AI	AI training, HPC, data center GPUs
Voltage	1.1-1.2V	1.1V core voltage
Cost	Cost-effective	Expensive due to silicon interposers and stacking technology
Scalability	Flexible	Limited configurability

Use Case Differences:

GDDR7 is designed for consumer-grade GPUs used in gaming PCs and edge devices where cost-effectiveness and high-speed performance are priorities.
HBM3 is reserved for flagship GPUs in data centers or HPC environments where bandwidth requirements far exceed those of mainstream applications.

While GDDR7 excels in delivering high-speed performance at a lower cost for gaming and edge AI applications, HBM3 dominates in scenarios demanding extreme bandwidth and efficiency, such as AI training or HPC workloads. Choosing between these memory types depends on the specific requirements of the application, balancing performance needs against cost considerations.

Jump to: HBM Solutions »

What is the Difference Between GDDR7 and LPDDR5?

GDDR7 and LPDDR5 are both advanced memory technologies, but they are designed for very different use cases and have distinct technical characteristics.

GDDR7 is the latest generation of graphics memory, primarily used in GPUs for high-performance computing, AI inference, and gaming. It is engineered to deliver extremely high bandwidth and data rates, making it ideal for applications that require rapid movement of large amounts of data, such as real-time graphics rendering and AI model inference. GDDR7 achieves this by employing advanced PAM3 signaling, supporting data rates up to 32 Gbps per pin initially (with a roadmap to 48 Gbps), and offering per-chip bandwidths as high as 128–192 GB/s. Its interface and architecture are optimized for speed and throughput, with moderate power efficiency.

LPDDR5, on the other hand, stands for Low Power DDR5 and is optimized for energy efficiency and compactness, making it the memory of choice for mobile devices, laptops, and other battery-powered systems. LPDDR5 typically supports data rates of 6.4–8.5 Gbps per pin, with a focus on minimizing power consumption through features like Dynamic Voltage Scaling and multiple low-power modes. While LPDDR5 is highly efficient and supports reasonable bandwidth for mobile and embedded applications, it cannot match the raw speed and throughput of GDDR7.

Key Differences Table:

Feature	GDDR7	LPDDR5
Primary Use	GPUs, AI accelerators	Smartphones, laptops
Max Data Rate (per pin)	32–48 Gbps	Up to 6.4–8.5 Gbps
Bandwidth per Device	128–192 GB/s	~34 GB/s
Signaling	PAM3	NRZ (PAM2)
Voltage	1.1-1.2V	1.05V/0.9V (core), 0.5V/0.3V (I/O)
Power Efficiency	Moderate	High
Prefetch	32n	16n
Interface Width	32 bits	32 bits
Typical Application	Graphics cards, AI edge servers	Mobile devices, ultrabooks

GDDR7 excels in scenarios where maximum bandwidth and low latency are critical, such as AI inference and high-end gaming, but it consumes more power and is less suited for compact, battery-powered devices.

LPDDR5 is optimized for energy efficiency and space, making it ideal for mobile and portable applications, but it cannot deliver the same level of bandwidth as GDDR7.

Jump to: LPDDR Solutions »

How GDDR7 Supercharges AI Inference Performance

GDDR7 memory delivers transformative improvements for AI inference workloads through groundbreaking advancements in bandwidth, efficiency, and signaling technology.

Here’s how it achieves this:

Unmatched Bandwidth for Data-Intensive Models
- Speed: GDDR7 operates at 32–48 Gbps per pin, doubling GDDR6X’s 21 Gbps limit. At 48 Gbps, each GDDR7 device provides 192 GB/s of bandwidth, enabling AI accelerators to process trillion-parameter models (e.g., LLMs) without data bottlenecks.
- Scalability: A system requiring 500 GB/s bandwidth needs just 3 GDDR7 chips—compared to 13 LPDDR5X modules—reducing latency and complexity for edge AI deployments.
Power Efficiency for Sustainable AI
- Voltage: GDDR7 runs at 1.2V (vs. GDDR6X’s 1.35V), reducing power consumption by >10% per bit.
- Dynamic Voltage Scaling: Adjusts power based on workload demands, critical for energy-constrained edge devices running continuous inference tasks.
Advanced Signaling for Lower Latency
- PAM3 Encoding: Transmits 50% more data per cycle than GDDR6’s NRZ, enabling faster throughput without higher clock speeds. This reduces inference latency for real-time applications like autonomous driving.
Reliability for Mission-Critical Inference
- On-Die ECC: Corrects errors in real time, ensuring data integrity for sensitive applications like medical diagnostics.
- Standardization: As a JEDEC-approved technology, GDDR7 ensures broad compatibility and optimization across AI hardware ecosystems.
Real-World Impact
- Generative AI: GDDR7’s bandwidth handles large language models (LLMs) like GPT-4, enabling faster text/image generation.
- Autonomous Systems: Low latency ensures rapid sensor data processing for real-time decision-making.
- Edge Servers: Compact GDDR7-based systems deliver data center-level performance in retail, healthcare, and IoT.

By combining raw speed with intelligent power management, GDDR7 is redefining what’s possible for AI inference at the edge and beyond.

Keep on reading:
GDDR7 Memory Supercharges AI Inference

Conclusion

As AI inference models grow in size and complexity, the need for memory solutions that deliver both high bandwidth and low latency has never been greater. GDDR7 rises to this challenge, offering a leap in performance over previous memory technologies. With data rates starting at 32 Gbps per pin and a roadmap to 48 Gbps, GDDR7 provides up to 192 GB/s of bandwidth per device—more than double that of its predecessors and well ahead of alternatives like LPDDR5X. Its adoption of PAM3 signaling, enhanced reliability features, and improved power efficiency at 1.2V make it uniquely suited for the demands of next-generation GPUs and AI accelerators.

Compared to other memory types, GDDR7 stands out for its ability to efficiently feed data-hungry AI inference engines, enabling faster processing of large language models, real-time analytics, and advanced edge applications. Its balance of performance, scalability, and reliability ensures that designers can meet the requirements of both today’s and tomorrow’s AI workloads without compromise. As the industry moves forward, GDDR7 is set to become a cornerstone of high-performance computing, powering innovations in AI, gaming, and beyond.

Explore more resources:
– GDDR Memory for High-Performance AI Inference
– Supercharging AI Inference with GDDR7
– From Training to Inference: HBM, GDDR & LPDDR Memory

Post-quantum Cryptography (PQC): New Algorithms for a New Era

Rambus Press — Mon, 14 Apr 2025 17:00:44 +0000

[Updated April 14, 2025] Post-Quantum Cryptography (PQC), also known as Quantum Safe Cryptography (QSC), refers to cryptographic algorithms designed to withstand attacks by quantum computers.

Quantum computers will eventually become powerful enough to break public key-based cryptography, also known as asymmetric cryptography. Public key-based cryptography is used to protect everything from your online communications to your financial transactions.

Quantum computing represents a major security threat and action is needed now to secure applications and infrastructure using Post-Quantum/Quantum Safe Cryptography.

This blog explains everything you need to know about the new algorithms designed to protect against quantum computer attacks.

Table of Contents

What is quantum computing?
Why are quantum computers a security threat?
What is Post-Quantum Cryptography (PQC)?
Is Quantum Safe Cryptography the same as Post-Quantum Cryptography (PQC)?
Why do we need to act now if quantum computers are still a way off?
What progress has been made to develop new PQC algorithms?
What recommendations does CNSA 2.0 make for transitioning to PQC algorithms?
How can companies get ready for the Quantum Computing Era?
What Quantum Safe IP solutions are available from Rambus?

What is quantum computing?

Quantum computing utilizes quantum mechanics to solve certain classes of complex problems faster than is possible on classic computers. Problems that currently take the most powerful supercomputer several years could potentially be solved in days.

Source: Quantum Could Solve Countless Problems —And Create New Ones | Time, February 2023

As such, quantum computers have the potential to deliver the computational power that could take applications like AI to a whole new level. Powerful quantum computers will become a reality in the not-so-distant future, and while they offer many benefits, they also present a major security threat.

Why are quantum computers a security threat?

Once sufficiently powerful quantum computers exist, traditional asymmetric cryptographic methods for key exchange and digital signatures will be broken. Leveraging Shor’s algorithm, quantum computers will be capable of reducing the security of discrete logarithm-based schemes like Elliptic Curve Cryptography (ECC) and factorization-based schemes like RSA (Rivest-Shamir-Adleman) so much that no reasonable key size would suffice to keep data secure. ECC and RSA are the algorithms used to protect everything from our bank accounts to our medical records.

Governments, researchers, and tech leaders the world over have recognized this quantum threat and the difficulty in securing critical infrastructure against attacks from quantum computers.

“A quantum computer of sufficient size and sophistication — also known as a cryptanalytically relevant quantum computer (CRQC) — will be capable of breaking much of the public-key cryptography used on digital systems across the United States and around the world.

When it becomes available, a CRQC could jeopardize civilian and military communications, undermine supervisory and control systems for critical infrastructure, and defeat security protocols for most Internet-based financial transactions.”

National Security Memorandum on Promoting United States Leadership in Quantum Computing While Mitigating Risks to Vulnerable Cryptographic Systems, May 2022

What is Post-Quantum Cryptography (PQC)?

New digital signatures and key encapsulation mechanisms (KEMs) are needed to protect data and hardware from quantum attacks. Many initiatives have been launched throughout the world to develop and deploy new cryptographic algorithms that can replace RSA and ECC while being highly resistant to both classic and quantum attacks. Post-Quantum Cryptography (PQC) refers to these cryptographic algorithms designed to withstand attacks by quantum computers.

Is Quantum Safe Cryptography the same as Post-Quantum Cryptography (PQC)?

Yes, Quantum Safe Cryptography is another term for Post-Quantum Cryptography. Both refer to cryptographic algorithms designed to withstand attacks by quantum computers. Other terms that you may come across include Quantum Proof Cryptography or Quantum Resistant Cryptography.

Why do we need to act now if quantum computers are still a way off?

While quantum computers powerful enough to break public key encryption may still be a way off, data harvesting is happening now. Malicious actors are already said to be collecting encrypted data and storing it for the time when future quantum computers will be capable of breaking our current encryption methods. This is known as a “harvest now, decrypt later” strategy.

Further because the shelf life of confidential or private information can span years or decades, there is a rapidly growing need to protect such data today to future proof it from quantum attack. Additionally, for many devices such as chips, the development cycle is a long one. Given that it can take years for security testing, certification and then deployment into the existing infrastructure, the earlier the transition to Quantum Safe Cryptography begins, the better.

What progress has been made to develop new PQC algorithms?

The biggest public initiative to develop and standardize new PQC algorithms was launched by The U.S. Department of Commerce’s National Institute of Standards and Technology (NIST). International teams of cryptographers submitted algorithm proposals, reviewed the proposals, broke some, and gained confidence in the security of others.

After multiple rounds of evaluations, on July 5th, 2022, NIST announced the first PQC algorithms selected for standardization. CRYSTALS-Kyber was selected as a Key Encapsulation Mechanism (KEM) and CRYSTALS-Dilithium, FALCON, and SPHINCS+ were selected as digital signature algorithms.

On August 24th, 2023, NIST announced the first three draft standards for general-purpose Quantum Safe Cryptography.

These are draft standards are:

FIPS 203 ML-KEM: Module-Lattice-Based Key Encapsulation Mechanism Standard, which is based on the previously selected CRYSTALS-Kyber mechanism
FIPS 204 ML-DSA: Module-Lattice-Based Digital Signature Standard, which is based on the previously selected CRYSTALS-Dilithium signature scheme
FIPS 205 SLH-DSA: Stateless Hash-Based Digital Signature Standard, which is based on the previously selected SPHINCS+ signature scheme

What recommendations does CNSA 2.0 make for transitioning to PQC algorithms?

The National Security Agency (NSA) published an update to its Commercial National Security Algorithm Suite (CNSA) in September 2022, CNSA 2.0.

National Security Systems (NSS) will need to fully transition to PQC algorithms by 2033 and some use cases will be required to complete the transition as early as 2030. CNSA 2.0 specifies that CRYSTALS-Kyber and CRYSTALS-Dilithium should be used as quantum-resistant algorithms, along with stateful hash-based signature schemes XMSS (eXtended Merkle Signature Scheme) and LMS (Leighton-Micali Signatures).

CNSA 2.0 sets out an ambitious timeline for PQC algorithm adoption – other organizations across the globe are set to follow suit with their own guidelines.

Source: NSA Commercial National Security Algorithm Suite 2.0, September 2022

How can companies get ready for the Quantum Computing Era?

Understand where vulnerable cryptography like RSA or ECC is deployed in your products.
Investigate what performance impact a PQC transition will have on your products and what makes sense for your product roadmap.
Establish what transition timelines your products must observe.
Speak with your customers and suppliers to ensure that expectations and plans align.
Understand where vulnerable cryptography like RSA or ECC is deployed in your business infrastructure and business processes.
Talk to security experts like Rambus to understand how you can begin to transition to Quantum Safe Cryptography

What Quantum Safe IP solutions are available from Rambus?

Rambus Quantum Safe IP solutions offer a hardware-level security solution to protect data and hardware against quantum computer attacks using NIST and CNSA selected algorithms.

Rambus Quantum Safe IP products are compliant with FIPS 203 ML-KEM and FIPS 204 ML-DSA draft standards. Products are firmware programmable to allow for updates with evolving quantum-resistant standards.

The products can be deployed in ASIC, SoC and FPGA implementations for a wide range of applications including data center, AI/ML, defense and other highly secure applications.

Solution	Applications
QSE-IP-86	Standalone engine providing Quantum Safe Cryptography acceleration
QSE-IP-86 DPA	Standalone engine providing Quantum Safe Cryptography acceleration and DPA-resistant cryptographic accelerators
RT-634	Programmable Root of Trust with Quantum Safe Cryptography acceleration
RT-654	Programmable Root of Trust with Quantum Safe Cryptography acceleration and DPA-resistant cryptographic accelerators
RT-664	Programmable Root of Trust with Quantum Safe Cryptography acceleration and FIA-protected cryptographic accelerators
Quantum Safe IPsec Toolkit	Quantum Safe complete IPsec implementation. Fast, scalable and fully compliant IPsec implementation. Used in cloud and virtual deployments, high traffic gateways, and embedded devices.
Quantum Safe Library	Quantum Safe Cryptographic library offering future-proof cryptography by providing new quantum resistant algorithms and classic algorithms in a single package.

Keep Reading:
– Bringing IPsec into the Quantum Safe Era
– Rambus Expands Quantum Safe Solutions with Quantum Safe Engine IP
– Rambus CryptoManager Root of Trust Solutions Tailor Security Capabilities to Specific Customer Needs with New Three-Tier Architecture

Summary

Quantum computing is being pursued across industry, government and academia with tremendous energy and is set to become a reality in the not-so-distant future. For many years, Rambus has been a leading voice in the PQC movement and now offers a portfolio of Quantum Safe IP solutions designed to offer hardware-level security using NIST and CNSA selected algorithms.

Explore more resources:
– Hardware Root of Trust: Everything you need to know
– Protecting Data and Devices Now and in the Quantum Computing Era
– Quantum Safe Cryptography: Protecting Devices and Data in the Quantum Era

Hardware Root of Trust: Everything you need to know

Rambus Press — Tue, 08 Apr 2025 21:00:15 +0000

[Last updated on April 8, 2025] A root of trust is the security foundation for an SoC, other semiconductor device or electronic system. However, its meaning differs depending on who you ask. From our perspective, the hardware root of trust contains the keys for cryptographic functions and is usually a part of the secure boot process providing the foundation for the software chain of trust.

What is hardware root of trust?

A hardware root of trust is the foundation on which all secure operations of a computing system depend. It contains the keys used for cryptographic functions and enables a secure boot process. It is inherently trusted and therefore must be secure by design. The most secure implementation of a root of trust is in hardware making it immune from malware attacks. As such, it can be a stand-alone security module or implemented as security module within a processor or system on chip (SoC).

What are the types of a silicon-based hardware root of trust?

A silicon-based hardware root of trust falls into two categories: fixed function and programmable. Essentially, a fixed-function root of trust is firmware controlled. These are typically compact and designed to perform a specific set of functions like data encryption, certificate validation and key management. These compact, firmware-controlled root of trust solutions are particularly well suited for Internet of Things (IoT) devices.

In contrast, a hardware-based programmable root of trust is built around a CPU. Performing all the functions of a firmware-controlled solution, a programmable root of trust can also execute a more complex set of security functions. A programmable root of trust is versatile and upgradable, enabling it to run entirely new cryptographic algorithms and secure applications to meet evolving attack vectors.

What are the benefits of a programmable hardware root of trust?

The cybersecurity threat landscape is dynamic and rapidly evolving. Indeed, attackers are constantly finding new ways to exploit critical vulnerabilities across a wide range of applications and devices. Fortunately, a programmable hardware-based root of trust can be continuously updated to contend with an ever-increasing range of threats.

A programmable hardware-based root of trust is a key component to protect against a number of security threats, including:

Host processor compromise
Non-volatile memory (NVM) key extraction
Tearing and other attacks against NVM writes
Corruption of non-volatile memory or fuses
Test and debug interface attacks
Side-channel and perturbation attacks including Simple Power Analysis (SPA), Differential Power Analysis (DPA) and Fault Injection Attacks (FIA)
Manufacturing/personalization facility compromise (insider attack)
Man-in-the-middle and replay attacks
Probing of external buses

What features should a programmable hardware root of trust offer?

A programmable hardware root of trust should be purpose-built; specifically designed from the ground up to provide a robust level of security. Since the root of trust is a logical target for an attacker, it should be made as secure as possible to safeguard it from compromise. Capabilities should include:

- Siloed Execution:
  
  Ensures that sensitive security functions are only performed within a dedicated security domain that is physically separated from the general-purpose processor. This paradigm allows the primary CPU to be optimized for architectural complexity and performance – with security functionality safely isolated in a physically separated root of trust.
- Comprehensive Anti-Tamper and Side-Channel Resistance:
  
  Protects against multiple fault injection and side-channel attacks.
- Layered Security:
  
  Provides multiple layers of robust defense to avoid a single point of failure. Access to cryptographic hardware modules and other sensitive security resources are enforced in hardware, while critical keys are only available to hardware. Software security can be layered on top of a hardware-based root of trust, thereby providing additional flexibility and security.
- Multiple Root of Trust Instances:
  
  Ensures isolation of resources, keys and security assets. In real-world terms, this means each entity – such as a chip vendor, OEM or service provider – has access to its own ‘virtual’ security core and performs secure functions without having to ‘trust’ other entities. This allows individual entities to possess unique root and derived keys, as well as access only to specified features and resources such as OTP, debug and control bits. Moreover, support for multiple root of trust instances enables the security core to assign or delegate permissions to other entities at any point in the device life cycle, while isolating (in hardware) unique signed apps that are siloed away from other programs.

What is the Rambus Root of Trust?

Rambus offers a catalog of robust Root of Trust solutions, ranging from richly featured military-grade co-processors to highly compact firmware-controlled. With a breadth of solutions applicable from the data center to IoT devices, Rambus has a Root of Trust solution for almost every application.

Rambus’ Parvez Shaik explains the importance of addressing supply chain vulnerabilities, the advantages of a hardware root of trust, and the new features of the third-generation CryptoManager Root of Trust products in this episode of Ask the Experts.

Jump to: Root of Trust solutions »

How is the Rambus Root of Trust architected for security?

The CryptoManager RT-6xx Root of Trust family from Rambus is the latest generation of fully programmable FIPS 140-3 compliant hardware security cores offering Quantum Safe security by design for data center and other highly secure applications. The CryptoManager RT-6xx family protects against a wide range of hardware and software attacks through state-of-the-art side channel attack countermeasures and anti-tamper and security techniques.

CryptoManager RT-6xx Series Root of Trust Block Diagram

The diagram above illustrates the basic architecture of the Rambus RT-600 series Root of Trust, including:

The CryptoManager RT-6xx Root of Trust is a siloed hardware security IP core for integration into semiconductors, offering secure execution of authenticated user applications, tamper detection and protection, secure storage and handling of keys and security assets, and optional resistance to side-channel attacks. The Root of Trust is easily integrated with industry-standard interfaces and system architectures and includes standard hardware cryptographic cores. Access to crypto modules, keys, memory ranges, I/O, and other resources is enforced in hardware. Critical operations, including key derivation and storage, are performed in hardware with no access by software. The Root of Trust is based on a custom 32-bit processor designed specifically to provide a trusted foundation for secure processing on chip and in the system.

The Root of Trust supports all common host processor architectures including ARM, RISC-V, x86 and others. The multi-threaded secure processor runs customer developed signed code either as a monolithic supervisor or as loadable security applications which include permissions and security-related metadata. It can implement standard security functionality provided by Rambus, or complete customer-specific security applications, including key and data provisioning, security protocols, biometric applications, secure boot, secure firmware update, and many more.

Keep on reading:
Rambus CryptoManager Root of Trust Solutions Tailor Security Capabilities to Specific Customer Needs with New Three-Tier Architecture

What is Quantum Safe Cryptography?

The CryptoManager RT-6xx Root of Trust series is at the forefront of a new category of programmable hardware-based security cores with its new Quantum Safe Cryptography features.

Once sufficiently powerful quantum computers exist, traditional asymmetric cryptographic methods for key exchange and digital signatures will be easily broken. New cryptographic algorithms known as quantum safe cryptography (QSC) or post-quantum cryptography (PQC) are needed to protect against quantum computer attacks.

The latest generation of Rambus Root of Trust IP offers a state-of-the-art programmable security solution to protect hardware and data with NIST and CNSA quantum-resistant algorithms. The Quantum Safe Engine operates with the CRYSTALS-Kyber and CRYSTALS-Dilithium algorithms, as well as the stateful hash-based signature schemes XMSS (eXtended Merkle Signature Scheme) or LMS (Leighton-Micali Signatures).

Learn more about Quantum Safe Cryptography:
– Post-quantum Cryptography (PQC): New Algorithms for a New Era
– Rambus Expands Quantum Safe Solutions with Quantum Safe Engine IP

Is there a Rambus Root of Trust configured for my application?

There are Rambus Root of Trust solutions tailored to address the specific security requirements and certification standards of nearly every application:

- The RT-1xx series of Root of Trust solutions are designed for use in power and space-constrained applications as in IoT devices. Featuring a firmware-controlled architecture with dedicated secure memories, the RT-1xx hardware Root of Trust cores provide a variety of cryptographic accelerators including AES, SHA-2, RSA and ECC. There are versions which include SM2, SM3 and SM4 accelerators for the China market.
- The CryptoManager RT-6xx is a fully programmable, FIPS 140-3 compliant, hardware security core offering security-by-design for data center cloud, AI/ML, as well as general purpose semiconductor applications. It protects against a wide range of hardware and software attacks through state-of-the-art anti-tamper and security techniques.
- The CryptoManager RT-7xx is tailored for the automotive market offering ISO 26262 and ISO 21434 compliant hardware security. It supports vehicle-to-vehicle and vehicle-to-infrastructure (V2X), advanced driver-assistance systems (ADAS) and infotainment uses.
- CryptoCell Root of Trust solutions are programmable, FIPS 140-3 certifiable hardware security modules. They are designed to be integrated into Arm TrustZone-based SoCs or FPGAs where power and space are a consideration.

Find out more: See all Rambus Root of Trust IP Solutions »

What should I keep in mind when selecting a Root of Trust IP?

Root of Trust product designs vary greatly in architecture and capabilities. When selecting a Root of Trust solution, it’s important to ask the right questions to ensure the best level of protection for your specific security needs.

Some questions to consider include:

What is the end use of the chip?
Who and what are you protecting against?
What is the risk of a compromised device?
What certifications are required?

It’s also worth noting that Root of Trust products can be tailored to match an application’s security threat model, use case, industry segment, lifetime, cost, and geography. Some examples of the different criteria that can be selected include the crypto algorithms, security/anti-tamper mechanisms, and provisioning methods used.

Next steps?

If you have any questions about how to select a Root of Trust for your next project, contact us here.

Explore more resources:
– The Ultimate Guide to Secure Silicon: Root of Trust
– Ask the Experts: PUF-based Security
– Implementing State-of-the-Art Digital Protection with Rambus CryptoManager Security IP

DDR5 vs DDR4 DRAM – All the Advantages & Design Challenges

Rambus Press — Mon, 29 Jul 2024 19:30:04 +0000

[Last updated on: July 29, 2024] On July 14^th, 2021, JEDEC announced the publication of the JESD79-5 DDR5 SDRAM standard signaling the industry transition to DDR5 server and client dual-inline memory modules (DIMMs). DDR5 memory brings a number of key performance gains to the table, as well as new design challenges. Computing system architects, designers, and purchasers want to know what’s new in DDR5 vs DDR4 and how they can get the most from this new generation of memory.

Performance: what changes in DDR5 vs DDR4 DRAM?

The top seven most significant specification advances made in the transition from DDR4 to DDR5 DIMMs are shown in the table below.

DDR5 changes and advantages over DDR4 DIMMs

1. DDR5 Scales to 8.4 GT/s

You can never have enough memory bandwidth, and DDR5 helps feed that insatiable need for speed. While DDR4 DIMMs top out at 3.2 gigatransfers per second (GT/s) at a clock rate of 1.6 gigahertz (GHz), initial DDR5 DIMMs delivered a 50% bandwidth increase to 4.8 GT/s. DDR5 memory will ultimately scale to a data rate of 8.4 GT/s. New features, such as Decision Feedback Equalization (DFE), were incorporated in DDR5 enabling the higher IO speeds and data rates.

2. Lower Voltage Keeps Power Manageable

A second major change is a reduction in operating voltage (VDD), and that helps offset the power increase that comes with running at higher speed. With DDR5, the DRAM, the registering clock driver (RCD) voltage drops from 1.2 V down to 1.1 V. Command/Address (CA) signaling is changed from SSTL to PODL, which has the advantage of burning no static power when the pins are parked in the high state.

3. New Power Architecture for DDR5 DIMMs

A third change, and a major one, is power architecture. With DDR5 DIMMs, power management moves from the motherboard to the DIMM itself. DDR5 DIMMs will have a 12-V power management IC (PMIC) on DIMM allowing for better granularity of system power loading. The PMIC distributes the 1.1 V VDD supply, helping with signal integrity and noise with better on-DIMM control of the power supply.

4. DDR5 vs DDR4 Channel Architecture

Another major change with DDR5, number four on our list, is a new DIMM channel architecture. DDR4 DIMMs have a 72-bit bus, comprised of 64 data bits plus eight ECC bits. With DDR5, each DIMM will have two channels. Each of these channels will be 40-bits wide: 32 data bits with eight ECC bits. While the data width is the same (64-bits total) having two smaller independent channels improves memory access efficiency. So not only do you get the benefit of the speed bump with DDR5, the benefit of that higher MT/s is amplified by greater efficiency.

In the DDR5 DIMM architecture, the left and right side of the DIMM, each served by an independent 40-bit wide channel, share the RCD. In DDR4, the RCD provides two output clocks per side. In DDR5, the RCD provides four output clocks per side. In the highest density DIMMs with x4 DRAMs, this allows each group of 5 DRAMs (single rank, half-channel) to receive its own independent clock. Giving each rank and half-channel an independent clock improves signal integrity, helping to address the lower noise margin issue raised by lowering the VDD (from change #2 above).

5. Longer Burst Length

The fifth major change is burst length. DDR4 burst chop length is four and burst length is eight. For DDR5, burst chop and burst length will be extended to eight and sixteen to increase burst payload. Burst length of sixteen (BL16), allows a single burst to access 64 Bytes of data, which is the typical CPU cache line size. It can do this using only one of the two independent channels. This provides a significant improvement in concurrency and with two channels, greater memory efficiency.

6. DDR5 Supports Higher Capacity DRAM

A sixth change to highlight is DDR5’s support for higher capacity DRAM devices. With DDR5 buffer chip DIMMs, the server or system designer can use densities of up to 64 Gb DRAMs in a single-die package. DDR4 maxes out at 16 Gb DRAM in a single-die package (SDP). DDR5 supports features like on-die ECC, error transparency mode, post-package repair, and read and write CRC modes to support higher-capacity DRAMs. The impact of higher capacity devices obviously translates to higher capacity DIMMs. So, while DDR4 DIMMs can have capacities of up to 64 GB (using SDP), DDR5 SDP-based DIMMs quadruple that to 256 GB.

7. A Smarter DIMM with DDR5

The DDR5 server DIMM chipset replaces the DDR4 SPD IC with an SPD Hub IC and adds two temperature sensor (TS) ICs. The SPD Hub has an integrated TS, which in conjunction with the two discrete TS ICs, provides three points of thermal telemetry from the RDIMM.

With DDR5, the communication bus between chips gets an upgrade to I3C running 10X faster than the I2C bus used in DDR4. The DDR5 SPD Hub handles communication from the module to the Baseboard Management Controller (BMC). Using the faster I3C protocol, the DDR5 SPD Hub reduces initialization time and supports a higher rate of polling and real-time control.

Thermal information, communicated from the SPD Hub to the BMC, can be used to manage cooling fan speed. DRAM refresh rate can now be more finely managed to provide for higher performance or higher retention, and if the RDIMM is running too hot, bandwidth can be throttled as needed to reduce the thermal load.

What are the DDR5 Design Challenges?

DDR5 RDIMMs Showing Rambus Memory Interface Chips

These changes in DDR5 introduce a number of design considerations dealing with higher speeds and lower voltages – raising a new round of signal integrity challenges. Designers will need to ensure that motherboards and DIMMs can handle the higher signal speeds. When performing system-level simulations, signal integrity at all DRAM locations needs to be checked.

For DDR4 designs, the primary signal integrity challenges were on the dual-data-rate DQ bus, with less attention paid to the lower-speed command address (CA) bus. For DDR5 designs, even the CA bus will require special attention for signal integrity. In DDR4, there was consideration for using differential feedback equalization (DFE) to improve the DQ data channel. But for DDR5, the RCD’s CA bus receivers will also require DFE options to ensure good signal reception.

The power delivery network (PDN) on the motherboard is another consideration, including up to the DIMM with the PMIC. Considering the higher clock and data rates, you will want to make sure that the PDN can handle the load of running at higher speed, with good signal integrity, and with good clean power supplies to the DIMMs.

The DIMM connectors from the motherboard to the DIMM will also have to handle the new clock and data rates. For the system designer, at the higher clock speeds and data rates around the printed circuit board (PCB), more emphasis must be placed on system design for electromagnetic interference and compatibility (EMI and EMC).

How do DDR5 memory interface chipsets harness the advantages of DDR5 for DIMMs?

The good news is that DDR5 memory interface chips improve signal integrity for the command and address signals sent from the host memory controller to the DIMMs. The bus for each of the two channels goes to the RCD and then fans out to the two halves of the DIMM. The RCD effectively reduces the loading on the CA bus that the host memory controller sees.

The expanded chipset including PMIC, SPD Hub and TS enable a smarter DIMM which can operate at the higher data rates of DDR5 while remaining within the desired power and thermal envelope.

Rambus offers a full DDR5 memory interface chipset that helps designers harness the full advantages of DDR5 while dealing with the signal integrity challenges of higher data, CA and clock speeds. Rambus was the first in the industry to deliver a DDR5 RCD to 5600 MT/s and is continually advancing the performance of its DDR5 solutions to meet growing market needs. The Rambus DDR5 RCD has now reached performance levels of 7200 MT/s.

As DDR5 evolves and makes its way to the client space, the Rambus DDR5 client memory interface chipset enables client DIMMs (CSODIMMs and CUDIMMs) to deliver new levels of memory performance for demanding gaming, content creation and AI workloads on PCs. The DDR5 Client DIMM Chipset includes a DDR5 Client Clock Driver (CKD) and Serial Presence Detect Hubs (SPD Hub).

As a renowned leader in signal integrity (SI) and power integrity (PI), Rambus has over 30 years’ experience in enabling the highest performance systems in the market.

Additional resources on DRR5:
– What’s Next for DDR5 Memory?
– Data Center Evolution: DDR5 DIMMs Advance Server Performance

Compute Express Link (CXL): All you need to know

Rambus Press — Tue, 23 Jan 2024 18:00:04 +0000

[Last updated on: January 23, 2024] In this blog post, we take an in-depth look at Compute Express Link® (CXL®), an open standard cache-coherent interconnect between processors and accelerators, smart NICs, and memory devices.

We explore how CXL can help data centers more efficiently handle the tremendous memory performance demands of generative AI and other advanced workloads.
We discuss how CXL technology maintains memory coherency between the CPU memory space and memory on attached devices to enable resource sharing (or pooling).
We also detail how CXL builds upon the physical and electrical interfaces of PCI Express® (PCIe®) with protocols that establish coherency, simplify the software stack, and maintain compatibility with existing standards.
Lastly, we review Rambus CXL solutions, which include the Rambus CXL 3.1 Controller. This IP comes with integrated Integrity and Data Encryption (IDE) modules to monitor and protect against cyber and physical attacks on CXL and PCIe links.

Table of Contents

1. Industry Landscape: Why is CXL needed?

Data centers face three major memory challenges as roadblocks to greater performance and lower total cost of ownership (TCO). The first of these is the limitations of the current server memory hierarchy. There is a three-order of magnitude latency gap that exists between direct-attached DRAM and Solid-State Drive (SSD) storage. When a processor runs out of capacity in direct-attached memory, it must go to SSD, which leaves the processor waiting. That waiting, or latency, has a dramatic negative impact on computing performance.

Secondly, core counts in multi-core processors are scaling far faster than main memory channels. This translates to processor cores beyond a certain number being starved for memory bandwidth, sub-optimizing the benefit of additional cores.

Finally, with the increasing move to accelerated computing, wherein accelerators have their own directed attached memory, there is the growing problem of underutilized or stranded memory resources.

Keep on reading:
– PCIe 6.1 – All you need to know
– CXL Memory Initiative: Enabling a New Era of Data Center Architecture

The solution to these data center memory challenges is a complimentary, pin-efficient memory technology that can provide more bandwidth and capacity to processors in a flexible manner. Compute Express Link (CXL) is the broadly supported industry standard solution that has been developed to provide low-latency, memory cache coherent links between processors, accelerators and memory devices.

2. An Introduction to CXL: What is Compute Express Link?

CXL is an open standard industry-supported cache-coherent interconnect for processors, memory expansion, and accelerators. Essentially, CXL technology maintains memory coherency between the CPU memory space and memory on attached devices. This enables resource sharing (or pooling) for higher performance, reduces software stack complexity, and lowers overall system cost. The CXL Consortium has identified three primary classes of devices that will employ the new interconnect:

- - Type 1 Devices: Accelerators such as smart NICs typically lack local memory. Via CXL, these devices can communicate with the host processor’s DDR memory.
  - Type 2 Devices: GPUs, ASICs, and FPGAs are all equipped with DDR or HBM memory and can use CXL to make the host processor’s memory locally available to the accelerator—and the accelerator’s memory locally available to the CPU. They are also co-located in the same cache coherent domain and help boost heterogeneous workloads.
  - Type 3 Devices: Memory devices can be attached via CXL to provide additional bandwidth and capacity to host processors. The type of memory is independent of the host’s main memory.

3. What Is the CXL Consortium?

The CXL Consortium is an open industry standard group formed to develop technical specifications that facilitate breakthrough performance for emerging usage models while supporting an open ecosystem for data center accelerators and other high-speed enhancements.

4. CXL Protocols & Standards

The CXL standard supports a variety of use cases via three protocols: CXL.io, CXL.cache, and CXL.memory.

- - CXL.io: This protocol is functionally equivalent to the PCIe protocol—and utilizes the broad industry adoption and familiarity of PCIe. As the foundational communication protocol, CXL.io is versatile and addresses a wide range of use cases.
  - CXL.cache: This protocol, which is designed for more specific applications, enables accelerators to efficiently access and cache host memory for optimized performance.
  - CXL.memory: This protocol enables a host, such as a processor, to access device-attached memory using load/store commands.

Together, these three protocols facilitate the coherent sharing of memory resources between computing devices, e.g., a CPU host and an AI accelerator. Essentially, this simplifies programming by enabling communication through shared memory. The protocols used to interconnect devices and hosts are as follows:

Type 1 Devices: CXL.io + CXL.cache
Type 2 Devices: CXL.io + CXL.cache + CXL.memory
Type 3 Devices: CXL.io + CXL.memory

5. Compute Express Link vs PCIe: How Are They Related?

CXL builds upon the physical and electrical interfaces of PCIe with protocols that establish coherency, simplify the software stack, and maintain compatibility with existing standards. Specifically, CXL leverages a PCIe 5 feature that allows alternate protocols to use the physical PCIe layer. When a CXL-enabled accelerator is plugged into a x16 slot, the device negotiates with the host processor’s port at default PCI Express 1.0 transfer rates of 2.5 gigatransfers per second (GT/s). CXL transaction protocols are activated only if both sides support CXL. Otherwise, they operate as PCIe devices.

CXL 1.1 and 2.0 use the PCIe 5.0 physical layer, allowing data transfers at 32 GT/s, or up to 64 gigabytes per second (GB/s) in each direction over a 16-lane link.

CXL 3.1 uses the PCIe 6.1 physical layer to scale data transfers to 64 GT/s supporting up to 128 GB/s bi-directional communication over a x16 link.

6. CXL Features and Benefits

Streamlining and improving low-latency connectivity and memory coherency significantly bolsters computing performance and efficiency while lowering TCO. Moreover, CXL memory expansion capabilities enable additional capacity and bandwidth above and beyond the direct-attach DIMM slots in today’s servers. CXL makes it possible to add more memory to a CPU host processor through a CXL-attached device. When paired with persistent memory, the low-latency CXL link allows the CPU host to use this additional memory in conjunction with DRAM memory. The performance of high-capacity workloads depends on large memory capacities such as AI. Considering that these are the types of workloads most businesses and data-center operators are investing in, the advantages of CXL are clear.

7. CXL 2.0 and 3.1 Features

CXL Memory Pooling Through Direct Connect

Memory Pooling

CXL 2.0 supports switching to enable memory pooling. With a CXL 2.0 switch, a host can access one or more devices from the pool. Although the hosts must be CXL 2.0-enabled to leverage this capability, the memory devices can be a mix of CXL 1.0, 1.1, and 2.0-enabled hardware. At 1.0/1.1, a device is limited to behaving as a single logical device accessible by only one host at a time. However, a 2.0 level device can be partitioned as multiple logical devices, allowing up to 16 hosts to simultaneously access different portions of the memory.

As an example, a host 1 (H1) can use half the memory in device 1 (D1) and a quarter of the memory in device 2 (D2) to finely match the memory requirements of its workload to the available capacity in the memory pool. The remaining capacity in devices D1 and D2 can be used by one or more of the other hosts up to a maximum of 16. Devices D3 and D4, CXL 1.0 and 1.1-enabled respectively, can be used by only one host at a time.

CXL 3.1 introduces peer-to-peer direct memory access and enhancements to memory pooling where multiple hosts can coherently share a memory space on a CXL 3.1 device. These features enable new use models and increased flexibility in data center architectures.

Switching

By moving to a CXL 2.0 direct-connect architecture, data centers can achieve the performance benefits of main memory expansion—and the efficiency and total cost of ownership (TCO) benefits of pooled memory. Assuming all hosts and devices are CXL 2.0 (and above)-enabled, “switching” is incorporated into the memory devices via a crossbar in the CXL memory pooling chip. This keeps latency low but requires a more powerful chip since it is now responsible for the control plane functionality performed by the switch. With low-latency direct connections, attached memory devices can employ DDR DRAM to provide expansion of host main memory. This can be done on a very flexible basis, as a host is able to access all—or portions of—the capacity of as many devices as needed to tackle a specific workload.

CXL 3.1 introduces multi-tiered switching which enables the implementation of switch fabrics. CXL 2.0 enabled a single layer of switching. With CXL 3.1, switch fabrics are enabled, where switches can connect to other switches, vastly increasing the scaling possibilities.

The “As Needed” Memory Paradigm

Analogous to ridesharing, CXL 2.0 and 3.1 allocate memory to hosts on an “as needed” basis, thereby delivering greater utilization and efficiency of memory. With CXL 3.1, memory pooling can be reconfigured dynamically without the need for a server (host) reboot. This architecture provides the option to provision server main memory for nominal workloads (rather than worst case), with the ability to access the pool when needed for high-capacity workloads and offering further benefits for TCO. Ultimately, the CXL memory pooling models can support the fundamental shift to server disaggregation and composability. In this paradigm, discrete units of compute, memory and storage can be composed on-demand to efficiently meet the needs of any workload.

Integrity and Data Encryption (IDE)

Disaggregation—or separating the components of server architectures—increases the attack surface. This is precisely why CXL includes a secure by design approach. Specifically, all three CXL protocols are secured via Integrity and Data Encryption (IDE) which provides confidentiality, integrity, and replay protection. IDE is implemented in hardware-level secure protocol engines instantiated in the CXL host and device chips to meet the high-speed data rate requirements of CXL without introducing additional latency. It should be noted that CXL chips and systems themselves require safeguards against tampering and cyberattacks. A hardware root of trust implemented in the CXL chips can provide this basis for security and support requirements for secure boot and secure firmware download.

Scaling Signaling to 64 GT/s

CXL 3.1 brings a step function increase in data rate of the standard. As mentioned earlier, CXL 1.1 and 2.0 use the PCIe 5.0 electricals for their physical layer: NRZ signaling at 32 GT/s. CXL 3.1 keeps that same philosophy of building on broadly adopted PCIe technology and extends it to the latest 6.1 version of the PCIe standard released in early 2022. That boosts CXL 3.1 data rates to 64 GT/s using PAM4 signaling. We cover the details of PAM4 signaling in PCIe 6 – All you need to know.

8. Rambus CXL Solutions

Rambus CXL 3.1 Controller

The Rambus CXL 3.1 Controller leverages the Rambus PCIe 6.1 Controller [link to https://www.rambus.com/interface-ip/pci-express/pcie6-controller/] architecture for the CXL.io protocol and adds the CXL.cache and CXL.mem protocols specific to CXL. The controller exposes a native Tx/Rx user interface for CXL.io traffic as well as an Intel CXL-cache/mem Protocol Interface (CPI) for CXL.mem and CXL. There is also a CXL 3.1 Controller with AXI version of the core that is compliant with the AMBA AXI Protocol Specification (AXI3, AXI4 and AXI4-Lite).

Read on:
– Rambus CXL Memory Initiative
– Rambus CXL & PCI Express Controllers

Zero-Latency IDE

The Rambus CXL 3.1 and PCIe 6.1 controllers are available with integrated Integrity and Data Encryption (IDE) modules. IDE monitors and protects against physical attacks on CXL and PCIe links. CXL requires extremely low latency to enable load-store memory architectures and cache-coherent links for its targeted use cases. This breakthrough controller with a zero-latency IDE delivers state-of-the-art security and performance at full 32 GT/s speed.

The built-in IDE modules employ a 256-bit AES-GCM (Advanced Encryption Standard, Galois/Counter Mode) symmetric-key cryptographic block cipher, helping chip designers and security architects to ensure confidentiality, integrity, and replay protection for traffic that travels over CXL and PCIe links. This secure functionality is especially imperative for data center computing applications including AI/ML and high-performance computing (HPC).

Key features include:

- - IDE security with zero latency for CXL.mem and CXL.cache
  - Robust protection from physical security attacks, minimizing the safety, financial, and brand reputation risks of a security breach
  - IDE modules pre-integrated in Rambus CXL 3.1 and PCIe 6.1 controllers reduce implementation risks and speed time-to-market

Final Thoughts

CXL is a once-in-a-decade technological force that will transform data center architectures. Supported by a who’s who of industry players including hyperscalers, system OEMs, platform and module makers, chip makers and IP providers, its rapid development is a reflection of the tremendous value it can deliver.

This is why Rambus launched the CXL Memory Initiative—to research and develop solutions that enable a new era of data center performance and efficiency. Current Rambus CXL solutions include the Rambus CXL 3.1 Controller with integrated IDE.

PCIe 6.1 – All you need to know about PCI Express Gen6

Rambus Press — Tue, 23 Jan 2024 15:00:37 +0000

[Updated January 23, 2024] The PCI Express^® 6.0 (PCIe^® 6.0) specification was released by PCI-SIG^® in January 2022. This new generation of the ubiquitous PCIe standard brought with it many exciting new features designed to boost performance for compute-intensive workloads including data center, AI/ML and HPC applications. PCIe 6.0 has now evolved to version 6.1 of the standard.

Find out all about PCIe 6.1 in the article below.

Contents

What is PCIe 6.1?
What’s new with PCIe 6.1?
Why PCIe 6.1 now?
Conclusion

What is PCIe 6.1?

Since PCIe 3, each new generation of the standard has seen a doubling in the data rate. PCIe 6.1 boosts the data rate to 64 gigatransfers per second (GT/s), twice that of PCIe 5.0. For a x16 link, which is typical of graphics and network cards, the bandwidth of the link reaches 128 gigabytes per second (GB/s). As in previous generations, the PCIe 6.1 link is full duplex, so it can deliver that 128 GB/s bandwidth in both directions simultaneously for a total bandwidth capacity of 256 GB/s.

PCIe has proliferated widely beyond servers and PCs, with its economies of scale making it attractive for data-centric applications in IoT, automotive, medical and elsewhere. That being said, the initial deployments of PCIe 6.1 will target applications requiring the highest bandwidth possible and those can be found in the heart of the data center: AI/ML, HPC, networking and cloud graphics.

The following chart shows the evolution of the PCIe specification over time:

PCie Specification	Data Rate per Lane (GT/s)	Encoding	x16 Unidirectional Bandwidth (GB/s)	Specification Ratification Year
1.x	2.5	8b/10b	4	2003
2.x	5	8b/10b	8	2007
3.x	8	128b/130b	15.75	2010
4.0	16	128b/130b	31.5	2017
5.0	32	128b/130b	63	2019
6.x	64	PAM4/FLIT	128	2022

What’s new with PCIe 6.1?

To achieve the 64 GT/s, PCIe 6.1 introduces new features and innovations:

1. PAM4 Signaling:

On the electrical layer, PCIe 6.1 uses PAM4 signaling (“Pulse Amplitude Modulation with four levels”) that combines 2 bits per clock cycle for 4 amplitude levels (00, 01, 10, 11) vs. PCIe 5.0, and earlier generations, which used NRZ modulation with 1 bit per clock cycle and two amplitude levels (0, 1).

: Comparison of NRZ modulation and PAM4 modulation

2. Forward Error Correction (FEC)

There are always tradeoffs, and the transition to PAM4 signal encoding introduces a significantly higher Bit Error Rate (BER) vs. NRZ. This prompted the adoption of a Forward Error Correction (FEC) mechanism to mitigate the higher error rate. Fortunately, the PCIe 6.1 FEC is sufficiently lightweight to have minimal impact on latency. It works in conjunction with strong CRC (Cyclic Redundancy Check) to keep Link Retry probability under 5×10-6.  This new FEC feature targets an added latency under 2ns.

While PAM4 signaling is more susceptible to errors, channel loss is not affected compared to PCIe 5.0 due to the nature of the modulation technique, so the reach of PCIe 6.1 signals on a PCB will be the same as that of a PCIe 5.0.

3. FLIT Mode:

PCIe 6.1 introduces FLIT mode, where packets are organized in Flow Control Units of fixed sizes, as opposed to variable sizes in past PCIe generations. The initial reason for introducing FLIT mode was that error correction requires working with fixed size packets; however, FLIT mode also simplifies data management at the controller level and results in higher bandwidth efficiency, lower latency, and smaller controller footprint. Let’s address bandwidth efficiency for a minute: with fixed-size packets, the framing of packets at the Physical Layer is no longer needed, that’s a 4-byte savings for every packet. FLIT encoding also does away with 128B/130B encoding and DLLP (Data Link Layer Packets) overhead from previous PCIe specifications, resulting in a significantly higher TLP (Transaction Layer Packet) efficiency, especially for smaller packets.

4. Other changes in PCIe 6:

L0p mode – enabling traffic to run on a reduced number of lanes to save power
A new PIPE specification – for the PHY to Controller interface

PCIe 6.1 Fun Fact: the x32 and x12 interface widths from earlier generations are dropped. While these widths are available in PCIe 5.0 and earlier specifications, these widths were never implemented in the market.

Why PCIe 6.1 now?

Before 2015, the PCIe specification was well ahead of the market in terms of available bandwidth required for use cases. After 2015, global data traffic has exploded. Data centers transitioned to 100G Ethernet (and up) pushing the bottleneck to the PCIe interconnects in servers and network devices.

The PCIe 6.1 specification fully supports the transition to 800G Ethernet in data centers: 800 gigabit per second (Gb/s) requires 100 GB/s of unidirectional bandwidth which falls within the 128 GB/s envelope of a x16 PCIe 6.1 link; 800G Ethernet, like PCIe, is full duplex. Further, data center general compute and networking are not the sole driving forces behind PCIe 6.1. AI/ML accelerators have an insatiable need for more bandwidth. Processing AI/ML training models is all about speed, and the faster accelerators can move data in and out, the more efficient and cost effective the training can be executed.

Conclusion

PCIe is everywhere in modern computing architectures, and we expect PCIe 6.1 will gain quick adoption in performance-critical applications in AI/ML, HPC, cloud computing and networking.

Rambus offers PCIe 6.1 controller IP, featuring an Integrity and Data Encryption (IDE) engine which provides state-of-the-art security for the PCIe links and the valuable data transferred over them.

PCI Express 5 vs. 4: What’s New? [Everything You Need to Know]

Rambus Press — Thu, 07 Sep 2023 13:05:44 +0000

Introduction

What’s new about PCI Express 5 (PCIe 5)? The latest PCI Express standard, PCIe 5, represents a doubling of speed over the PCIe 4.0 specifications.

We’re talking about 32 Gigatransfers per second (GT/s) vs. 16GT/s, with an aggregate x16 link duplex bandwidth of almost 128 Gigabytes per second (GB/s).

This speed boost is needed to support a new generation of artificial intelligence (AI) and machine learning (ML) applications as well as cloud-based workloads.

Both are significantly increasing network traffic. In turn, this is accelerating the implementation of higher speed networking protocols which are seeing a doubling in speed approximately every two years.

You can find much more about PCIe 5 in the article below.

1. PCI Express: Frequently Asked Questions (FAQ)
2. PCIe 5 – A New Era
3. PCIe 5 vs. PCIe 4 (+Comparison table included)
4. PCIe 5: Applications and Market Adoption
5. Complete PCIe 5 Interface Solutions from Rambus
6. Conclusion

PCI Express: Frequently Asked Questions (FAQ)

Let’s answer five frequently asked questions about PCI Express and PCIe 5.

a. What is PCI Express 5?

With the preliminary specification announced in 2017, PCIe 5 is a high-speed serial computer expansion bus standard that moves data at high bandwidth between multiple components. The PCIe 5.0 specification was formally released in May of 2019.

You might be wondering why a new PCI Express standard like PCIe 5 is needed. Well, PCIe 5 offers twice the data transfer rate of its PCIe 4 predecessor, delivering 32 GT/s vs. 16 GT/s. This speed increase is critical to support new AI/ML applications and cloud-centric computing.

b. Why both GT/s and GB/s?

GT/s is a measure of raw speed – how many bits can we transfer in a second. The data rate, on the other hand, has to take into consideration the overhead for encoding the signal. Bandwidth is data rate times link width, so encoding overhead’s impact on the data rate translates directly to an impact on bandwidth.

Back in the days of PCIe 2, the encoding scheme was 8b/10b, so there was a hefty overhead penalty for encoding. With such a high overhead, it was particularly useful to have measures of transfer rate (x GT/s) and data rate (y Gbps), where “y” was only 80% of “x.”
With Gen 3 and continuing through to the present Gen 5, the PCI Express standard moved to a very efficient 128b/130b encoding scheme, so the overhead penalty is now less than 2%. As such, the link speed and the data rate are roughly the same.

For a PCI 5 x8 link, 32 GT/s raw speed translates to 31.5 GB/s bandwidth (we chose a x8 link so we could go straight from bits to bytes). And since PCIe is a duplex link, total aggregate bandwidth rounds to 63GB/s (32GT/s x 8 lanes / 8 bits-per-byte x 128/130 encoding x 2 for duplex).

c. What is a PCI Express lane?

So what’s a PCI Express lane? Well, a PCIe lane consists of four wires to support two differential signaling pairs. One pair transmits data (from A to B), while the other receives data (from B to A). Want to know the best part? Each PCIe lane is designed to function as a full-duplex transceiver which can simultaneously transfer 128-bit data packets in both directions.

d. What does PCIe x16 mean?

We’ve discussed lanes, but what do they have to do with x16? Well, the term “PCIe x16” is used to refer to a 16-lane link instantiated on a board or a card. Physical PCIe links may include 1, 2, 4, 8, 12, 16 or 32 lanes. The 32-lane link is a pretty rare beast, so in practical terms the x16 represents the top end of the PCI Express link options.

e. What is PCI Express used for?

We’ve talked a lot about PCIe 5, but what is PCI Express actually used for?

You can think of the PCIe interface as the system “backbone” that transfers data at high bandwidth between various compute nodes. What’s the bottom line? Put simply, PCIe 5 rapidly moves data between CPUs, GPUs, FPGAs, networking devices and ASIC accelerators using links with various lane widths configured to meet the bandwidth requirements for the linked devices.

PCIe 5 vs. PCIe 4

Here’s a handy by-the-numbers comparison of PCIe 5 vs. PCIe 4 with the actual aggregate (duplex) bandwidth adjusted for the encoding overhead.

Comparison table: PCIe 5 vs PCIe 4

PCIe 5: Applications & Market Adoption

AI/ML and Cloud Computing

No surprise, PCIe 5 is the fastest PCI Express ever. While the speed upgrade makes the applications of today run faster, what’s particularly exciting is that PCIe 5 is enabling new applications in markets such as AI/ML and cloud computing.

AI applications generate, move and process massive amounts of data at real-time speeds. An example is a smart car which can generate as much as 4TB of data per day!

But that’s not all, the size of AI/ML training models are doubling every 3-4 months. The torrent of data, and the rapid growth in training models is putting tremendous stress on every aspect of the compute architecture, with interconnections between devices and systems being of critical importance. Also critical is fast access to memory as AI/ML workloads are extremely compute intensive.

But while AI/ML is one major megatrend, there are others. Data centers are changing, with enterprise workloads moving to the cloud at a rapid pace. Those applications mean moving more data, often with real-time speed and latency.

This shift to the cloud, along with ever-more sophisticated AI/ML applications, is accelerating the adoption of higher speed networking protocols that are experiencing a doubling in speed about every two years: 100GbE ->200GbE-> 400GbE.

Now this is where PCI express 5 comes in. PCIe 5 delivers duplex link bandwidth of almost 128 GB/s in a x16 configuration. Put simply, PCI express 5 effectively addresses the demands of AI/ML and cloud computing by supporting higher speed networking protocols as well as higher speed interconnections between system devices..

Complete PCI Express 5 Digital Controller Solutions from Rambus

Rambus offers a highly configurable PCIe 5.0 digital controller.

The Rambus PCIe 5.0 Controller can be paired with 3^rd-party PHYs or those developed in house. Rambus can provide integration and verification of the entire interface subsystem.

Conclusion

In “PCI Express 5 vs. 4: What’s New?” we explain how PCI Express is the system backbone that transfers data at high bandwidth between CPUs, GPUs, FPGAs and ASIC accelerators using links of variable lane widths depending on the bandwidth needs of the linked devices.

We also detail how the latest PCI Express standard, PCIe 5, represents a doubling over PCIe 4 with a raw speed of 32GT/s vs. 16GT/s translating to total duplex bandwidth for a x16 link of ~128 GB/s vs. ~64 GB/s.

We then explored how the higher data rates of PCIe 5 are enabling system designers to support a new generation of cloud computing and AI/ML applications.

Explore more primers:
– Hardware root of trust: All you need to know
– Side-channel attacks: explained
– DDR5 vs DDR4 – All the Design Challenges & Advantages
– Compute express link: All you need to know
– MACsec Explained: From A to Z
– The Ultimate Guide to HBM2E Implementation & Selection

Side-channel attacks explained: everything you need to know

Rambus Press — Thu, 14 Oct 2021 13:35:04 +0000

In this blog post, we take an in-depth look at the world of side-channel attacks.

We describe how side-channel attacks work and detail some of the most common attack methodologies. We also explore differential power analysis (DPA), an extremely powerful side-channel attack capable of obtaining and analyzing statistical measurements across multiple operations. In addition, we provide a walkthrough of a DPA attack and explain how different countermeasures with varying levels of effectiveness can be used to prevent side-channel attacks.

What is a side-channel attack?
How does a side channel attack work?
What attacks use side channel analysis?
DPA explained
DPA & Paul Kocher
Technical example of a differential power analysis attack
Countermeasures: Preventing Side-channel attacks
Final thoughts

What is a side-channel attack?

A side-channel attack (SCA) is a security exploit that attempts to extract secrets from a chip or a system. This can be achieved by measuring or analyzing various physical parameters. Examples include supply current, execution time, and electromagnetic emission. These attacks pose a serious threat to modules that integrate cryptographic systems. Indeed, many side-channel analysis techniques have proven successful in breaking an algorithmically robust cryptographic operation and extracting the secret key.

How does a side channel attack work?

A side-channel attack does not target a program or its code directly. Rather, a side-channel attack attempts to gather information or influence the program execution of a system by measuring or exploiting indirect effects of the system or its hardware. Put simply, a side channel attack breaks cryptography by exploiting information inadvertently leaked by a system. One such example is van Eck phreaking attack, which is also known as a Transient Electromagnetic Pulse Emanation Standard (TEMPEST). This attack monitors the electromagnetic field (EMF) radiation emitted by a computer screen to view information before it is encrypted.

What attacks use side channel analysis?

There are a growing number of known side-channel attack vectors. Some of the most common attacks are:

Timing attack: Analyzes the time a system spends executing cryptographic algorithms. Keep on reading: Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems »
Electromagnetic (EM) attack: Measures and performs a signal analysis on the electromagnetic radiation emitted from a device.
Simple power analysis (SPA): Directly observes the power and electromagnetic (EM) variations of a cryptographic system during operations.
Differential power analysis (DPA): Obtains and analyzes detailed statistical measurements across multiple operations.
Template attack: Recovers cryptographic keys by exploiting an identical “template” device and comparing side-channel data.

DPA explained: Why is this black-box attack considered one of the most efficient and dangerous attacks?

Definition: A Differential Power Analysis (DPA) is a form of side-channel attack that monitors variations in the electrical power consumption or electro-magnetic emissions of a target device. The basic method involves partitioning a set of traces into subsets, then subsequently computing the difference of the averages of these subsets.
Differences: DPA is an extremely powerful technique that obtains and analyzes statistical measurements across multiple operations.
What makes DPA more efficient/dangerous? Given enough traces, extremely minute correlations can be isolated—no matter how much noise is present in the measurements. DPA can even extract information about individual gate-switching, an individual transistor turning on or off, or the interaction between one gate and another.

DPA & Paul Kocher: An introduction to differential power analysis

Introduction to differential power analysis from Rambus

How does an attacker target a device or system using DPA? In the paper titled “Introduction to Differential Power Analysis,” Paul Kocher describes how information inadvertently leaked through power consumption and other side channels can be analyzed to extract secret keys from a wide range of devices.

The attacks are practical, non-invasive, and highly effective—even against complex and noisy systems where cryptographic computations account for only a small fraction of the overall power consumption.

Technical example of a differential power analysis attack

The following steps detail the DPA attack process.

1. Make power consumption measurements of the last few rounds of 1000 DES operations. Each sample set consists of 100000 data points. The data collected can be represented as a two-dimensional array S[0…999][0…99999], where the first index is the operation number and the second index is the sample. For this example, the attacker is also assumed to have the encrypted ciphertexts, C[0…999].

2. The attacker next chooses a key-dependent selection function D. In this case, the selection function would have the form D(Ki,C), where Ki is some key information and C is a ciphertext.

For example, the attacker’s goal will be to find the 6 bits of the DES key that are provided as the input to the DES S box 4, so Ki is a 6-bit input. The result of D(Ki,C) would be obtained by performing the DES initial permutation (IP) on C to obtain R and L, performing the E expansion on R, extracting the 6-bit input to S4, XORing with Ki, and using the XOR result as the input to the standard DES S4 lookup operation.

A target bit (for example, the most significant bit) of the S result is selected. The P permutation is applied to the bit. The result of the D(Ki,C) function is set to 0 if the single-bit P permutation result and the corresponding bit in L are equal, and otherwise D(Ki,C) yields 1.

3. A differential average trace T[0…63][0…99999] is constructed from the data set S using the results of the function D. In particular:

4. The attacker knows that there is one correct value for Ki; other values are incorrect. The attack goal is to identify the correct value. In the trace T[i][0…99999] where i=Ki, D(i,C[k]) for any k will equal the value of the target bit in L of the DES operation before the DES F function result was XORed. When the target device performed the DES operations, this bit value was stored in registers, manipulated in logic units, etc. — yielding detectable power consumption differences.

Thus, for the portions of the trace T[i=Ki] where that bit was present and/or manipulated, the sample set T[i] will show power consumption biases. However, for samples T[i != Ki], the value of D(i,C[k]) will not correspond to any operation actually computed by the target device. As a result, the trace T[i] will not be correlated to anything actually performed, and will average to zero. (Actually, T[i != Ki] will show small fluctuations due to noise and error that is not statistically filtered out, and due to biases resulting from statistical properties of the S tables. However, the largest biases will correspond to the correct value of Ki.)

5. The steps above are then repeated for the remaining S boxes to find the 48 key bits for the last round. The attack can then be repeated to find the previous round’s subkey (or the remaining 8 bits can be found using a quick search).

Countermeasures: Preventing Side-channel attacks

Countermeasures fall into two main categories:

Category 1: Eliminate or reduce the release of such information.

Countermeasures for category 1

Jam the emitted channel with noise: Specifically, random delays are introduced to deter timing attacks. The arbitrary and artificial “noise” forces an adversary to collect more measurements. It should be noted that standalone noise introduction is incapable of sufficiently masking side-channel emissions. DPA conducted against a device can effectively bypass stand-alone noise countermeasures, ultimately allowing the signal to be isolated.
Apply power line conditioning and filtering: Although somewhat effective, this method may not eliminate all minute correlations—and could potentially allow a determined attacker to compromise system security.
Analyze and evaluate: All electronic systems should be carefully evaluated with a Test Vector Leakage Assessment (TVLA) platform such as the Rambus DPA Workstation (DPAWS) to identify sensitive side-channel leakage.
Implement a silicon-based hardware root of trust: Rambus DPA Resistant hardware cores (DPARC)—which feature integrated countermeasures—are built around optimized implementations of industry accepted ciphers such as AES, SHA-256, RSA and ECC. These countermeasures have been designed and extensively validated using the Test Vector Leakage Assessment (TVLA) methodology revealing no leakage beyond 100 million traces, which means the cores are protected against univariate first and second-order side-channel attacks beyond 1 billion operations.

Category 2: eliminate the relationship between the leaked information and the secret data.

Countermeasures for category 2

Apply blinding techniques: This technique alters the algorithm’s input (for asymmetric encryption schemes) into an unpredictable state to prevent leakage.
Implement masking: This countermeasure randomly splits every sensitive intermediate variable occurring in the computation into d + 1 shares. Although widely used in practice, masking is often considered as an empirical solution and its effectiveness is rarely proved.

Final thoughts

Side-channel attacks conducted against electronic equipment and infrastructure are relatively simple and inexpensive to execute. An attacker does not necessarily need to know specific implementation details of the cryptographic device to perform these attacks and extract keys. Side-channel attacks have successfully cracked the hardware or software implementations of numerous cryptosystems including block ciphers such as DES, AES, Camellia, IDEA and Misty1. Side-channel attacks have also broken stream ciphers (RC4, RC6, A5/1 and SOBER-t32) and public key ciphers. Since all physical electronic systems routinely leak information, effective side-channel countermeasures should be implemented at the design stage to ensure protection of sensitive keys and data.

Here at Rambus, we developed fundamental solutions and techniques for protecting devices against DPA and related side-channel attacks, along with supporting tools, programs, and services. Learn more about our DPA Countermeasure solutions.

Explore more primers:
– Hardware root of trust: All you need to know
– PCI Express 5 vs. 4: What’s New?
– DDR5 vs DDR4 – All the Design Challenges & Advantages
– Compute express link: All you need to know
– MACsec Explained: From A to Z
– The Ultimate Guide to HBM2E Implementation & Selection

Read more about “Side-channel attacks” topic:

The importance of protecting military equipment from side-channel attacks

Side-Channel Attacks Target Machine Learning (ML) Models

Detecting and analyzing side-channel vulnerabilities with TVLA

Cracking SIM cards with side-channel attacks

Side-channel attack targets deep neural networks (DNNs)

TEMPEST side-channel attacks recover AES-256 encryption keys

Side-Channel Analysis Demo: FPGA Board

Primers Archives - Rambus

High Bandwidth Memory (HBM): Everything You Need to Know

Table of Contents:

What is High Bandwidth Memory (HBM) and How is it Reshaping the Future of Computing?

What is a 2.5D/3D Architecture?

How is HBM4 Different from HBM3E, HBM3, HBM3, or HBM (Gen 1)?

What are the Additional Features of HBM4?

Rambus HBM Memory Controller Cores for AI and High-Performance Workloads

Summary

MIPI: Powering the Future of Connected Devices

Table of Contents:

What does MIPI stand for?

The MIPI Protocols

MIPI vs. Other Interfaces: SPI and LVDS

Target Markets: Where MIPI Shines

MIPI and Automotive: A Transformational Use Case

The Rambus Offering: Integrated, High-Performance MIPI Solutions

Advanced Video Compression: VESA DSC and VDC-M

Ready to Power Your Next Innovation?

All You Need to Know About GDDR7

Table of Contents:

What Does GDDR Stand For?

Which is Faster DDR or GDDR?

When Did GDDR7 Launch?

Key Features of GDDR7

What is the Difference Between GDDR6 and GDDR7?

What is the Difference Between GDDR6X and GDDR7?

What is GDDR6x?

What is the Difference Between GDDR7 and HBM3?

Use Case Differences:

What is the Difference Between GDDR7 and LPDDR5?

How GDDR7 Supercharges AI Inference Performance

Conclusion

Post-quantum Cryptography (PQC): New Algorithms for a New Era

What is quantum computing?

Why are quantum computers a security threat?

What is Post-Quantum Cryptography (PQC)?

Is Quantum Safe Cryptography the same as Post-Quantum Cryptography (PQC)?

Why do we need to act now if quantum computers are still a way off?

What progress has been made to develop new PQC algorithms?

What recommendations does CNSA 2.0 make for transitioning to PQC algorithms?

How can companies get ready for the Quantum Computing Era?

What Quantum Safe IP solutions are available from Rambus?

Summary

Hardware Root of Trust: Everything you need to know

In this article:

What is hardware root of trust?

What are the types of a silicon-based hardware root of trust?

What are the benefits of a programmable hardware root of trust?

What features should a programmable hardware root of trust offer?

Siloed Execution:

Comprehensive Anti-Tamper and Side-Channel Resistance:

Layered Security:

Multiple Root of Trust Instances:

What is the Rambus Root of Trust?

How is the Rambus Root of Trust architected for security?

What is Quantum Safe Cryptography?

Is there a Rambus Root of Trust configured for my application?

What should I keep in mind when selecting a Root of Trust IP?

Next steps?

DDR5 vs DDR4 DRAM – All the Advantages & Design Challenges

In this article:

Performance: what changes in DDR5 vs DDR4 DRAM?

1. DDR5 Scales to 8.4 GT/s

2. Lower Voltage Keeps Power Manageable

3. New Power Architecture for DDR5 DIMMs

4. DDR5 vs DDR4 Channel Architecture

5. Longer Burst Length

6. DDR5 Supports Higher Capacity DRAM

7. A Smarter DIMM with DDR5

What are the DDR5 Design Challenges?

How do DDR5 memory interface chipsets harness the advantages of DDR5 for DIMMs?

Compute Express Link (CXL): All you need to know

1. Industry Landscape: Why is CXL needed?

2. An Introduction to CXL: What is Compute Express Link?

3. What Is the CXL Consortium?

4. CXL Protocols & Standards

5. Compute Express Link vs PCIe: How Are They Related?

6. CXL Features and Benefits

7. CXL 2.0 and 3.1 Features