HBM and GDDR6 Archives - Rambus At Rambus, we create cutting-edge semiconductor and IP products, providing industry-leading chips and silicon IP to make data faster and safer. Tue, 11 Nov 2025 23:51:36 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 High Bandwidth Memory (HBM): Everything You Need to Know https://www.rambus.com/blogs/hbm3-everything-you-need-to-know/ https://www.rambus.com/blogs/hbm3-everything-you-need-to-know/#respond Thu, 30 Oct 2025 17:00:18 +0000 https://www.rambus.com/?post_type=blogs&p=63372 [Updated on October 30, 2025] In an era where data-intensive applications, from AI and machine learning to high-performance computing (HPC) and gaming, are pushing the limits of traditional memory architectures, High Bandwidth Memory (HBM) has emerged as a high-performance, power-efficient solution. As industries demand faster, higher throughput processing, understanding HBM’s architecture, benefits, and evolving role in next-gen systems is essential.

In this blog, we’ll explore how HBM works, how it compares to previous generations, and why it’s becoming the cornerstone of next-generation computing.

Table of Contents:

What is High Bandwidth Memory (HBM) and How is it Reshaping the Future of Computing?


As computing races toward higher speeds and greater efficiency, memory bandwidth has emerged as a major bottleneck for workloads like AI, high-performance computing, and data analytics. This is where High Bandwidth Memory (HBM) comes in. HBM is a cutting-edge 2.5D and 3D memory architecture designed with an exceptionally wide data path, enabling massive throughput and performance gains. Unlike traditional memory architectures that rely on horizontal layouts and narrow interfaces, HBM takes a vertical approach: stacking memory dies atop one another and connecting through through-silicon vias (TSVs). This 3D-stacked design drastically shortens data travel paths, enabling bandwidth and lower power consumption in a compact footprint.

HBM operates at incredible multi-gigabit speeds. When you combine that speed with a very wide data path, the result is staggering bandwidth, often measured in hundreds of Gigabytes per second (Gb/s) and even reaching into the Terabytes per second (TB/s) range.

To put this into perspective, an HBM4 device running at 8 GB/s delivers 2.048 TB/s of bandwidth. That level of performance is what makes HBM4 a leading choice for AI training hardware.

What is a 2.5D/3D Architecture?

2.5D and 3D architectures refer to advanced integration techniques that improve performance, bandwidth, and power efficiency by bringing components closer together—literally.

HBM4 Uses a 2.5D/3D Architecture
HBM4 Uses a 2.5D/3D Architecture

3D Architecture
The “3D” part is easy to see. In 3D architecture, chips are stacked vertically and connected through TSVs (vertical electrical connections that pass through the silicon dies). An HBM memory device is a packaged 3D stack of DRAM, forming a compact, high-performance memory module. Think of it as a high-rise building of chips with elevators (TSVs) connecting the floors.

2.5D Architecture
In a 2.5D setup, multiple chips, like a CPU, GPU, and in our case, HBM devices stacks are placed side-by-side on a silicon interposer – a thin substrate of silicon that acts as a high-speed communication bridge. The interposer contains the fine-pitch wiring that enables fast, low-latency connections between the chips.
Why do we need to use a silicon interposer? The data path between each HBM4 memory device and the processor requires 2,048 “wires” or traces. With the addition of command and address, clocks, etc. the number of traces necessary grows to about 3,000.

Thousands of traces are far more than can be supported on a standard PCB. Therefore, a silicon interposer is used as an intermediary to connect memory device(s) and processor. As with an integrated circuit, finely spaced traces can be etched in the silicon interposer enabling the desired number of wires needed for the HBM interface. The HBM device(s) and the processor are mounted atop the interposer in what is referred to as a 2.5D architecture.

HBM uses both 2.5D and 3D architectures described above, so it’s a 2.5D/3D architecture memory solution.

How is HBM4 Different from HBM3E, HBM3, HBM3, or HBM (Gen 1)?

HBM4 represents a significant leap forward from its predecessors—HBM3E, HBM3 and earlier generations—in terms of bandwidth, capacity, efficiency and architectural innovation. With each generation, we’ve seen an upward trend in data rate, 3D-stack height, and DRAM chip density. That translates to higher bandwidth and greater device capacity with each upgrade of the specification.

When HBM launched, it started with a 1 Gb/s data rate and a 1024-bit wide interface. HBM delivered 128 GB/s of bandwidth, a huge step forward at the time.
Since then, every generation has pushed the limits a little further. HBM2, HBM3, and now HBM3E have all scaled bandwidth primarily by increasing the data rate. For example, HBM3E runs at 9.6 Gb/s, enabling a 1229 GB/s of bandwidth per stack. That’s impressive, but HBM4 takes things to an entirely new level. HBM4 doesn’t just tweak the speed; it doubles the interface width from 1024 bits to 2048 bits. This architectural shift means that even at a modest 8 Gb/s data rate, HBM4 can deliver 2.048 TB/s of bandwidth per stack. That’s nearly double what HBM3E offers.

Chip architects aren’t stopping at one stack. In fact, they’re designing systems with higher attach rates to feed the insatiable appetite of AI accelerators and next-gen GPUs. Imagine a configuration with eight HBM4 stacks, each running at 8 Gb/s. The result? A staggering 16.384 TB/s of memory bandwidth. That’s the kind of throughput needed for massive AI models and high-performance computing workloads.

This table below shares the key differences between HBM4 and its earlier generations.

Generation Data Rate (Gb/s) Interface Width (b) Bandwidth per Device (GB/s) Stack Height Max. DRAM Capacity (Gb) Max. Device Capacity (GB)
HBM 1.0 1024 128 8 16 16
HBM2 2.0 1024 256 8 16 16
HBM2E 3.6 1024 461 12 24 36
HBM3 6.4 1024 819 16 32 64
HBM3E 9.6 1024 1229 16 32 64
HBM4 8.0 2048 2048 16 32 64

What are the Additional Features of HBM4?

But that’s not all. HBM4 also introduces enhancements in power, memory access and RAS over HBM3E.

    • Double the Memory Channels: HBM4 doubles the number of independent channels per stack to 32 with 2 pseudo-channels per channel. This provides designers more flexibility in accessing the DRAM devices in the stack.
    • Improved Power Efficiency: HBM4 supports VDDQ options of 0.7V, 0.75V, 0.8V or 0.9V and VDDC of 1.0V or 1.05V. The lower voltage levels improve power efficiency.
    • Compatibility and Flexibility: The HBM4 interface standard ensures backwards compatibility with existing HBM3 controllers, allowing for seamless integration and flexibility in various applications.
    • Directed Refresh Management (DRFM): HBM4 incorporates Directed Refresh Management (DRFM) for improved Reliability, Availability, and Serviceability (RAS) including improved row-hammer mitigation.

Rambus HBM Memory Controller Cores for AI and High-Performance Workloads

Rambus delivers a comprehensive portfolio of HBM controller cores engineered for maximum speed and efficiency. Designed for high bandwidth and ultra-low latency, these controllers enable cutting-edge performance for AI training, machine learning, and advanced computing applications.

The lineup includes our industry-leading HBM4 memory controller, supporting data rates up to 10 Gb/s and offering exceptional flexibility for next-generation workloads. With Rambus HBM controllers, designers can achieve superior throughput, scalability, and reliability for demanding AI and HPC environments.

Summary

As computing demands continue to skyrocket, HBM stands out as a transformative technology that addresses the critical bottleneck of memory bandwidth. By leveraging advanced 2.5D and 3D architectures, HBM delivers massive throughput, exceptional power efficiency, and scalability for next-generation workloads. With HBM4 doubling interface width and introducing new features for flexibility and reliability, it is poised to become the backbone of AI, HPC, and data-intensive applications. Understanding this evolution is key to achieving the performance required for tomorrow’s most demanding systems.

Explore more resources:
HBM4 Memory: Break Through to Greater Bandwidth
Unleashing the Performance of AI Training with HBM4
Ask the Experts: HBM3E Memory Interface IP

]]>
https://www.rambus.com/blogs/hbm3-everything-you-need-to-know/feed/ 0
Nidish Kamath Talks HBM4 and AI in Rambus Ask the Experts https://www.rambus.com/blogs/ask-the-experts-hbm4-controller-ip/ https://www.rambus.com/blogs/ask-the-experts-hbm4-controller-ip/#respond Mon, 09 Sep 2024 21:00:46 +0000 https://www.rambus.com/?post_type=blogs&p=64852 To coincide with the launch of the industry’s first HBM4 Controller IP from Rambus, we talked to Nidish Kamath, Director of Product Management for Memory Interface IP.

The discussion highlighted how AI applications are driving the increased demand for HBM-based systems; the transition to Generative AI applications has led to significant performance and efficiency demands on the underlying compute infrastructure. The HBM4 standard, currently under development by JEDEC, will introduce new features designed to support the future memory requirements of AI applications.

Rambus is supporting designers with the transition to a new generation of HBM designs with an innovative digital controller that manages some of the implementation challenges that emerge when designing at higher data rates.

Check out the full video interview below or skip to read the key takeaways.

Expert

  • Nidish Kamath, Director of Product Management, Rambus

Key Takeaways

  1. AI Drives HBM Evolution: The rapid evolution of the HBM specification is driven by the increasing demands of AI applications as they evolve from machine learning to more generalized and widely deployed AI. These applications pose critical performance and efficiency challenges for the underlying compute infrastructure.
  2. HBM4 Standard Development: The HBM4 standard, currently under development by JEDEC, is set to introduce a doubled channel count per stack compared to HBM3, with a larger physical footprint. HBM4 will support speeds of 6.4 Gigabits per second (Gbps) with ongoing discussions regarding support for higher data rates.
  3. HBM4 Implementation Challenges: HBM4 will specify 24 and 32 Gigabit capacities, with options for supporting 4-, 8-, and 16-high TSV stacks. The increased channel count introduces implementation challenges such as packaging complexities, increased power density, as well as thermal and DRAM refresh management challenges.
  4. Rambus HBM4 Controller Solution: The Rambus HBM4 Controller IP is designed to manage the complexity of data parallelism at higher speeds. For example, it has a re-ordering logic that optimizes the outgoing HBM transactions and incoming HBM read data to keep the high bandwidth data interface efficiently utilized for the given performance and power target.
  5. Rambus HBM Expertise and Partnerships: The Rambus Memory Controller engineering team has over a decade of specialized expertise in designing high performance memory interface IP, including over 150 design wins for HBM and GDDR. The team works closely with PHY memory vendors to ensure any new PHY releases are fully tested out and supported for end customers.

Key Quote

Today’s AI applications pose critical performance and efficiency challenges for the underlying compute infrastructure. We are seeing widespread use of GPUs and AI accelerators that need to evolve quickly to meet the demanding performance requirements of these applications. This is one of the key reasons why we are seeing HBM4-based system development proceed at a more rapid pace compared to previous generations of the standard.

]]>
https://www.rambus.com/blogs/ask-the-experts-hbm4-controller-ip/feed/ 0
HBM3E Memory Interface IP On the Latest ATE https://www.rambus.com/blogs/ask-the-experts-hbm3e-memory-interface-ip/ https://www.rambus.com/blogs/ask-the-experts-hbm3e-memory-interface-ip/#respond Tue, 11 Jun 2024 19:02:44 +0000 https://www.rambus.com/?post_type=blogs&p=64765 This episode of “Ask the Experts” features a discussion on High Bandwidth Memory (HBM) with memory experts Frank Ferro and Nidish Kamath. The conversation focused on the role of HBM in today’s computing landscape, particularly for data center, AI, and High-Performance Computing (HPC) applications.

The experts highlighted the advantages of HBM3E, including higher memory bandwidth, higher capacity in a compact form factor, and improved power efficiency. They also discussed some of the challenges of implementing HBM, such as managing the complexity of data parallelism at higher speeds, its unique 2.5D architecture, and thermal management.

The interview concluded with a discussion on how Cadence and Rambus work together to deliver complete HBM3E memory subsystem solutions for customers.

Watch the full video interview to hear the details or skip below to read the key takeaways.

Experts

  • Frank Ferro, Group Director of Memory and Storage IP, Cadence
  • Nidish Kamath, Director of Product Management, Rambus

Key Takeaways

  1. HBM’s Crucial Role: The growth of AI is placing new demands on computing infrastructures, particularly in terms of performance and efficiency. HBM has quickly become a crucial element in meeting these requirements, especially for GPUs and AI accelerators.
  2. HBM Specification Evolution: The rapid evolution in the HBM specification in recent years has been driven by the phenomenal growth in data, particularly in AI training models. HBM3E offers high memory bandwidth performance, which is needed to train today’s large language models.
  3. HBM vs DDR Memory: HBM has three advantages over traditional DDR memory: higher memory bandwidth, higher capacity in a compact form factor, and improved power efficiency. HBM3E provides a maximum bandwidth of up to 1.2 Terabytes per second per HBM3E memory device. 
  4. HBM3E Implementation Challenges: Managing the complexity of data parallelism at higher speeds is important for the controller. Implementing HBM3E also presents challenges at the physical layer as HBM requires a silicon interposer, and many designers are unfamiliar with this 2.5D architecture.
  5. Cadence-Rambus Collaboration: Cadence and Rambus have experience working together to leverage their respective areas of expertise to deliver HBM3E memory subsystems for customers. Cadence focuses on the physical layer, while Rambus designs memory controllers that work seamlessly with Cadence PHYs.

Key Quote

HBM has three key advantages: higher memory bandwidth, higher capacity in a compact form factor, and improved power efficiency. HBM3E has a per pin maximum of 9.6 Gigabits per second. Up to 1024 of these high-speed IOs provide data connectivity to the 12 or more stacked DRAM chips in the 3D package and provide a maximum bandwidth of up to 1.2 Terabytes per second.

Related Content

]]>
https://www.rambus.com/blogs/ask-the-experts-hbm3e-memory-interface-ip/feed/ 0
Rambus Advances AI 2.0 with GDDR7 Memory Controller IP https://www.rambus.com/blogs/rambus-advances-ai-2-0-with-gddr7-memory-controller-ip/ https://www.rambus.com/blogs/rambus-advances-ai-2-0-with-gddr7-memory-controller-ip/#respond Mon, 22 Apr 2024 21:00:04 +0000 https://www.rambus.com/?post_type=blogs&p=64351 As the latest addition to the Rambus portfolio of industry-leading interface and security digital IP for AI 2.0, the GDDR7 memory controller will provide the breakthrough memory throughput required by servers and clients in the next wave of AI inference.

Memory Solutions for AI 2.0

AI 2.0 represents the revolutionary world of generative AI. AI 2.0 leverages the enormous growth in Large Language Models (LLMs) and their kin to create new multimodal content. Multimodality means text, images, speech, music, video can be combined as inputs to create outputs in all these media and more. Examples include creating a 3D model from an image or a video from a text prompt.

LLMs have scaled to over a trillion parameters with data sets in the billions of samples. Training LLMs requires enormous computational power supported by the latest high-performance memory solutions.

Supercharging AI Inference with GDDR7

The output of the AI 2.0 training process is an inference model that can be employed to create new multimodal content from a user’s prompts. Since accuracy and fidelity increase with model size, there is an ongoing push to larger and larger inference models. And as AI inference becomes increasingly pervasive and moves out from the data center to the edge and endpoints, it drives the need for more powerful processing engines with tailored high-performance memory solutions across the entire computing landscape.

GPUs have been the inference engines of choice, and in the case of edge and endpoint  applications, such as servers and desktops, these have been GPUs using GDDR6 memory. GDDR6, however, has reached the practical limit of standard NRZ signaling at 24 Gigabits per second (Gbps) data rates. To meet the bandwidth needs of future GPUs, a new generation of GDDR using a new signaling scheme is required. Enter GDDR7 memory which using PAM3 signaling can boost data rates to 40 Gbps and higher.

Rambus Silicon IP for AI 2.0

As the preferred silicon IP supplier for AI 2.0, Rambus offers industry-leading HBM, PCIe and CXL  Controller IP and now the industry’s first GDDR7 Memory Controller IP. The Rambus GDDR7 Controller provides a full-featured, bandwidth-efficient solution for GDDR7 memory implementations. It supports 40 Gbps operation providing 160 Gigabytes per second (GB/s) throughput for a GDDR7 memory device, a 67% increase over the industry’s highest throughput GDDR6 Controller (also from Rambus). The Rambus GDDR7 Controller enables a new generation of GDDR memory deployments for cutting-edge AI accelerators, graphics and high-performance computing (HPC) applications.

“Delivering greater memory performance is mission critical as AI 2.0 workloads push bandwidth requirements higher than ever before,” said Neeraj Paliwal, general manager of Silicon IP, at Rambus. “With our breakthrough GDDR7 Controller IP solution, designers can quickly take advantage of this latest generation of GDDR memory at industry-leading throughput.”

“GDDR7 memory offers significant performance gains over GDDR6,” said Soo-Kyoum Kim, vice president, memory semiconductors at IDC. “The Rambus GDDR7 Controller IP solution will be a vital tool for anyone that wants to take advantage of the improved speed and latency features offered by GDDR7.”

Rambus GDDR7 Controller key features:

  • Supports all GDDR7 link features including PAM3 and NRZ signaling
  • Supports broad range of GDDR7 device sizes and speeds
  • Optimized for high efficiency and low latency across a wide variety of traffic scenarios
  • Flexible AXI interface support
  • Low-power support (self-refresh, hibernate self-refresh, dynamic frequency scaling, etc.)
  • Reliability, Availability and Serviceability (RAS) features – such as end-to-end data path parity, parity protection for stored registers, etc.
  • Comprehensive memory test support
  • Integration support for third-party PHYs available
  • Validated utilizing the latest GDDR7 VIP and memory vendor memory models


The Rambus GDDR7 Memory Controller IP is available now. Learn more about the Rambus GDDR7 Controller here or download our white paper, Supercharging AI Inference with GDDR7.

]]>
https://www.rambus.com/blogs/rambus-advances-ai-2-0-with-gddr7-memory-controller-ip/feed/ 0
[Infographic]: The Powerful Technologies that Enable Systems like ChatGPT to Thrive https://www.rambus.com/blogs/infographic-the-powerful-technologies-that-enable-systems-like-chatgpt-to-thrive/ https://www.rambus.com/blogs/infographic-the-powerful-technologies-that-enable-systems-like-chatgpt-to-thrive/#respond Tue, 12 Mar 2024 20:59:04 +0000 https://www.rambus.com/?post_type=blogs&p=63981 Generative AI has been making waves in the tech industry. The capability to understand context and perform tasks like creating and summarizing content with astonishing accuracy in seconds showcases the cutting-edge potential that generative AI has to transform business processes.

Have you ever thought about the technologies that enable generative AI, including Chat GPT and Google Bard? Semiconductor technologies like DDR5, High-bandwidth Memory (HBM), GDDR, and PCI Express are critical in the training and deployment of generative AI.

Security will be another essential requirement as Generative AI proliferates to the edge and increasingly to client systems and smart end points. Safeguarding AI data and assets will require security anchored in hardware.

Check out the Rambus infographic below, “The Powerful Technologies that Enable Systems like ChatGPT to Thrive” to learn more.

Read this infographic to learn about the powerful technologies that enable systems like ChatGPT to thrive

]]>
https://www.rambus.com/blogs/infographic-the-powerful-technologies-that-enable-systems-like-chatgpt-to-thrive/feed/ 0
Rambus HBM3 Controller IP Gives AI Training a New Boost https://www.rambus.com/blogs/rambus-hbm3-controller-ip-gives-ai-training-a-new-boost/ https://www.rambus.com/blogs/rambus-hbm3-controller-ip-gives-ai-training-a-new-boost/#respond Wed, 25 Oct 2023 21:00:43 +0000 https://www.rambus.com/?post_type=blogs&p=63377 As AI continues to grow in reach and complexity, the unrelenting demand for more memory requires the constant advancement of high-performance memory IP solutions. We’re pleased to announce that our HBM3 Memory Controller now enables an industry-leading memory throughput of over 1.23 Terabytes per second (TB/s) for training recommender systems, generative AI and other compute-intensive AI workloads.

According to OpenAI, the amount of compute used in the largest AI training has increased at a rate of 10X per year since 2012, and this is showing no signs of slowing down any time soon! The growth of AI training data sets is being driven by a number of factors. These include complex AI models, vast amounts of online data being produced and made available, as well as a continued desire for more accuracy and robustness of AI models.

OpenAI’s very own ChatGPT, the most talked about large language model (LLM) of this year, is a great example to illustrate the growth of AI data sets. When it was first released to the public in November 2022, GPT-3 was built using 175 billion parameters. GPT-4, released just a few months after, is reported to use upwards of 1.5 trillion parameters. This staggering growth illustrates just how large data sets are becoming in such a short period of time.

As AI applications evolve and become more complex, more advanced models, larger data sets and massive data processing needs require lower latency, higher bandwidth memory for training. Delivering the highest per device bandwidth of any available memory, HBM3 has become the memory of choice for AI training hardware.

With its unique 2.5D/3D architecture, HBM memory offers significantly higher bandwidth when compared to traditional DDR-based memories, resulting in faster data access and processing vital for AI training tasks. HBM is also extremely power efficient given its position in relation to the GPU/CPU, and its compact form factor offers many benefits for devices where space is at a premium.

The Rambus HBM3 Memory Controller delivers a market-leading data rate of 9.6 Gigabits per second (Gb/s), supporting the continued evolution of HBM3 beyond the top specification speed of 6.4 Gb/s. The interface features 16 independent channels, each containing 64 bits for a total data width of 1024 bits. At the 9.6 Gb/s data rate, this provides a total interface bandwidth of 1228.8 GB/s, or in other words, over 1.23 Terabyte per second (TB/s) of memory throughput! HBM3 memory solutions are evolving in the market, and the Rambus HBM3 Memory Controller supports this trend as HBM scales to new performance levels.

Want to dive into some of the benefits of HBM3 memory in more detail? Check out our new “HBM3: Everything You Need to Know” blog.

]]>
https://www.rambus.com/blogs/rambus-hbm3-controller-ip-gives-ai-training-a-new-boost/feed/ 0
Powering the Next Wave of AI Inference with the Rambus GDDR6 PHY at 24 Gb/s https://www.rambus.com/blogs/powering-the-next-wave-of-ai-inference-with-the-rambus-gddr6-phy-at-24-gbs/ https://www.rambus.com/blogs/powering-the-next-wave-of-ai-inference-with-the-rambus-gddr6-phy-at-24-gbs/#respond Wed, 19 Apr 2023 21:00:52 +0000 https://www.rambus.com/?post_type=blogs&p=62708 Rambus is, once again, leading the way in memory performance solutions with today’s announcement that the Rambus GDDR6 PHY now reaches performance of up to 24 Gigabits per Second (Gb/s), the industry’s highest data rate for GDDR6 memory interfaces!

AI/ML inference models are growing rapidly in both size and sophistication, and because of this we are seeing increasingly powerful hardware deployed at the network edge and in endpoint devices. For inference, memory throughput speed and low latency are critical. GDDR6 memory offers an impressive combination of bandwidth, capacity, latency and power that makes it ideal for these applications.

The GDDR6 interface supports 2 channels, each with 16 bits for a total data width of 32 bits. With speeds up to 24 Gb/s per pin, the Rambus GDDR6 PHY offers a maximum bandwidth of up to 96 GB/s. This represents a 50% increase in available bandwidth, compared with the previous generation 16G GDDR6 PHY.

Of course, hitting such high data rates also comes with some challenges. Maintaining signal integrity (SI) at speeds of 24 Gb/s, particularly at lower voltages, requires significant expertise. Designers face tighter timing and voltage margins and the number of sources of loss and their effects all rise rapidly. This is where the long-standing Rambus expertise in SI comes in and allows customers to maintain the SI of their system, even at these new 24G data rates.

Check out our “From Data Center to End Device: AI/ML Inference with GDDR6” white paper for a detailed look at GDDR6 memory capabilities and discover why it is ideally suited to meet the challenges of AI inference applications.

]]>
https://www.rambus.com/blogs/powering-the-next-wave-of-ai-inference-with-the-rambus-gddr6-phy-at-24-gbs/feed/ 0
Accelerating AI/ML applications in the data center with HBM3 https://www.rambus.com/blogs/accelerating-ai-ml-applications-in-the-data-center-with-hbm3/ https://www.rambus.com/blogs/accelerating-ai-ml-applications-in-the-data-center-with-hbm3/#comments Thu, 03 Nov 2022 20:33:58 +0000 https://www.rambus.com/?post_type=blogs&p=62103 Semiconductor Engineering Editor in Chief Ed Sperling recently spoke with Frank Ferro, Senior Director of Product Management at Rambus, about accelerating AI/ML applications in the data center with HBM3. Introduced by JEDEC in early 2022, the latest iteration of the high bandwidth memory standard increases the per-pin data rate to 6.4 Gigabits per second (Gb/s), double that of HBM2.

HBM3 maintains the 1024-bit wide interface of previous generations—while extending the track record of bandwidth performance set by what was originally dubbed the “slow and wide” HBM memory architecture. Since bandwidth is the product of data rate and interface width, 6.4 Gb/s x 1024 enables 6,554 Gb/s. Dividing by 8 bits/byte yields a total bandwidth of 819 Gigabytes per second (GB/s).

HBM3 also supports 3D DRAM devices of up to 12-high stacks—with provision for a future extension to as high as 16 devices per stack—for device densities of up to 32Gb. In real-world terms, a 12-high stack of 32Gb devices translates to a single HBM3 DRAM device of 48GB capacity. Moreover, HBM3 doubles the number of memory channels to 16 and supports 32 virtual channels (with two pseudo-channels per channel). With more memory channels, HBM3 can support higher stacks of DRAM per device and finer access granularity.

Eliminating memory bandwidth bottlenecks

“HBM3 is all about bandwidth,” says Ferro. “There are many high-end accelerator cards going into the data center for AI [applications], particularly AI training. A lot of these systems have a good [number] of processors—but you’ve got to keep these processors fed [which means] memory bandwidth is now the bottleneck.”

To highlight IP requirements and potential design choices for the next generation of HBM3-based silicon, Ferro sketches a generic AI accelerator model with purpose-built processors running a neural network.

“You’ve got a processor—probably multiple processors—and these must get fed from memory. So, when you’re doing for example, image recognition training, you’ve got to put lots of data into the system [to enable high-accuracy inference],” he elaborates. “Clearly, you need a lot of memory bandwidth and that’s really where HBM3 comes into the picture. Although HBM2 and HBM2E [offer] very high bandwidth, processors still need to get fed with [even] more data.”

According to Ferro, memory is currently one of the most the critical bottlenecks in the data center, especially for AI/ML applications.

“If you look at the data sets for AI, they’re just growing at exponential rates,” says Ferro. “Data increases from month-to-month and puts a lot of pressure on the memory side.”

Balancing price, performance, and power

As Ferro points out, requirements for specific workloads—such as image processing, financial modeling, and pharmaceutical simulations—play a major role in influencing the design of AI accelerators.

“In the picture above, I’m showing two HBM3 memory devices, a configuration that will provide 1.6 terabytes of performance. If you’re doing genome sequencing or financial transactions, you may need more—or less—bandwidth [depending on workload],” he explains. “So, you might add two more HBMs to double that bandwidth even further. We’ve even seen systems that go up to eight HBMs. The basic architecture [remains] the same, although you’re tuning the system from an optimization standpoint.”

Additional design considerations include power and cost. As Ferro points out, HBM3 improves energy efficiency by dropping operating voltage to 1.1V and leveraging low-swing 0.4V signaling.

“You’re going to want to tune and balance the system to efficiently meet application [requirements] while staying within your cost and power budgets,” he adds.

To effectively determine tradeoffs that balance price, performance, and power, Ferro recommends that system designers first gauge memory processing requirements and then select an optimal implementation. For example, if only a terabyte of performance is needed, perhaps a single HBM2E memory device will suffice. If the application demands more bandwidth, multiple HBM3 devices will likely be a better fit.

PCIe 6 and chiplets

As Ferro notes, PCIe will also play a major role in influencing future AI accelerator designs. Indeed, PCIe 5 offers a transfer rate of 32 Giga transfers per second (GT/s) per pin (per second), while PCIe 6 will double this rate to 64 GT/s.

“You’ve got to look at how much data you will be bringing into the system, how much data you’re bringing out, and how these processors need to get fed,” he elaborates. “For example, you can [potentially] partition some [workloads] dynamically, so if you decide to split it into multiple jobs—because a lot of this is happening in parallel—maybe you don’t [need to] use all of that bandwidth [or] processing power [for a single task], although you can do multiple things at once.”

According to Ferro, minimizing die size is also an important consideration, especially for HBM implementations. This is one reason the semiconductor industry is eyeing chiplets for AI accelerators, as the technology enables system designers to mix and match different components based on specific workload requirements, shrink overall die size, and reduce costs.

“[With chiplets], you can potentially go with a cheaper process node for the I/O controller, for example, but if you need the most advanced process node for your processor, you can [do so while] balancing overall system cost,” he adds.

]]>
https://www.rambus.com/blogs/accelerating-ai-ml-applications-in-the-data-center-with-hbm3/feed/ 1
Rambus Design Summit Interview Series: Steven Woo https://www.rambus.com/blogs/rambus-design-summit-interview-series-steven-woo/ https://www.rambus.com/blogs/rambus-design-summit-interview-series-steven-woo/#respond Mon, 18 Jul 2022 17:54:06 +0000 https://www.rambus.com/?post_type=blogs&p=61715 Rambus Fellow, Steven Woo, returns to the Rambus Design Summit stage tomorrow, and we are so excited for his keynote: Advancing Computing in the Accelerator Age! In our last interview before the show, we met with Steven to chat about his background, CXL, and some of the biggest challenges for computing in the years ahead.

Read on for Steven’s full interview and don’t forget to register for Rambus Design Summit, happening tomorrow!

Register for Rambus Design Summit!

Question: Can you tell us a bit about your background?
Steven: My background is in computer architecture, and I’ve done research work in multiprocessor architectures, parallel programming, and neural networks. I’ve always been interested in improving the performance of computer systems, and memory systems are critical to faster computing. I’ve led and worked on several projects here at Rambus pushing DRAM and memory performance in PCs and servers, domain-specific architectures for applications like machine learning, and advanced architectures for near-data processing.

Question: What are you working on at Rambus these days?
Steven:I’m currently working in Rambus Labs, the research organization within Rambus, where I lead a team of senior architects chartered with developing innovations for future DRAMs and memory systems. We get to work on longer-term research projects as well as with our business units on nearer-term programs. There are a lot of interesting challenges for future memory systems, and we’re working on solutions that apply to data centers, mobile computing, and high-performance systems.>

Question: CXL is such an exciting emerging technology – how do you see that impacting the future of data center architecture?
Steven: CXL is one of the most disruptive technologies that’s happened over the last 20 years. It will support emerging datacenter usage models by providing a cache-coherent interconnect for processors and accelerators, we well as memory expansion for applications that process large amounts of data. CXL will ultimately enable higher performance and improved resource sharing, reducing overall cost of ownership.

Question: What do you think are the biggest challenges for computing in the years ahead?
Steven: As the world’s digital data continues to increase, new innovations are needed so that processing can keep up.  With performance increasingly limited by data movement, the industry must focus on faster and more power-efficient interconnects and memory systems. Applications and usage models are changing, so system architectures must continue to evolve as well. Accelerators offer new ways to process data more quickly, and resource disaggregation enables higher resource utilization and improved cost of ownership that will influence the direction of computing architectures in the coming years.

]]>
https://www.rambus.com/blogs/rambus-design-summit-interview-series-steven-woo/feed/ 0
AI Accelerates HBM Momentum https://www.rambus.com/blogs/ai-accelerates-hbm-momentum/ https://www.rambus.com/blogs/ai-accelerates-hbm-momentum/#respond Thu, 03 Feb 2022 21:52:16 +0000 https://www.rambus.com/?post_type=blogs&p=61228 In a recent EE Times article, Gary Hilson notes that high bandwidth memory (HBM) deployments are becoming more mainstream due to the massive growth and diversity in artificial intelligence (AI) applications.

“HBM is [now] less than niche. It’s even become less expensive, but it’s still a premium memory and requires expertise to implement,” writes Hilson. “As a memory interface for 3D-stacked DRAM, HBM achieves higher bandwidth while using less power in a form factor that’s significantly smaller than DDR4 or GDDR5 by stacking as many as eight DRAM dies with an optional base die which can include buffer circuitry and test logic.”

According to Jim Handy, principal analyst with Objective Analysis, GPUs and AI accelerators have an “unbelievable hunger” for bandwidth and HBM gets them where they want to go.

“The applications where HBM is being used need so much computing power that HBM is really the only way to do it,” Handy tells the publication. “If you tried doing it with DDR, you’d end up having to have multiple processors instead of just one to do the same job, and the processor cost would end up more than offsetting what you saved in the DRAM.”

Early HBM3 hardware will reportedly be capable of ~1.4x more bandwidth than HBM2E. As the standard evolves, this number is expected to increase to ~1.075TB/s of memory bandwidth per stack, with maximum I/O transfer rates of up to 8.4Gbps. This means that the total bandwidth provided by a four-stack HBM3 solution at 665GB/s will hit ~2.7TB/s.

As Hilson emphasizes, moving to HBM3 requires careful planning and expertise, which is why Avery Design Systems is creating a streamlined ecosystem for design and verification to make HBM3 adoption as easy as possible. In late 2021, Avery announced that Rambus would use Avery’s HBM3 memory model to verify its HBM3 PHY and controller subsystem.

The Rambus HBM3-ready memory interface consists of a fully integrated physical layer (PHY) and digital memory controller, the latter drawing on technology from the company’s recent acquisition of Northwest Logic. The subsystem supports data rates of up to 8.4 Gbps and delivers as much as 1 terabyte per second of bandwidth, thereby doubling the performance of high-end HBM2E memory subsystems.

“People are starting to move from architecting for HBM3 to starting chip implementation,” states Chris Browy, VP of Sales and Marketing at Avery Design Systems. “Now that there are more AI chips coming online and the competition is fierce, everybody’s looking to take advantage of the latest memory architectures.”

According to Frank Ferro, Rambus Senior Director of Product Marketing for IP Cores, the neural networks in AI applications require a significant amount of data both for processing and training—with training sets alone increasing 10x per year.

For AI training and high-performance applications, says Ferro, HBM3 can deliver more than one terabyte per second with two DRAM stacks. With four DRAM stacks, this number increases to 3.2 terabytes per second, offering significant processing power for AI-and high-performance computing applications. In addition, HBM3 delivers better power and area efficiency, as the DRAM stack and SoC are placed in a single package substrate.

Interested in learning more? The full text of “AI Expands HBM Footprint” by Gary Hilson is available on EE Times and the Rambus HBM3 PHY solution page can be viewed here.

]]>
https://www.rambus.com/blogs/ai-accelerates-hbm-momentum/feed/ 0