Memory PHYs Archives - Rambus At Rambus, we create cutting-edge semiconductor and IP products, providing industry-leading chips and silicon IP to make data faster and safer. Tue, 29 Oct 2024 21:52:02 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 Powering the Next Wave of AI Inference with the Rambus GDDR6 PHY at 24 Gb/s https://www.rambus.com/blogs/powering-the-next-wave-of-ai-inference-with-the-rambus-gddr6-phy-at-24-gbs/ https://www.rambus.com/blogs/powering-the-next-wave-of-ai-inference-with-the-rambus-gddr6-phy-at-24-gbs/#respond Wed, 19 Apr 2023 21:00:52 +0000 https://www.rambus.com/?post_type=blogs&p=62708 Rambus is, once again, leading the way in memory performance solutions with today’s announcement that the Rambus GDDR6 PHY now reaches performance of up to 24 Gigabits per Second (Gb/s), the industry’s highest data rate for GDDR6 memory interfaces!

AI/ML inference models are growing rapidly in both size and sophistication, and because of this we are seeing increasingly powerful hardware deployed at the network edge and in endpoint devices. For inference, memory throughput speed and low latency are critical. GDDR6 memory offers an impressive combination of bandwidth, capacity, latency and power that makes it ideal for these applications.

The GDDR6 interface supports 2 channels, each with 16 bits for a total data width of 32 bits. With speeds up to 24 Gb/s per pin, the Rambus GDDR6 PHY offers a maximum bandwidth of up to 96 GB/s. This represents a 50% increase in available bandwidth, compared with the previous generation 16G GDDR6 PHY.

Of course, hitting such high data rates also comes with some challenges. Maintaining signal integrity (SI) at speeds of 24 Gb/s, particularly at lower voltages, requires significant expertise. Designers face tighter timing and voltage margins and the number of sources of loss and their effects all rise rapidly. This is where the long-standing Rambus expertise in SI comes in and allows customers to maintain the SI of their system, even at these new 24G data rates.

Check out our “From Data Center to End Device: AI/ML Inference with GDDR6” white paper for a detailed look at GDDR6 memory capabilities and discover why it is ideally suited to meet the challenges of AI inference applications.

]]>
https://www.rambus.com/blogs/powering-the-next-wave-of-ai-inference-with-the-rambus-gddr6-phy-at-24-gbs/feed/ 0
Accelerating AI/ML applications in the data center with HBM3 https://www.rambus.com/blogs/accelerating-ai-ml-applications-in-the-data-center-with-hbm3/ https://www.rambus.com/blogs/accelerating-ai-ml-applications-in-the-data-center-with-hbm3/#comments Thu, 03 Nov 2022 20:33:58 +0000 https://www.rambus.com/?post_type=blogs&p=62103 Semiconductor Engineering Editor in Chief Ed Sperling recently spoke with Frank Ferro, Senior Director of Product Management at Rambus, about accelerating AI/ML applications in the data center with HBM3. Introduced by JEDEC in early 2022, the latest iteration of the high bandwidth memory standard increases the per-pin data rate to 6.4 Gigabits per second (Gb/s), double that of HBM2.

HBM3 maintains the 1024-bit wide interface of previous generations—while extending the track record of bandwidth performance set by what was originally dubbed the “slow and wide” HBM memory architecture. Since bandwidth is the product of data rate and interface width, 6.4 Gb/s x 1024 enables 6,554 Gb/s. Dividing by 8 bits/byte yields a total bandwidth of 819 Gigabytes per second (GB/s).

HBM3 also supports 3D DRAM devices of up to 12-high stacks—with provision for a future extension to as high as 16 devices per stack—for device densities of up to 32Gb. In real-world terms, a 12-high stack of 32Gb devices translates to a single HBM3 DRAM device of 48GB capacity. Moreover, HBM3 doubles the number of memory channels to 16 and supports 32 virtual channels (with two pseudo-channels per channel). With more memory channels, HBM3 can support higher stacks of DRAM per device and finer access granularity.

Eliminating memory bandwidth bottlenecks

“HBM3 is all about bandwidth,” says Ferro. “There are many high-end accelerator cards going into the data center for AI [applications], particularly AI training. A lot of these systems have a good [number] of processors—but you’ve got to keep these processors fed [which means] memory bandwidth is now the bottleneck.”

To highlight IP requirements and potential design choices for the next generation of HBM3-based silicon, Ferro sketches a generic AI accelerator model with purpose-built processors running a neural network.

“You’ve got a processor—probably multiple processors—and these must get fed from memory. So, when you’re doing for example, image recognition training, you’ve got to put lots of data into the system [to enable high-accuracy inference],” he elaborates. “Clearly, you need a lot of memory bandwidth and that’s really where HBM3 comes into the picture. Although HBM2 and HBM2E [offer] very high bandwidth, processors still need to get fed with [even] more data.”

According to Ferro, memory is currently one of the most the critical bottlenecks in the data center, especially for AI/ML applications.

“If you look at the data sets for AI, they’re just growing at exponential rates,” says Ferro. “Data increases from month-to-month and puts a lot of pressure on the memory side.”

Balancing price, performance, and power

As Ferro points out, requirements for specific workloads—such as image processing, financial modeling, and pharmaceutical simulations—play a major role in influencing the design of AI accelerators.

“In the picture above, I’m showing two HBM3 memory devices, a configuration that will provide 1.6 terabytes of performance. If you’re doing genome sequencing or financial transactions, you may need more—or less—bandwidth [depending on workload],” he explains. “So, you might add two more HBMs to double that bandwidth even further. We’ve even seen systems that go up to eight HBMs. The basic architecture [remains] the same, although you’re tuning the system from an optimization standpoint.”

Additional design considerations include power and cost. As Ferro points out, HBM3 improves energy efficiency by dropping operating voltage to 1.1V and leveraging low-swing 0.4V signaling.

“You’re going to want to tune and balance the system to efficiently meet application [requirements] while staying within your cost and power budgets,” he adds.

To effectively determine tradeoffs that balance price, performance, and power, Ferro recommends that system designers first gauge memory processing requirements and then select an optimal implementation. For example, if only a terabyte of performance is needed, perhaps a single HBM2E memory device will suffice. If the application demands more bandwidth, multiple HBM3 devices will likely be a better fit.

PCIe 6 and chiplets

As Ferro notes, PCIe will also play a major role in influencing future AI accelerator designs. Indeed, PCIe 5 offers a transfer rate of 32 Giga transfers per second (GT/s) per pin (per second), while PCIe 6 will double this rate to 64 GT/s.

“You’ve got to look at how much data you will be bringing into the system, how much data you’re bringing out, and how these processors need to get fed,” he elaborates. “For example, you can [potentially] partition some [workloads] dynamically, so if you decide to split it into multiple jobs—because a lot of this is happening in parallel—maybe you don’t [need to] use all of that bandwidth [or] processing power [for a single task], although you can do multiple things at once.”

According to Ferro, minimizing die size is also an important consideration, especially for HBM implementations. This is one reason the semiconductor industry is eyeing chiplets for AI accelerators, as the technology enables system designers to mix and match different components based on specific workload requirements, shrink overall die size, and reduce costs.

“[With chiplets], you can potentially go with a cheaper process node for the I/O controller, for example, but if you need the most advanced process node for your processor, you can [do so while] balancing overall system cost,” he adds.

]]>
https://www.rambus.com/blogs/accelerating-ai-ml-applications-in-the-data-center-with-hbm3/feed/ 1
Rambus Design Summit Interview Series: Steven Woo https://www.rambus.com/blogs/rambus-design-summit-interview-series-steven-woo/ https://www.rambus.com/blogs/rambus-design-summit-interview-series-steven-woo/#respond Mon, 18 Jul 2022 17:54:06 +0000 https://www.rambus.com/?post_type=blogs&p=61715 Rambus Fellow, Steven Woo, returns to the Rambus Design Summit stage tomorrow, and we are so excited for his keynote: Advancing Computing in the Accelerator Age! In our last interview before the show, we met with Steven to chat about his background, CXL, and some of the biggest challenges for computing in the years ahead.

Read on for Steven’s full interview and don’t forget to register for Rambus Design Summit, happening tomorrow!

Register for Rambus Design Summit!

Question: Can you tell us a bit about your background?
Steven: My background is in computer architecture, and I’ve done research work in multiprocessor architectures, parallel programming, and neural networks. I’ve always been interested in improving the performance of computer systems, and memory systems are critical to faster computing. I’ve led and worked on several projects here at Rambus pushing DRAM and memory performance in PCs and servers, domain-specific architectures for applications like machine learning, and advanced architectures for near-data processing.

Question: What are you working on at Rambus these days?
Steven:I’m currently working in Rambus Labs, the research organization within Rambus, where I lead a team of senior architects chartered with developing innovations for future DRAMs and memory systems. We get to work on longer-term research projects as well as with our business units on nearer-term programs. There are a lot of interesting challenges for future memory systems, and we’re working on solutions that apply to data centers, mobile computing, and high-performance systems.>

Question: CXL is such an exciting emerging technology – how do you see that impacting the future of data center architecture?
Steven: CXL is one of the most disruptive technologies that’s happened over the last 20 years. It will support emerging datacenter usage models by providing a cache-coherent interconnect for processors and accelerators, we well as memory expansion for applications that process large amounts of data. CXL will ultimately enable higher performance and improved resource sharing, reducing overall cost of ownership.

Question: What do you think are the biggest challenges for computing in the years ahead?
Steven: As the world’s digital data continues to increase, new innovations are needed so that processing can keep up.  With performance increasingly limited by data movement, the industry must focus on faster and more power-efficient interconnects and memory systems. Applications and usage models are changing, so system architectures must continue to evolve as well. Accelerators offer new ways to process data more quickly, and resource disaggregation enables higher resource utilization and improved cost of ownership that will influence the direction of computing architectures in the coming years.

]]>
https://www.rambus.com/blogs/rambus-design-summit-interview-series-steven-woo/feed/ 0
Rambus Design Summit Featured Speaker: Frank Ferro https://www.rambus.com/blogs/rambus-design-summit-featured-speaker-frank-ferro/ https://www.rambus.com/blogs/rambus-design-summit-featured-speaker-frank-ferro/#respond Fri, 13 Aug 2021 02:03:40 +0000 https://www.rambus.com/?post_type=blogs&p=60592 Thanks to everyone who joined us for Rambus Design Summit 2021. Over the coming weeks we’ll highlight the webinars and panels from the event all available now on-demand.

Watch Selecting the Right High Bandwidth Memory Solution

About Frank Ferro

Frank Ferro is the senior director of product management at Rambus Inc. responsible for memory interface IP products. Having spent more than 20 years at AT&T, Lucent and Agere Systems, he has extensive experience in wireless communications, networking and consumer electronics fields. Mr. Ferro holds an executive MBA from the Fuqua School of Business at Duke University, an M.S. in computer science and a B.S.E.T. in electronic engineering technology from the New Jersey Institute of Technology.

Session Topic: Selecting the Right High Bandwidth Memory Solution

“Today you see CPUs and GPUs running the neural networks…but as these networks mature and become more specialized we’re seen new architectures emerging that are going to take advantage of the specific problem they’re trying to solve.”

An exponentially rising tide of data has led to the development of application-specific silicon to tackle the requirements of demanding workloads such as AI/ML training, Advanced Driver Assistance Systems (ADAS) for automotive, network graphics and HPC. To keep these processors and accelerators fed requires state-of-the-art memory solutions that deliver extremely high bandwidth. Frank Ferro will discuss design and implementation considerations of HBM2E and GDDR6 memory subsystems to address the bandwidth needs of next-generation computing applications.

View this session on-demand here!

]]>
https://www.rambus.com/blogs/rambus-design-summit-featured-speaker-frank-ferro/feed/ 0
451 Research Report: Interconnecting AnalogX and PLDA with Rambus https://www.rambus.com/blogs/451-research-report-interconnecting-analogx-and-plda-with-rambus/ https://www.rambus.com/blogs/451-research-report-interconnecting-analogx-and-plda-with-rambus/#respond Tue, 03 Aug 2021 17:15:19 +0000 https://www.rambus.com/?post_type=blogs&p=60512 Compute Express Link (CXL) will enable memory expansion and pooling. Memory pooling with CXL 2.0 allows for the tailored matching of workloads to the available memory in the pool leading to improved performance, higher memory use efficiency, and improved utilization and TCO. This future technology will help revolutionize the future of data center architectures.

In a report titled “Rambus buys into CXL interconnect ecosystem with two new deals” (June 2021), John Abbott, principal research analyst at 451 Research, part of S&P Global Market Intelligence outlines Rambus’ recently announced acquisitions of AnalogX and PLDA.

He notes, “The additions of PLDA and AnalogX will speed up Rambus’ initiative to enable memory expansion and pooling in disaggregated infrastructure through the emerging Compute Express Link (CXL) interconnect ecosystem. CXL 2.0 is a PCIe-based technology intended to make it easier to connect CPUs with memory and specialist accelerators, and to separate memory from physical servers to improve memory bandwidth, capacity and efficiency.”

Read the report here.

 

Additional Resources:

Rambus CXL Memory Interconnect Initiative
CXL White Paper: “Enabling a New Era of Data Center Architecture”
Press Release: “Rambus Advances New Era of Data Center Architecture with CXL™ Memory Interconnect Initiative”

]]>
https://www.rambus.com/blogs/451-research-report-interconnecting-analogx-and-plda-with-rambus/feed/ 0
Stacking memory for AI/ML training with HBM2E https://www.rambus.com/blogs/stacking-memory-for-ai-ml-training-with-hbm2e/ https://www.rambus.com/blogs/stacking-memory-for-ai-ml-training-with-hbm2e/#respond Thu, 06 May 2021 23:50:40 +0000 https://www.rambus.com/?post_type=blogs&p=60306 Frank Ferro, Senior Director Product Management at Rambus, recently penned an article for Semiconductor Engineering that takes a closer look at high bandwidth memory (HBM) and 2.5D (stacking) architecture for AI/ML training. As Ferro notes, the impact of AI/ML increases daily – impacting nearly every industry across the globe.

“In marketing, healthcare, retail, transportation, manufacturing and more, AI/ML is a catalyst for great change,” he explains. “This rapid advance is powerfully illustrated by the growth in AI/ML training capabilities which have since 2012 grown by a factor of 10X every year.”

According to Ferro, AI/ML neural network training models can currently exceed 10 billion parameters, although this number will soon jump to over 100 billion. This is made possible by enormous gains in computing power thanks to Moore’s Law and Dennard scaling.

“At some point, however, the trend line of processing power, doubling every two years, would be overtaken by one that doubles every three-and-a-half months,” he elaborates. “That point is now. To make matters worse, Moore’s Law is slowing, and Dennard scaling has stopped, at a time when arguably we need them most.”

With no slackening in demand, says Ferro, it will take improvements in every aspect of computer hardware and software to stay on pace.

“Among these, memory capacity and bandwidth will be critical areas of focus to enable the continued growth of AI. If we can’t continue to scale down (via Moore’s Law), then we’ll have to scale up,” he states. “[This is precisely why] the industry has responded with 3D-packaging of DRAM in JEDEC’s High Bandwidth Memory (HBM) standard. By scaling in the Z-dimension, we can realize a significant increase in capacity.”

As Ferro points out, the latest iteration of HBMHBM2E – supports 12-high stacks of DRAM with memory capacities of up to 24 GB per stack. However, that greater capacity would be useless to AI/ML training without rapid access. As such, the HBM2E interface provides bandwidth of up to 410 GB/s per stack. In real world terms, this means an implementation with four stacks of HBM2E memory can deliver nearly 100 GB of capacity at an aggregate bandwidth of 1.6 TB/s.

Ferro also emphasizes that with AI/ML accelerators deployed in hyperscale data centers, it is critical to take heat dissipation issues and power constraints into consideration.

“HBM2E provides very power efficient bandwidth by running a ‘wide and slow’ interface. Slow, at least in relative terms, HBM2E operates at up to 3.2 Gbps per pin. Across a wide interface of 1,024 data pins, the 3.2 Gbps data rate yields a bandwidth of 410 GB/s,” he explains. “To data add clock, power management and command/address, and the number of ‘wires’ in the HBM interface grows to about 1,700.”

Since this is far more than can be supported on a standard PCB, a silicon interposer is used as an intermediary to connect memory stack(s) and processor. In simple terms, this means the use of a silicon interposer is what makes this a 2.5D architecture. As with an IC, finely spaced traces can be etched in the silicon interposer to achieve the number needed for the HBM interface.

“With 3D stacking of memory, high bandwidth and high capacity can be achieved in an exceptionally small footprint. In data center environments, where physical space is increasingly constrained, HBM2E’s compact architecture offers tangible benefits,” he continues. “Further, by keeping data rates relatively low, and the memory close to the processor, overall system power is kept low.”

According to Ferro, HBM2E memory delivers what AI/ML training needs, with high bandwidth, high capacity, compactness, and power efficiency. But there is a catch – as the design trade-off with HBM is increased complexity and costs. More specifically, the silicon interposer is an additional element that must be designed, characterized, and manufactured.

“3D stacked memory shipments pale in comparison to the enormous volume and manufacturing experience built up making traditional DDR-type memories,” he states. “[This is because] implementation and manufacturing costs are higher for HBM2E than for a high-performance memory built using traditional manufacturing methods such as GDDR6 DRAM.”

Nevertheless, overcoming complexity through innovation is what the semiconductor industry has done time and again to push computing performance to new heights. With AI/ML, the economic benefits of accelerating training runs are enormous. This is not only for better utilization of training hardware – but because of the value created when trained models are deployed in inference engines across millions of AI-powered devices.

In addition, says Ferro, designers can greatly mitigate the challenges of higher complexity with their choice of IP supplier.

“Integrated solutions such as the HBM2E memory interface from Rambus ease implementation and provide a complete memory interface sub-system consisting of co-verified PHY and digital controller,” he explains. “Further, Rambus has extensive experience in interposer design with silicon-proven HBM/HBM2 implementations benefiting from Rambus’ mixed-signal circuit design history, deep signal integrity/power integrity and process technology expertise, and system engineering capabilities.”

As Ferro observes, the progress of AI/ML has been breathtaking in recent years, and improvements to every aspect of computing hardware and software will be needed to keep this scorching pace on track.

“For memory, AI/ML training demands bandwidth, capacity and power efficiency all in a compact footprint. HBM2E memory, using a 2.5D architecture, answers AI/ML training’s call for ‘all of the above’ performance,” he concludes.

]]>
https://www.rambus.com/blogs/stacking-memory-for-ai-ml-training-with-hbm2e/feed/ 0
Powering the Next Wave of AI Applications https://www.rambus.com/blogs/powering-the-next-wave-of-ai-applications/ https://www.rambus.com/blogs/powering-the-next-wave-of-ai-applications/#respond Thu, 29 Apr 2021 19:12:43 +0000 https://www.rambus.com/?post_type=blogs&p=60240 Artificial Intelligence/Machine Learning (AI/ML) grows at a blistering pace. The size of the largest training models has passed 100 billion parameters and is on pace to hit a trillion in the next year. The impact of AI/ML is being felt across the industry landscape, in higher education, and in financial markets. Underpinning this growth is the rapid advancement in computer hardware technology with specific emphasis on AI/ML-tailored memory solutions that provide extremely high bandwidth. Check out this new infographic that captures some of the high level trends and highlights two high-performance memories, HBM2E and GDDR6 DRAM, that are powering the next wave of AI applications.

Powering the Next Wave of AI Applications - Infographic

]]>
https://www.rambus.com/blogs/powering-the-next-wave-of-ai-applications/feed/ 0
HBM2E targets AI/ML training https://www.rambus.com/blogs/hbm2e-targets-ai-ml-training/ https://www.rambus.com/blogs/hbm2e-targets-ai-ml-training/#respond Thu, 18 Mar 2021 17:16:21 +0000 https://www.rambus.com/?post_type=blogs&p=60085 Frank Ferro, Senior Director Product Management at Rambus, has written a detailed article for Semiconductor Engineering that explains why HBM2E is a perfect fit for Artificial Intelligence/Machine Learning (AI/ML) training. As Ferro points out, AI/ML growth and development are proceeding at a lighting pace. Indeed, AI training capabilities have jumped by a factor of 300,000 (10X annually) over the past 8 years. This trend continues to drive rapid improvements in nearly every area of computing, including memory bandwidth capabilities.

HBM: A Need for Speed

Introduced in 2013, High Bandwidth Memory (HBM) is a high-performance 3D-stacked SDRAM architecture.

“Like its predecessor, the second generation HBM2 specifies up to 8 memory die per stack, while doubling pin transfer rates to 2 Gbps,” Ferro explains. “HBM2 achieves 256 GB/s of memory bandwidth per package (DRAM stack), with the HBM2 specification supporting up to 8 GB of capacity per package.”

As Ferro notes, JEDEC announced the HBM2E specification in late 2018 to support increased bandwidth and capacity.

“With transfer rates rising to 3.2 Gbps per pin, HBM2E can achieve 410 GB/s of memory bandwidth per stack,” he explains. “In addition, HBM2E supports 12‑high stacks with memory capacities of up to 24 GB per stack.”

As Ferro points out, all versions of HBM run at a relatively low data rate compared to a high-speed memory such as GDDR6.

“High bandwidth is achieved [using] an extremely wide interface. Specifically, each HBM2E stack running at 3.2 Gbps connects to its associated processor through an interface of 1,024 data ‘wires,’” he adds.

With command and address, says Ferro, the number of wires increases to about 1,700, which is far more than can be supported on a standard PCB. As such, a silicon interposer is used as an intermediary to connect memory stack(s) and processor. As with an SoC, finely spaced data traces can be etched in the silicon interposer to achieve the desired number of wires needed for the HBM interface.

HBM2E, Ferro emphasizes, offers the capability to achieve tremendous memory bandwidth. More specifically, four HBM2E stacks connected to a processor can collectively deliver over 1.6 TB/s of bandwidth.

“With 3D stacking of memory, high bandwidth and high capacity can be achieved in an exceptionally small footprint. Further, by keeping data rates relatively low, and the memory close to the processor, overall system power is kept low,” he adds.

HBM Design Tradeoffs

Unsurprisingly, the design tradeoffs around HBM are increased complexity and costs. Specifically, says Ferro, the interposer is an additional element that must be designed, characterized, and manufactured.

“3D stacked memory shipments pale in comparison to the enormous volume and manufacturing experience built up making traditional DDR-type memories (including GDDR),” he explains. “The net is that implementation and manufacturing costs are higher for HBM2E than for memory using traditional manufacturing methods as in GDDR6 or DDR4.”

However, Ferro emphasizes, the benefits of HBM2E make it the superior choice for AI training applications.

“The performance is outstanding, and higher implementation and manufacturing costs can be traded off against savings of board space and power,” he elaborates. “In data center environments, where physical space is increasingly constrained, HBM2E’s compact architecture offers tangible benefits. Its lower power translates to lower heat loads for an environment where cooling is often one of the top operating costs.”

For training, says Ferro, bandwidth and capacity are “critical” requirements. This is particularly so given that training capabilities are on a pace to double in size every 3.43 months.

“Training workloads now run over multiple servers to provide the needed processing power – flipping virtualization on its head,” he explains. “Given the value created through training, there is a powerful ‘time-to-market’ incentive to complete training runs as quickly as possible. Furthermore, training applications run in data centers increasingly constrained for power and space, so there is a premium for solutions that offer power efficiency and smaller size.”

Given all these requirements, HBM2E is an ideal memory solution for AI training hardware. It provides excellent bandwidth and capacity capabilities: 410 GB/s of memory bandwidth with 24 GB of capacity for a single 12‑high HBM2E stack. Its 3D structure provides these features in a very compact form factor and at a lower power thanks to a low interface speed and proximity between memory and processor.

According to Ferro, this means designers can both realize the benefits of HBM2E memory and mitigate implementation challenges through their choice of IP supplier.

“Rambus offers a complete HBM2E memory interface sub-system consisting of a co-verified PHY and controller. An integrated interface solution greatly reduces implementation complexity,” he states. “Further, Rambus’ extensive mixed-signal circuit design history, deep signal integrity/power integrity and process technology expertise, and system engineering capabilities help ensure first-time-right design execution.”

As Ferro concludes, the growth of AI/ML training capabilities requires sustained and across the board improvements in both hardware and software to stay on the current pace. As part of this mix, memory is a critical enabler.

“HBM2E memory is an ideal solution, offering bandwidth and capacity at low power in a compact footprint [that] hits all of AI/ML training’s key performance requirements. With a partner like Rambus, designers can harness the capabilities of HBM2E memory to supercharge their next generation of AI accelerators,” he adds.

]]>
https://www.rambus.com/blogs/hbm2e-targets-ai-ml-training/feed/ 0
How PCIe 5 Can Accelerate AI and ML Applications https://www.rambus.com/blogs/how-pcie-5-can-accelerate-ai-and-ml-applications/ https://www.rambus.com/blogs/how-pcie-5-can-accelerate-ai-and-ml-applications/#respond Fri, 19 Feb 2021 16:01:55 +0000 https://www.rambus.com/?post_type=blogs&p=60011 Rambus’ Suresh Andani has written a detailed Semiconductor Engineering article that explores how PCIe 5 can effectively accelerate AI and ML applications. According to Andani, the rapid adoption of sophisticated artificial intelligence/machine learning (AI/ML) applications and the shift to cloud-based workloads has significantly increased network traffic in recent years. However, the paradigm of virtualization can no longer keep up with the AI/ML applications and cloud-based workloads that are quickly outpacing server compute capacity.

“AI workloads – including machine learning and deep learning – require a new generation of computing architectures,” he explains. “This is because AI applications generate, move and process massive amounts of data at real time speeds. For example, a smart car generates around 4TB of data per day, while AI and ML training model sizes continue to double approximately every 3-4 months.”

As Andani notes, AI applications across multiple verticals are demanding significant amounts of memory bandwidth to support the processing of extremely large data sets. Moreover, unlike traditional multi-level caching architectures, AI applications require direct and fast access to memory.

“Additional characteristics and requirements of AI-specific applications include parallel computing, low-precision computing and empirical analysis assumption,” he explains. “AI/ML workloads are [also] extremely compute intensive – and they are shifting system architecture from traditional CPU-based computing towards more heterogenous/distributed computing.”

Looking beyond AI/ML applications, says Andani, the conventional data center paradigm is evolving due to the ongoing shift to cloud computing.

“Enterprise workloads are moving to the cloud: 45% were cloud-based in 2017, while over 60% were cloud-based in 2019. As such, data centers are leveraging hyperscale computing and networking to meet the needs of cloud-based workloads,” he elaborates. “Because the economies of scale are driven by increasing the bandwidth per physical unit of space, this new cloud-based model (along with AI/ML applications) is accelerating the adoption of higher speed networking protocols that double in speed approximately every two years: 100GbE ->200GbE-> 400GbE->800GbE.”

The steady march towards 400GbE cloud networking and the evolution of sophisticated AI/ML workloads is pushing the need for doubling the PCIe bandwidth every two years to effectively move data between compute nodes.

“PCIe5 – with an aggregate link bandwidth of 128GB/s in a x16 configuration – addresses these demands without ‘boiling the ocean’ as it is built on the proven PCIe framework,” he explains. “Essentially, the PCIe interface is the backbone that moves high-bandwidth data between various compute nodes (CPUs, GPUs, FPGAs, custom-build ASIC accelerators) in a heterogenous compute setup.”

For system designers, says Andani, significant signal integrity experience is required to support the latest networking protocols like 400GbE.

“The performance of SoCs is contingent upon how fast data can be moved in, out and between other components. Because the physical size of SoCs remain approximately constant, bandwidth increases are primarily achieved by increasing the speed (data rate) of data per pin. Issues related to higher speeds – such as loss, cross talk and reflections – all become more pronounced as data rates increase,” he adds.

As Andani emphasizes, significant increases in speed are required to support AI/ML applications such as massive training models and real-time inference.

“This means that all supporting technologies – such as CPU, memory access bandwidth and interface speeds – need to double every 1-2 years. PCIe 5.0, the latest PCIe standard, represents a doubling over PCIe 4.0: 32GT/s vs. 16GT/s, with a x16 link bandwidth of 128 GBps.”

To effectively meet the demands of AI/ML applications and cloud-based workloads, says Andani, a PCIe 5.0 interface should be a comprehensive solution built on an advanced process node such as 7nm (FINFET). In addition, the solution should comprise a co-verified PHY and digital controller. As well, the PCIe 5.0 interface should support Compute Express Link (CXL) connectivity between host processor and workload accelerators for heterogenous computing.

The introduction of CXL (which uses the same transport layer as PCIe5) provides high-performance computing (HPC) and AI/ML system designers with a low-latency cache- coherent interconnect to virtually unify the system memory across various compute nodes,” he elaborates.

Additional key features and capabilities should include:

  • 32 GT/s bandwidth per lane with 128 GB/s bandwidth in x16 configuration
  • Backward compatibility to PCIe 4.0, 3.0 and 2.0
  • Advanced multi-tap transceiver and receiver equalization to compensate for more than 36dB of insertion loss

“PCIe 5.0, the latest PCIe standard, represents a doubling over PCIe 4.0: 32GT/s vs. 16GT/s, with an aggregate x16 link bandwidth of 128 GBps. At these speeds, it is important for systems designers to have significant signal integrity experience to prevent loss, crosstalk and reflections,” Andani concludes.

]]>
https://www.rambus.com/blogs/how-pcie-5-can-accelerate-ai-and-ml-applications/feed/ 0
HiPEAC Tech Transfer Award Highlights DRAMSys4.0 Collaboration https://www.rambus.com/blogs/hipeac-tech-transfer-award-highlights-dramsys4-0-collaboration/ https://www.rambus.com/blogs/hipeac-tech-transfer-award-highlights-dramsys4-0-collaboration/#respond Wed, 13 Jan 2021 18:09:44 +0000 https://www.rambus.com/?post_type=blogs&p=59953 HiPEAC, a European network of almost 2,000 world-class computing systems researchers, named Matthias Jung (Fraunhofer IESE), Lukas Steiner, and Norbert Wehn (TUK) as winners of the prestigious Tech Transfer Award for their work on DRAMSys4.0. With an ongoing collaboration with Fraunhofer IESE on DRAMSys4.0, Rambus salutes the award recipients.

“The work Fraunhofer IESE had done in building the initial version of DRAMSys4 well positioned it to meet virtual prototyping challenges. However, certain fidelity, usability, and feature set gaps can only be uncovered in real world use,” says James Tringali, Technical Director at Rambus. “With Rambus helping Fraunhofer IESE to identify and address these gaps, future versions of DRAMSys4 will become even more valuable to the technical community. The HiPEAC Technology Transfer Award is an affirmation of this value.”

DRAMSys4.0 is a flexible and fast DRAM subsystem design exploration framework based on SystemC TLM-2.0. It is designed to address the challenges of different DRAM architectures with respect to applications, performance, power, temperature, and retention errors. DRAMSys can be expected to accelerate the design space exploration of memory systems, especially for the adoption of new memory standards, while reducing time-to-market compared to traditional RTL modeling.

“For Fraunhofer IESE and TUK, the cooperation with Rambus is an important partnership and a huge step towards a broader application and transfer of the DRAMSys tool in companies,” Tringali explains. “Rambus’ efforts to optimize DRAMSys4 will be part of an open-source roadmap that was established when Fraunhofer IESE initially released DRAMSys4. This is key for enabling customers to benefit from our highly vetted innovations and reproduce our results as desired. In this regard, Rambus is accentuating the ‘science’ in computer science.”

As a technology leader in the memory systems design space, says Tringali, Rambus understands and appreciates the critical importance of using high fidelity models to showcase innovations.

“At Rambus, we are well versed in the complex nature of memory system operations and their impacts on system level performance. In this regard, we understand what it takes to create a flexible, high- fidelity virtual prototyping platform,” Tringali concludes.

Visit DRAMSys4.0 to learn more or to discover more details about this notable award. Congratulations to the winners!

]]>
https://www.rambus.com/blogs/hipeac-tech-transfer-award-highlights-dramsys4-0-collaboration/feed/ 0