SerDes PHYs Archives - Rambus

PCIe 6.1 – All you need to know about PCI Express Gen6

Rambus Press — Tue, 23 Jan 2024 15:00:37 +0000

[Updated January 23, 2024] The PCI Express^® 6.0 (PCIe^® 6.0) specification was released by PCI-SIG^® in January 2022. This new generation of the ubiquitous PCIe standard brought with it many exciting new features designed to boost performance for compute-intensive workloads including data center, AI/ML and HPC applications. PCIe 6.0 has now evolved to version 6.1 of the standard.

Find out all about PCIe 6.1 in the article below.

Contents

What is PCIe 6.1?
What’s new with PCIe 6.1?
Why PCIe 6.1 now?
Conclusion

What is PCIe 6.1?

Since PCIe 3, each new generation of the standard has seen a doubling in the data rate. PCIe 6.1 boosts the data rate to 64 gigatransfers per second (GT/s), twice that of PCIe 5.0. For a x16 link, which is typical of graphics and network cards, the bandwidth of the link reaches 128 gigabytes per second (GB/s). As in previous generations, the PCIe 6.1 link is full duplex, so it can deliver that 128 GB/s bandwidth in both directions simultaneously for a total bandwidth capacity of 256 GB/s.

PCIe has proliferated widely beyond servers and PCs, with its economies of scale making it attractive for data-centric applications in IoT, automotive, medical and elsewhere. That being said, the initial deployments of PCIe 6.1 will target applications requiring the highest bandwidth possible and those can be found in the heart of the data center: AI/ML, HPC, networking and cloud graphics.

The following chart shows the evolution of the PCIe specification over time:

PCie Specification	Data Rate per Lane (GT/s)	Encoding	x16 Unidirectional Bandwidth (GB/s)	Specification Ratification Year
1.x	2.5	8b/10b	4	2003
2.x	5	8b/10b	8	2007
3.x	8	128b/130b	15.75	2010
4.0	16	128b/130b	31.5	2017
5.0	32	128b/130b	63	2019
6.x	64	PAM4/FLIT	128	2022

What’s new with PCIe 6.1?

To achieve the 64 GT/s, PCIe 6.1 introduces new features and innovations:

1. PAM4 Signaling:

On the electrical layer, PCIe 6.1 uses PAM4 signaling (“Pulse Amplitude Modulation with four levels”) that combines 2 bits per clock cycle for 4 amplitude levels (00, 01, 10, 11) vs. PCIe 5.0, and earlier generations, which used NRZ modulation with 1 bit per clock cycle and two amplitude levels (0, 1).

: Comparison of NRZ modulation and PAM4 modulation

2. Forward Error Correction (FEC)

There are always tradeoffs, and the transition to PAM4 signal encoding introduces a significantly higher Bit Error Rate (BER) vs. NRZ. This prompted the adoption of a Forward Error Correction (FEC) mechanism to mitigate the higher error rate. Fortunately, the PCIe 6.1 FEC is sufficiently lightweight to have minimal impact on latency. It works in conjunction with strong CRC (Cyclic Redundancy Check) to keep Link Retry probability under 5×10-6.  This new FEC feature targets an added latency under 2ns.

While PAM4 signaling is more susceptible to errors, channel loss is not affected compared to PCIe 5.0 due to the nature of the modulation technique, so the reach of PCIe 6.1 signals on a PCB will be the same as that of a PCIe 5.0.

3. FLIT Mode:

PCIe 6.1 introduces FLIT mode, where packets are organized in Flow Control Units of fixed sizes, as opposed to variable sizes in past PCIe generations. The initial reason for introducing FLIT mode was that error correction requires working with fixed size packets; however, FLIT mode also simplifies data management at the controller level and results in higher bandwidth efficiency, lower latency, and smaller controller footprint. Let’s address bandwidth efficiency for a minute: with fixed-size packets, the framing of packets at the Physical Layer is no longer needed, that’s a 4-byte savings for every packet. FLIT encoding also does away with 128B/130B encoding and DLLP (Data Link Layer Packets) overhead from previous PCIe specifications, resulting in a significantly higher TLP (Transaction Layer Packet) efficiency, especially for smaller packets.

4. Other changes in PCIe 6:

L0p mode – enabling traffic to run on a reduced number of lanes to save power
A new PIPE specification – for the PHY to Controller interface

PCIe 6.1 Fun Fact: the x32 and x12 interface widths from earlier generations are dropped. While these widths are available in PCIe 5.0 and earlier specifications, these widths were never implemented in the market.

Why PCIe 6.1 now?

Before 2015, the PCIe specification was well ahead of the market in terms of available bandwidth required for use cases. After 2015, global data traffic has exploded. Data centers transitioned to 100G Ethernet (and up) pushing the bottleneck to the PCIe interconnects in servers and network devices.

The PCIe 6.1 specification fully supports the transition to 800G Ethernet in data centers: 800 gigabit per second (Gb/s) requires 100 GB/s of unidirectional bandwidth which falls within the 128 GB/s envelope of a x16 PCIe 6.1 link; 800G Ethernet, like PCIe, is full duplex. Further, data center general compute and networking are not the sole driving forces behind PCIe 6.1. AI/ML accelerators have an insatiable need for more bandwidth. Processing AI/ML training models is all about speed, and the faster accelerators can move data in and out, the more efficient and cost effective the training can be executed.

Conclusion

PCIe is everywhere in modern computing architectures, and we expect PCIe 6.1 will gain quick adoption in performance-critical applications in AI/ML, HPC, cloud computing and networking.

Rambus offers PCIe 6.1 controller IP, featuring an Integrity and Data Encryption (IDE) engine which provides state-of-the-art security for the PCIe links and the valuable data transferred over them.

PCI Express 5 vs. 4: What’s New? [Everything You Need to Know]

Rambus Press — Thu, 07 Sep 2023 13:05:44 +0000

Introduction

What’s new about PCI Express 5 (PCIe 5)? The latest PCI Express standard, PCIe 5, represents a doubling of speed over the PCIe 4.0 specifications.

We’re talking about 32 Gigatransfers per second (GT/s) vs. 16GT/s, with an aggregate x16 link duplex bandwidth of almost 128 Gigabytes per second (GB/s).

This speed boost is needed to support a new generation of artificial intelligence (AI) and machine learning (ML) applications as well as cloud-based workloads.

Both are significantly increasing network traffic. In turn, this is accelerating the implementation of higher speed networking protocols which are seeing a doubling in speed approximately every two years.

You can find much more about PCIe 5 in the article below.

1. PCI Express: Frequently Asked Questions (FAQ)
2. PCIe 5 – A New Era
3. PCIe 5 vs. PCIe 4 (+Comparison table included)
4. PCIe 5: Applications and Market Adoption
5. Complete PCIe 5 Interface Solutions from Rambus
6. Conclusion

PCI Express: Frequently Asked Questions (FAQ)

Let’s answer five frequently asked questions about PCI Express and PCIe 5.

a. What is PCI Express 5?

With the preliminary specification announced in 2017, PCIe 5 is a high-speed serial computer expansion bus standard that moves data at high bandwidth between multiple components. The PCIe 5.0 specification was formally released in May of 2019.

You might be wondering why a new PCI Express standard like PCIe 5 is needed. Well, PCIe 5 offers twice the data transfer rate of its PCIe 4 predecessor, delivering 32 GT/s vs. 16 GT/s. This speed increase is critical to support new AI/ML applications and cloud-centric computing.

b. Why both GT/s and GB/s?

GT/s is a measure of raw speed – how many bits can we transfer in a second. The data rate, on the other hand, has to take into consideration the overhead for encoding the signal. Bandwidth is data rate times link width, so encoding overhead’s impact on the data rate translates directly to an impact on bandwidth.

Back in the days of PCIe 2, the encoding scheme was 8b/10b, so there was a hefty overhead penalty for encoding. With such a high overhead, it was particularly useful to have measures of transfer rate (x GT/s) and data rate (y Gbps), where “y” was only 80% of “x.”
With Gen 3 and continuing through to the present Gen 5, the PCI Express standard moved to a very efficient 128b/130b encoding scheme, so the overhead penalty is now less than 2%. As such, the link speed and the data rate are roughly the same.

For a PCI 5 x8 link, 32 GT/s raw speed translates to 31.5 GB/s bandwidth (we chose a x8 link so we could go straight from bits to bytes). And since PCIe is a duplex link, total aggregate bandwidth rounds to 63GB/s (32GT/s x 8 lanes / 8 bits-per-byte x 128/130 encoding x 2 for duplex).

c. What is a PCI Express lane?

So what’s a PCI Express lane? Well, a PCIe lane consists of four wires to support two differential signaling pairs. One pair transmits data (from A to B), while the other receives data (from B to A). Want to know the best part? Each PCIe lane is designed to function as a full-duplex transceiver which can simultaneously transfer 128-bit data packets in both directions.

d. What does PCIe x16 mean?

We’ve discussed lanes, but what do they have to do with x16? Well, the term “PCIe x16” is used to refer to a 16-lane link instantiated on a board or a card. Physical PCIe links may include 1, 2, 4, 8, 12, 16 or 32 lanes. The 32-lane link is a pretty rare beast, so in practical terms the x16 represents the top end of the PCI Express link options.

e. What is PCI Express used for?

We’ve talked a lot about PCIe 5, but what is PCI Express actually used for?

You can think of the PCIe interface as the system “backbone” that transfers data at high bandwidth between various compute nodes. What’s the bottom line? Put simply, PCIe 5 rapidly moves data between CPUs, GPUs, FPGAs, networking devices and ASIC accelerators using links with various lane widths configured to meet the bandwidth requirements for the linked devices.

PCIe 5 vs. PCIe 4

Here’s a handy by-the-numbers comparison of PCIe 5 vs. PCIe 4 with the actual aggregate (duplex) bandwidth adjusted for the encoding overhead.

Comparison table: PCIe 5 vs PCIe 4

PCIe 5: Applications & Market Adoption

AI/ML and Cloud Computing

No surprise, PCIe 5 is the fastest PCI Express ever. While the speed upgrade makes the applications of today run faster, what’s particularly exciting is that PCIe 5 is enabling new applications in markets such as AI/ML and cloud computing.

AI applications generate, move and process massive amounts of data at real-time speeds. An example is a smart car which can generate as much as 4TB of data per day!

But that’s not all, the size of AI/ML training models are doubling every 3-4 months. The torrent of data, and the rapid growth in training models is putting tremendous stress on every aspect of the compute architecture, with interconnections between devices and systems being of critical importance. Also critical is fast access to memory as AI/ML workloads are extremely compute intensive.

But while AI/ML is one major megatrend, there are others. Data centers are changing, with enterprise workloads moving to the cloud at a rapid pace. Those applications mean moving more data, often with real-time speed and latency.

This shift to the cloud, along with ever-more sophisticated AI/ML applications, is accelerating the adoption of higher speed networking protocols that are experiencing a doubling in speed about every two years: 100GbE ->200GbE-> 400GbE.

Now this is where PCI express 5 comes in. PCIe 5 delivers duplex link bandwidth of almost 128 GB/s in a x16 configuration. Put simply, PCI express 5 effectively addresses the demands of AI/ML and cloud computing by supporting higher speed networking protocols as well as higher speed interconnections between system devices..

Complete PCI Express 5 Digital Controller Solutions from Rambus

Rambus offers a highly configurable PCIe 5.0 digital controller.

The Rambus PCIe 5.0 Controller can be paired with 3^rd-party PHYs or those developed in house. Rambus can provide integration and verification of the entire interface subsystem.

Conclusion

In “PCI Express 5 vs. 4: What’s New?” we explain how PCI Express is the system backbone that transfers data at high bandwidth between CPUs, GPUs, FPGAs and ASIC accelerators using links of variable lane widths depending on the bandwidth needs of the linked devices.

We also detail how the latest PCI Express standard, PCIe 5, represents a doubling over PCIe 4 with a raw speed of 32GT/s vs. 16GT/s translating to total duplex bandwidth for a x16 link of ~128 GB/s vs. ~64 GB/s.

We then explored how the higher data rates of PCIe 5 are enabling system designers to support a new generation of cloud computing and AI/ML applications.

Explore more primers:
– Hardware root of trust: All you need to know
– Side-channel attacks: explained
– DDR5 vs DDR4 – All the Design Challenges & Advantages
– Compute express link: All you need to know
– MACsec Explained: From A to Z
– The Ultimate Guide to HBM2E Implementation & Selection

PCIe 6.0 Takes the Spotlight

Rambus Press — Wed, 21 Jun 2023 15:45:58 +0000

Written by Frank Ferro and Lou Ternullo

We wrapped up a great PCI-SIG Developers Conference (DevCon) last week which really showed off the strength and momentum of the PCI Express® community. There was great engagement with everyone who stopped by the booth, and we appreciate the time of everyone who had the chance to do so. While PCIe 5.0 just recently reached the market in the latest state-of-the-art server and client systems, the demand for more bandwidth is unrelenting. So, this DevCon was the opportunity to shine the spotlight on the generation for the next wave of computing systems: PCIe® 6.0.

Rambus Booth at PCI-SIG Developers Conference 2023

PCIe 6.0 represents a real watershed event for the standard, because for the first time in its storied history, we’re moving from tried-and-true NRZ to a new signaling scheme, PAM4. With PAM4 signaling (“Pulse Amplitude Modulation with four levels”) you get 2 bits per clock cycle for 4 amplitude levels (00, 01, 10, 11) vs. PCIe 5.0, and earlier generations, which used NRZ modulation with 1 bit per clock cycle and two amplitude levels (0, 1). With PAM4, instead of talking about a clean eye, we need to talk about “three clean eyes” between the four voltage levels.

That’s exactly what we demo’ed at DevCon, with our PCIe 6.0 PHY running at 64 GT/s. In long reach and short reach implementations, we showed off Bit Error Rate (BER) performance that far exceeded the spec. With SI/PI being integral to our engineering DNA, we’ve designed our PHY with the headroom to ensure first-time right implementations for the most demanding applications.

Rambus PCIe 6.0 PHY Demo

Given PAM4’s inherently higher BER compared to NRZ, the PCIe 6.0 standard incorporates Forward Error Correction (FEC) in the controller to mitigate the higher error rate. The PCIe 6.0 FEC is kept lightweight to have minimal impact on latency (under 2ns). FECs required fixed sized packets, so away go the variable packets of PCIe 5.0 and in come FLITs with PCIe 6.0.

FLIT mode packets are organized in Flow Control Units of fixed sizes, as opposed to variable sizes of past PCIe generations. In addition to supporting FEC, FLIT mode also simplifies data management at the controller level and results in higher bandwidth efficiency, lower latency, and smaller controller footprint.

Rambus PCIe 6.0 Digital Controller Demo

In our PCIe 6.0 Digital Controller demo we showed operation to the full PCIe 6.0 spec. including FLIT mode. We sent Transaction Layer Packets (TLP) both from Root Port to Endpoint, and Endpoint to Root Port. This demo used Root Port and End Point instantiations of the Rambus PCIe 6.0 Controller IP implemented into two internally developed FPGA boards to accommodate the current unavailability of PCIe 6.0 Host devices. We utilized our internally developed embedded debugger and logic analyzer, XpressAGENT, to trace and display TL Packets. In compliance with the PCIe 6.0 specification, the controller is backwards compatible to non-FLIT mode NRZ operation when interoperating with PCIe 5.0 and earlier generation devices.

Whether you need a PCIe 6.0 PHY, PCIe 6.0 Digital Controller or a full PCIe 6.0 Interface Subsystem, we’ve got you covered. You can check out all our PCIe IP offerings here and get in touch with us at rambus.com.

Accelerating AI/ML applications in the data center with HBM3

Rambus Press — Thu, 03 Nov 2022 20:33:58 +0000

Semiconductor Engineering Editor in Chief Ed Sperling recently spoke with Frank Ferro, Senior Director of Product Management at Rambus, about accelerating AI/ML applications in the data center with HBM3. Introduced by JEDEC in early 2022, the latest iteration of the high bandwidth memory standard increases the per-pin data rate to 6.4 Gigabits per second (Gb/s), double that of HBM2.

HBM3 maintains the 1024-bit wide interface of previous generations—while extending the track record of bandwidth performance set by what was originally dubbed the “slow and wide” HBM memory architecture. Since bandwidth is the product of data rate and interface width, 6.4 Gb/s x 1024 enables 6,554 Gb/s. Dividing by 8 bits/byte yields a total bandwidth of 819 Gigabytes per second (GB/s).

HBM3 also supports 3D DRAM devices of up to 12-high stacks—with provision for a future extension to as high as 16 devices per stack—for device densities of up to 32Gb. In real-world terms, a 12-high stack of 32Gb devices translates to a single HBM3 DRAM device of 48GB capacity. Moreover, HBM3 doubles the number of memory channels to 16 and supports 32 virtual channels (with two pseudo-channels per channel). With more memory channels, HBM3 can support higher stacks of DRAM per device and finer access granularity.

Eliminating memory bandwidth bottlenecks

“HBM3 is all about bandwidth,” says Ferro. “There are many high-end accelerator cards going into the data center for AI [applications], particularly AI training. A lot of these systems have a good [number] of processors—but you’ve got to keep these processors fed [which means] memory bandwidth is now the bottleneck.”

To highlight IP requirements and potential design choices for the next generation of HBM3-based silicon, Ferro sketches a generic AI accelerator model with purpose-built processors running a neural network.

“You’ve got a processor—probably multiple processors—and these must get fed from memory. So, when you’re doing for example, image recognition training, you’ve got to put lots of data into the system [to enable high-accuracy inference],” he elaborates. “Clearly, you need a lot of memory bandwidth and that’s really where HBM3 comes into the picture. Although HBM2 and HBM2E [offer] very high bandwidth, processors still need to get fed with [even] more data.”

According to Ferro, memory is currently one of the most the critical bottlenecks in the data center, especially for AI/ML applications.

“If you look at the data sets for AI, they’re just growing at exponential rates,” says Ferro. “Data increases from month-to-month and puts a lot of pressure on the memory side.”

Balancing price, performance, and power

As Ferro points out, requirements for specific workloads—such as image processing, financial modeling, and pharmaceutical simulations—play a major role in influencing the design of AI accelerators.

“In the picture above, I’m showing two HBM3 memory devices, a configuration that will provide 1.6 terabytes of performance. If you’re doing genome sequencing or financial transactions, you may need more—or less—bandwidth [depending on workload],” he explains. “So, you might add two more HBMs to double that bandwidth even further. We’ve even seen systems that go up to eight HBMs. The basic architecture [remains] the same, although you’re tuning the system from an optimization standpoint.”

Additional design considerations include power and cost. As Ferro points out, HBM3 improves energy efficiency by dropping operating voltage to 1.1V and leveraging low-swing 0.4V signaling.

“You’re going to want to tune and balance the system to efficiently meet application [requirements] while staying within your cost and power budgets,” he adds.

To effectively determine tradeoffs that balance price, performance, and power, Ferro recommends that system designers first gauge memory processing requirements and then select an optimal implementation. For example, if only a terabyte of performance is needed, perhaps a single HBM2E memory device will suffice. If the application demands more bandwidth, multiple HBM3 devices will likely be a better fit.

PCIe 6 and chiplets

As Ferro notes, PCIe will also play a major role in influencing future AI accelerator designs. Indeed, PCIe 5 offers a transfer rate of 32 Giga transfers per second (GT/s) per pin (per second), while PCIe 6 will double this rate to 64 GT/s.

“You’ve got to look at how much data you will be bringing into the system, how much data you’re bringing out, and how these processors need to get fed,” he elaborates. “For example, you can [potentially] partition some [workloads] dynamically, so if you decide to split it into multiple jobs—because a lot of this is happening in parallel—maybe you don’t [need to] use all of that bandwidth [or] processing power [for a single task], although you can do multiple things at once.”

According to Ferro, minimizing die size is also an important consideration, especially for HBM implementations. This is one reason the semiconductor industry is eyeing chiplets for AI accelerators, as the technology enables system designers to mix and match different components based on specific workload requirements, shrink overall die size, and reduce costs.

“[With chiplets], you can potentially go with a cheaper process node for the I/O controller, for example, but if you need the most advanced process node for your processor, you can [do so while] balancing overall system cost,” he adds.

Boosting Data Center Performance to the Next Level with PCIe 6.0 & CXL 3.0

Rambus Press — Mon, 24 Oct 2022 21:00:36 +0000

2022 has seen major updates to two standards critical to the future evolution of the data center: PCI Express® (PCIe®) and Compute Express Link (CXL). The two are interwoven, and in this blog, we’ll look at their relationship and the impact of latest developments.

Like many standards in the computing world, PCIe has proliferated far beyond its original remit. Over the past two decades, it has become not just the de facto standard for computing connectivity, it has also expanded into new applications, such as IoT, automotive, government, and many more. With its most recent update to PCIe 6.0, it is poised to take data center performance to the next level.

PCIe 6.0 boosts signaling rates to 64 gigatransfers per second (GT/s), twice that of PCIe 5.0. Initial designs incorporating PCIe 6.0 will be where bandwidth demands are most intense right now: in the heart of the data center. For bandwidth-hungry, data-intensive workloads, the extra bandwidth offered by PCIe 6.0 will certainly be a game changer!

CXL, first introduced in 2019, adopted the ubiquitous PCIe standard for its physical layer protocol (CXL.io). At that time, PCIe 5.0 was the latest standard, and CXL 1.0, 1.1 and the subsequent 2.0 generation all used PCIe 5.0’s 32 GT/s signaling.

In August 2022, CXL 3.0 was released, adopting the PCIe 6.0 physical interface. This new version of the CXL specification introduced new features such that promise to increase data center performance and scalability, while reducing the total cost of ownership (TCO). CXL 3.0, like PCIe 6.0, uses PAM4 to boost signaling rates to 64 GT/s with no additional latency.

Beyond this, it offers multi-tiered switching and switch-based fabrics, along with improved memory sharing and pooling capabilities. Combined, these three key features enable new use models and increased flexibility in data center architectures. This facilitates the move to distributed, composable architectures and higher performance levels for AI/ML and other compute-intensive or memory-intensive workloads.

For SoC designers, the number of signal integrity and power integrity (SI/PI) issues compound as data rates rise. Designing for 64 GT/s operation can be exceedingly tricky. Rambus has over 30 years of renowned leadership in SI/PI and has helped chip makers successfully implement hundreds of PCIe and CXL designs. With today’s announcement of a PHY that supports both PCIe 6.0 and CXL 3.0, we offer an easy to integrate solution that will help you take your chip design to the next level of performance.

CXL™ 3.0 Turns Up Scalability to 11

Rambus Press — Tue, 02 Aug 2022 15:58:36 +0000

The CXL Consortium (of which Rambus is a member) has now released the 3.0 specification of the Compute Express Link (CXL) standard. CXL 3.0 introduces compelling new features that promise to increase data center performance, scalability and TCO. CXL has evolved rapidly from its introduction in 2019. The 1.0/1.1 specification enabled prototyping of CXL solutions. With 2.0 and the introduction of memory pooling, CXL reached the deployment phase. Now with CXL 3.0, we have capabilities that will power the scaling phase.

So, what’s new in CXL 3.0? Well, first up there’s a step function increase in data rate. CXL 1.x and 2.0 use the PCI Express® (PCIe®) 5.0 electricals for their physical layer: NRZ signaling at 32 Gigatransfers per second (GT/s). CXL 3.0 keeps that same philosophy of building on broadly adopted PCIe technology and extends it to the latest 6.0 version of the PCIe standard released earlier this year. That boosts CXL 3.0 data rates to 64 GT/s using PAM4 signaling.

A second big addition with CXL 3.0 is multi-tiered switching which enables the implementation of switch fabrics. CXL 2.0 allowed for a single layer of switching. CXL 2.0 switches can connect to upstream hosts and downstream devices, but not other switches, and the scale is limited to the available ports on a switch. With CXL 3.0, switch fabrics are enabled, where switches can connect to other switches, vastly increasing the scaling possibilities.

Among additional features, CXL 3.0 introduces peer-to-peer direct memory access and enhancements to memory pooling where multiple hosts can coherently share a memory space on a CXL 3.0 device. These features enable new use models and increased flexibility in data center architectures. Taken together with 64 GT/s signaling and fabric switching, CXL 3.0 puts us on the road for composable server systems which optimize performance and TCO.

CXL is a-once-in-a-decade technological force that will transform data center architectures. Supported by a who’s who of industry players including hyperscalers, system OEMs, platform and module makers, chip makers and IP providers, its rapid development is a reflection of the tremendous value it can deliver. Rambus is proud to be a member of the CXL Consortium and to provide chip and IP solutions that will shape the data center of the future.

What is PCIe 4.0? PCI Express 4 explained

Rambus Press — Tue, 02 Nov 2021 10:00:53 +0000

PCI Express® 4.0 also known as PCIe 4.0 or PCIe Gen 4 is the fourth generation of Peripheral Component Interconnect Express (PCI express) expansion bus specifications, which are developed, published, and maintained by the PCI Special Interest Group (PCI-SIG). It is an open standard.

In this blog, you’ll learn all about PCI express 4 performance vs PCIe 3.0. More specifically:
1. PCIe 4.0 bandwidth
2. Market applications: Who needs PCIe 4.0?
3. PCIe 3.0 vs 4.0: Comparison table
4. PCI Express 4.0 controller IP solutions from Rambus
5. Conclusion

Read our primer? Jump to: PCI Express 5 vs. 4: What’s New?

PCIe 4.0 bandwidth

The interconnect performance bandwidth is double that of the PCIe 3.0 specification achieving 16 GT/s and compatibility with software and mechanical interfaces is preserved. PCIe 4.0 architecture is compatible with prior generations of PCIe technology.

PCI Express growth from 2002 and on

Market applications: Who needs PCIe 4.0?

Big Data needs throughput

According to Gary King, Weatherhead University Professor, “The data flow is so fast that the total accumulation of the past two years—a zettabyte—dwarfs the prior record of human civilization”. Internet, ubiquitous smartphone usage and increased marketing accelerated the Big Data revolution, and the Internet of Things (IoT) will increase the needs for fast and efficient data management environments. More throughput and lower power are necessary to prevent a bottleneck in the emergence of Big Data.

Networking applications

PCI express 4 lane bandwidth (GB/s)

PCIe 4.0 can handle 40Gb Ethernet with an 8-lane configuration, and can handle 100Gb Ethernet (requiring 25 GB/s) with a x16.

PCIe 3.0 vs 4.0: Comparison table

Bandwidth in the table above is for a x16 link

PCIe 4.0 specifications

There are no encoding changes from PCIe 3.0 to 4.0. There were only minor updates in terms of the protocol. There were only minor updates in terms of the protocol.

There were also minor changes in terms of link-level management. PCIe 4.0 enables a more robust equalization.

In terms of performance, with PCIe 4.0, throughput per lane is 16 GT/s. The link is full duplex, which means the data can be sent and received simultaneously for a throughput of 32 GT/s per lane. For a x16 configuration, this provides 64 GB/s of bandwidth (16 GT/s x2 (duplex) x16 lanes / 8 bits/byte = 64 GB/s). There is some encoding overhead, the 128b/130b encoding is 98.46% efficient, so actual bandwidth is 63 GB/s.

PCI Express 4.0 controller IP solutions from Rambus

Rambus offers a broad portfolio of PCI Express controller IP from PCIe 2.0 to PCIe 6.0. For PCIe 4.0 specifically, the Rambus PCIe 4.0 Controller IP offers a highly flexible, silicon-proven solution. It can be seamlessly integrated with the customer’s choice of PHY IP.

Why choose Rambus PCIe 4.0 controller IP ?

For the reliability:

Rambus has over 20 years of experience in the design of IP cores for SoCs with specialization in high-speed interface protocols and technologies with a specific focus on PCIe. This history includes hundreds of successful production tapeouts.
The proven Rambus PCIe 3.0 architecture is preserved to enable easy migration to PCIe 4.0. No interface change is necessary; existing behavior is preserved for seamless integration

For the flexibility:

Flexibility of the supported PIPE Configurations for PCIe 4.0:

• PIPE 16-bit is supported in x1, x2, x4, x8 and x16 with 500MHz PIPE clock at 8 Gb/s (ASIC)

• PIPE 32-bit is supported in x1, x2, x4, x8 with 500MHz PIPE clock at 16 Gb/s (ASIC)

• PIPE 64-bit will be supported in x1, x2 and x4 with 250MHz PIPE clock at 16 Gb/s (ASIC/FPGA)

Flexibility of the core configuration to meet spec evolutions

For the supported features:

Features already proven in 3.0, optimized for PCIe 4.0

Endpoint, root port, switch, dual-mode shared silicon
Virtualization-ready with SRIOV and ATS/ARI (networking, data center)
Multi-function
AER and data integrity mechanism
Complete power management support: legacy, ASPM L0s/L1, OBFF, L1 PM substate with CLKREQ
End-end TLP prefixes
Supports Extension Device ECN

Conclusion

The 4th generation of the PCI Express standard builds on the widely adopted PCIe 3.0 architecture while doubling the link transfer rate to 16 GT/s. In a x16 configuration, PCIe 4.0 supports 100Gb Ethernet and can deliver nearly 64 GB/s of duplex bandwidth. For PCIe 4.0 interface implementations, Rambus offers the silicon-proven PCIe 4.0 Controller IP backed by industry-renowned Rambus service and support.

Keep on reading:

New interface architectures enable data scaling

Rambus Press — Thu, 03 Jun 2021 17:20:50 +0000

Suresh Andani, senior director of product marketing at Rambus, has written an article for Semiconductor Engineering that takes a closer look at why data center scaling requires new interface architectures.

As Andani notes, global data traffic is growing at an exponential rate. More specifically, 5G networks are enabling billions of AI-powered IoT devices untethered from wired networks, while machine learning’s voracious appetite for enormous data sets is skyrocketing. Moreover, data intensive video streaming for both entertainment and business applications continues to accelerate, as ADAS and autonomous vehicles add yet another torrent of data.

“Nowhere is the impact of all this growth being felt more intensely than in the data centers at the heart of the global network,” he explains. “Ethernet networks interconnect the switch, server and storage devices of the data center, and are scaling rapidly to meet the tidal wave of data.”

According to Andani, Ethernet’s evolution kicked into high gear with the 25 Gigabit Ethernet Consortium’s introduction of 25G and 50G standards in July 2014. Flash forward to 2020, and the re-branded Ethernet Technology Consortium announced the 800G Ethernet standard in April. This represents a 16X increase in 6 years from the 50G milestone, or more than 2.5X increase every two years. As such, 1.6T Ethernet should be on pace to make its appearance in late 2022 or early 2023.

“At the silicon-level, scaling to smaller process nodes has enabled switch ASICs to advance from 12.8 Terabits per second (Tbps), to 25.6T, to the upcoming generation 51.2T,” writes Andani. “But there is a critical issue that arises with the end of Dennard scaling: power.”

Indeed, thanks to Dennard scaling, die shrinks that enabled a doubling of bandwidth, a doubling of transistor density, and a halving of per transistor power, allowed for power per area to remain constant. However, in a post-Dennard world, power consumption has become a dominant factor in system architectural design as power per area rises even with process node shrinks.

“12.8T switch ASICs used 25G and 50G medium reach (MR) and long reach (LR) SerDes for 25G, 50G and 100G Ethernet links. As switch ASICs migrate to 25.6T ASICs and 400G ports, 512 SerDes running at 50G are needed to move data on and off chip,” states Andani. “That number of MR and LR SerDes burn too much power to be practical. Architecturally, this motivates moving from the electrical domain to the optical domain to keep the power budget in check.”

In response, says Andani, the SerDes for 25.6T ASICs transition to 50G very short reach (VSR) interfaces to link the silicon with on-board or pluggable optical modules.

“8 50G lanes are aggregated for a 400G Ethernet connection. The power savings afforded with the simpler VSR architecture, eliminating much of the DFE circuitry required for an MR/LR class SerDes, is substantial,” he elaborates. “But with the move to 51.2T ASICS and 800G Ethernet, it most likely won’t be enough. [For] 51.2T ASICs, the SerDes need to scale to 100G to fit them all on chip. As everything has doubled, 512 SerDes running at 100G are needed for the bandwidth. These can be aggregated for 64 links of 800G Ethernet.”

As signal losses rise with data rate, the 100G VSR circuitry will be more complex and burn more power than the 50G variety. Thus, the architecture will need to evolve to one which moves the optics even closer to the silicon.

“Here’s where co-packaged optics (CPO) enters the picture. By integrating 800G optical engines in the same package as our switch ASIC, we can drop the size, complexity and power of the 100G links to extra short reach (XSR) requirements,” Andani explains. “Using 100G XSR links running at less than 1 picojoule per bit, we can reduce the I/O power by more than 80% and the switch ASIC thermal design power (TDP) by more than 25% compared to an MR/LR SerDes running at the same speed. This gives us the means to scale to the 51.2T ASICs needed for the next generation of network switches.”

With the end of Dennard Scaling, architectural innovation is the key to advancing the performance of network devices. Moving optics closer to the silicon, with optical modules and then with CPO, enables us to overcome the growing obstacle of power.

“112G XSR SerDes and CPO make possible network devices with 800G and then 1.6T Ethernet links to handle the growing flood of data traffic,” concludes Andani.

Designing chiplet and co-packaged optics architectures with 112G XSR SerDes

Rambus Press — Thu, 13 May 2021 17:43:52 +0000

Suresh Andani, senior director of product marketing at Rambus, has written an article for Semiconductor Engineering that takes an in-depth look at how 112G XSR SerDes can be used to optimally design chiplet and co-packaged optics architectures.

As Andani notes, conventional chip designs are struggling to achieve the scalability, as well as power, performance, and area (PPA) that are expected of leading-edge designs.

“With the slowing of Moore’s Law, high complexity ASICs increasingly bump up against reticle limits. The demise of Dennard scaling means power consumption is [also] a growing challenge,” he explains. “In this context, disaggregated architectures such as chiplets or co-packaged optics (CPO) become truly viable alternatives to the traditional monolithic SoC scaling approach.”

According to Andani, aggregating multiple chiplets to perform the function of a single monolithic IC de-risks the overall system by reducing complexity and increasing yields.

“This is precisely why chiplets are already enabling a wide range of applications across multiple mainstream markets such as the data center, networking, 5G, high-performance computing (HPC), and artificial intelligence/machine learning (AI/ML),” he elaborates. “It also allows different process technologies to be mixed and matched with functions implemented in the most appropriate node.”

As Andani observes, a similar rationale exists for CPO, with next-generation 51.2 Terabit per second (Tbps) switch ASICs slated to be co-packaged with sixty-four (64) 800G silicon photonics die.

“Trying to integrate both logic and the silicon photonics required for the needed bandwidth in a single chip would be completely impractical,” he states. “Conversely, very short reach electrical links interconnecting a packaged 51.2T ASIC with separately packaged optical modules would burn too much power.”

The solution, says Andani, is co-packaging ASIC and optics connected with ultra-low power, extra short reach (XSR) SerDes links. This disaggregated architecture allows for logic and optics to be implemented in the best-fit process nodes for each, reduces complexity and achieves power targets.

For AI/ML and HPC SoCs, the 112G XSR SerDes is used to bridge purpose-built accelerator chiplets for natural language processing, video transcoding, and image recognition. Another popular use case, says Andani, is the die disaggregation of large SoCs (which are hitting reticle size limits for manufacturable yields) into multiple smaller die that are connected with XSR links over organic substrate.

“As the semiconductor industry turns towards chiplets and CPO to enable high-performance products, implementation of the SerDes PHY is critical to effectively maintain high speeds and signal integrity across XSR and ultra-short reach (USR) distances,” he explains. “From our perspective, 112G XSR SerDes PHYs fabricated in advanced process nodes – such as 7nm – can successfully deliver the required speed and signal integrity demanded by chiplets and co-packaged optics. The 112G XSR interface – formalized by the Optical Internetworking Forum (OIF) – offers extremely high throughput capabilities, even though it is designed for low complexity and very low power consumption.”

As Andani points out, 112G XSR SerDes PHYs should be tailored for the ultra-low power and area requirements of die-to-die (D2D) and die-to-optical-engine (D2OE) interfaces, supporting NRZ and PAM-4 signaling at multiple data rates for maximum design flexibility.

Additional features should optimally include a high-bandwidth ~2Tbps/mm of uni-directional beachfront efficiency, support for channels up to 10dB insertion loss without DFE (to save power), multiple lane configurations to allow flexible ASIC floorplan integration, extensive Design-For-Test (DFT) capabilities to aid manufacturing of Known Good Dice (KGD), static and run-time debug capabilities, and software and scripts for enhanced bring-up and validation.

“In a world of Moore’s Law slowing and Dennard Scaling stopped, the traditional approach of implementing monolithic SoCs faces real challenges,” he adds. “Complexity, yield, and power consumption become insurmountable obstacles to delivering the PPA required by leading-edge applications.”

Fortunately, disaggregated architectures, namely chiplets and CPO, break through these limits. High-speed, extra and ultra-short reach links delivered by 112G XSR SerDes PHYs are the key technology for interconnecting chiplets, ASICs and optics.

“With 112G XSR SerDes, chiplets and CPO will enable the most demanding applications across the data center, networking, 5G, HPC, and AI/ML markets,” Andani concludes.

Overcoming high-speed SerDes IP integration challenges: Part 2

Rambus Press — Wed, 21 Apr 2021 20:31:32 +0000

In this two-part blog series based on a recent Semiconductor Engineering article, Rambus engineers Niall Sorensen and Malini Narayana Moorthi take an in-depth look at how to overcome high-speed SerDes IP integration challenges. In part one, the two point out that SerDes design is a complex process which requires a multidisciplinary team of analog, digital, and physical designers and software engineers, as well as support from silicon-validation and operations teams. Perhaps not surprisingly, it has become more time and cost effective to source SerDes from an IP vendor specializing in the technology.

The role of the SerDes IP vendor is to ensure the effective integration of its IP into the ASIC. However, SerDes IP vendors must deal with multiple challenges and issues during the integration process, including IP pin list and ensuring accurate design simulation (both topics are discussed in-depth in part one of this two-part blog series). SerDes IP vendors must also understand timing analysis/place and route, board/PCB design issues, bring-up readiness, as well as silicon compliance, debug, and failure analysis.

Timing Analysis/Place & Route

According to Sorensen and Moorthi, SerDes clock speeds in the digital layer are increasing to rates of 1Ghz and beyond. As data rates increase and bus widths remain steady, this is a natural progression. For example, PCIe Gen4 runs at 16Ghz with a 16-bit bus width and a 1Ghz clock speed.

“Higher speed clocks result in lower design latencies on the data path which is beneficial for communication speed over networks. However, this creates a challenge for timing closure with a SerDes design that can include a complex array of clock loopbacks and control blocks running at these transmit (TX) and receive (RX) clock speeds,” the two explain. “If the digital layer of the SerDes design is provided to ASIC integrator in a soft Verilog or VHDL format, the ASIC integrator is expected to synthesize, place and route (P&R) and close timing successfully on the design. Given that the ASIC integrator is generally not intimately familiar with the IP design, this can present a challenge.”

To successfully close timing, says Sorensen and Moorthi, Rambus provides the ASIC integrator with synthesis design constraints (SDC). Essentially, SDC offers a high-level description of the clocking scheme for the design including clock definitions, clock groups, clock frequencies, clock crossing domains, and non-standard clocking schemes. In addition, an application note is provided including detailed information on clock mux design, suggested clock tree timing budgets for the critical high-speed TX and RX data path, as well as instructions on clock balancing for loopback and mission mode clocking.

“For the analog portion of the design, Rambus provides liberty timing models for the analog front end (AFE) portion of the design that is a hardened custom transistor design,” the two elaborate. “The liberty timing model specifies the timing delays on the interface between the AFE and the digital layer. Rambus also supports annotated gate-level simulations to catch any functional timing issues that may not be caught using static timing analysis (STA) tools.”

Board/PCB Design Issues

As data rates increase, board design becomes increasingly challenging. This means a wide range of factors must be taken into consideration when designing a PCB board and package including:

Crosstalk
Impedance matching
Intra-channel skew
Supply decoupling
Insertion/return loss
Regulator supply noise
Connector choice

This is precisely why Rambus provides a detailed application note with instructions and specifications for the ASIC integrator to have signoff criteria for all the aspects of channel and noise quality outlined above.

“Although many ASIC integrators are very knowledgeable of high-speed board and package design, and while outside the design of the SerDes IP, it is always helpful to provide clear instructions to avoid these kinds of issues,” Sorensen and Moorthi state. “In addition, Rambus provides several debug tools as part of the IP to analyze signal integrity (SI) and channel related issues, including an eye diagram and an impulse response analyzer to find channel reflections, crosstalk, etc.”

Figure 1 Example Eye Diagram (Rambus)

Figure 2 Example Impulse Response Diagram (Rambus)

Bring-Up Readiness

According to Sorensen and Moorthi, there are many questions that should be answered when a chip finally comes back to the ASIC integrator’s lab. This is because there is typically a lull of several months between the tape out and the chip’s return. These questions include:

Do/did you:

Have the equipment you need to bring up the design?
Conduct a full package and board review to remove SI surprises?
Run system simulations?
Inform IP providers of when the chip is expected so they can plan on support?
Review the relevant portion of the test plan by the IP applications team?
Familiarize yourself with, and simulate, the different test and debug modes?
Have a procedure to obtain and implement updates?

“You can never be too prepared, and with planned reviews with your IP vendor support team you can minimize bring-up issues with new chips,” they add.

Silicon Compliance, Debug & Failure Analysis

As Sorensen and Moorthi observe, compliance, debug, production yield, and system debug can all pose challenges during customer production ramp up. For example, system issues are often brought to SerDes IP vendors since these eventually result in poor link performance.

“[This means] the IP needs to have sufficient isolation test paths to isolate the issue at hand. While isolation sounds basic, the problem gets compounded by repeatability challenges, board-to-board variations, socketed vs. soldered assembly,” the two elaborate. “Once isolation of the system failure is identified, the issue must be reproduced to understand trigger points. This then leads to cause-and-effect analysis.”

As the two emphasize, it’s often not systematic issues – but one-off Monte-Carlo variant issues which tend to be tricky yet are important to root cause. Therefore, the IP vendor should have enough observability to enable such debugs. However, this is often challenging when working at the high speeds. Moreover, it is important to ensure that the design can meet the compliance targets through sufficient design verification of the IP or via design configurability.

“With the relentless demand for more network bandwidth, the speed of SerDes will continue to climb. Increasingly, ASIC companies will turn to third-party SerDes IP providers for these critical building blocks motivated by considerations of time-to-market, lowering total cost, and reducing risk,” Sorensen and Moorthi conclude. “Choice of IP vendor needs to encompass the vendor’s ability to support all the elements of IP integration including simulation, analysis, package and board design, bring up and system debug. Though greater complexity is introduced with every speed bump, the application of best practices with the right IP choice will ensure successful implementation of new ASIC designs.”

Interested in learning more? Part one of this two-part blog series is available here.

SerDes PHYs Archives - Rambus

PCIe 6.1 – All you need to know about PCI Express Gen6

What is PCIe 6.1?

What’s new with PCIe 6.1?

1. PAM4 Signaling:

2. Forward Error Correction (FEC)

3. FLIT Mode:

4. Other changes in PCIe 6:

Why PCIe 6.1 now?

Conclusion

PCI Express 5 vs. 4: What’s New? [Everything You Need to Know]

Introduction

Table of contents

PCI Express: Frequently Asked Questions (FAQ)

a. What is PCI Express 5?

b. Why both GT/s and GB/s?

c. What is a PCI Express lane?

d. What does PCIe x16 mean?

e. What is PCI Express used for?

PCIe 5 vs. PCIe 4

PCIe 5: Applications & Market Adoption

AI/ML and Cloud Computing

Complete PCI Express 5 Digital Controller Solutions from Rambus

Conclusion

PCIe 6.0 Takes the Spotlight

Accelerating AI/ML applications in the data center with HBM3

Boosting Data Center Performance to the Next Level with PCIe 6.0 & CXL 3.0

CXL™ 3.0 Turns Up Scalability to 11

What is PCIe 4.0? PCI Express 4 explained

PCIe 4.0 bandwidth

Market applications: Who needs PCIe 4.0?

Big Data needs throughput

Networking applications

PCIe 3.0 vs 4.0: Comparison table

PCI Express 4.0 controller IP solutions from Rambus

Why choose Rambus PCIe 4.0 controller IP ?

Conclusion

New interface architectures enable data scaling

Designing chiplet and co-packaged optics architectures with 112G XSR SerDes

Overcoming high-speed SerDes IP integration challenges: Part 2

Timing Analysis/Place & Route

Board/PCB Design Issues