Xilinx HBM2 Internals

Conventions

  • MiB = Megabytes (2^20 bytes)
  • Gb = Gigabits (2^27 bytes, or 128MiB)
  • GiB = Gibibytes (2^30 bytes)

Background on HBM2

HBM2 is a type of stacked memory, with each stack containing multiple DRAM dies (chips), and each die supporting 2 channels. Every channel accesses a different (and independent) set of DRAM banks - as a result, requests for one channel cannot access data from another channel. This can be thought of as each channel “owning” a certain address range. Channels may be clocked independently. For more details on HBM, see the Wikipedia article.

Xilinx HBM Interface Details

Xilinx’s HBM devices contain two Samsung Aquabolt HBM2 stacks with a 1024-bit wide bus (per stack). For the technical reader, I believe the 8GiB devices in particular have Samsung KHA844801X-MC12 HBM2 (16 banks; 2 channels per die.) Each stack has multiple dies, and each die has two channels - each of which is divided into two pseudo-channels with 64 bits of I/O each. Pseudo-channels operate in a semi-independent manner - reads and writes can be issued at the same time as other commands such as activations (“opening” a row for access) and precharges (“closing” a row after access.) Channels are independent, and may even be clocked independently.

Now, the HBM interface has 32 AXI3 1 ports (provides memory-mapped read/write access to a given portion of the HBM2 address space2). Each of these controls a single pseudo channel - this also means that sixteen of them go to one stack, and sixteen to the other. For an 8GiB (2x32Gb stack) device, every pseudo channel controls access to 2Gb (2^28 bytes, or 256MiB). For 16GiB (2x64Gb stack) devices, every pseudo-channel controls access to 4Gb (2^29 bytes, or 512MiB). It’s important to realize that every pseudo-channel is limited to its own section of memory - accesses through it cannot go to another pseudo-channel’s memory.2

Knowing your limits

Since I like to push the limits of performance, it is important that we define what those limits are. I’ve done my very best to create an environment with as few unintended limitations as possible. The known (and intended) limitations include the AXI clock frequency, and the memory speed itself. We’re going to first define what the speeds are without any memory speed changes (that is, stock speeds.) The first thing to know about the AXI clock is that it operates at a 4:1 ratio to the memory. This means at 450Mhz, it is perfectly matched to a 900Mhz HBM2 clock (as HBM2 is double data rate (DDR), its data bus runs at twice this speed.) We can tell by first calculating the maximum possible data rate for the HBM2, and then the max data rate through the AXI3 ports: shell

Memory max data rate at 1800Mhz (900Mhz DDR): 64 bits per pseudo-channel * 
2 pseudo-channels per memory controller * 8 channels * 1,800,000,000Hz 
* 2 stacks = 3,686,400,000,000 bits/second, or 460,800,000,000 bytes/second
(~429GiB/s, or 460.8GB/s).

AXI max data rate at 450Mhz: 256 bits per AXI port * 16 ports per stack
* 2 stacks * 450,000,000Hz = 3,686,400,000,000 bits/second,
or 460,800,000,000 bytes/second (~429GiB/s or 460.8GB/s).

As you can see, they are the same. This means neither will limit me any more than the other. Since I want to test the limits of the memory itself, I brought my AXI clock up to 600Mhz - this clock speed will not limit me until I pass 1200Mhz, so I can, in effect, forget about that limitation (for now) and focus on the memory.

Memory max data rate at 2400Mhz (1200Mhz DDR): 64 bits per pseudo-channel * 
2 pseudo-channels per memory controller * 8 channels * 2,400,000,000Hz 
* 2 stacks = 4,915,200,000,000 bits/second, or 614,400,000,000 bytes/second 
(~572GiB/s or 614.4GB/s).

AXI max data rate at 600Mhz: 256 bits per AXI port * 16 ports per stack 
* 2 stacks * 600,000,000Hz = 4,915,200,000,000 bits/second, 
or 614,400,000,000 bytes/second (~572GiB/s or 614.4GB/s).

Memory timings, and the refresh effect

Memory timings are very important, but often ignored. Here, I will demonstrate a very simple change to a single timing parameter providing around 1.87% more bandwidth - but first, background! All DRAM relies on capacitors to store bits, and capacitors leak - meaning it will not take long for that data to disappear (and the warmer it is, the faster this will happen!) To prevent this, all of memory is periodically read, then the same values written back to it. This operation is known as a “refresh”, and how often it does this is known as the “refresh rate.”

During a refresh, no other accesses may occur - all reads/writes must wait. From the Xilinx AXI HBM Controller Product Guide (PG276), “The base refresh interval (tREFI) for the HBM stacks is 3.9μs… The base rate of 3.9 μs is used for temperatures from 0°C to 85°C. Between 85°C and 95°C, tREFI is reduced to 1.95 μs.” From my reading of the registers live, this is true. However, the refresh period here depends on the memory frequency. The formula to convert between the format that the refresh period is stored in (here called “ref_per”) follow:

MemClk * time_in_us = ref_per
ref_per / MemClk = time_in_us

As a working example, let’s use the stock frequency (ref_per is 3510):

900Mhz * 3.9μs = 3510

And let’s see what that value for ref_per becomes when we use it with an 1150Mhz clock:

3510 / 1150Mhz = 3.052μs

Yikes. Our refresh interval has gone DOWN (that is, we’re doing refreshes more often), meaning that our performance is being limited quite a bit! So… let’s see how we can move it back to 3.9μs (it’s safe to go higher, but this is just an example):

1150Mhz * 3.9μs = 4485

I said I’d demonstrate a simple change which provides around 1.87% more bandwidth, and I meant it. Here are results for a benchmark with the HBM2 clock at 1150Mhz, the AXI clock at 600Mhz, and all stock timings.

[wolf@surody ~/FPGA/HBMWork]$ ./memtest-new /dev/ttyUSB1
Starting self-test (sequential) with burst length 64 bytes.
Completed writing pattern to memory in 305882325 cycles.
Took 0.510 seconds to read 274 GB (256 GiB).
Speed: 539.184 GB/sec (502.154 GiB/sec)

Completed reading pattern from memory in 304593511 cycles.
Took 0.508 seconds to read 274 GB (256 GiB).
Speed: 541.465 GB/sec (504.279 GiB/sec)

After changing ref_per to 4485, these were the results:

Starting self-test (sequential) with burst length 64 bytes.
Completed writing pattern to memory in 299825873 cycles.
Took 0.500 seconds to read 274 GB (256 GiB).
Speed: 550.075 GB/sec (512.297 GiB/sec)

Completed reading pattern from memory in 298995921 cycles.
Took 0.498 seconds to read 274 GB (256 GiB).
Speed: 551.602 GB/sec (513.719 GiB/sec)

  1. Advanced eXtensible Interface, a high-speed communications bus designed for on-chip communications. See the Wikipedia article on the subject, and for in-depth information, see the ARM Advanced Microcontroller Bus Architecture (AMBA) protocol specifications. ↩︎

  2. Only when using Xilinx’s switching subsystem (which has a performance penalty) can you access all of the HBM2 memory from any AXI port. This doesn’t change the fact pseudo-channels only control a portion of the memory - it just routes the request to the right port internally. ↩︎ ↩︎