AMD presented the first x86 processor made in the 7 nm lithography: the second-generation Epyc server, which will go on sale next year.
rome: 64 cores, 8 DDR4 channels, PCI-E 4.0
Server processors with Zen 2 architecture will be sold under the Epyc brand. They are compatible with existing infrastructure – Epyc 2nd generation fits the same stand and will work on existing motherboards (after updating the firmware). The new Epyc have a PCI Express 4.0 controller – it's the first server x86 processors offering such fast PCI-E connections (previously PCI-E 4.0 was available in IBM Power9 processors). This part of the functionality will in most cases require the use of a new motherboard; in most existing Epyc 2 generation servers, it will offer PCI-E 3.0.
The top-generation Epic 2 models will have 64 cores (128 threads) – twice as much as the current AMD server processors. Due to compatibility with the existing stand, an 8-channel memory controller and 128 PCI-E lines (of which 64 can be used as a link to a second processor) have been retained.
Macroarchitecture: one processor in several parts
The next generation of Epycs will have a modular, heterogeneous design. The first generation epyc consisted of four identical silicon nuclei joined in one housing using Infinity Fabric. The cores, memory interfaces and I / O were distributed equally among all the kernels. In our opinion, such a project was the main source of financial success that Zen architecture brought to AMD: only one type of silicon nucleus was produced, which could be used both for desktop processors with a small number of cores, Threadripper processors with an average number of cores and for server processor megacracks. This approach has saved a lot of money in kernel production costs (lithographic masks, time and resources needed to "speed up" yields to a commercially profitable level) and sorting processors: yield in the production of small systems is much better than one with a large area.
In the Zen 2 architecture processors, the processor's functions were separated into two types of silicon nuclei, in addition produced in various technological processes. The x86 cores and the fastest, closest CPU cache levels are in separate kernels made in the 7 nm process. Elements that can not be easily reduced together with the technological process update, i.e. input / output interfaces and memory controller were separated and placed in a separate nucleus made in the 14 nm technique at GlobalFoundries factories (this information was confirmed by representatives of AMD). There are nine nuclei in one Rome processor. It is worth noting that if AMD wants to offer Epic models of the 2nd generation with fewer cores, it is enough to mount in one housing exactly as many computing nuclei as needed, and one IO core. For every Epic of the first generation, even eight-core, four even partially functional kernels are needed – only four collectively have eight memory channels and 128 PCI-E lines.
The above block diagram fairly faithfully represents the organization of the Rome processor – in addition to the fact that eight nuclei with x86 cores are presented in two blocks ("chiplets"). Eight Infinity Fabric links (marked with the symbol ∞) connect the computational kernels to the central IO nucleus. Two blocks marked IO are PCI-E interfaces that provide a total of 128 lines. Half of them can be used as four Infinity Fabric links to the second processor; then one processor provides 64 PCI-E lines, just like the first-generation Epic processors. The eight-channel memory controller is also the equivalent of that of Naples, but it is unified, and not distributed between the four nuclei as in the Epyc of the first generation. The entire memory pool connected to one processor is uniform: access from each core to each address in memory takes place with the same bandwidth and delay. Such a memory organization is a great advantage over Epic First Generation processors. Only two processors form the NUMA configuration, and the machine with one stand is in terms of access to memory very close to single machines with Intel processors (we skip the upcoming Cascade Lake AP systems that will have memory organized in NUMA within one stand).
In such an organized processor the memory pool is uniform, and the maximum bandwidth available for one thread – the largest. However, in the first-generation Epyc processor, from the point of view of one core, 1/4 of the memory pool is available locally, by the memory controller in the same kernel, and 3/4 the memory pool "at a distance" of one Infinity Fabric link. In the second-generation Epyc, because the compute kernels do not have any part of the memory controller, the entire storage pool is available at a distance of one IF link. We do not know what changes have been made to the Infinity Fabric links, but we have been assured that it is improved in relation to that of the first generation Epics. Depending on how well it is improved, the delay in memory access in the 2nd generation Epyc can be longer than the delay in accessing the local memory (best case) in the first generation Epyc.
The IO kernel also contains an additional cache – unfortunately, AMD has not revealed any details yet. We do not know anything about its capacity or organization: will this memory be common to all cores, or will it act as memory-side memory (accelerating all operations on memory, also between RAM and IO space, this solution could be useful in applications with multiple GPUs) or CPU-side (which speeds up only transfers between x86 cores and RAM).
The division of the processor into kernels with different functionalities is a vision presented for many years in the semiconductor industry. Intel also presented it at one of its conferences dedicated to production capabilities. As far as we know, no manufacturer of highly integrated systems has decided to use this technique in practice. Until now, only identical kernels have been connected in this way (eg Xilinx FPGA systems) or systems that function independently of each other (eg CPU and memory or GPU and memory). Dividing the processor into computing functions and IO functions is a sign that AMD is not afraid to take risky steps as the first company in the industry.
Technological process 7 nm
The most important part of the new Epyc processors is manufactured in TSMC factories in Taiwan, in a 7 nm class process, using the latest developments in "traditional" lithography (with laser irradiation at 193 nm). Up to now, two processors manufactured in this technique are commercially available: Apple A12 and Huawei Kirin 980. We do not know if TSMC offered AMD a special version of the production process different from that used by Apple and Huawei. According to AMD, this production process provides 2 times higher packing density transistors than GlobalFoundries 14 nm. You can also achieve a 25% higher clock speed (with the same energy consumption) or 50% lower power consumption (at the same timing). In EpYC processors, AMD chose to save energy: energy efficiency is much more important in the server world, and it was necessary to fit twice as many cores in the same energy budget.
AMD representatives during the conference repeatedly stressed the importance of 7 nm lithography for further company plans. It was also mentioned that more than two systems made in this technique have already left the TSMC factories.
Zen Zen Microarchitecture
Zen 2 cores are an evolutionary development of Zen architecture. The most important change is the significant expansion of the part responsible for vector calculations and floating point numbers. The capacity of the main data path has been doubled: the processor can download and save twice as much data in the clock cycle (256 bits per cycle).
The width of execution units, which can now perform two 128-bit operations simultaneously, or one 256-bit operation in one clock cycle, has also been doubled. At the same time, the front end of the processor can spend more floating-point operations in one cycle (in Zen: up to 4 micro-ops, no more, if more in Zen 2), it can also end and save the result of a larger number of operations. This improvement will increase floating point performance also in programs that do not use wide vector instructions – for example, in some of the procedures used in games.
The front end of the processor (the part responsible for decoding, paging and splitting instructions between the execution units) has been improved. Not many details were given: we only know that the jump prediction and pre-fetching system has been improved and the microoperation memory has been extended (which allows to skip order decoding if the same operation was performed recently). It seems that these changes will better hide the delay in cache access.
The Zen 2 instruction set is almost identical to that in Zen. Despite the expansion of Zen 2 floating-point units, it does not execute the AVX-512 instruction; 256-bit AVX2 instructions will be performed at best twice as fast as in Zen (one in the clock cycle). Two instructions have also been added to save a line from the cache to RAM and an RDPID instruction to identify a specific core within a multi-core processor – the same instructions will also be available in processors Ice Lake Intel (coming … nobody knows when).
Like previous AMD processors, Epyc 2nd generation is insensitive to some types of side-channel attacks: Meltdown and L1TF. According to representatives of AMD, Zen 2 contains some hardware improvements that hinder Spectre attacks and attacks on SMT – unfortunately, no details were provided. As far as we know, no effective attack on the Zen processor associated with SMT technology has yet been presented – improvements in Zen 2 should make it become the safest modern x86 architecture (although, of course, there is no total insensitivity to side channel attacks or other attacks) hardware).
In the 2nd generation Epyc, the technique of encrypting memory and virtual machines (TSME and SVE) has been developed: new processors can manage more cryptographic keys at the same time, allowing the creation of more independently encrypted, isolated virtual machines on a single processor.
As provided by AMD, one 64-core Epyc processor of the second generation offers greater performance than the two best Intel processors available today (Xeon Platinum 8180) or the two best Epyc processors of the first generation (Epyc 7601). The reference AMD machine with the working name has been demonstrated live Daytona in the c-ray benchmark.
Test machine with a second-generation Epyc processor. The same motherboard in the 2U server chassis form the reference platform AMD Daytona.
This is a fairly simple performance test in ray tracing that mainly uses floating point instructions and AVX vector instructions (but not AVX-512); the data set is very small and fits into the cache, and the benchmark itself scales almost unlimited with the number of cores.
One Epyc of the second generation finished rendering faster than two Epyc 7601 or two Xeony Platinum 8180. We were assured that software optimizations and accelerated timing would give the final version of Epyc the second generation even higher performance.
Of course, the test results presented by the producers are always taken with reserve – especially since practical programs that use heavily vector instructions also require high memory bandwidth, and c-ray completely ignores this aspect of the system.
What do we still know?
The announcement of Epyc CPUs generates a lot of questions; for most of them, representatives of AMD do not want to answer yet. Among the most important unknown technical details are:
- cache organization in Rome, especially the additional memory in the IO kernel
- organization of cores within the computational kernel – we do not know if they are still connected to CCX blocks for four cores with a shared cache
- clock speed, number of models and the range of the diversity of Epyc processor 2. generation
- launch date – we only know the year (2019)
In turn, the Zen 2 macroarchitecture raises many questions about future desktop processors with Zen 2 cores. AMD will probably again decide to satisfy the needs of many market segments with the lowest number of different kernels – thus gaining the greatest benefits from the modular nature of Zen 2. If computational kernels such same as in rome will be used in desktop processors (which is not certain), it probably will have to be accompanied by a new, different IO kernel. This can have a big impact on memory access in the next generation of Ryzen processors.