Hardware Upgrade Forum - View Single Post - [Thread Ufficiale] AMD Ryzen

PORODDA · 04-03-2017, 10:51

Allora, affrontiamo il problema alla radice, non so se sia stato postato ma entriamo nel tecnico avendo una visione d' insieme sul problema principale della nuova cpu che pare essere la latenza
:
Tratto da reddit, che ne diciate o no è sempre una fonte utile, iniziamo a capire perchè in certi frangenti anche il buon fx8350 sembra insidiare RY e cosa ha inculcato dubbi in tanti che aspettavano il suo rilascio.

https://www.reddit.com/r/Amd/comment...discussion_of/

Hello team red,
I saw this link: Hardware.fr over at /r/hardware and thought I'd start a discussion about it over here since this sub is seeing a lot more traffic at the moment.
NOTE: I work as a computer engineer, but only have a BS. There are many people smarter than me and that know more about this stuff than I do. If I get anything wrong in this post, please do speak up. This is just my understanding of the situation. I will try to explain the article to the best of my ability.
Benchmarks
Ryzen's main memory latency (access speed from processor to RAM) is horrid compared to the competing Intel processor (6900k), and also horrid compared to the FX-8350. It sits at 98ns, compared to around 70ns of the Intel and FX-8350.
Going down a step, we look at the latency to the three levels of cache. In general, the L1 and L2 caches of Ryzen and the 6900k are comparable. The 6900k has higher L1 and L3 bandwidth, and Ryzen wins out in L2. However, Ryzen's L3 latency is at 46.6ns, whereas the 6900k's is at 17.3ns.
Memory Access Basics
Basically, the CPU checks memory locations moving outward from the CPU. First the small, fast L1 cache, then the larger, slower L2 cache, then larger, slower L3 cache, etc. etc. until it hits main memory. This is an oversimplification, but gives you an idea of how it works.
That means that the massive 28ns main memory latency differential can be almost completely explained by the terrible L3 performance.
Why is the L3 performance so bad?
This is where things get weird. If you have been reading the architectural details of Ryzen, you'll know that the 8c/16t chips have 2 CCXs on them, each containing 4 cores. You'll also know that Ryzen's L3 cache is not a true general-purpose cache. It's a victim cache.
The Victim Cache
A victim cache is a specialized type of cache. The way it works is that any bytes that are going to be evicted from the lower level are put into the victim cache. Then it generally works as a normal cache, until data needs to be pulled from it. Then, the data in the lower level cache and the data in the victim cache are swapped.
It should be noted that some of Intel's Haswell product line also implemented an L4 victim cache that was shared by the CPU and iGPU, called Crystal Well. For Skylake, Intel canned the victim cache concept as it is quite complicated compared to a normal cache.
The swap, and why it's bad for certain situations
I noted above that the 8c/16t chips have 2 CCXs on them. Each CCX contains 8MB of the L3 cache, for a total of 16MB. Ryzen's architecture is such that if a thread on one CCX needs to access the cache in the other CCX, it needs to talk through a bus system that goes through the memory controller. The bandwidth of this interconnection is only 22GB/s, about the speed of DDR3-1600. SLOW. This introduces two possible problematic scenarios:
A thread saturates more than half of the total L3 cache. In this case, let's say a thread in CCX0 needs to access a library that is 12MB large. It's all in the L3 cache, but some of it is in the half of the cache that is in CCX1. It queries both caches simultaneously, but has to wait for some of the data in CCX1.
Windows likes to swap loads on threads often. Basically, it will switch serial computations (not serial in the traditional sense, but serial as in it will re-use data that has already been computed and put into cache) onto new threads with no regard for where that thread is physically located. This makes sense for Intel and older AMD products, where there exists a unified cache across all cores. This is bad for Ryzen, as moving cache accesses from one CCX to the other is extremely expensive.
THE GOOD
This problem can partially be remedied in software. It is likely that AMD and Microsoft are working on scheduling adaptations to prevent the occurrence of thread walking across CCXs as much as possible. Additionally, it's been reported that one of the optimizations coming in Windows 10's Game Mode is to move threads less. Also of note is the fact that when Crystal Well was released (Intel's foray into victim caches), Linux kernel and driver updates were able to provide double the performance that it was displaying prior to the software upgrades. Reference: Phoronix
THE BAD
This problem is, at its core, a limitation of Ryzen's architecture. It cannot be entirely fixed by software. Even with scheduling fixes, either of the two above scenarios will result in performance hiccups.
THE UGLY
GAMING. Games are a prime candidate for the second scenario above (thread roaming). It remains to be seen how much or how little an updated scheduler will help in games. Some games pin workloads to specific threads, but share data across threads. It might be that a thread in CCX0 writes some data to later be read back by another thread, but the game engine has pinned the second calculation to a thread in CCX1 instead of allowing the Windows scheduler to decide. In this case, the crossover penalty is still incurred.

Alla luce di ciò e leggendo anche i vari commenti nel link che ho riportato vediamo di affrontare il problema in maniera razionale in quanto c'è,
e come dicevo in precedenza nell' altro thread non è risolvibile totalmente a livello software.
Qualcuno ha le conoscenze per esprimere un parere sopra le parti?
Evitiamo uscite del tipo che costa la metà eccetera perchè qua dentro sono inutili....

04-03-2017, 10:51	#274
PORODDA Bannato Iscritto dal: Dec 2016 Messaggi: 749	Allora, affrontiamo il problema alla radice, non so se sia stato postato ma entriamo nel tecnico avendo una visione d' insieme sul problema principale della nuova cpu che pare essere la latenza : Tratto da reddit, che ne diciate o no è sempre una fonte utile, iniziamo a capire perchè in certi frangenti anche il buon fx8350 sembra insidiare RY e cosa ha inculcato dubbi in tanti che aspettavano il suo rilascio. https://www.reddit.com/r/Amd/comment...discussion_of/ Hello team red, I saw this link: Hardware.fr over at /r/hardware and thought I'd start a discussion about it over here since this sub is seeing a lot more traffic at the moment. NOTE: I work as a computer engineer, but only have a BS. There are many people smarter than me and that know more about this stuff than I do. If I get anything wrong in this post, please do speak up. This is just my understanding of the situation. I will try to explain the article to the best of my ability. Benchmarks Ryzen's main memory latency (access speed from processor to RAM) is horrid compared to the competing Intel processor (6900k), and also horrid compared to the FX-8350. It sits at 98ns, compared to around 70ns of the Intel and FX-8350. Going down a step, we look at the latency to the three levels of cache. In general, the L1 and L2 caches of Ryzen and the 6900k are comparable. The 6900k has higher L1 and L3 bandwidth, and Ryzen wins out in L2. However, Ryzen's L3 latency is at 46.6ns, whereas the 6900k's is at 17.3ns. Memory Access Basics Basically, the CPU checks memory locations moving outward from the CPU. First the small, fast L1 cache, then the larger, slower L2 cache, then larger, slower L3 cache, etc. etc. until it hits main memory. This is an oversimplification, but gives you an idea of how it works. That means that the massive 28ns main memory latency differential can be almost completely explained by the terrible L3 performance. Why is the L3 performance so bad? This is where things get weird. If you have been reading the architectural details of Ryzen, you'll know that the 8c/16t chips have 2 CCXs on them, each containing 4 cores. You'll also know that Ryzen's L3 cache is not a true general-purpose cache. It's a victim cache. The Victim Cache A victim cache is a specialized type of cache. The way it works is that any bytes that are going to be evicted from the lower level are put into the victim cache. Then it generally works as a normal cache, until data needs to be pulled from it. Then, the data in the lower level cache and the data in the victim cache are swapped. It should be noted that some of Intel's Haswell product line also implemented an L4 victim cache that was shared by the CPU and iGPU, called Crystal Well. For Skylake, Intel canned the victim cache concept as it is quite complicated compared to a normal cache. The swap, and why it's bad for certain situations I noted above that the 8c/16t chips have 2 CCXs on them. Each CCX contains 8MB of the L3 cache, for a total of 16MB. Ryzen's architecture is such that if a thread on one CCX needs to access the cache in the other CCX, it needs to talk through a bus system that goes through the memory controller. The bandwidth of this interconnection is only 22GB/s, about the speed of DDR3-1600. SLOW. This introduces two possible problematic scenarios: A thread saturates more than half of the total L3 cache. In this case, let's say a thread in CCX0 needs to access a library that is 12MB large. It's all in the L3 cache, but some of it is in the half of the cache that is in CCX1. It queries both caches simultaneously, but has to wait for some of the data in CCX1. Windows likes to swap loads on threads often. Basically, it will switch serial computations (not serial in the traditional sense, but serial as in it will re-use data that has already been computed and put into cache) onto new threads with no regard for where that thread is physically located. This makes sense for Intel and older AMD products, where there exists a unified cache across all cores. This is bad for Ryzen, as moving cache accesses from one CCX to the other is extremely expensive. THE GOOD This problem can partially be remedied in software. It is likely that AMD and Microsoft are working on scheduling adaptations to prevent the occurrence of thread walking across CCXs as much as possible. Additionally, it's been reported that one of the optimizations coming in Windows 10's Game Mode is to move threads less. Also of note is the fact that when Crystal Well was released (Intel's foray into victim caches), Linux kernel and driver updates were able to provide double the performance that it was displaying prior to the software upgrades. Reference: Phoronix THE BAD This problem is, at its core, a limitation of Ryzen's architecture. It cannot be entirely fixed by software. Even with scheduling fixes, either of the two above scenarios will result in performance hiccups. THE UGLY GAMING. Games are a prime candidate for the second scenario above (thread roaming). It remains to be seen how much or how little an updated scheduler will help in games. Some games pin workloads to specific threads, but share data across threads. It might be that a thread in CCX0 writes some data to later be read back by another thread, but the game engine has pinned the second calculation to a thread in CCX1 instead of allowing the Windows scheduler to decide. In this case, the crossover penalty is still incurred. Alla luce di ciò e leggendo anche i vari commenti nel link che ho riportato vediamo di affrontare il problema in maniera razionale in quanto c'è, e come dicevo in precedenza nell' altro thread non è risolvibile totalmente a livello software. Qualcuno ha le conoscenze per esprimere un parere sopra le parti? Evitiamo uscite del tipo che costa la metà eccetera perchè qua dentro sono inutili.... Ultima modifica di PORODDA : 04-03-2017 alle 10:58.