Jump to content

AMD Comments on Apparent Performance Bug with Windows Scheduler


Recommended Posts

Last year AMD released new, second generation Threadripper CPUs, with one having a core count of 32 that should be very effective in some tasks. Unfortunately, as some discovered after release, there are times this performance potential is not realized, which led many to speculate it was a memory bandwidth issue. To reach 32 cores for the 2990WX or 24 cores for the 2970WX, AMD has four active dies connected together, and as each core supports SMT you have double the thread count. Each die has its own memory controller, but the design for Threadripper does not allow two of these dies direct access to the memory, limiting it to quad-channel, so many decided this was the issue, with quad-channel just not being able to feed all of the cores, especially for the two dies with indirect access. Not everyone was satisfied with this hypothesis though, including Wendell from Level1Tech who reached out to Ian Cutress of Anandtech and Jeremy of Bitsum.

One hole in the memory-bandwidth theory is the issue is not present in Linux, which led many to decide it must be with the Windows scheduler, the part of the operating system that decides which threads processes are run on. It would seem the Linux scheduler is better, but Wendell wanted to keep experimenting, so he got an AMD EYPC 7551, which is also a 32 core processor, like the 2990WX, but it supports octa-channel memory. This testing further confirmed the issue is not memory bandwidth, as it too demonstrated the performance loss under Windows. By the time though, Wendell had also discovered that, curiously, if you remove CPU 0 from the processor affinity while running a multi-threaded application, the performance was restored on Windows, again indicating a scheduler issue, but still an odd one.

After more testing and the work with Ian Cutress and Jeremy, it was determined the Windows scheduler has a problematic approach to handling NUMA nodes. Standing for Non-Uniform Memory Access, NUMA is used to arrange clusters of processors in systems that have more than one, such as dual-socket systems and now Threadripper and EPYC CPUs. The Windows scheduler has a concept of a 'best NUMA node' and will assign this to programs and then try to run the program's thread on that node and apparently this is causing the issue. If one launches a program that will use all of the 32 cores/64 threads across all of the NUMA nodes, the scheduler will keep trying to put these threads on the 'best NUMA node' pushing others off, resulting in a core contention and performance loss as the multi-threaded program is shuffled between cores and threads.

As a result of this, BitSum has added a feature to its CorePrio software called 'NUMA Disassociator' that will adjust thread affinity while software is running, but this is only a soft-fix as it is the Windows scheduler that needs to be addressed. Interestingly, it was discovered it already has a hotfix in place for systems with two NUMA nodes, where the 'best NUMA node' mechanism is disabled.

While Wendell shared this discovery earlier in the month, which you can find embedded below, Ian Cutress has now written an article at Anandtech, including comments from AMD on this situation. Though the company does not go into details, it does state it has tickets open with Microsoft's Windows team on this and commends Wendell for his work and discovery, though apparently the actual issue is slightly different. When the fix is found and ready, we can expect an announcement for it, and how it will impact performance along with other optimizations that will be made then.



Source: Anandtech

Back to original news post

Share this post

Link to post
Share on other sites

  • Create New...