Notes From a Parallel Universe

30 August 2017
intel-kaby-lake-670x335.jpg

A frequent request I hear is to clarify processor nomenclature on our roadmap. To be fair, it is confusing, with internal code names, marketing names, platform names, and more. Here’s an attempt to lay it out from our perspective as a rugged COTS board vendor.

Note that some outlooking details are covered by our NDAs with the chip vendors, and those are not included here. Everything here is in the public domain, with all that implies (you know you can’t believe everything on the web, right?). This only represents part of the story: the changes in processor microarchitecture and parallelism become ever more important when trying to get the maximum performance out of a compute platform, so we’ll have a look at what is going on there too.

The current state of the art in the Intel world is represented by Kaby Lake 7th Gen and Kaby Lake R(efresh) 8th Gen that was announced in August 2017. Note that the 8th Gen parts announced so far are system-on-chip devices intended for the laptop market – compelling low power (15W TDP), but no ECC memory, a pre-requisite for most mission-critical applications. Expect to see other 8th Gen devices being announced this year – possibly some from the Coffee Lake portfolio, blurring the lines between architecture updates and device “generation”, making things yet more complicated.

Intel has formally announced the availability of AVX-512, the next significant speed bump for vectorizable code. It has been available for some time on the Xeon Phi family, is now available on the Xeon Processor Scalable Family and is widely expected to make its way into embedded processors in the not-too-distant future. This will provide a doubling of the peak theoretical FLOPS over CPUs with AVX2.

One thing that becomes apparent when looking at the trends is that, to extract maximum performance from newer devices, applications must be written to exploit an increasing number of cores and threads. Another is the increased dependence on using the AVX/AVX2/AVX-512 vector engines to mitigate the slowing down of the increases in clock rates. That’s good news for those of us who understand how to vectorize and thread code - but that doesn’t include everyone.

Solutions

How, then, to extract that performance without relying on very specific skill-sets? Fortunately, there are some solutions. Firstly, you can call upon math libraries that are written to use the available vector instructions and to launch multiple threads across cores, such as Abaco’s AXISLib or Intel’s MKL. These work great if you are willing to modify source code to manually replace looped algorithms with calls to the library.

If that is still not acceptable, all is not lost. Vectorizing compilers have been around for decades, initially in the realm of supercomputing, but increasingly in embedded and now mainstream computing. I first used them some 20 years ago. The latest ones will take your standard C/C++ code and look for opportunities to automatically replace compute loops with vector code - and in some cases with some form of threading exploitation such as OpenMP.

Of course, life isn’t always simple, so often loops will not vectorize/thread due to detected loop dependencies or other factors that could produce incorrect results, or ugly memory access patterns. A compiler worth its salt will produce a report that identifies such loops and their snags, with suggestions of how the code could be restructured to allow automatic optimization. It is also worth bearing in mind that there is only one AVX2 execution unit per physical core. This means that for highly optimized code, enabling Hyper-Threading, which doubles the number of virtual cores, can actually hurt performance.

Equally, if code is not fully optimized for AVX2, then Hyper-Threading may help as it can fill the time spent waiting for pipelines to no longer be stalled with useful work. Tools like Intel’s Vectorization Advisor can prove extremely advantageous in identifying and characterizing performance drains and opportunities for optimization.

Next time, we’ll look at what is going on with some of the other processors of interest for specific niches.

 

Make sure you don’t miss Pete’s next post on this topic. Sign up for our blog digest: see ‘Stay in the Loop’ above right.

 

 

 

 

Peter Thompson

Peter Thompson is Vice President, Product Management at Abaco Systems. He first started working on High Performance Embedded Computing systems when a 1 MFLOP machine was enough to give him a hernia while carrying it from the parking lot to a customer’s lab. He is now very happy to have 27,000 times more compute power in his phone, which weighs considerably less.