Here’s what HPC programmers are constantly looking for:
(1) performance – on all the systems they care about (and hopefully future ones too).
(2) portability – who likes rewrites just because a new architecture comes along, right?
(3) maintainability – just like in any other computer science field, code should be clean, such that it can be changed without huge costs, preferably by the scientists who are directly working with the applications.
The following is a tale on how these wishful properties are going to affect your work if you decide to look into accelerator programming.
Say, you are on the onset of programming a HPC application. No problem, right? You know how the underlying machine works in terms of memory architecture and ALUs. (Or not? Well, that’s no problem either, the compilers have become so good I’m hearing, they will surely figure it out). You know what numeric approximation will be used to map your problem most efficiently. You know all about Roofline performance modelling, such that you can verify whether your algorithm performs on the hardware the way you’ve expected. You know what you’re supposed to do when you encounter data parallelism. So – let’s sit down and do it!
You’re hearing about your organisation ordering a new cluster. In order to get closer to Exascale, this cluster will sport these fancy new accelerators. So all new HPC software projects should evaluate, if and how they can make use of coprocessors. You start reading yourself into the accelerator landscape. OpenCL, CUDA, OpenACC, OpenMP, ArrayFire, Tesla, Intel MIC, Parallela… Your head starts getting dizzy from all this stuff – all these hardware and software tools have lots of overlap, but also significant differences. Especially, they’re very different from x86 CPU architecture. Why is that?
It essentially comes down to the fact that in 2005, the free lunch was over.
CPU vendors still claim to perform according to Moore’s law today. However, while in the past decades you could simply compile your code for a new CPU, since around 2005 you are forced to rethink. Processor clock rates have hit their limit, so in order to gain significant speedups, more parallelisms were introduced that you need to care about. Today we have multiple cores per socket and the vector sizes per thread are ever growing. So in HPC, you can no longer just distribute your load onto the nodes and let them do their work sequentially – you usually care about one or two more levels of parallelization. Here’s where accelerators come into play: If you have to do all that multithreading implementation – why not make it worthwhile? Why care about six or eight threads if we can have thousands? What if your application is limited by memory bandwidth rather than computation? Well, today’s accelerators have 200-300 GB/s memory bandwidth, so compared to around 50 GB/s on the CPU we’re still good, right?
How about the fact that we need to copy our data back and forth between main memory and accelerator memory over a slow (5 GB/s) PCI Express bus? Won’t this kill any kind of potential speedup we could get by using the higher memory bandwidth on the device? Well yes, that’s the biggest gotcha: We need to integrate as much of our program onto the device as possible, such that we can iterate while keeping most of the memory on the device. You come up with some simple formulas to figure out whether a speedup is possible.
So you notice that in order to get a good speedup on your accelerator, at least one of three characteristics need to be true: (1) Your memory is being iterated over often (e.g. simulations, solvers), (2) your code is heavy in computations compared to the input/output memory, (3) your code creates lots of local memory operations that only affect device memory compared to the input/output.
You concur that your code matches these criteria. So, let’s do this! You take a good look at your existing Multi-100k-lines-of-code application written in Fortran. Shouldn’t be too difficult to integrate this, right? After all, we have these new fancy directives that we can put around the parallel loops and it will all just work automatically, right?
You start to notice a problem. Since in the past, creating new processes or threads has been expensive, your existing program has been parallelized as coarse-grained as possible. In some parts of your code, you see a single big parallel loop going over your parallel domains X and Y, with lots of zero- or single dimensional sequential operations (several 10k code lines in a deep call graph) executed inside this loop. In the simple examples you’ve done for GPGPU code, be it CUDA, OpenCL or OpenACC, all you’ve ever seen are tight loops over a few hundred lines of code, tops. You learn that GPUs are not good with deep call graphs – its context switching is limited to kernels, so everything going on within a kernel (essentially a parallel loop) needs to be inlined. You’re trying to do this with an OpenACC enabled compiler, but at around the second call graph level you give up.
You come to the conclusion: The only way this code is going to be parallelized for GPGPU is by rewriting it with tight loops, passing 2D- or 3D data around instead of 0D or 1D. What you initially thought would be a simple act of wrapping code, turns into a major rewrite with almost 100% of the code inside the main iteration affected. Alright, that’s annoying – but you guess that since the next cluster’s architecture has already been decided, this will have to be done anyway. (Surely you haven’t yet given up, have you?)
While preparing for the next big cluster update sure is a swell thing, what about all the existing CPU-only clusters? They’ll still be operational for the next years and your application is still needed on them. Furthermore, the scientists working with your software need to be able to constantly tweak it in order to fit their latest needs. If you simply let them use the old version on these machines, it will diverge from your nicely accelerated one pretty quickly. Furthermore, while talking to them you find out, that they are not happy with your plans – why should the code now be written in higher dimensions – all that’s going to do is inserting x,y without any offsets everywhere in the code, in each array access and each declaration. All that manual labor just to serve the architecture – and who knows what new loop structure it’s going to take once the next big architectural change is coming along – will we have to rewrite everything again? Then comes the final blow – you’re doing some tests with this new loop structure and you start seeing huge performance losses when executing this on the CPU. When analyzing it you notice that
(1) since this codebase is often memory bandwidth limited, cache performance is essential.
(2) passing around higher dimensional data flushes out the CPU cache between the procedures, so the cache’s effect goes down significantly.
(3) GPU and CPU have different optimal storage orders in this case – CPU likes Z-first, since lots of calculations over different values of Z are close together in the code. GPU meanwhile likes X-first, since all its little cores are reading and storing values in lock step – and X is mapped to their thread ids, which in turn is mapped directly to the hardware cores.
So, essentially, you’ve come up with a portable solution, but it is not performance portable. How are you going to explain to all your users that they now have to write code in a more complicated 3D version, just so that it’s slower on their current machines. Isn’t there a better way? You remember Larry Wall’s three great programmer virtues: Laziness, Impatience and Hubris. Is it smart to do these code transformations by hand? It looks like it effects all basic properties of ideal HPC code negatively: Neither is it performance portable, nor is it very maintainable. Has no one come up with a better solution? Can’t the machine do this for you?
Yes. The machine can.
You start googling for keywords like Fortran GPU storage order, Fortran performance portable or Fortran GPGPU. Again and again you see one framework coming up. It’s Open Source and it’s called Hybrid Fortran, and it was created to solve exactly these issues that you’ve just run into.
You see the big difference between the code transformation we talked above, and the one introduced here? Yes – all your computational code stays the same, you can keep it in whatever low dimensionality you like – Hybrid Fortran takes care of the necessary transformations at compile-time (so there is no runtime overhead). There’s only two things you need to introduce in addition:
(1) Where is the code to be parallelized? (Can be specified for CPU and GPU separately.)
(2) What symbols do you need to be transformed in different dimensions?
Hybrid Fortran is simply a (python based) preprocessor that parses these annotations together with your Fortran code structure, declarations, accessors and procedure calls, and then writes separate versions of your code – once for CPU with OpenMP parallelization and once for GPU with CUDA Fortran.
But what about the performance?
Well, see for yourself:
|Name||Performance Results||Performance Characteristic||Speedup HF on 6 Core vs. 1 Core ||Speedup HF on GPU vs 6 Core ||Speedup HF on GPU vs 1 Core |
|Japanese Physical Weather Prediction Core (121 Kernels)||Slides Only
|3D Diffusion (Source)||XLS||Memory Bandwidth Bound||1.06x||10.94x||11.66x|
|Particle Push (Source)||XLS||Computationally Bound, Sine/Cosine operations||9.08x||21.72x||152.79x|
|Poisson on FEM Solver with Jacobi Approximation (Source)||XLS||Memory Bandwidth Bound||1.41x||5.13x||7.28x|
|MIDACO Ant Colony Solver with MINLP Example (Source)||XLS||Computationally Bound, Divisions||5.26x||10.07x||52.99x|
 If available, comparing to reference C version, otherwise comparing to Hybrid Fortran CPU implementation. Kepler K20x has been used as GPU, Westmere Xeon X5670 has been used as CPU (TSUBAME 2.5). All results measured in double precision. The CPU cores have been limited to one socket using thread affinity ‘compact’ with 12 logical threads. For CPU, Intel compilers ifort / icc with ‘-fast’ setting have been used. For GPU, PGI compiler with ‘-fast’ setting and CUDA compute capability 3.x has been used. All GPU results include the memory copy time from host to device.
For the Diffusion and Particle Push examples, there is also a model analysis (see XLS links) as well as comparisons with CUDA C, OpenACC/OpenMP C and OpenACC Fortran available (see attachments at the end of this post).
But wait, there’s more!
You notice that in this framework, the portability goes even further.
(1) Hardware-specific properties such as the vector size have been abstracted away from the code by using a ‘template’ attribute for parallel regions. When a new hardware with a different vector size comes along, your code only needs to be adapted in one place that defines the templates.
(2) The python preprocessor has the parallelization implementation completely separated from the parser. When a new architecture comes along with a different Fortran parallelization scheme (for example ARM multiprocessors with OpenCL), all that needs to be done is a new implementation class. As an example, here is the python class that implements the OpenMP parallelization scheme:
class OpenMPFortranImplementation(FortranImplementation): def __init__(self): self.currDependantSymbols = None def declarationEnd(self, dependantSymbols, routineIsKernelCaller, currRoutineNode, currParallelRegionTemplate): self.currDependantSymbols = dependantSymbols return FortranImplementation.declarationEnd(self, dependantSymbols, routineIsKernelCaller, currRoutineNode, currParallelRegionTemplate) def parallelRegionBegin(self, parallelRegionTemplate): if self.currDependantSymbols == None: raise Exception("parallel region without any dependant arrays") openMPLines = "!$OMP PARALLEL DO DEFAULT(Private) " openMPLines += "SHARED(%s)\n" %(', '.join([symbol.deviceName() for symbol in self.currDependantSymbols])) return openMPLines + FortranImplementation.parallelRegionBegin(self, parallelRegionTemplate) def parallelRegionEnd(self, parallelRegionTemplate): openMPLines = "\n!$OMP END PARALLEL DO\n" return FortranImplementation.parallelRegionEnd(self, parallelRegionTemplate) + openMPLines def subroutineEnd(self, dependantSymbols, routineIsKernelCaller): self.currDependantSymbols = None return FortranImplementation.subroutineEnd(self, dependantSymbols, routineIsKernelCaller)
You have found a solution to port your code that
(1) works for all kinds of shared memory data parallelizations on GPU and CPU,
(2) enables the fastest portable code you can find thus far,
(3) comes within a few percent of the highest measured performance, both on GPU and CPU,
(4) requires the least amount of code changes for coarse grained parallelizations, i.e. allows scientists to keep their code in low dimensions,
(5) can be adapted for new parallelization schemes without changing your code.
The cake is not a lie this time.