How-To's

Webinar Recording and Q&A: High-Performance Computing with C++

The recording of our May 29th webinar with Dmitri Nesteruk, High-Performance Computing with C++, is now available on JetBrains YouTube Channel.

Languages such as JavaScript may receive a lot of hype nowadays, but for high-performance, close-to-the-metal computing, C++ is still king. This webinar takes you on a tour of the HPC universe, with a focus on parallelism; be it instruction-level (SIMD), data-level, task-based (multithreading, OpenMP), or cluster-based (MPI).

We also discuss how specific hardware can significantly accelerate computation by looking at two such technologies: NVIDIA CUDA and Intel Xeon Phi. (Some scarier tech such as FPGAs is also mentioned). The slides used in the webinar are available here.

We received plenty of questions during the webinar and we’d like to use this opportunity to highlight some of them here as well as those we didn’t have a chance to answer during the webinar. Please find the questions below with answers by presenter, Dmitri Nesteruk.

Q: Why are you not a fan of OpenCL?
A: I generally think that OpenCL is a great idea. The ability to run the same code on x86, GPU, FPGA and Xeon Phi is fantastic. However, this hinges on a very important assumption, namely that it is reasonable to run the same kind of computation on all these devices.

In practice this isn’t always the case. x86 and Xeon Phi are great for general-purpose code. GPGPU is restricted mainly to data-parallel numerics. FPGAs are an entirely different beasts – they are not as good for numerics, but excellent for structured data with a high degree of intrinsic parallelism that arises from the architecture that’s being used.

On the GPGPU side, where OpenCL is the principal way to program AMD graphics cards, I’d say NVIDIA won the war, at least for now. Whether or not they make better devices, their tools support and their marketing efforts have earned them the top spot. CUDA C is more concise than OpenCL, which is great, but of course you have to keep in mind that OpenCL tries to be compatible with all architectures, so you’ve got to expect it to be more verbose. However, programming OpenCL is like programming CUDA using the Driver API, and I’d much rather use the user-mode APIs of CUDA, together with the excellent libraries (including Thrust, where applicable).

Q: CUDA has to copy data to the devices RAM, right? At which magnitude of data do we gain something from CUDA?
A: Yes, and there’s actually a whole 5 (five!) types of memory on CUDA devices that can be used. The time delay in sending data to/from does prevent us from using CUDA for real-time purposes, but if you’re not trying to do that, then what you’re going to be worrying about is saturating the device with enough data to make the computations worthwhile. This depends first of all on how well your problem parallelizes, but assuming it does, the question becomes how much a single unit of work actually takes.

The logic here is simple: if a unit of work takes less than the time to send data to/from the device, consider keeping it on the CPU and vectorizing it if possible. If, however, you’ve got processes that take up more time, then it might make sense to do the calculations on the GPU. And keep in mind that the GPU is capable of supporting a form of task-based parallelism (streams), so if you’ve got distinct computation processes, you can try pipelining them onto the device.

The best way to tell if the GPU is the right solution or not is to write your algorithm and do a performance measurement.

Q: Can you elaborate on the protocol parsing?
A: Let’s look at a problem from a higher level. Any specific single computation problem you may have is likely to be faster when done in hardware than in software. But in practice, you’re unlikely to be designing an ASIC for each particular problem, because it’s very expensive, and because changing an ASIC is impossible once it’s designed and produce.

On the other hand, some problems do benefit from extra computation resources. For example, parsing data from a protocol such as FIX allows you to offload some of the work for, e.g., building order books from the CPU and also avoid paying the costs of data moving from your network card to RAM and back. These might seem like trivial costs, but given the kind of technology war that trading firms are engaged in, every microsecond helps!

It also so happens that FPGAs are excellent for parsing fixed data formats. This means that you can build an NIC that uses an FPGA to parse the data then and there, structure it, maybe even analyze it and send it back upstream faster than an ordinary server would. Plenty of commercial solutions exist for this, but designing your own is also a lot of fun. FPGAs also offer a lot of additional benefits that I’ve mentioned in the webinar, such as relative scalability (you can have 20 FPGAs on a single card, if you want).

Q: Would you recommend using the Intel Xeon Phi for complex simulation (like car simulator)?

A: This depends on what you’re simulating and whether the problem is actually decomposable into independently executing units of work. If you’re after agent-based modeling, then with the Xeon Phi you’re in luck, because it supports just about every parallel paradigm under the sun. You can use MPI or Pthreads or something else, and have the parts of the system interact with one another via messages.

It really depends on the specifics of the problem you’re trying to solve.

Q: As an alternative to inline assembly and instrinsics, you can ensure vectorization and use of SIMD instructions by giving hints to the compiler through pragmas. Why not mention that?

A: Indeed! In fact, one of the pain points of SIMD is that you have to abandon the normal ways of writing code. So having the compiler vectorize it for you is great – it leaves you to do other things. The same goes for OpenMP that follows exactly the same idea: mark a loop as parallel and the compiler will do its best to parallelize it.

The concern that I have with these processes are always a black box. The compiler can easily understand a simple for loop with no side effects, but problems aren’t always that simple, and when other variables and constructs get entangled, it is sometimes better to hand-craft critical paths rather than trusting the compiler to do a good job on rewriting them.

Q: In the domain you currently work, which HW alternative do you recommend using: FPGAs, GPGPUs, or anything else?
A: It’s ‘horses for courses.’ Quant finance falls into two categories: analysis when you sit at your desk and try to calibrate a model, and execution, which is when your co-located servers trade on some market or other. The rules of the game are drastically different.

For analysis, anything goes, because you’re not time-constrained by a fast-moving market. The more computational power you have, the faster you’re going to get your ideas validated. Numeric simulations (e.g., Monte-Carlo) fit nicely on GPUs and are especially great when you can allow yourself to drop to single-precision. Xeon Phi’s are general-purpose workhorses, anything can run on them, but they are a relatively new technology that I’m still getting the grasp of.

When it comes to execution, it’s all about speed. An arbitrage opportunity comes in and disappears in the blink of an eye, and thus FPGA-based hardware can help you capture it before others react. Having great hardware is not enough if you’ve got a huge ping to the exchange, obviously. I haven’t tried GPUs and FPGAs on the execution side: I suspect they might be relevant, but it’s something I haven’t investigated yet.

Keep in mind that not all trading is high-frequency trading. Some firms run things on MATLAB which, while not being as fast as hand-tuned C++ with SIMD instructions, provides sufficient execution speed coupled with a vast array of built-in math libraries meaning you don’t have to roll your own. Other quant institutions, including some of the major ones, survive just fine running their trades off Excel spreadsheets, possibly the slowest computation mechanism.

Q: You are talking a lot about FPGA. I don’t quite understand what FPGA has to do with C++ programming

A: It’s fair to say that FPGAs are typically programmed using Hardware Description Languages (HDLs) such as VHDL and Verilog. However, an interesting trend is the support for OpenCL on FPGAs. Hopefully the relationship between OpenCL and C++ is self-evident. Plus, since we’re discussing HPC technologies, FPGAs definitely deserve a mention.

Q: Does ReSharper C++ support CUDA? / Will the IDE support Embedded Systems?
A: The support for CUDA as well as the Intel C++ compiler is on the top of my personal wish list, and I’ve been pestering the ReSharper developers with related issues. While I hesitate to make any promises, I don’t imagine CUDA support is that difficult considering there’s only one language extension (the triple-angled chevrons) and the rest is fairly normal C++ that should be parseable straight away. I could be wrong of course.

Regarding the C++ IDE and its support for embedded: if you’re after cross-compilation, it will work right now, though of course the IDE will not parse compiler output. As for supporting specific libraries – we’ll consider those after the initial release which, as you may have guessed, will target the general platform rather than any particular device.

The future direction of the C++ IDE will largely depend on user demand. At the moment, it’s difficult to predict where the product will head. So keep posting and voting for feature requests!

About the Presenter

Dmitri NesterukDmitri Nesteruk is a developer, speaker, podcaster and a technical evangelist for JetBrains. His interests lie in software development and integration practices in the areas of computation, quantitative finance and algorithmic trading. He is an instructor of an entry-level course in Quantitative Finance. His technological interests include C#, F# and C++ programming as well as high-performance computing using technologies such as CUDA. He has been a C# MVP since 2009.

image description