Jump to content

Archive for the 'GPU' Category

Homogenous vs. heterogenous multi-core: hardware strategies (Part 1)

Tuesday, September 2nd, 2008

By Patrick Leonard, VP Engineering & Product Strategy

All of the major computer hardware vendors have been moving to multi-core CPUs for several years now, creating the situation that we refer to as the Multi-core Dilemma, which I have written about for several years. Recently, more choices for parallel hardware have become available for enterprise applications.  This will lead to some great opportunities but will also require tricky decisions of software development organizations.

Some of the new multi-core hardware will look more homogeneous, like a group of CPUs. Some of the new multi-core hardware will start to look more heterogeneous, like a CPU surrounded by different specialized cores like GPUs (graphical processing units) and others. Both approaches are going to benefit the industry on the whole, but there are pros and cons for each that should be factored into your software development planning.  This is my breakdown of the hardware vendors’ strategy.

Intel and Sun are pursuing the homogeneous CPU multi-core strategy:

  • Intel’s Larrabee project is a many-core CPU strategy that they argue will reduce (or eliminate) the need to use separate GPUs and other accelerators for general-purpose computing, although it seems likely that it could bear some similarities to GPUs in the areas of floating point and vector operations.  One of the main advantages of this approach is that it leverages the existing X86 instruction set.  Not everyone loves the X86 instruction set, but there are certainly huge benefits to keeping the existing toolchains (compilers, debuggers, profilers) that are already in place.  At Rogue Wave, we have tested multi-core Xeon systems with our Hydra product and see very good scalability with this approach for common business applications.
  • Sun recently released Niagara2, their follow up to the industry-leading Niagara multi-core server.

IBM, AMD and nVidia are taking the heterogeneous approach:

  • AMD, with their purchase of ATI, added GPU hardware to their offering, and has put “stream computing” and “accelerated computing” (formerly known as “Fusion”) at the center of their strategy, and has partnered with Rogue Wave as part of this strategy.  The vision here is basically fusion of CPU and specialized “accelerators” so that the hardware is more tuned to different use cases.  For example, CPUs are great at time slicing and scheduling, while GPUs are great at processing math in parallel.  AMD recently made a public embrace of the nascent OpenCL standard as a programming model for GPU.
  • IBM, in partnership with Sony and Toshiba, has brought the CellBE processor to market.  The basic idea here is similar to AMD’s accelerated computing: specialized hardware tuned to different use cases all on a single chip.
  • nVidia is focused on the GPU.  Focus has a lot of advantages; nVidia is ahead of AMD in getting GPU into the mainstream market (although this is still very early stage) and has a more robust API and tool environment with CUDA.  Many people in both industry and academic areas have reported significant throughput increases using nVidia GPUs for complex compute-intensive problems that are capable of running in a massively parallel environment. Rogue Wave is also working closely with nVidia and we have seen excellent performance using their hardware for compute-intensive parallel problems.

All of these strategies are pursuing a similar goal: increased throughput through parallel hardware.  All of them will require changes to existing software to take advantage of this increased compute power.  All of them will also require software developers to think differently about how they design software in the future.

Parallel Computing Discussion at ACM in Chicago

Friday, June 27th, 2008

By Patrick Leonard, VP Engineering & Product Strategy

Recently I had the opportunity to address the Chicago chapter of ACM (Association of Computer Machinery) on the subject of parallel computing. In addition to giving me the opportunity to make a Star Wars reference or two, Yoda
it was a very interesting conversation on the subject. About a quarter of the attendees were from the financial services industry and close to half of the attendees are interested in using GPUs.

The main point of my talk was to give an overview on current trends in the industry and to discuss a model for parallel computing in software development. Most parallel programming has been focused at the task and data level with tools like OpenMP and MPI. Data parallelism continues to be very important, but I suggest that there is a higher level of granularity in parallel computing - Service Parallelism.

Service Parallelism is essentially the intersection of SOA and parallel computing. Rather than taking a loop or function inside of a program and making a parallel you take a whole service and run multiple instances of that service (loops can be parallel too, they are just running inside the service).

There are several advantages to this:

  • If you already have services there is little to no recoding required.
  • Changing service parallelism means a change in configuration not encode so ongoing maintenance is much easier.
  • Service parallelism separates the parallel aspect of the application from the logic so your application developers don’t have to be experts in the parallel model.

If you are interested in a review of industry trends on multi-core CPU, GPU, and ideas for software parallelism, take a look at the slides on the Rogue Wave web site.

Intel’s ‘Ct’

Wednesday, June 25th, 2008

By Patrick Leonard, VP Engineering & Product Strategy

Intel recently announced that they are working on a new programming language specifically designed for multi-core CPU hardware - called ‘Ct’. Ct is ‘C’ for throughput, and is essentially the C programming language with extensions.

It is similar in many ways to CUDA from nVidia and Brook+ from AMD, although Ct is for CPUs and CUDA & Brook+ are for GPUs (see earlier post re: GPUs). This is likely to be a good thing for software developers who are working on getting existing and yet-to-be-written software to scale appropriately on multi-core hardware.

Ct uses the combination of a compiler and runtime to take much of the burden of parallelism from the software developer. For example, the basic tasking unit is a ‘future’, which can be executed now or later and receives data consistency guarantees from the runtime. You can find details on how it will work on Intel’s site.

It does, however, highlight again the split that has occurred in hardware design - all vendors are going multi-core/multi-thread, but some are taking more of a homogeneous CPU approach, and some are taking a more heterogeneous GPU (accelerator) approach.

For software engineers, this means productivity challenges (”how do I get my existing code to run on GPUs, how do I get it to scale on multi-core CPUs”) as well as portability issues (”I don’t really want to maintain code written in CUDA, Brook+ and Ct, even though they are all variants of C”). This is all related to the Multi-core Dilemma that I have written about previously on the Intel Blog site and elsewhere.

Rogue Wave’s ‘Hydra’ product uses Service Parallelism to address the Multi-core Dilemma on CPUs, and we have worked with Intel a great deal on this, as it is complementary to Ct and other Intel technologies like TBB.

We are also working with both nVidia and AMD on Project “Gazelle” to address GPUs. “Gazelle” will generate optimized code for nVidia and AMD GPUs, and could do the same for Intel Ct in the future to ease migration for existing applications.

Pac-Man crashed the SIFMA 2008 party. His message: “I will help you save on your electricity bill”

Thursday, June 19th, 2008

June 10th, New York City: At about 100 degrees, this was probably one of the hottest spring days in 2008. The SIFMA (Securities Industry and Financial Markets Association) technology management exhibit was just opening up, and to keep all the suit-wearing businessmen cool, the hotel’s air conditioners were throwing many BTUs away…

This wasn’t without reminding me of the reason why I was there, standing at the AMD booth, demonstrating computation running on an AMD FireStream graphic card…

It all started almost a year ago, when most of our Wall Street customers asked us whether we could help with programming to GPU (Graphical Processing Unit), or most widely known as ‘accelerated graphics card’.

Their interest is pretty simple - financial computations require a LOT of computing power. And with a traditional CPU-based approach like a grid, a LOT of computing power requires a LOT of electrical power, which at the end of the day is lost in the air conditioning system…

It is foreseen than ‘accelerated computing’ based on hardware derived from what is commonly known as ‘graphics cards’ is the best chance to save a lot of those BTUs…

Why?

First, GPU can accelerate computations by a huge ratio.

Pac-Man was released in 1981, and by today’s standards, moving the yellow flat circle across the screen is no more considered a technological achievement. As a matter of fact, 3dfx Inc. revolutionized gaming in 1995 by introducing the first ‘consumer accelerated graphics card’, hence delivering ‘mind bending graphics’. Simultaneously, Microsoft introduced the first version of their DirectX API, now the leading reference in gaming development. And 13 years later, GPUs are now able to create three dimensional images more than 200 times faster than regular CPUs.

It is not a stretch to establish a parallel between the Black-Scholes paper (published in 1973 and introducing basic option pricing concepts) and Pac-Man. After all, they both created a new industry. But while the ‘accelerated computing’ revolution happened in 1995 for video-gaming, Financial and General Purpose accelerated computing is being revolutionized today.

AMD and nVIDIA are the first to introduce new dedicated cards that are no more limited to graphics and linear algebra only, but can also run full double-precision C-like programs on extremely large sets of data. Simultaneously, general purposes APIs are becoming available. And preliminary tests show 10X - 40X improvement for some applications. It is still a bit shy of the acceleration we are seeing in the graphics world, but remember we are talking 1st generation hardware.

Second, GPUs can save power.

Even at a 20X improvement, a single GPU offers performance of 20 CPU cores. And it consumes around less than 150Watts… And if you can picture it correctly, it is easy to compare the size of the 8 cores server I was using at the SIFMA show and the AMD FireStream card (about a fourth of a shoe box).

SIFMA 2008 was the perfect opportunity to confirm that ‘accelerated computing’ is the future. But, the overall feel remains that in most cases GPU development is slowed down by a still maturing industry both for APIs and hardware, but people are seriously investigating it.

And hedging the investment made in a single one of those new technologies is still the main concern, people being a bit reluctant to put all their eggs in the same basket.

As a matter of fact, lessons can be learned from history: When developing video games in 1995, the API of choice was Glide, actually published by the leading and only vendor in the accelerated 3D card market: 3dfx Inc. But “By 2000, the improved performance of Direct3D and OpenGL on the average personal computer, coupled with the huge variety of new 3D cards on the market, the widespread support of these standard APIs by the game developer community and the closure of 3dfx, would make Glide obsolete.” (source: Wikipedia)

I almost forgot! But that was really my reason to be there.

I have been helping in a lot of the parallel assessments for Wall Street and non-Wall Street customer to evaluate current implementations of CPU intensive and non-CPU intensive applications, to see how GPUs and other techniques like multi-threading and service grids can help improve throughputs and reduce latency of applications. And with the support of our development team, we can provide solutions to quickly evaluate, code and test GPU implementation (on multiple APIs) and on multi-core Technology.

You will be surprised at some of the results…

GPU Programming For C++ Developers

Thursday, February 7th, 2008

It appears as if the latest way to find more processing power is heading into the mainstream. Both Nivida and AMD have GPU cards dedicated to processing instead of graphics. The popular book series GPU Gems has even released a third generation of the series that contains quite a bit of information and examples using the CUDA architecture. However, if you’ve looked at CUDA you may have had a Bud Light commercial moment like I did and said, “Dude…”. The years of not writing code may be skewing my view on it, but to me this is not a simple API. It seems that without easier to use higher level tools, GPU processing might be restricted to the uber-brains of software engineering that do matrix math in their head for fun. Now I don’t want to single out CUDA, because Brook doesn’t seem likely to win any Easy Button awards either.

What is clearly needed is an abstract API on top of both vendor cards, that lets C++ developers write code without a huge learning curve. That’s an easy statement to make, but a lot of complexities arise. I have been asking some of our customers about what they think an API like this should look like, what level of complexity to hide and not hide, etc. While there is some consensus, there is also just as much differing opinion. The only obvious answer seems to be that people want to take advantage of GPUs, and they’re looking for a way to make it easier for them to do so. If you have strong feelings on what you would like to see, please let them be known.