Jump to content

Archive for June, 2008

Parallel Computing Discussion at ACM in Chicago

Friday, June 27th, 2008

By Patrick Leonard, VP Engineering & Product Strategy

Recently I had the opportunity to address the Chicago chapter of ACM (Association of Computer Machinery) on the subject of parallel computing. In addition to giving me the opportunity to make a Star Wars reference or two, Yoda
it was a very interesting conversation on the subject. About a quarter of the attendees were from the financial services industry and close to half of the attendees are interested in using GPUs.

The main point of my talk was to give an overview on current trends in the industry and to discuss a model for parallel computing in software development. Most parallel programming has been focused at the task and data level with tools like OpenMP and MPI. Data parallelism continues to be very important, but I suggest that there is a higher level of granularity in parallel computing - Service Parallelism.

Service Parallelism is essentially the intersection of SOA and parallel computing. Rather than taking a loop or function inside of a program and making a parallel you take a whole service and run multiple instances of that service (loops can be parallel too, they are just running inside the service).

There are several advantages to this:

  • If you already have services there is little to no recoding required.
  • Changing service parallelism means a change in configuration not encode so ongoing maintenance is much easier.
  • Service parallelism separates the parallel aspect of the application from the logic so your application developers don’t have to be experts in the parallel model.

If you are interested in a review of industry trends on multi-core CPU, GPU, and ideas for software parallelism, take a look at the slides on the Rogue Wave web site.

Intel’s ‘Ct’

Wednesday, June 25th, 2008

By Patrick Leonard, VP Engineering & Product Strategy

Intel recently announced that they are working on a new programming language specifically designed for multi-core CPU hardware - called ‘Ct’. Ct is ‘C’ for throughput, and is essentially the C programming language with extensions.

It is similar in many ways to CUDA from nVidia and Brook+ from AMD, although Ct is for CPUs and CUDA & Brook+ are for GPUs (see earlier post re: GPUs). This is likely to be a good thing for software developers who are working on getting existing and yet-to-be-written software to scale appropriately on multi-core hardware.

Ct uses the combination of a compiler and runtime to take much of the burden of parallelism from the software developer. For example, the basic tasking unit is a ‘future’, which can be executed now or later and receives data consistency guarantees from the runtime. You can find details on how it will work on Intel’s site.

It does, however, highlight again the split that has occurred in hardware design - all vendors are going multi-core/multi-thread, but some are taking more of a homogeneous CPU approach, and some are taking a more heterogeneous GPU (accelerator) approach.

For software engineers, this means productivity challenges (”how do I get my existing code to run on GPUs, how do I get it to scale on multi-core CPUs”) as well as portability issues (”I don’t really want to maintain code written in CUDA, Brook+ and Ct, even though they are all variants of C”). This is all related to the Multi-core Dilemma that I have written about previously on the Intel Blog site and elsewhere.

Rogue Wave’s ‘Hydra’ product uses Service Parallelism to address the Multi-core Dilemma on CPUs, and we have worked with Intel a great deal on this, as it is complementary to Ct and other Intel technologies like TBB.

We are also working with both nVidia and AMD on Project “Gazelle” to address GPUs. “Gazelle” will generate optimized code for nVidia and AMD GPUs, and could do the same for Intel Ct in the future to ease migration for existing applications.

Pac-Man crashed the SIFMA 2008 party. His message: “I will help you save on your electricity bill”

Thursday, June 19th, 2008

June 10th, New York City: At about 100 degrees, this was probably one of the hottest spring days in 2008. The SIFMA (Securities Industry and Financial Markets Association) technology management exhibit was just opening up, and to keep all the suit-wearing businessmen cool, the hotel’s air conditioners were throwing many BTUs away…

This wasn’t without reminding me of the reason why I was there, standing at the AMD booth, demonstrating computation running on an AMD FireStream graphic card…

It all started almost a year ago, when most of our Wall Street customers asked us whether we could help with programming to GPU (Graphical Processing Unit), or most widely known as ‘accelerated graphics card’.

Their interest is pretty simple - financial computations require a LOT of computing power. And with a traditional CPU-based approach like a grid, a LOT of computing power requires a LOT of electrical power, which at the end of the day is lost in the air conditioning system…

It is foreseen than ‘accelerated computing’ based on hardware derived from what is commonly known as ‘graphics cards’ is the best chance to save a lot of those BTUs…

Why?

First, GPU can accelerate computations by a huge ratio.

Pac-Man was released in 1981, and by today’s standards, moving the yellow flat circle across the screen is no more considered a technological achievement. As a matter of fact, 3dfx Inc. revolutionized gaming in 1995 by introducing the first ‘consumer accelerated graphics card’, hence delivering ‘mind bending graphics’. Simultaneously, Microsoft introduced the first version of their DirectX API, now the leading reference in gaming development. And 13 years later, GPUs are now able to create three dimensional images more than 200 times faster than regular CPUs.

It is not a stretch to establish a parallel between the Black-Scholes paper (published in 1973 and introducing basic option pricing concepts) and Pac-Man. After all, they both created a new industry. But while the ‘accelerated computing’ revolution happened in 1995 for video-gaming, Financial and General Purpose accelerated computing is being revolutionized today.

AMD and nVIDIA are the first to introduce new dedicated cards that are no more limited to graphics and linear algebra only, but can also run full double-precision C-like programs on extremely large sets of data. Simultaneously, general purposes APIs are becoming available. And preliminary tests show 10X - 40X improvement for some applications. It is still a bit shy of the acceleration we are seeing in the graphics world, but remember we are talking 1st generation hardware.

Second, GPUs can save power.

Even at a 20X improvement, a single GPU offers performance of 20 CPU cores. And it consumes around less than 150Watts… And if you can picture it correctly, it is easy to compare the size of the 8 cores server I was using at the SIFMA show and the AMD FireStream card (about a fourth of a shoe box).

SIFMA 2008 was the perfect opportunity to confirm that ‘accelerated computing’ is the future. But, the overall feel remains that in most cases GPU development is slowed down by a still maturing industry both for APIs and hardware, but people are seriously investigating it.

And hedging the investment made in a single one of those new technologies is still the main concern, people being a bit reluctant to put all their eggs in the same basket.

As a matter of fact, lessons can be learned from history: When developing video games in 1995, the API of choice was Glide, actually published by the leading and only vendor in the accelerated 3D card market: 3dfx Inc. But “By 2000, the improved performance of Direct3D and OpenGL on the average personal computer, coupled with the huge variety of new 3D cards on the market, the widespread support of these standard APIs by the game developer community and the closure of 3dfx, would make Glide obsolete.” (source: Wikipedia)

I almost forgot! But that was really my reason to be there.

I have been helping in a lot of the parallel assessments for Wall Street and non-Wall Street customer to evaluate current implementations of CPU intensive and non-CPU intensive applications, to see how GPUs and other techniques like multi-threading and service grids can help improve throughputs and reduce latency of applications. And with the support of our development team, we can provide solutions to quickly evaluate, code and test GPU implementation (on multiple APIs) and on multi-core Technology.

You will be surprised at some of the results…

Why would SOA become the dominant architecture for software development?

Wednesday, June 18th, 2008

In a recent blog post, Alex Cameron with EDS talks about SOA becoming the dominant architecture for Software Development. I could definitely see how this could be true. It seems software development has progressed and chosen certain styles of programming languages for a reason. As Java and C++ instrumented separate implementation and interfaces, developers realized they could more easily use another developer’s work without having to know what was going on under the covers. Companies and managers saw that they could more efficiently manage and control large projects with various teams interacting with each other. It led to easier to understand software, more productive development teams, and even documenting the software became simpler as the interface was a great guide as to what the component did.

So what is the extension of that? Not only would that developer like to use someone else’s work without knowing anything about it, but they also want to have access to work done on other OSes, on different hardware, in different languages, and all without having to understand the details. So the previous model of finding a .h file or some other class description in the appropriate programming language would be replaced by a search of WSDLs for the functionality needed. No longer would the developer be limited by language, platform, or in some cases, even geography or affiliation.

Rogue Wave / AMD partnership for Multi-core CPU and GPU

Wednesday, June 11th, 2008

By Patrick Leonard, VP Engineering & Product Strategy

Expansion of our Relationship

Rogue Wave and AMD have a long-standing partnership to advance C++ software development on AMD’s Opteron CPU platform. I’m excited that our two companies have recently announced an expansion of that relationship to make it easier for software applications to take advantage of the additional computing power available on multi-core CPUs and on GPUs (graphical processing units).

For several years, increased performance from all hardware vendors has largely come from additional “cores” instead of faster clock speeds. This provided significant additional processing power, but most existing software doesn’t adequately take advantage without significant modification. This is called the “Multi-core Dilemma”.

Challenge and Opportunity

The Multi-core Dilemma is both a challenge and an opportunity that will increase rapidly as the number of cores and threads continues to increase. A typical GPU already has 128 threads. For applications that lend themselves to parallel processing, this can mean a significant gain in throughput.

Although GPUs have the potential for even greater processing power than their CPU counterparts (for certain applications), there are additional challenges as well:
1. Developer productivity - use of the software tools requires special training.
2. Portability - software written for GPUs will not run on other GPUs or on CPUs.

Our partnership is designed to address both of these issues, and to close the gap between hardware and software that has been widening over the past few years.

Although both companies are committed to broadly applicable solutions, our initial focus is on the financial services industry, where much of the activity is already happening.

What are your experiences with multi-core CPU and / or GPU? Please post a response with your thoughts.

Matrix multiply in parallel - is a different result ok?

Monday, June 9th, 2008

By Patrick Leonard, VP Engineering & Product Strategy

When moving a production application from one system to another, extensive testing is generally done to ensure, among other things, that results from the new systems agree with expected results from the old system. This is true whether changing operating systems, hardware, or anything else.

For example, many financial services firms have moved from Unix systems to Linux for a variety of good reasons. When moving quantitative analysis applications, they had to verify - to multiple significant digits - that calculations done on Linux would not be different from what they got in the old system.

Different is not always wrong - sometimes a different new result is “more correct” - but it takes effort and time to verify that and make sure.

Now many companies are moving from sequential processing to parallel processing. This can actually be a bit trickier. Certain mathematical algorithms calculate differently in a parallel environment than in a sequential environment. This may not have anything to do with the implementation, it is often just the nature of the numbers.

Matrix multiplication is an example of this. Since matrix multiplication is not commutative in most cases, multiplying a matrix in parallel can result in a different outcome because the multiplication and subsequent addition is necessarily done in a different order.

Here is an example (thanks to David Haney):

Given two 4 x 4 matrices (A and B), you would normally calculate the result in 0,0 as:


(A00 * B00) + (A01 * B10) + (A02 * B20) + (A03 * B03)

If you change the order of operations though, like the following (note the parens):


((A00 * B00) + (A01 * B10)) + ((A02 * B20) + (A03 * B03))

Then you might see different results, depending on how the floating point rounding turns out. You probably won’t see much skew at this scale (especially if all of the numbers are roughly the same magnitude), but if you were dealing with an 1024 x 1024 matrix, you would probably start seeing some variation.

There are some algorithms for breaking up a matrix multiply that allow you to maintain equivalent results to sequential, but still at least partially execute the code in parallel, but from what we’ve seen those methods look like they’re less efficient than algorithms that do some amount of reordering.

The outcome, although different, may not be any less “correct”. But that difference may have business consequences that need to be planned for. Regardless of the software programming model and technology used to go parallel, this is something to be mindful of.

Release at Any Time: the Documentation Perspective

Tuesday, June 3rd, 2008

At Rogue Wave, the trend is towards agile development, with frequent releases of new features between major product releases. To this end, we maintain an impressive infrastructure of nightly automated testing of a large code base across a daunting number of platform, compiler, and database combinations. The system includes extensive reporting of test results against code quality baselines, regression analysis, and ongoing fixing of priority bugs. The goal is to maintain the code base at a high level of quality such that we can release at any time with confidence.

As a documentation person, the good news is that Rogue Wave has always valued documentation highly, and considers good documentation an important part of the product. The challenge is that documentation must therefore strive for the same level of consistent, release-at-any-time quality.

== Getting There with the Process Automation ==

When I realized that the documentation team either needed to match the agility and automation of the development teams or risk becoming less relevant, I could take comfort in the fact that documentation already had in place considerable process automation. For many years at Rogue Wave, a conversion architecture has supported the ability to reconvert FrameMaker source documents into the release formats easily and at any time. An added feature of this process was extensive reporting of formatting and linking problems found during the conversions.

The first step was to create infrastructure to support automated nightly conversion runs. The biggest obstacle was automating up-to-date PDF creation, the one main distribution format that was still created manually. A utility called FrameScript was the solution to that problem. With a little more creative jiggling, we reached a state where all documentation could be converted nightly, and the conversion error reporting neatly summarized on a single point of access Web page.

So far so good, but what customers expect to see is not an amorphous bundle of document files. They expect a well-formed document distribution, with convenient access points to the information. So we next devised a process for defining a manifest of everything that needed to be in a given product distribution, and a script to act on that manifest to create document distributions exactly as we expected to deliver to customers. Naturally we added some testing, too, resulting in a nightly distribution quality report.

== Document Health ==

All well and good, but all of this automation counts for very little without a commitment and a process to maintain good document quality — what we choose to call document health. Scripts are very non-judgemental, which is the inspiration for the old saying about the consequences of feeding them garbage. So while we in documentation were emulating the automated processes of our development colleagues, we were also adopting their scrum-based agile methodology. As they work on a feature, we work beside them on its documentation. Critically, we also continually monitor the reports that come out of our nightly automation, and attempt to keep the errors at or near zero. This works quite well with the incremental changes expected with an established, stable product, not quite so well with the major revisions and refactorings that are the inevitable burden with dynamically changing newer products.

Even if the picture is sometimes less than completely rosy, there is no arguing with the vision. When it is going well, this approach gloriously meets its intended goal. The document distributions that are created each night are exactly the documentation we intend to release. If the document set is reasonably stable and we are on top of the errors, we truly can on any given day publish a document distribution to release engineering and be proud to have it given to our customers.

It doesn’t get any better than that.

Life after CORBA

Tuesday, June 3rd, 2008

I have been involved in distributed computing for a number of years, and recognize that Service Oriented Architecture is just another approach to getting distributed applications to work together. Previous generations include things like rpc’s, DCE (http://en.wikipedia.org/wiki/Distributed_computing_environment), CORBA (http://www.omg.org/) etc. The advantage of SOA lies in fact that the underlying standards, ie XML, SOAP etc, are broadly accepted across the industry, so interoperability between vendor products is much more real now than it has ever been previously.

I happen to have a good deal of experience in the CORBA world, having worked for Visigenic Software both before and after it was acquired by Borland. CORBA was an effective tool for connecting distributed objects, providing both language and platform neutrality. This was true so long as your platform was not Windows, because then you had to deal with COM/DCOM and the world of COM/CORBA bridges. This split between Microsoft and the rest of the world was a key issue that ultimately limited the proliferation of CORBA, but not before it was broadly adopted, particularly in the Telco and Financial Services industries. You still see many implementations of CORBA in what are now referred to as legacy applications, but not as much in newly developed systems.

Many of our SourcePro C++ customers also use CORBA orbs, most often Orbix. What we are finding is that many customers have applications that use older versions of Orbix that are no longer supported, and yet they continue to pay significant maintenance fees on those licenses. One customer explained that they feel they are at risk every time they touch the application, because if something breaks, they have no good avenue to seek help. This is not the ideal that IT strives for, ie it is both expensive and risky. The good news is that for many customers, there is a better alternative that it easy to put in place.

In many cases, orbs were used essentially as a communications mechanism between remote applications, maybe handling the mediation between C++ and Java applications. Today, this problem can easily be solved using a Web services approach. Rogue Wave has a product known as HydraExpress that has the capability to easily turn a C++ application into a service. For CORBA users the paradigm is familiar. This product can take WSDL (remember IDL?) as an input and generate stubs and skeletons for a Web services client or server. There are open source tools http://search.cpan.org/~perrad/CORBA-XMLSchemas-0.41/idl2wsdl.pl) that help you to convert IDL to WSDL, which is the key step in the process. It is not always that simple, but often it is darn close. Once complete, you have an application based upon modern standards that gives you more flexibility, less risk and at significantly less cost. Sounds pretty good to me.

The problems inherit in ticket distribution

Monday, June 2nd, 2008

I recently bought tickets to an upcoming once-in-a-lifetime concert event. The tickets were being sold through an online ticket distributor which seems to have a firm hold on the market for ticket distribution. There were quite a few people trying for these tickets and I was expecting lots of problems. Here in Colorado we lived through the World Series ticket fiasco of 2007 and I was expecting nothing less for this one. I anticipated slow page loads, having to refresh often, being dependent on luck to get through, and ultimately I expected to come away with no tickets.

However, I was pleasantly surprised. The site never failed. It allowed me to specify the tickets I was looking for, then it searched, and then I was presented with seats that I could buy as well as an option to search again. Then came the surcharges: $14.50 convenience on each ticket; $4.50 Processing; $2.50 delivery; and a $4.00 Facility charge on each ticket. Granted the facility charge is probably from the venue itself, but the rest go to the distributor. It made me think: How can they charge so much without driving customers away? I would certainly use someone else if there was an option.

All we have to do is look at that 2007 World Series to see that the process isn’t that easy. You have to manage impossible amounts of traffic that comes in a very short amount of time. Seats have to be held and assigned in the order in which the requests come in without giving the same seats away twice. There has to be a process for holding and releasing tickets. And you have to have a scalable server workforce to handle anything from the small venues where 30 tickets might be sold up through events where you might have 300,000 tickets for a series, all of which might sell out within an hour or two.

This use case is tailor-made for Hydra. Hydra allows for scalability, maintaining proper order, failover in case of a server crash, and will take advantage of the extra processing power allowed by multi-core hardware. Hydra will also allow new servers to come online to handle an increase in demand with minimal configuration change and no disruption to existing services that are being provided. This way, idle servers can be assigned to high demand areas when needed, and can be moved back and forth between projects or events as volume changes. Hydra will simplify the real difficulties of ticket distribution and let someone work on the business model and user experience rather than the technological difficulties.

Next task: Take on the ticket distribution market!