Programming for multicore: An introduction to OpenMP using GCC-4.4

About a couple of months back, I happened to attend a short seminar on multi-core programming by Intel here at Hyderabad. What I liked immensely about it was that it was not yet another blatant advertising campaign on some hardware or software product by some industry giant.

It was about the paradigm shift that the chip industry is undergoing – the trend towards more cores, rather than higher gigahertz horsepower. However, if you are an average Joe developer like me, you probably program your applications without leveraging the power of two or more cores simultaneously. By default, we don’t ‘think parallel’ for various reasons. For one, threading is not an easy concept. The seminar looked at some of Intel’s software offerings that help developers (especially Visual Studio developers) to create, debug and optimize threaded/multicore applications. (However, this post will not focus on those tools – you may want to visit www.intel.com/go/parallel for more information).

On a related thread (pun intended!), gcc-4.4.0 was released recently. This added support for version 3.0 of the OpenMP specification. OpenMP is something I had heard of before, but never actually tried. It is an API for C, C++ and Fortran programmers that enables you to ‘parallel program’ easily. Jargonspeak calls it ‘platform independent shared memory multiprocessing’. In effect, it’s threads without the associated headache of thread management. By the way, gcc has supported OpenMP way back from version 4.2. So, you don’t need the latest bleeding edge version for this. However, should you want to, on Windows you can always download the excellent TDM MingW builds for gcc-4.4.0 (latest direct link). If you’re a Linux geek, you probably know how to get gcc-4.4 for your distro anyway. Also, Microsoft Visual C++ Express does not include/support OpenMP – hence my experiments are limited to gcc on both Win and Lin.

All right then, let’s see how OpenMP aids a classic case of parallelization: matrix multiplication. Agreed – this is a rather simple programming problem, and real world problems are usually harder to parallelize than this. However, this should serve as a good starting point to explore further.

So, here’s the basic matrix multiplication loop that we want to parallelize, assuming arr1 and arr2 are inputs, and arr3 is the output array:


for(i=0; i<n; ++i) {
  for(j=0; j<n; ++j) {
    temp = 0;
    for(k=0; k<n; ++k) {
      temp += arr1[i][k] * arr2[k][j];
    }
    arr3[i][j] = temp;
  }
}

OpenMP is mostly a set of compiler directives (pragmas) and library routines. In this case, it's enough for us to add on single statement before our loop.


#pragma omp parallel for private(i, j, k, temp)
for(i=0; i<n; ++i) {
  for(j=0; j<n; ++j) {
    temp = 0;
    for(k=0; k<n; ++k) {
      temp += arr1[i][k] * arr2[k][j];
    }
    arr3[i][j] = temp;
  }
}

That’s it! This pragma tells the OpenMP subsystem to do it’s little magic behind the scenes and parallelize the ‘for loop’ following it.

Here’s the complete program, which contains additional code to initialize the arrays arr1 and arr2 pseudo-randomly, and to calculate the timings taken by the normal and the parallelized versions. You can compile the program with gcc-4.4 by the simple command:

gcc -fopenmp matmul.c

And on Windows, you might need to edit the PATH variable to include the GNU libgomp runtime (libgomp-1.dll). (Libgomp is GNU’s implementation of OpenMP). Here’s how I did it, for example:

set PATH=D:\MinGW\lib\gcc\mingw32\bin;%PATH%

So, the end result? Here are 4 sets of execution outputs. Two from Windows (TDM gcc-4.4.0):

Enter dimension ('N' for 'NxN' matrix) (100-2000): 1000
Populating array with random values...
Completed array init.
Crunching without OMP... took 23.032000 seconds.
Crunching with OMP... took 13.000000 seconds.

Enter dimension ('N' for 'NxN' matrix) (100-2000): 2000
Populating array with random values...
Completed array init.
Crunching without OMP... took 216.140000 seconds.
Crunching with OMP... took 118.641000 seconds.

And two from Linux (Ubuntu 9.04, gcc-4.3.3):

Enter dimension ('N' for 'NxN' matrix) (100-2000): 1000
Populating array with random values...
Completed array init.
Crunching without OMP... took 21.623144 seconds.
Crunching with OMP... took 13.686926 seconds.

Enter dimension ('N' for 'NxN' matrix) (100-2000): 2000
Populating array with random values...
Completed array init.
Crunching without OMP... took 189.184673 seconds.
Crunching with OMP... took 104.220751 seconds.

That’s almost doubling the speed, while adding one statement to your program! Actually two statements, if you include the include directive for <omp.h>. I’m sure you’d agree that for this case, OpenMP provides a really easy way of utilizing the idle core of most desktop machines out there. The good part is, even on a single core machine, the code works the way it should (the pragmas essentially NOP out, since there’d be no benefit in parallelizing on one core).

A look at the CPU utilization proves to be interesting too. (By the way, my home system runs an AMD Althon64 X2 4600 dual core, at a clock speed of 2.4GHz). In the first case, here’s a snap of the system information (using the excellent Process Explorer). Notice how the CPU usage remains peaked at around 50%, and the second CPU is mostly idle. Please click on the images below for the full view.

1

And here’s the usage when the OpenMP crunching is in action:

2

That’s more like it. Both horses in action, CPU peaked at 100%. Similar stuff can be seen on Linux, using Ubuntu’s (rather, GNOME’s) inbuilt System Monitor:

ubuntu sm

The portion where the red and orange worms collide at the top is the duration of the OpenMP version of the matrix multiplication program.

As already stated, matrix multiplication is an ideal case – and such 2x speedup on dual core machines are possible with only such ideal problems. However, there often are, if you look closely enough, parts of your program that can be parallelized. Further, we have not even scratched the surface of what’s possible using OpenMP 3.0. It goes way beyond parallelizing simple for loops. (Here’s the link to the spec in PDF).

And for sure, OpenMP is not the only way to go parallel portably. If you work in C++, you would have heard of the Boost C++ libraries. Give boost::threads a go!

With Intel gearing up for the release of its eight core Nehalem EX processors, and with AMD’s six core Istanbul processor already finding its way into mainstream desktop boards, there remains only one thing to say: if there is a time to think in parallel, this is it!

~Raj