About a couple of months back, I happened to attend a short seminar on multi-core programming by Intel here at Hyderabad. What I liked immensely about it was that it was not yet another blatant advertising campaign on some hardware or software product by some industry giant.
It was about the paradigm shift that the chip industry is undergoing – the trend towards more cores, rather than higher gigahertz horsepower. However, if you are an average Joe developer like me, you probably program your applications without leveraging the power of two or more cores simultaneously. By default, we don’t ‘think parallel’ for various reasons. For one, threading is not an easy concept. The seminar looked at some of Intel’s software offerings that help developers (especially Visual Studio developers) to create, debug and optimize threaded/multicore applications. (However, this post will not focus on those tools – you may want to visit www.intel.com/go/parallel for more information).
On a related thread (pun intended!), gcc-4.4.0 was released recently. This added support for version 3.0 of the OpenMP specification. OpenMP is something I had heard of before, but never actually tried. It is an API for C, C++ and Fortran programmers that enables you to ‘parallel program’ easily. Jargonspeak calls it ‘platform independent shared memory multiprocessing’. In effect, it’s threads without the associated headache of thread management. By the way, gcc has supported OpenMP way back from version 4.2. So, you don’t need the latest bleeding edge version for this. However, should you want to, on Windows you can always download the excellent TDM MingW builds for gcc-4.4.0 (latest direct link). If you’re a Linux geek, you probably know how to get gcc-4.4 for your distro anyway. Also, Microsoft Visual C++ Express does not include/support OpenMP – hence my experiments are limited to gcc on both Win and Lin.
All right then, let’s see how OpenMP aids a classic case of parallelization: matrix multiplication. Agreed – this is a rather simple programming problem, and real world problems are usually harder to parallelize than this. However, this should serve as a good starting point to explore further.
So, here’s the basic matrix multiplication loop that we want to parallelize, assuming arr1 and arr2 are inputs, and arr3 is the output array:
for(i=0; i<n; ++i) {
for(j=0; j<n; ++j) {
temp = 0;
for(k=0; k<n; ++k) {
temp += arr1[i][k] * arr2[k][j];
}
arr3[i][j] = temp;
}
}
OpenMP is mostly a set of compiler directives (pragmas) and library routines. In this case, it's enough for us to add on single statement before our loop.
#pragma omp parallel for private(i, j, k, temp)
for(i=0; i<n; ++i) {
for(j=0; j<n; ++j) {
temp = 0;
for(k=0; k<n; ++k) {
temp += arr1[i][k] * arr2[k][j];
}
arr3[i][j] = temp;
}
}
That’s it! This pragma tells the OpenMP subsystem to do it’s little magic behind the scenes and parallelize the ‘for loop’ following it.
Here’s the complete program, which contains additional code to initialize the arrays arr1 and arr2 pseudo-randomly, and to calculate the timings taken by the normal and the parallelized versions. You can compile the program with gcc-4.4 by the simple command:
gcc -fopenmp matmul.c
And on Windows, you might need to edit the PATH variable to include the GNU libgomp runtime (libgomp-1.dll). (Libgomp is GNU’s implementation of OpenMP). Here’s how I did it, for example:
set PATH=D:\MinGW\lib\gcc\mingw32\bin;%PATH%
So, the end result? Here are 4 sets of execution outputs. Two from Windows (TDM gcc-4.4.0):
Enter dimension ('N' for 'NxN' matrix) (100-2000): 1000
Populating array with random values...
Completed array init.
Crunching without OMP... took 23.032000 seconds.
Crunching with OMP... took 13.000000 seconds.
Enter dimension ('N' for 'NxN' matrix) (100-2000): 2000
Populating array with random values...
Completed array init.
Crunching without OMP... took 216.140000 seconds.
Crunching with OMP... took 118.641000 seconds.
And two from Linux (Ubuntu 9.04, gcc-4.3.3):
Enter dimension ('N' for 'NxN' matrix) (100-2000): 1000
Populating array with random values...
Completed array init.
Crunching without OMP... took 21.623144 seconds.
Crunching with OMP... took 13.686926 seconds.
Enter dimension ('N' for 'NxN' matrix) (100-2000): 2000
Populating array with random values...
Completed array init.
Crunching without OMP... took 189.184673 seconds.
Crunching with OMP... took 104.220751 seconds.
That’s almost doubling the speed, while adding one statement to your program! Actually two statements, if you include the include directive for <omp.h>. I’m sure you’d agree that for this case, OpenMP provides a really easy way of utilizing the idle core of most desktop machines out there. The good part is, even on a single core machine, the code works the way it should (the pragmas essentially NOP out, since there’d be no benefit in parallelizing on one core).
A look at the CPU utilization proves to be interesting too. (By the way, my home system runs an AMD Althon64 X2 4600 dual core, at a clock speed of 2.4GHz). In the first case, here’s a snap of the system information (using the excellent Process Explorer). Notice how the CPU usage remains peaked at around 50%, and the second CPU is mostly idle. Please click on the images below for the full view.
And here’s the usage when the OpenMP crunching is in action:
That’s more like it. Both horses in action, CPU peaked at 100%. Similar stuff can be seen on Linux, using Ubuntu’s (rather, GNOME’s) inbuilt System Monitor:
The portion where the red and orange worms collide at the top is the duration of the OpenMP version of the matrix multiplication program.
As already stated, matrix multiplication is an ideal case – and such 2x speedup on dual core machines are possible with only such ideal problems. However, there often are, if you look closely enough, parts of your program that can be parallelized. Further, we have not even scratched the surface of what’s possible using OpenMP 3.0. It goes way beyond parallelizing simple for loops. (Here’s the link to the spec in PDF).
And for sure, OpenMP is not the only way to go parallel portably. If you work in C++, you would have heard of the Boost C++ libraries. Give boost::threads a go!
With Intel gearing up for the release of its eight core Nehalem EX processors, and with AMD’s six core Istanbul processor already finding its way into mainstream desktop boards, there remains only one thing to say: if there is a time to think in parallel, this is it!
~Raj



28 comments ↓
Nice post! Although I never had to deal with large scale programs that could benefit with openMP, the idea still shows a lot of potential.
It is really unbelievable that with just a single statement you can double the performance !
Kudos to the GCC team in producing yet another marvelous piece of software!
~seemanta
Gonna have to figure out how to leverage this in my code…
BTW, here is the output from a Q6600@2.4Ghz (Linux Fedora 11, gcc 4.4.0)
Enter dimension (‘N’ for ‘NxN’ matrix) (100-2000): 1000
Populating array with random values…
Completed array init.
Crunching without OMP… took 11.320387 seconds.
Crunching with OMP… took 3.974353 seconds.
Enter dimension (‘N’ for ‘NxN’ matrix) (100-2000): 2000
Populating array with random values…
Completed array init.
Crunching without OMP… took 118.371741 seconds.
Crunching with OMP… took 38.730126 seconds.
With gcc 4.4, now we can use std::thread also. No need to use boost just for threading
Shouldn’t that be a multiply in the innermost loop?
@Kevin: 4 cores, hence (almost) 4x improvement. Way cool!
@Shridhar: Thats good news. That’s in fact one of the good things about the Boost guys – many of them are part of the C++ standards committee – so the good stuff ends up in the C++ standard as well!
@mike: Oops, my bad. Corrected!
Nice test.
But you should really take the printf() calls out of the
loops. You don’t want to measure that, do you?
Stefan
@Stefan: I’m sorry I had uploaded a ‘debug’ version which had the printf statements. I removed those now. The run times reported are using the original version though – else printf alone takes up a lot of time!
IMO, this is definitely the ideal solution for integrating parallel-processing into most applications – let the compiler do it, rather than trying to instantiate and manage threads/processes. Multiple threads/processes can be useful for performing multiple (distinct) jobs simultaneously (though in many such cases a good compiler should be able to take care of making those tasks parallel), but parallel for speed optimization should definitely be handled by the compiler where possible.
ArchLinux GCC4.4 AMD PhenomII 955BE@3.7GHz
[root@myhost amdpc]# ./matmul.r
Enter dimension (‘N’ for ‘NxN’ matrix) (100-2000): 1000
Populating array with random values…
Completed array init.
Crunching without OMP… took 11.956630 seconds.
Crunching with OMP… took 3.271322 seconds.
[root@myhost amdpc]# ./matmul.r
Enter dimension (‘N’ for ‘NxN’ matrix) (100-2000): 2000
Populating array with random values…
Completed array init.
Crunching without OMP… took 122.894132 seconds.
Crunching with OMP… took 35.093673 seconds.
—
very nice program. =))
On Intel 32-core Xeon server – X7560 @ 2.27GHz
Populating array[2000] with random values…
Completed array init.
Crunching without OMP… took 90.139761 seconds.
Crunching with OMP with 32 threads… took 3.308093 seconds.
hello..i want ask something..
i want do sorting by using open mp in c++..
it’s possible to run?maybe my coding is wrong..
please help me..
Thanks for your information,But i need to develop a multicore application for an embedded system in rtos,i read that pthread have advantage over openmp including portability.Which of the method is most suitable in the case of rtos.Moreover i wish to known that whether gcc had the multicore support?Which are the tool that provide the thread performance information.
I got some gomp examples going as executables, but the parallel for loop variables don’t increment properly when implemented in Windows DLL code!?
Hi, I try to use openmp in kdevelop while building a project of wxWidgets and not understand where should I use -fopenmp in project options. Anyone can help me in this context?
I was trying to duplicate the nice code boxes
that allow me to look at the source code print it out,etc.
But cannot find this in your .css files. WHere does
this functionality lurk?
Hello your article published is an excellent one especially for students like beginners to understand the concept easily. I am interested to recieve more example codes and theory based articles from you
Thank You,
Praveenraj
BE CS
PESIT-Bangalore.
[...] http://rajorshi.net/blog/2009/05/programming-for-multicore-introduction-openmp-gcc/ [...]
when i am trying
gcc title.c
…its givng error like omp_set_num_threads undefiened.
when i gave the command as u told that is
gcc -fopenmp prog1.c
it was givng errors like
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
C:\Documents and Settings\Avi>cd C:\
C:\>gcc -fopenmp title.c
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `_mingw_CRTSt
artup’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:193: undefined reference to `__dyn
_tls_init_callback’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:202: undefined reference to `SetUn
handledExceptionFilter@4′
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:207: undefined reference to `__cpu
_features_init’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:208: undefined reference to `_fpre
set’
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `mingw32_init
_mainargs’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/init.c:60: undefined reference to `_CRT_g
lob’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/init.c:60: undefined reference to `__getm
ainargs’
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `mingw32_init
_fmode’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:70: undefined reference to `_CRT_f
mode’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:97: undefined reference to `__p__f
mode’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:97: undefined reference to `_fmode
‘
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `_mingw_CRTSt
artup’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:224: undefined reference to `_pei3
86_runtime_relocator’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:236: undefined reference to `__mai
n’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:244: undefined reference to `__p__
environ’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:250: undefined reference to `_cexi
t’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:252: undefined reference to `ExitP
rocess@4′
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `mingw32_init
_fmode’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:72: undefined reference to `_fmode
‘
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:83: undefined reference to `_imp__
_iob’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:83: undefined reference to `_setmo
de’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:87: undefined reference to `_CRT_f
mode’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:87: undefined reference to `_setmo
de’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:91: undefined reference to `_CRT_f
mode’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:91: undefined reference to `_setmo
de’
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `gnu_exceptio
n_handler’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:162: undefined reference to `signa
l’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:135: undefined reference to `signa
l’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:116: undefined reference to `signa
l’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:140: undefined reference to `signa
l’
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:121: undefined reference to `signa
l’
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o:C:\MinGW\msys\1.0\src\ming
wrt/../mingw/crt1.c:165: more undefined references to `signal’ follow
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `gnu_exceptio
n_handler’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:167: undefined reference to `_fpre
set’
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `mainCRTStart
up’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:262: undefined reference to `_imp_
___set_app_type’
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `WinMainCRTSt
artup’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:276: undefined reference to `_imp_
___set_app_type’
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `atexit’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:288: undefined reference to `_imp_
_atexit’
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/../../../crt2.o: In function `onexit’:
C:\MinGW\msys\1.0\src\mingwrt/../mingw/crt1.c:296: undefined reference to `_imp_
__onexit’
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/crtbegin.o:cygming-crtbegin.c:(.text+0xe):
undefined reference to `GetModuleHandleA@4′
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/crtbegin.o:cygming-crtbegin.c:(.text+0×23)
: undefined reference to `GetProcAddress@8′
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/crtbegin.o:cygming-crtbegin.c:(.text+0×51)
: undefined reference to `GetModuleHandleA@4′
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/crtbegin.o:cygming-crtbegin.c:(.text+0×66)
: undefined reference to `GetProcAddress@8′
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/crtbegin.o:cygming-crtbegin.c:(.text+0x9a)
: undefined reference to `GetModuleHandleA@4′
c:/mingw/bin/../lib/gcc/mingw32/4.6.2/crtbegin.o:cygming-crtbegin.c:(.text+0xaf)
: undefined reference to `GetProcAddress@8′
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0xa): undefined referenc
e to `__main’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0x3b): undefined referen
ce to `printf’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0×47): undefined referen
ce to `puts’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0x5b): undefined referen
ce to `scanf’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0x6f): undefined referen
ce to `fopen’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0×86): undefined referen
ce to `puts’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0×92): undefined referen
ce to `exit’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0xb6): undefined referen
ce to `fscanf’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0xe4): undefined referen
ce to `fscanf’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0xfb): undefined referen
ce to `fclose’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0×109): undefined refere
nce to `clock’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0x12f): undefined refere
nce to `clock’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0×160): undefined refere
nce to `printf’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0x17c): undefined refere
nce to `printf’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0×194): undefined refere
nce to `printf’
C:\DOCUME~1\Avi\LOCALS~1\Temp\ccYSWgNR.o:title.c:(.text+0x1c2): undefined refere
nce to `printf’
collect2: ld returned 1 exit status
morever i set my PATH variable to
C:\MinGW\lib\gcc\4.6.2\
i am using MinGW gcc 4.6.2
please give any useful suggestion to me.
my program is like this
#include
#include
#include
#include
#include
#define MAX 10000000
void mergesort(float a[],int,int);
void merge(float a[],int,int,int);
float a[MAX],b[MAX];
char ar[MAX];
int main()
{
FILE *fp;
int i,n=0,j;
float time;
clock_t start,stop;
double startTime,endTime,wt;
int iCPU=omp_get_num_procs();
omp_set_num_threads(iCPU);
printf(“nNo of cpu’s=%d\n”,iCPU);
printf(“Parallel version of Merge sort.\nEnter the data file name :\n”);
scanf(“%s”,ar);
if((fp=fopen(ar,”r”))==NULL)
{
printf(“Can’t open file\n”);
exit(1);
}
i=fscanf(fp,”%f”,&a[n]);
while(i != EOF)
{
//printf(“%d,”,a[n]);
n++;
i=fscanf(fp,”%f”,&a[n]);
}
fclose(fp);
startTime=omp_get_wtime();
start=clock();
mergesort(a,0,n-1);
stop=clock();
endTime=omp_get_wtime();
printf(“\nNo of elements=%d\nstart time=%10.9f”,n,((float)(start)/ CLOCKS_PER_SEC));
printf(“\nstop=%10.9f”,((float)(stop)/CLOCKS_PER_SEC));
printf(“\nElapsed time=%10.9f”,(double)(endTime-startTime));
printf(“\nTime taken :%10.9f secs\n”,(((float)(stop-start))/CLOCKS_PER_SEC));
}
void mergesort(float a[],int low,int high)
{
int mid;
if (low<high)
{
mid=(low+high)/2;
#pragma omp parallel
{
#pragma omp sections nowait
{
#pragma omp section
mergesort(a,low,mid);
#pragma omp section
mergesort(a,mid+1,high);
}}
merge(a,low,mid,high);
}
}
void merge(float a[],int low,int mid,int high)
{
int i,h,j,k;
h=i=low;
j= mid + 1;
while((h<=mid) && (j<=high))
if (a[h]mid)
for(k=j;k<=high;k++)
b[i++]=a[k];
else
for(k=h;k<=mid;k++)
b[i++]=a[k];
for(k=low;k<=high;k++)
a[k]=b[k];
}
but its not geting executed using MInGW
m suing header files as omp.h stdio.h time.h…and sm more it didnt pasted in abv post…Program is working file without omp_num_threads and all
So,please tell me how can i use omp here…..
do i need to install threads or need to perform any other settings???
Excellent Post Mr.RAJ……….Felt that I learnt something
Hello there Nice blog. Do you want to invitee publish on my very own someday? If that’s the case you should let me know by way of e mail or simply answer this kind of comment since My partner and my spouse and i actually subscribed to notifications and can understand should you choose.
Sorry to the huge evaluate, but I am just really loving the new Zune, and desire this, as well because the excellent reviews a few other people have got written, will assist you to decide whether it’s the right choice for anyone.
The facts on your website is good. Finding transmit for more postings.
Hi there! Someone in my Facebook group shared this website with us so I came to take a look. I’m definitely enjoying the information. I’m book-marking and will be tweeting this to my followers! Wonderful blog and excellent design.
I’m not sure why but this website is loading incredibly slow for me. Is anyone else having this issue or is it a issue on my end? I’ll check back later on and see if the problem still exists.
Study it, liked it, thanks for it
It’s a shame you don’t have a donate button! I’d definitely donate to this superb blog! I suppose for now i’ll settle for bookmarking and adding your RSS feed to my Google account. I look forward to fresh updates and will share this website with my Facebook group. Chat soon!
Leave a Comment