NoSerC | Norwegian Service Centre for Climate Modelling -> SGI optimisation -> Optimising a modified ccm3.2 climate model | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Optimising a modified ccm3.2 climate modelBy Arild Burud and Egil Støren IntroductionJón Egill Kristjánsson (Department of Geophysics, University of Oslo) is using a modified version of the NCAR CCM3 climate model to study the indirect effect of aerosol particles in the atmosphere.The model is parallellized with OpenMP and is implemented on the gridur.ntnu.no computer, typically using 8 processors. The modifications by Kristjánsson have not been optimised for efficient use of the computer, and there could be a potential of reducing the time to run this model. The NoSerC crew have now investigated the program and made modifications that allow the time-consumption (in CPU-seconds) to be reduced with approximately 25%. This brief report summarizes the methods and findings in the optimisation process. Description of changesThree routines have been modified due to hardcoded directory paths (directories containing datafiles for I/O):chini.F initabc.F data.FOne file has been modified to remove an Y2K-error (causing the year in program output to be shown as 20102 instead of 2002): getdate.c Optimised routinesOptimisation of parmix.FThe routine parmix was by itself the most time-consuming routine in the whole program. Several techniques have been utilized to improve speed here.Loops have been modified to bring calculations out of the loops where possible, or even changing loop order to reduce work in the innermost loops. Several loops contained calculation of constant expressions (not dependant of loop increment), which were thus moved outside. One loop involved the calculation of log10(), which was identified to be constant also between each call to parmix. Thus an array was declared in a common block and the necessary values could be precalculated in the initialisation of the model (changes made in const.h, constants.F and koagsub.F). Parmix also calls the routines intccnb and intccns with several arrays as
arguments. The original version of parmix spent time copying the content of
matrices into temporary arrays before calling intccnb and intccns, afterwards
resulting arrays were copied back to matrices. This process has been changed so
that the routines now work directly on the data in the matrices. See below:
Optimisation of trcab.FLoops have been modified to bring calculations out of the loops where possible. Optimisation of intccnb.FThis routine has been more or less completely rewritten.In the original file there were several loops of the following kind: do long=1,nlonsand also double loops of the following kind: do ictot=1,5Initially, the outer and inner loops in the double loops above were interchanged, and then all the the smaller loops (long=1,nlons) were merged into one large loop. This made it no longer neccessary to keep the variables isup1, isup2, ict1, ict2, ifbc1, ifbc2, ifaq1 and ifaq2 as arrays. They were converted to scalars. The loops where these variables were set were also optimised by converting them to 'do while' loops to be able to exit from the loops as soon as the correct values were found. Finally, in the last part of the program where arrays f111, f112 etc. are set, parts of the formulaes were found to be the same for several statements. These parts were computed separately, and assigned to new temporary variables. Optimisation of backgra.F, backgrl.F and backgro.F
These small programs all had a double loop as follows:
do long=1,nlons lon = ind(long) Nnatkl(lon,i) = basic(i)*(0.5*(pint(lon,lev)+pint(lon,lev+1))/ps0)**vexp enddo enddo |
expression | expression | performance ratio |
---|---|---|
a**2.0 | a**2 | 17:1 |
a**3.0 | a**3 | 4:1 |
a**3.0 | a*a*a | 65:1 |
a**5.0 | a**5 | 3:1 |
a**5.0 | a*a*a*a*a | 33:1 |
time
command.
# processors | CPU time (optimised) | Elapsed time (optimised) | CPU time (original) | Elapsed time (original) | CPU time (ccm3.2) | Elapsed time (ccm3.2) |
---|---|---|---|---|---|---|
1 | 807.9 | 826 | 1055.0 | 1062 | 354.1 | 359 |
2 | 798.2 | 416 | 1075.2 | 553 | 373.5 | 190 |
4 | 834.4 | 229 | 1295.0 | 345 | 400.4 | 103 |
8 | 1032.0 | 153 | 1467.3 | 207 | 655.9 | 86 |
16 | 1238.4 | 102 | 1414.8 | 112 | 534.3 | 37 |
**
"),
exp and log.
Another way of parallellization may be considered, as it is seen from our results, the openMP version of ccm3 does not scale well above 8 processors. One may consider use of MPI or other parallellization frameworks to utilize more processors, it is indicated in the ccm3 documentation that the algorithms used are tailored for the use of up to 32 processors.
Finally, the actual version of ccm3 used as basis for this program should be considered
updated to a newer release.