GENERAL INFORMATION: TRCADV STANDALONE TEST
This is a test harness for the tracer advection component of the atmospheric
forecast model FIM (rev. 2269). The test harness first reads in realistic
initial values from a G5 model run (10242 horizontal points) with 64 vertical
levels. Next, one timestep of the tracer advection calculation is
done. Finally, the results of the advection calculation on the GPU are compared
to baseline results generated on the CPU with the same vendor compiler
compared to the F2C-ACC compiler developed at NOAA.
Build and run scripts have been built for the Cray, PGI and CAPS compilers
for Titan. Routines that are run on the GPU are contained in a single file
called trcadv_compute.F90. The source code contains openACC, and F2C-ACC
directives which are built with the appropriate makefiles, and runscripts.
History:
Jim Rosinski 2012-04-17:
Mark Govett 2012-08-20: Modifications to support gpu builds
Mark Govett 2013-10-30: openACC parallelization with PGI, Cray, CAPS
added build and run automation for Titan
----------------------------------------------------------------------------------
BUILD AND RUN INSTRUCTIONS FOR TITAN
1. load the appropriate modules:
source setup.titan.[f2c][cray][pgi][caps]
2. If running the CAPS or PGI compilers, replace the original source with one
that works around bugs in these compilers
cp trcadv_compute.F90[.pgi][.caps] trcadv_compute.F90
Once the bugs are fixed, these copies can be eliminated
3. build using the appropriate makefile:
make -f Makefile.titan.[f2c][cray][pgi][caps]
4. edit (as needed) and run using the appropriate script:
qsub runscript.titan.[f2c][cray][pgi][caps]
4. view the appropriate output file to determine if the results are correct:
more stdout.titan.[f2c][cray][pgi][caps]
See further details below under VALIDATION
5. view performance results:
grep trcadv[123] cud_profile_trcadv.[f2c][pgi][cray][caps]
Note that while GPTL performance results are reported in the stdout.titan
output, these timings include data transfers between CPU and GPU, and thus are
not valid for timing each kernel's computation time. Instead, timings from
the cuda profiler are used:
----------------------------------------------------------------------------------
PERFORMANCE RESULTS
No attempt was made to keep data resident on the GPU, or to optimize data
movement between CPU and GPU. The performance results are intended to measure
execution time of each GPU kernel without timing the data movement. This is
measured using output from the CUDA profiler.
F2C-ACC V5.2 trcadv1 trcadv2 trcadv3
f2c opt 1025 376 650
f2c bitwise exact 1031 396 655
CCE Version 8.2.0
parallel 2895 757 1233
parallel-fast 1205 607 1076
PGI Version 13.9
pgi,parallel 1536 872 850
CAPS, Version 3.4.0
caps 864
caps bitwise exact 847
----------------------------------------------------------------------------------
VALIDATION:
Validation is done by comparing output for a CPU run generated by the same
compiler, against the run on the GPU. Ideally, the GPU results should be
bitwise exact compared to the baseline cpu results. F2C-ACC and CAPS results
are bitwise exact between the CPU and GPU runs. The Cray and PGI results are not.
When results are not bitwise exact, we output statistics in stdout.titan.[arch]
For a gross estimate of how close each tracer calculation was to the reference
calculation, look for the string:
"average number of matching decimal digits"
Your results should be similar to or better than those printed below for each
of the tracers. If not, there is likely a problem with your modifications.
There is also information printed about largest pointwise max absolute
difference, and largest pointwise max relative difference. Again, your
results should be similar to these.
---------------------------------------------------------------------------
Reference output from a test harness run using pgf90:
reading resolution info
nip= 10242
nvl= 64
npp= 6
nabl= 3
ntra= 4
dt= 360.0000
reading Adams-Bashforth values
reading prox info
reading trc_edg
reading tr3d
reading trdp
reading trc_tdcy
reading trl_tdcy
reading massfx
reading dp3d
Diff stats for tr3d_ref vs. tr3d:
tracer= 1
141586 diffs were found of a possible 655488
max diff= 4.8828125E-04 at k,ipn= 64 5304
arr1,arr2= 2054.493 2054.494
number of decimal digits= 6.6
max relative diff= 3.3439824E-07 at k,ipn= 9 10130
arr1,arr2= 273.7835 273.7834
number of decimal digits= 6.5
average number of matching decimal digits for points with diffs= 7.0
tracer= 2
159853 diffs were found of a possible 655488
max diff= 5.5879354E-09 at k,ipn= 3 942
arr1,arr2= 1.7527623E-02 1.7527629E-02
number of decimal digits= 6.5
max relative diff= 1.7598884E-05 at k,ipn= 49 2256
arr1,arr2= 1.3184734E-09 1.3184502E-09
number of decimal digits= 4.8
average number of matching decimal digits for points with diffs= 7.0
tracer= 3
43745 diffs were found of a possible 655488
max diff= 1.1641532E-10 at k,ipn= 20 937
arr1,arr2= 8.0762274E-04 8.0762262E-04
number of decimal digits= 6.8
max relative diff= 1.3976785E-03 at k,ipn= 36 1096
arr1,arr2= 4.4339701E-10 4.4277729E-10
number of decimal digits= 2.9
average number of matching decimal digits for points with diffs= 7.0
tracer= 4
159524 diffs were found of a possible 655488
max diff= 5.4569682E-12 at k,ipn= 56 767
arr1,arr2= 1.5167213E-05 1.5167218E-05
number of decimal digits= 6.4
max relative diff= 1.4990462E-06 at k,ipn= 33 6139
arr1,arr2= 2.6069809E-08 2.6069770E-08
number of decimal digits= 5.8
average number of matching decimal digits for points with diffs= 7.0
Diff stats for trc_tdcy_ref vs. trc_tdcy:
tracer= 1
6640 diffs were found of a possible 655488
max diff= 2.1576881E-04 at k,ipn= 29 9804
arr1,arr2= -1.861748 -1.861964
number of decimal digits= 3.9
max relative diff= 0.4218263 at k,ipn= 13 3667
arr1,arr2= -1.5180185E-04 -8.7767839E-05
number of decimal digits= 0.4
average number of matching decimal digits for points with diffs= 5.0
tracer= 2
61687 diffs were found of a possible 655488
max diff= 1.0026270E-08 at k,ipn= 35 2454
arr1,arr2= -4.3764179E-05 -4.3754153E-05
number of decimal digits= 3.6
max relative diff= 0.5107914 at k,ipn= 57 1441
arr1,arr2= 5.4012350E-14 2.6423308E-14
number of decimal digits= 0.3
average number of matching decimal digits for points with diffs= 5.5
tracer= 3
10595 diffs were found of a possible 655488
max diff= 2.6921043E-10 at k,ipn= 27 9846
arr1,arr2= -1.1448831E-04 -1.1448804E-04
number of decimal digits= 5.6
max relative diff= 2.4494989E-02 at k,ipn= 34 6401
arr1,arr2= -1.4294983E-09 -1.3944828E-09
number of decimal digits= 1.6
average number of matching decimal digits for points with diffs= 6.0
tracer= 4
52870 diffs were found of a possible 655488
max diff= 2.6005864E-12 at k,ipn= 51 5781
arr1,arr2= 2.4323910E-08 2.4326511E-08
number of decimal digits= 4.0
max relative diff= 0.4084507 at k,ipn= 10 3128
arr1,arr2= -1.0783041E-14 -1.8228474E-14
number of decimal digits= 0.4
average number of matching decimal digits for points with diffs= 5.2
------------------------------------------------------------------------------------
SOURCE CODE
------------------------------------------------------------------------------------
compare_output_data.F90 compare output
copytoGPUinit.F90 copy data to GPU - only needed for F2C-ACC
driver.F90 main program
trcadv_compute.F90 accelerator routine
trcadv.F90 driver routine for trcadv
variables.F90 module variables