ESRL Global Systems Division

Fine-Grain Parallel Computing: the Next Frontier in High Performance Computing

The Boulder HPC Facility: Exploring New Computing Technologies for NOAA



GPU:  NVIDIA Kepler K20x
 
  2688 cores

A new generation of high-performance computing has emerged called Massively Parallel Fine Grain (MPFG).  MPFG computing falls into two general categories: Graphics Processing Units (GPUs) from NVIDIA and AMD, and Many Integrated Core (MIC) offered by Intel.  These chips are ten to fifty times more powerful than CPUs.  Rather than 12-16 powerful cores found in CPUs, they rely on hundreds to thousands of simple compute cores to execute calculations simultaneously.  In addition to higher performance, MPFG chips also consume less power per floating-point operation (flop) making them an attractive, energy-efficient alternative to CPUs.



Intel MIC (Many Integrated Core)

     61 cores
    512-bit SIMD vector registers

Major Activites in 2012 / 2013
    • Fortran to CUDA (or C) Compiler:  The F2C-ACC compiler was released to the public in June 2009. This compiler continues to be developed, and has been used to parallelize and run the FIM and NIM models on NVIDIA GPUs.  
      F2C-ACC Version 5.5.1 was released in May 2014.  This minor release fixes a small error in some simple unit tests for using GPU constant memory that are no longer used.
  • OpenACC GPU Compiler Evaluation:  We would like to use the commercial compilers once they are ready.  In a recent evaluation of the Cray, PGI and CAPS compilers we compared their performance to our F2C-ACC compiler.  Performance results show they are significantly slower than the F2C-ACC compiler, but further optimizations may be possible.  To improve their performance, we plan to share code from the FIM, NIM and WRF models, packaged into simple stand alone tests, that were used in our evaluation.

The standalone tests, configured to run on ORNL Titan, can be downloaded, configured and run using the appropriate makefile, runscript and module scripts for each compiler.  Code examples are clean, efficient and with the highly parallelizable loop structure found in most weather and climate codes.

FIM:  advection                       README.titan            Download (300 MBytes)
FIM: continuity                        Available Soon
WRF PBL Physics                 Available Soon

  • NIM Model Dynamics:  Parallelization using F2C-ACC for the GPU, and openMP for the Intel MIC.  Parallelization efforts focused on (1) maintaining a single source code for CPU, GPU and MIC, and (2) run efficiently on all architectures.  Similar effort was required to parallelize the model using openMP for MIC (native mode), and F2C-ACC for GPU.  Most of the effort required was adapting the Fortran code to expose fine-grain parallelism, while maintaining performance portability.  We have run the NIM on over 1000 GPUs on ORNL's Titan machine, and hundreds of nodes on NSF's TACC MIC cluster.  Further work is ongoing to optimize inter GPU / MIC performance on these systems to improve scalability.
  • NIM Model Physics: Early work was done to extract a standalone test of the WRF PBL (Planetary Boundary Layer) physics that yielded encouraging results on the GPU in 2011.  This has led to a more general approach to parallelization that has mostly focused on small changes to expose fine-grain parallelism and handle alignment issues (for the MIC).  Parallelization of select routines is expected to begin in November 2013. 

Presentations

  • Coming ...

Prepared by Mark Govett, Mark.W.Govett@noaa.gov
Date of last update:November 1, 2013