Benchmarks at CyberInfrastructure Partnership

HPCC

HPCC Chart

The HPC Challenge benchmark set [HPCC] consists of seven synthetic benchmarks. These are listed in the following table along with the primary system parameters that determine the performance of each.

 
Flop
    Rate   
Memory
 Bandwidth 
Memory
 Latency 
Interconnect
Bandwidth
Interconnect
Latency
HPL
X
 
 
X
 
DGEMM
X
 
 
 
 
STREAM
 
X
 
 
 
PTRANS
 
X
 
X
 
RandomAccess
 
 
X
 
X
FFTE
 
X
 
X
 
bench_lat_bw
 
 
 
X
X

Three benchmarks are targeted toward single system parameters. These are DGEMM, STREAM, and the bandwidth and latency parts of bench_lat_bw. The other four benchmarks – HPL, PTRANS, RandomAccess, and FFTE – are more complex in that their performance depends upon at least two system parameters.

Most of the benchmarks output more than one metric. However, eight primary metrics are normally reported. These are listed as the column headings of the following table along with two derived metrics (G-STREAM Triad and HPL percent of peak).

Results for HPCC 1.0.0 obtained on the computers at NCSA and SDSC are listed at various processor counts in the table and at 1,024 processors in the associated column chart. For the complex benchmarks, the tabulated results at different processor counts represent weak scaling scans on each computer, i.e., the work per processor is kept roughly constant.

To run HPCC, it is necessary to provide a library containing the BLAS (Basic Linear Algebra Subprograms). For the IBM computers, the ESSL math library was used. For Cobalt, the SCS library was used. For the other computers, CMKL was used.

DGEMM performs matrix-matrix multiplication and is one of the BLAS. It is the primary subroutine used in HPL to solve a dense linear system. DGEMM employs a blocking algorithm to achieve high data reuse and minimize memory access. DGEMM and HPL are typically used to measure the maximum flop rate that a computer can sustain.

HPCC contains two variants of DGEMM: SingleDGEMM, which runs on only a single processor, and StarDGEMM (or EP-DGEMM), which runs simultaneously on all of the processors. Results for the two variants are essentially the same and independent of the total number of processors.

The results presented here are for EP-DGEMM. They show that T2 and Cobalt are fastest for this metric, with Mercury and Tungsten slightly slower. T2 achieves 91% of its peak flop rate, while Cobalt achieves 96% of its peak flop rate.

STREAM consists of four unit-stride loops to measure memory bandwidth by accessing a block of memory larger than will fit in the largest level of cache. The loop considered here is STREAM Triad: a(i) = b(i) + s*c(i). SingleSTREAM Triad and StarSTREAM Triad (or EP-STREAM Triad) variants are available, similar to those for DGEMM. In this case, the Star or EP variant is slower than the Single variant because of memory contention. Results for both variants are independent of the total number of processors beyond a single node.

Results for the EP variant are shown in the table and column chart. Also, shown in the table are results for the G-STREAM Triad metric, which is just the EP-STREAM Triad metric multiplied by the number of processors. Cobalt and Mercury are fastest for EP-STREAM Triad, while DataStar is slightly slower.

Bench_lat_bw is the interconnect bandwidth and latency benchmark in HPCC. Three tests are included: Ping Pong, Naturally Ordered Ring, and Randomly Ordered Ring.

Random Ring bandwidths and latencies are presented here and vary with processor count in a way that depends upon the interconnect topology. Cobalt and DataStar have the highest Random Ring bandwidths, whereas Blue Gene has the lowest (hence fastest) Random Ring latency, followed by Cobalt and DataStar. As expected, the Random Ring bandwidths and latencies for the computers with custom interconnects – Blue Gene, Cobalt, and DataStar – are substantially better than for the computers with commodity interconnects.

HPL is the high-performance version of the widely-reported Linpack benchmark. It solves a dense linear system of equations. Most of the computation time is spent in DGEMM, and the communication time is modest. Thus, the relative speeds of HPL and DGEMM are similar across the computers, except for HPL on Tungsten, which is slower. On 1,024 processors, T2, Mercury, and Cobalt run HPL the fastest and achieve 74%, 73%, and 66% of peak, respectively. HPL performance drops on Cobalt in going from 1,008 to 1,024 processors, presumably because of contention with the operating system when all processors in a node are used.

PTRANS does a parallel transpose of a large matrix. Memory and interconnect bandwidths are both important for performance, with computation time somewhat larger than communication time on DataStar. The relative performance of PTRANS is thus roughly a weighted average of that of EP-STREAM Triad and Random Ring bandwidth. On 1,024 processors, DataStar runs PTRANS fastest, followed by T2. Cobalt runs slower than might be expected on 1,024 processors, but runs very fast on 1,008 processors.

FFTE performs one-dimensional Fast Fourier transforms. Single, Star, and MPI (or G) variants are available. The results presented here are for the G variant, which uses MPI to spread the FFTs over multiple processors. As for PTRANS, memory and interconnect bandwidths are again important for performance, with computation dominating communication on DataStar. On 1,024 processors, DataStar and T2 are fastest for G-FFTE and about the same speed. Blue Gene does surprisingly well and is third fastest.

RandomAccess measures so-called Gup/s, the rate at which integers can be randomly updated or loaded from memory. Again, Single, Star, and MPI (or G) variants are available, with the results presented here for the G variant. In this case, memory latency and interconnect latency are most important for performance, and communication greatly dominates computation on DataStar. Thus the relative performance for G-RandomAccess across computers should be similar to that for Random Ring latency. This is roughly the case, with Blue Gene being the fastest on both benchmarks. Mercury and DataStar are next fastest for G-RandomAccess on 1,024 processors.








[HPCC] HPC Challenge Benchmark,
      -  http://icl.cs.utk.edu/hpcc/index.html