Benchmarks at CyberInfrastructure Partnership

Application Benchmarks

Benchmark results are presented here for five applications that are heavily used at NCSA and/or SDSC. They span a range of disciplines and code types, as can be seen from the following table.

Application
Discipline
Type 
Flop   Rate  
Memory
Bandwidth
Memory
Latency
Inter-
connect
Bandwidth
Inter-
connect
Latency
I/O
rate
GAMESS
Chemistry
Ab initio quantum chemistry
 

X

X
 

X

X
MILC
Physics
Quantum chromo-
dynamics

X

 

 

X
 
 
NAMD
Biophysics
Bio-
molecular dynamics

X
 
 

X
 
 
PARATEC
Materials science
Ab initio quantum mechanics

X
 
 

X

X
 
WRF
Atmo-
spheric science
Weather prediction
X
X
 
X
X
X

Also shown in the table are the primary system parameters that determine the performance of each application. For GAMESS, the dominant memory parameter depends upon the problem type. For some other applications, the dominant parameters depend upon processor count. For example, as the processor count increases for a fixed problem size, the flop speed can become more limiting than the memory bandwidth, and the interconnect latency can become more limiting than the interconnect bandwidth.

One or more benchmark problems were run for each application. Benchmark runs were typically made during normal production. All processors per node were used, except on Cobalt for less than 512 processors and on Blue Gene for PARATEC. The primary performance metric is run time, generally wall-clock time for a full run unless specified otherwise.

The following column chart gives an overall summary of performance. It shows the relative speed (inverse run time) of the various computers for each application run on one benchmark problem at a single processor count. In each case the results are normalized to the speed of DataStar.
The computers are ordered roughly according to their performance averaged over all the applications. Clearly the fastest computer varies from application to application. Thus DataStar is slightly faster than Cobalt for GAMESS, T2 is slightly faster than Cobalt for MILC, Cobalt and DataStar are the same speed for NAMD, and Cobalt is considerably faster than the other computers for PARATEC and WRF.

The relative performance also varies with problem size and processor count. For the problem sizes shown in the chart (except for MILC), the processor counts are at the high end of the scaling range for most of the computers. Because of its high-performance custom switch, DataStar generally scales to higher processor counts where its relative performance improves.

More detailed tabular and graphical results for additional problem sizes and processor counts are presented in the discussion of each application. Included there are scaling plots of normalized speed per processor versus processor count.

Two types of scaling scans are routinely done. In a strong scaling scan, the problem size is held fixed as the processor count increases. In a weak scaling scan, the problem size increases with processor count, typically holding the work per processor count. Strong scaling scans are reported for the application benchmarks here.

To understand scaling for a specific problem size, the normalized speed per processor is a particularly useful performance metric. Let t(i,p) be the run time on p processors of computer i. Then

- the speed per processor is 1/(p t(i,p)), and
- the normalized speed per processor is (q t(r,q))/(p t(i,p)),
where q is a reference number of processors, and r is a reference computer.

If computer i is the same as the reference computer r, then the metric is just the parallel efficiency relative to processor count q, which is typically taken to be the minimum of the scaling scan. If i and r are different, then the metric gives the parallel efficiency of computer i scaled by its speed ratio relative to computer r at processor count q.