Cluster Performance

The performance of a cluster can be characterized by performing a few standard benchmarking tests on the cluster. The results obtained can then be analyzed to form a standard baseline using which the cluster can be compared with other supercomputer systems. We performed a series of benchmarking tests on our cluster using the HPCC benchmark to get an estimate of the speed of the system with respect to other clusters.

HPC Challenge Benchmark

HPC Challenge Benchmark (HPCC) has been funded by the DARPA HPCS and is a suite of tests which evaluate the performance of high-end architectures. There are 7 benchmarks included in the HPCC benchmark — HPL, STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b_eff Latency/Bandwidth.
High Performance Linpack (HPL) measures the floating point execution rate for solving a system of dense linear equations. HPL is the standard benchmark used for the TOP500 list.

Running HPCC

To run the HPCC benchmark, a message passing library (MPICH / LAM) and an optimized version of BLAS (Goto, ATLAS) is required. The input configuration file (HPL.dat or HPCCINF.txt) is tuned according to the cluster architecture. The parameters defined by the hpccinf.txt file for our architecture are as follows
The following parameter values will be used:

N : 5760
NB : 80
PMAP : Column-major process mapping
P : 1
Q : 5
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words : full source reference

Analysing HPCC output

HPCC performs all the 7 tests on the cluster and the results obtained are stored in the ‘hpccoutf.txt’ file
An explanation of the input/output parameters follows:

T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

linpack.jpg

Thus, the Rmax value obtained is 9.735 GFlop/s.

Rpeak for (P4 or later) x86 and x86_64 processors is 2FLOP/cycle (w/ SSE2) * GHz * number of processors.
Rpeak for our system is 2 * 3.0 * 5 processors (assuming single processor machines).

Rpeak = 30 GFlop/s

Thus, an Rmax value of 9.735 GFlop/s is obtained using the ATLAS BLAS libraries.

Efficiency = (Rpeak / Rmax) * 100
= (9.735/30) * 100 = 30%

Thus, efficiency of the cluster is approximately 30% of the peak value.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.