WIEN2k

(L)APW

Features

Hard+Software

Order info

Papers

reg_user









Adding a new dimension to DFT calculations of solids ...

Hardware Benchmarks for

single cpu performance (serial lapw1c)

parallel cpu performance (mpi-parallel lapw1_mpi)



Below you can find some timings of serial and parallel benchmarks run on various platforms and using different compilers.
If you want to contribute a new benchmark time, please download the serial or parallel benchmark, run it ("x lapw1 -c" or "x lapw1 -p" with a proper .machines file) and send the timing (total cpu+wall time and partial times "grep HORB *output1*") together with some description of hardware + software to the WIEN mailing list or to P. Blaha.

At present the Intel Nehalem processor (Core i7) seems to be the fastest processor (for a very good price). Using all 4 cores the benchmark time comes down to 41 sec (on the "slowest" Core7 processor!), and the loss when running 4 jobs in k-parallel mode is just a few percent.

Itanium2 systems, IBM 52A (Power5) systems and the (much cheaper) Intel P4 Conroe Core2 Duo PCs are quite similar in single cpu performance. For some reasons (SSE3 instructions) the AMD-cpus perform a little slower (but the latter show very good scaling for parallel jobs).

Please note that on multi-core (or multi-cpu) systems the performance can drastically decrease when running N lapw1-jobs on N cores in parallel due to the limited memory-bus speed (Quad-core cpus). Thus, the memory bandwidth seems to be most important for the performance of a "k-parallel" scf-cycle.


Serial benchmark: NMAT=3481, complex

Intel Core i7 980x, 3.33GHz       65 sec    ifort11 (+mkl)(1 job with 1 thread)

Intel Core i7 920, 2.66 GHz       91 sec    ifort11 (+mkl)(1 job with 1 thread)
Intel Core i7 920, 2.66 GHz       57 sec    ifort11 (+mkl)(1 job with 2 thread)
Intel Core i7 920, 2.66 GHz       40 sec    ifort11 (+mkl)(1 job with 4 thread)

P4 dual-Xeon, 3.6 GHz            165 sec    ifort9 + mkl8 (1 job with 1 thread!)
P4 dual-Xeon, 3.6 GHz            125 sec    ifort9 + mkl8 (1 job with 2 threads!)

bi-Xeon 5320 (overcl 2.67GHz)    119 sec    ifort9.1 + mkl9.0 (1 job with 1 threads)
bi-Xeon 5320 (overcl 2.67GHz)     90 sec    ifort9.1 + mkl9.0 (1 job with 2 threads)
bi-Xeon 5320 (overcl 2.67GHz)     76 sec    ifort9.1 + mkl9.0 (1 job with 4 threads)
bi-Xeon 5320 (overcl 2.67GHz)     69 sec    ifort9.1 + mkl9.0 (1 job with 8 threads)

P4 Core2 Duo E6600, 2.4 GHz      128 sec    ifort10.1+mkl9.1, OMP_NUM_THREADS=1
P4 Core2 Duo E6600, 2.4 GHz      103 sec    ifort10.1+mkl9.1, OMP_NUM_THREADS=2

Xeon X3210 Quadcore 2.13GHz      140 sec    ifort10.1+cmkl10.0  1 job, 1 thread
Xeon X3210 1033 MHz FSB           88 sec    ifort10.1+cmkl10.0  1 job, 2 threads
Xeon X3210                       112 sec    ifort10.1+cmkl10.0  2 jobs, 2 threads
Xeon X3210                       228 sec    ifort10.1+cmkl10.0  4 jobs, 1 thread

IBM 52A  1.90GHz Power5+(1 cpu)  135 sec    xlf10.1,-q64 -O5,ESSL4.2
IBM 52A  (-"-,2 cpus)             83 sec     - " -
IBM 52A  (-"-,2 cpus, SMT=on)     80 sec     - " -

Itanium2(1.6GHz,SGI Altix 3700)  122 sec    ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00
Itanium2(-"-, 2 threads)          90 sec    ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00

AMD-Opteron, single cpu, 2.4Ghz  190 sec    ifort(9.1.40) + libgoto_opteron64p-r1.09.so
AMD-Opteron, single cpu, 2.8Ghz  167 sec    ifort(10.1.11) + libgoto_opteron64p-r1.23.so


Serial benchmark with parallel jobs (Tests the "real" performance under full load with a "k-parallel" job): NMAT=3481, complex

Intel Core i7 980x, 3.33 GHz   ifort11 (+mkl)
1 job (1 thread) 65 sec; 6 jobs (1 thread) 89 sec

Intel Core i7 920, 2.66 GHz   ifort11 (+mkl)
Jobs   1 Thread    2 Threads   4 Threads
1         99            62          41		
2        100            70			
4        104              			

1333 FSB Dual-Clovertown X5355  @ 2.66GHz, 667 Memory					
Jobs   1 Thread    2 Threads   4 Threads   8 Threads
1        132            88          66         62	
2        145           104          98	
4        177           163	

1600 FSB Dual-Harpertown 2.8 GHz with 800 MHz Memory					
Jobs   1 Thread    2 Threads   4 Threads
1        134            83          67		
2        123            94			
4        148           134			

AMD-Opteron Dual CPU/Dual Core 2.8Ghz (IBM 3455), 8Gb RAM DDR2 667MHz
Ifort 10.1.11 + libgotoopteron64p-r1.23.so
Jobs   1 Thread    2 Threads   4 Threads
1        167           120         101		
2        168           122			
4        174           			


A "historical list" can be found here.

MPI-parallel benchmark: NMAT=11571, real, full diagonalization

P4 dual-Xeon, 3.6 GHz,Infiniband, ifort9+cmkl8 (first number: jobs/node; 2nd number: nodes).

aurora_serial: TIME HAMILT (CPU)  =   346.7, HNS =   198.6, DIAG =  1188.6
aurora_serial: TOTAL CPU TIME:   1737.0 (INIT =      3.1 + K-POINTS =   1733.9)

aurora_1_2:    TIME HAMILT (CPU)  =   169.9, HNS =   145.2, DIAG =   991.1
aurora_1_2:    TOTAL CPU TIME:   1309.6 (INIT =      3.1 + K-POINTS =   1306.5)

aurora_1_4:    TIME HAMILT (CPU)  =    88.1, HNS =    78.4, DIAG =   514.4
aurora_1_4:    TOTAL CPU TIME:    684.1 (INIT =      3.0 + K-POINTS =    681.1)

aurora_1_8:    TIME HAMILT (CPU)  =    44.7, HNS =    41.6, DIAG =   304.6
aurora_1_8:    TOTAL CPU TIME:    394.3 (INIT =      3.1 + K-POINTS =    391.2)

aurora_1_16:   TIME HAMILT (CPU)  =    25.0, HNS =    23.7, DIAG =   196.8
aurora_1_16:   TOTAL CPU TIME:    248.8 (INIT =      3.1 + K-POINTS =    245.7)

aurora_1_32:   TIME HAMILT (CPU)  =    14.7, HNS =    13.8, DIAG =   137.6
aurora_1_32:   TOTAL CPU TIME:    169.4 (INIT =      3.1 + K-POINTS =    166.3)

aurora_2_1:    TIME HAMILT (CPU)  =   194.8, HNS =   171.4, DIAG =  1554.2
aurora_2_1:    TOTAL CPU TIME:   1923.8 (INIT =      3.2 + K-POINTS =   1920.7)

aurora_2_2:    TIME HAMILT (CPU)  =   103.2, HNS =    90.1, DIAG =   816.6
aurora_2_2:    TOTAL CPU TIME:   1013.3 (INIT =      3.1 + K-POINTS =   1010.2)

aurora_2_4:    TIME HAMILT (CPU)  =    46.5, HNS =    48.0, DIAG =   427.6
aurora_2_4:    TOTAL CPU TIME:    525.4 (INIT =      3.1 + K-POINTS =    522.3)

aurora_2_8:    TIME HAMILT (CPU)  =    25.5, HNS =    27.6, DIAG =   287.9
aurora_2_8:    TOTAL CPU TIME:    344.4 (INIT =      3.1 + K-POINTS =    341.2)

aurora_2_16:   TIME HAMILT (CPU)  =    15.2, HNS =    15.4, DIAG =   179.6
aurora_2_16:   TOTAL CPU TIME:    213.5 (INIT =      3.1 + K-POINTS =    210.4)

aurora_2_32:   TIME HAMILT (CPU)  =     9.8, HNS =     9.4, DIAG =   127.7
aurora_2_32:   TOTAL CPU TIME:    150.3 (INIT =      3.2 + K-POINTS =    147.1)

SUN AMD-2.4GHz dual-core/dual-cpu,Infiniband, SUNstudio10 (first number: jobs/node; 2nd number: nodes).

luna-serial: TIME HAMILT (CPU)  =   763.3, HNS =   265.0, DIAG =  2255.3
luna-serial: TOTAL CPU TIME:   3286.3 (INIT =      2.5 + K-POINTS =   3283.8)

luna-mpi_2_1:TIME HAMILT (CPU)  =   384.5, HNS =   199.2, DIAG =  1504.9
luna-mpi_2_1:TOTAL CPU TIME:   2091.4 (INIT =      2.5 + K-POINTS =   2088.9)

luna-mpi_4_1:TIME HAMILT (CPU)  =   196.6, HNS =   105.2, DIAG =   785.2
luna-mpi_4_1:TOTAL CPU TIME:   1089.9 (INIT =      2.5 + K-POINTS =   1087.3)

luna-mpi_2_2:TIME HAMILT (CPU)  =   193.7, HNS =   103.7, DIAG =   752.3
luna-mpi_2_2:TOTAL CPU TIME:   1052.6 (INIT =      2.5 + K-POINTS =   1050.1)

luna-mpi_4_2:TIME HAMILT (CPU)  =   102.2, HNS =    58.5, DIAG =   546.2
luna-mpi_4_2:TOTAL CPU TIME:    709.8 (INIT =      2.6 + K-POINTS =    707.2)

luna-mpi_4_4:TIME HAMILT (CPU)  =    53.8, HNS =    31.6, DIAG =   251.6
luna-mpi_4_4:TOTAL CPU TIME:    340.0 (INIT =      2.6 + K-POINTS =    337.4)

luna-mpi_4_8:TIME HAMILT (CPU)  =    31.6, HNS =    18.3, DIAG =   176.9
luna-mpi_4_8:TOTAL CPU TIME:    229.7 (INIT =      2.6 + K-POINTS =    227.1)

Xeon X3210 2.13GHz Quad Core, 1066 MHz FSB (first number: jobs/node; 2nd number: nodes).

1 MPI, 1 Thread     1423 Secs HAMILT (CPU ) =   223.0, HNS=174.1, DIAG=  1021.4
2 MPI, 1 Thread     1242 Secs HAMILT (WALL) =   120.3, HNS=129.8, DIAG=   988.4
4 MPI, 1 Thread     1175 Secs HAMILT (WALL) =    80.1, HNS=116.8, DIAG=   977.9

20 nodes AMD-Opteron Dual CPU/Dual Core 2.8Ghz (IBM 3455), 8Gb RAM DDR2 667MHz, Voltaire 20Gbps Infiniband Ifort 10.1.11 + libgotoopteron64p-r1.23.so # Intel Cluster MKL, MPI-CH1.2 Even more detailed data can be found here.

 1 core/ 1 node: TIME HAMILT (CPU) = 314.8, HNS = 341.4, HORB = 0.0, DIAG = 1990.4
                 TOTAL CPU TIME: 2648.7 (INIT = 2.1 + K-POINTS = 2646.7)
 4 core/ 4 node: TIME HAMILT (CPU) = 79.6, HNS = 91.9, HORB = 0.0, DIAG = 483.8
                 TOTAL CPU TIME: 657.5 (INIT = 2.1 + K-POINTS = 655.4)
 8 core/ 8 node: TIME HAMILT (CPU) = 40.0, HNS = 48.6, HORB = 0.0, DIAG = 268.4
                 TOTAL CPU TIME: 359.3 (INIT = 2.1 + K-POINTS = 357.2)
16 core/16 node: TIME HAMILT (CPU) = 22.0, HNS = 26.7, HORB = 0.0, DIAG = 159.9
                 TOTAL CPU TIME: 210.8 (INIT = 2.1 + K-POINTS = 208.7)

 4 core/ 1 node: TIME HAMILT (CPU) = 85.7, HNS = 96.0, HORB = 0.0, DIAG = 668.0
                 TOTAL CPU TIME: 851.8 (INIT = 2.1 + K-POINTS = 849.7)
 8 core/ 2 node: TIME HAMILT (CPU) = 40.5, HNS = 50.8, HORB = 0.0, DIAG = 344.8
                 TOTAL CPU TIME: 438.3 (INIT = 2.1 + K-POINTS = 436.3)
12 core/ 3 node: TIME HAMILT (CPU) = 27.6, HNS = 35.8, HORB = 0.0, DIAG = 247.1
                 TOTAL CPU TIME: 312.6 (INIT = 2.1 + K-POINTS = 310.5)
16 core/ 4 node: TIME HAMILT (CPU) = 22.1, HNS = 27.9, HORB = 0.0, DIAG = 194.3
                 TOTAL CPU TIME: 246.6 (INIT = 2.1 + K-POINTS = 244.5)
20 core/ 5 node: TIME HAMILT (CPU) = 18.1, HNS = 22.6, HORB = 0.0, DIAG = 165.3
                 TOTAL CPU TIME: 208.3 (INIT = 2.1 + K-POINTS = 206.3)

* libgoto blas libraries are available from: http://www.tacc.utexas .edu/resources/software/



[Home] [(L)APW+lo] [Features] [Hard+Soft] [Order info] [Papers] [Reg Users]

©2001 by P. Blaha and K. Schwarz