WIEN2k

(L)APW

Features

Hard+Software

Order info

Papers

reg_user







Adding a new dimension to DFT calculations of solids ...

Hardware Benchmarks for

single cpu performance (serial lapw1c)

parallel cpu performance (mpi-parallel lapw1_mpi)



Below you can find some timings of serial and parallel benchmarks run on various platforms and using different compilers.
If you want to contribute a new benchmark time, please download the serial or parallel benchmark, run it ("x lapw1" or "x lapw1 -p" with a proper .machines file) and send the timing (total cpu+wall time and partial times "grep HORB *output1*") together with some description of hardware + software to the WIEN mailing list or to P. Blaha.

At present Intel processors (at least Core i7) seem to be the fastest processors, but modern AMD cpus seem to have catched up (for a quite good price). Using 8 cores of an Intel I9 12900K (3.2 GHz) processor the benchmark time for a single k-point comes down to 3.7 seconds (while 6 cores of an Intel I7 3930K need 14 sec, and the speed when running 6 jobs in k-parallel mode is still only 49 sec for 6 k-points).

Please note that on multi-core (or multi-cpu) systems the performance can drastically decrease when running N lapw1-jobs on N cores in parallel due to the limited memory-bus speed (Multi-core cpus). Thus, the memory bandwidth seems to be most important for the performance of a "k-parallel" scf-cycle and thus for the real "throughput".


Serial benchmark: NMAT=3481, complex

Intel Core I9-12900k (8 cores@3.2 + 8 cores@2.4GHz), oneapi-2023
k=1, omp=1,     13.3 sec/k-point
k=1, omp=2,      7.4 sec/k-point
k=1, omp=4,      4.6 sec/k-point
k=1, omp=8,      3.7 sec/k-point

Intel Core i9-9900K 4.70GHz       15.4 sec  ifort 21.1            (1 job with 1 thread)
Intel Core i9-9900K 4.70GHz        9.1 sec  ifort 21.1            (1 job with 2 thread)
Intel Core i9-9900K 4.70GHz        7.1 sec  ifort 21.1            (1 job with 4 thread)
Intel Core i9-9900K 4.70GHz        6.4 sec  ifort 21.1            (1 job with 8 thread)

AMD Ryzen 5 5600G, 6 cores, CPU max MHz:4464; (January 2023)
gfortran 11.3 + openblas zenp-r0.3.21:
k=1, omp_global=1, 20.9 sec/k-point
k=1, omp_global=2, 13.2 sec/k-point
k=1, omp_global=4,  9.5 sec/k-point
k=1, omp_global=6,  8.8 sec/k-point
OneAPI (ifort 2021.8.0 + mkl 2023.0.0)
k=1, omp_global=1, 29.7 sec/k-point
k=1, omp_global=2, 17.4 sec/k-point
k=1, omp_global=4, 12.0 sec/k-point
k=1, omp_global=6, 10.0 sec/k-point
+++++++++++++++
test_case.output1_gfortran:  TIME HAMILT (WALL) =     5.7, HNS =     1.6, DIAG =    13.3
test_case.output1_ifort:     TIME HAMILT (WALL) =     4.0, HNS =     2.7, DIAG =    21.3
--- from these lines one sees that ifort is still faster than gfortran for AMD (HAMILT), 
--- but openblas is faster than mkl (HNS+DIAG) for AMD --- probably mkl switches off some opt. for AMD

Intel i9-10980XE    3.0 GHz (18 cores, Hyperthreding on, Ifort  19.0.1.144)
k=1, omp_global=1, Total 27.1 sec, 27.1 sec/k-point
k=1, omp_global=2, Total 14.7 sec, 14.7 sec/k-point
k=1, omp_global=4, Total  9.1 sec, 9.1 sec/k-point
k=1, omp_global=8, Total  8.0 sec, 8.0 sec/k-point
k=1, omp_global=18, Total 6.9 sec, 6.9 sec/k-point

Intel i7-8700K  3.7 GHz  omp_global=1, Total 20.4 sec Hyperthreding off, Ifort  19.0.1.144
Intel i7-8700K  3.7 GHz  omp_global=2, Total 12.4 sec
Intel i7-8700K  3.7 GHz  omp_global=3, Total 11.4 sec
Intel i7-8700K  3.7 GHz  omp_global=4, Total 10.2 sec
Intel i7-8700K  3.7 GHz  omp_global=5, Total 10.2 sec
Intel i7-8700K  3.7 GHz  omp_global=6, Total 10.5 sec

Intel(R) Core(TM) i7-7820X  3.6 GHz Hyperthreding off, Ifort  19.0.1.144
k=1, omp_global=1, Total 24.6 sec, 24.6 sec/k-point
k=1, omp_global=2, Total 14.6 sec, 14.6 sec/k-point
k=1, omp_global=3, Total 11.2 sec, 11.2 sec/k-point
k=1, omp_global=4, Total 10.1 sec, 10.1 sec/k-point
k=1, omp_global=5, Total 9.0 sec, 9.0 sec/k-point
k=1, omp_global=6, Total 9.0 sec, 9.0 sec/k-point
k=1, omp_global=7, Total 7.9 sec, 7.9 sec/k-point
k=1, omp_global=8, Total 8.0 sec, 8.0 sec/k-point

Intel Core i7-7820X 3.60GHz       23.5 sec  ifort 21.1 (Hyper. on)(1 job with 1 thread)
Intel Core i7-7820X 3.60GHz       13.4 sec  ifort 21.1            (1 job with 2 thread)
Intel Core i7-7820X 3.60GHz        8.2 sec  ifort 21.1            (1 job with 4 thread)
Intel Core i7-7820X 3.60GHz        6.1 sec  ifort 21.1            (1 job with 8 thread)

Intel Core i7-3930K 3.20GHz       36 sec    composerxe-2013.4.183 (1 job with 1 thread)
Intel Core i7-3930K 3.20GHz       23 sec    composerxe-203.4.1831 (1 job with 2 thread)
Intel Core i7-3930K 3.20GHz       16 sec    composerxe-2013.4.183 (1 job with 4 thread)
Intel Core i7-3930K 3.20GHz       14 sec    composerxe-2013.4.183 (1 job with 6 thread)

IBM p460 Power7                   53 sec     (1 thread only)

Intel Core i7-2600 3.40 GHz       37 sec    composerxe-2011.4.191 (1 job with 1 thread)
Intel Core i7-2600 3.40 GHz       26 sec    composerxe-2011.4.191 (1 job with 2 thread)
Intel Core i7-2600 3.40 GHz       22 sec    composerxe-2011.4.191 (1 job with 4 thread)

Intel Core i7 980x, 3.33GHz       65 sec    ifort11 (+mkl)(1 job with 1 thread)

Intel Core i7 920, 2.66 GHz       91 sec    ifort11 (+mkl)(1 job with 1 thread)
Intel Core i7 920, 2.66 GHz       57 sec    ifort11 (+mkl)(1 job with 2 thread)
Intel Core i7 920, 2.66 GHz       40 sec    ifort11 (+mkl)(1 job with 4 thread)

P4 dual-Xeon, 3.6 GHz            165 sec    ifort9 + mkl8 (1 job with 1 thread!)
P4 dual-Xeon, 3.6 GHz            125 sec    ifort9 + mkl8 (1 job with 2 threads!)

bi-Xeon 5320 (overcl 2.67GHz)    119 sec    ifort9.1 + mkl9.0 (1 job with 1 threads)
bi-Xeon 5320 (overcl 2.67GHz)     90 sec    ifort9.1 + mkl9.0 (1 job with 2 threads)
bi-Xeon 5320 (overcl 2.67GHz)     76 sec    ifort9.1 + mkl9.0 (1 job with 4 threads)
bi-Xeon 5320 (overcl 2.67GHz)     69 sec    ifort9.1 + mkl9.0 (1 job with 8 threads)

P4 Core2 Duo E6600, 2.4 GHz      128 sec    ifort10.1+mkl9.1, OMP_NUM_THREADS=1
P4 Core2 Duo E6600, 2.4 GHz      103 sec    ifort10.1+mkl9.1, OMP_NUM_THREADS=2

Xeon X3210 Quadcore 2.13GHz      140 sec    ifort10.1+cmkl10.0  1 job, 1 thread
Xeon X3210 1033 MHz FSB           88 sec    ifort10.1+cmkl10.0  1 job, 2 threads
Xeon X3210                       112 sec    ifort10.1+cmkl10.0  2 jobs, 2 threads
Xeon X3210                       228 sec    ifort10.1+cmkl10.0  4 jobs, 1 thread

IBM 52A  1.90GHz Power5+(1 cpu)  135 sec    xlf10.1,-q64 -O5,ESSL4.2
IBM 52A  (-"-,2 cpus)             83 sec     - " -
IBM 52A  (-"-,2 cpus, SMT=on)     80 sec     - " -

Itanium2(1.6GHz,SGI Altix 3700)  122 sec    ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00
Itanium2(-"-, 2 threads)          90 sec    ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00

AMD-Opteron, single cpu, 2.4Ghz  190 sec    ifort(9.1.40) + libgoto_opteron64p-r1.09.so
AMD-Opteron, single cpu, 2.8Ghz  167 sec    ifort(10.1.11) + libgoto_opteron64p-r1.23.so


Serial benchmark with parallel jobs (Tests the "real" performance under full load with a "k-parallel" job): NMAT=3481, complex

Intel Core I9-12900k (8 cores@3.2 + 8 cores@2.4GHz), oneapi-2023
k=2, omp=8,     13.5 sec (108.0 sec/16k-points or 6.8 sec/k-point)
k=4, omp=4,     17.1 sec (68.4 sec/16k-points or  4.3 sec/k-point)
k=8, omp=2,     29.9 sec (59.8 sec/16k-points or  3.8 sec/k-point)
k=16, omp=1,    55.8 sec (55.8 sec/16k-points or  3.5 sec/k-point)

Intel i9-10980XE  3.0 GHz (18 cores, Hyperthreding on, 1 thread, Ifort  19.0.1.144)
k=1 27.1 sec; k=2 25.6 sec; k=4 31.2 sec; k=6 31.9 sec; k=9 44.0 sec (4.9 sec/k-point); k=15 62.8 sec (4.2 sec/k-point); k=18 77.5 sec (4.3 sec/k-point)

Intel(R) Core(TM) i7-8700K  3.7 GHz  (1 thread)
k=1 20.4 sec; k=2 22.2 sec; k=3 25.8 sec; k=6 40.8 sec, (6.8 sec/k-point)

Intel Core i7-7820X 3.60GHz    ifort 21.1 (1 thread, Hyperthr off)
k=1 24.6 sec; k=2 24.4 sec; k=4 27.0 sec; k=8 34.4 sec, (4.3 sec/k-point)

Intel Core i7-7820X 3.60GHz    ifort 21.1 (1 thread, Hyperthr on)
1 job: 24 sec;    2 jobs: 26 sec;   4 jobs: 27 sec;   8 jobs: 31 sec (3.9 sec/k-point)

Intel Core i7-7820X 3.60GHz    ifort 21.1 (2 threads)
1 job: 14 sec;    2 jobs: 15 sec;   4 jobs: 19 sec (4.9 sec/k-point);    

Intel Core i7-3930K 3.20GHz    composerxe-2013.4.183 (1 thread)
1 job: 37 sec;    2 jobs: 38 sec;   4 jobs: 47 sec;   6 jobs: 49 sec 

Intel Core i7-3930K 3.20GHz    composerxe-2013.4.183 (2 threads)
1 job: 23 sec;    3 jobs: 39 sec;   6 jobs: 50 sec      

Intel Core i7-2600 3.40 GHz    composerxe-2011.4.191 (1 thread)
1 job: 37 sec;    2 jobs: 43 sec;    4 jobs: 62 sec 

Intel Core i7 980x, 3.33 GHz   ifort11 (+mkl)
1 job (1 thread) 65 sec; 6 jobs (1 thread) 89 sec

Intel Core i7 920, 2.66 GHz   ifort11 (+mkl)
Jobs   1 Thread    2 Threads   4 Threads
1         99            62          41		
2        100            70			
4        104              			

1333 FSB Dual-Clovertown X5355  @ 2.66GHz, 667 Memory					
Jobs   1 Thread    2 Threads   4 Threads   8 Threads
1        132            88          66         62	
2        145           104          98	
4        177           163	

1600 FSB Dual-Harpertown 2.8 GHz with 800 MHz Memory					
Jobs   1 Thread    2 Threads   4 Threads
1        134            83          67		
2        123            94			
4        148           134			

AMD-Opteron Dual CPU/Dual Core 2.8Ghz (IBM 3455), 8Gb RAM DDR2 667MHz
Ifort 10.1.11 + libgotoopteron64p-r1.23.so
Jobs   1 Thread    2 Threads   4 Threads
1        167           120         101		
2        168           122			
4        174           			


A "historical list" can be found here.

MPI-parallel benchmark: NMAT=11571, real, full diagonalization

Intel Core i7-7820X 3.60GHz ifort 21.1, elpa2020.10 20Gb infiniband, openmpii(cores-per-node_nodes: total cores)

serial 4_1:    TIME HAMILT (WALL) =    35.4, HNS =    26.9, HORB =     0.0, DIAG =   113.2, SYNC =     0.0
             > SUM OF WALL CLOCK TIMES:    176.2 (INIT =      0.6 + K-POINTS =    175.6)

serial 8_1:    TIME HAMILT (WALL) =    24.4, HNS =    21.9, HORB =     0.0, DIAG =    81.9, SYNC =     0.0
             > SUM OF WALL CLOCK TIMES:    129.0 (INIT =      0.6 + K-POINTS =    128.4)

4_1 (4 cores)  TIME HAMILT (WALL) =    32.9, HNS =    17.8, HORB =     0.0, DIAG =   106.7, SYNC =     0.2
4_1            ===> TOTAL CPU       TIME:    157.8 (INIT =      0.5 + K-POINTS =    157.3)

8_1:(8 cores)  TIME HAMILT (WALL) =    19.1, HNS =    10.0, HORB =     0.0, DIAG =    74.0, SYNC =     0.1
8_1:           ===> TOTAL CPU       TIME:    103.4 (INIT =      0.5 + K-POINTS =    102.9)

8_2:(16 cores) TIME HAMILT (WALL) =    10.1, HNS =     6.0, HORB =     0.0, DIAG =    48.3, SYNC =     0.0
8_2:           ===> TOTAL CPU       TIME:     64.9 (INIT =      0.5 + K-POINTS =     64.4)


8_4:(32 cores) TIME HAMILT (WALL) =     5.8, HNS =     3.5, HORB =     0.0, DIAG =    40.0, SYNC =     0.0
8_4:           ===> TOTAL CPU       TIME:     49.9 (INIT =      0.5 + K-POINTS =     49.4)	   

8_8:(64 cores) TIME HAMILT (WALL) =     3.1, HNS =     2.1, HORB =     0.0, DIAG =    24.0, SYNC =     0.1
8_8:           ===> TOTAL CPU       TIME:     29.9 (INIT =      0.6 + K-POINTS =     29.3)

P4 dual-Xeon, 3.6 GHz,Infiniband, ifort9+cmkl8 (first number: jobs/node; 2nd number: nodes).

aurora_serial: TIME HAMILT (CPU)  =   346.7, HNS =   198.6, DIAG =  1188.6
aurora_serial: TOTAL CPU TIME:   1737.0 (INIT =      3.1 + K-POINTS =   1733.9)

aurora_1_2:    TIME HAMILT (CPU)  =   169.9, HNS =   145.2, DIAG =   991.1
aurora_1_2:    TOTAL CPU TIME:   1309.6 (INIT =      3.1 + K-POINTS =   1306.5)

aurora_1_4:    TIME HAMILT (CPU)  =    88.1, HNS =    78.4, DIAG =   514.4
aurora_1_4:    TOTAL CPU TIME:    684.1 (INIT =      3.0 + K-POINTS =    681.1)

aurora_1_8:    TIME HAMILT (CPU)  =    44.7, HNS =    41.6, DIAG =   304.6
aurora_1_8:    TOTAL CPU TIME:    394.3 (INIT =      3.1 + K-POINTS =    391.2)

aurora_1_16:   TIME HAMILT (CPU)  =    25.0, HNS =    23.7, DIAG =   196.8
aurora_1_16:   TOTAL CPU TIME:    248.8 (INIT =      3.1 + K-POINTS =    245.7)

aurora_1_32:   TIME HAMILT (CPU)  =    14.7, HNS =    13.8, DIAG =   137.6
aurora_1_32:   TOTAL CPU TIME:    169.4 (INIT =      3.1 + K-POINTS =    166.3)

aurora_2_1:    TIME HAMILT (CPU)  =   194.8, HNS =   171.4, DIAG =  1554.2
aurora_2_1:    TOTAL CPU TIME:   1923.8 (INIT =      3.2 + K-POINTS =   1920.7)

aurora_2_2:    TIME HAMILT (CPU)  =   103.2, HNS =    90.1, DIAG =   816.6
aurora_2_2:    TOTAL CPU TIME:   1013.3 (INIT =      3.1 + K-POINTS =   1010.2)

aurora_2_4:    TIME HAMILT (CPU)  =    46.5, HNS =    48.0, DIAG =   427.6
aurora_2_4:    TOTAL CPU TIME:    525.4 (INIT =      3.1 + K-POINTS =    522.3)

aurora_2_8:    TIME HAMILT (CPU)  =    25.5, HNS =    27.6, DIAG =   287.9
aurora_2_8:    TOTAL CPU TIME:    344.4 (INIT =      3.1 + K-POINTS =    341.2)

aurora_2_16:   TIME HAMILT (CPU)  =    15.2, HNS =    15.4, DIAG =   179.6
aurora_2_16:   TOTAL CPU TIME:    213.5 (INIT =      3.1 + K-POINTS =    210.4)

aurora_2_32:   TIME HAMILT (CPU)  =     9.8, HNS =     9.4, DIAG =   127.7
aurora_2_32:   TOTAL CPU TIME:    150.3 (INIT =      3.2 + K-POINTS =    147.1)

SUN AMD-2.4GHz dual-core/dual-cpu,Infiniband, SUNstudio10 (first number: jobs/node; 2nd number: nodes).

luna-serial: TIME HAMILT (CPU)  =   763.3, HNS =   265.0, DIAG =  2255.3
luna-serial: TOTAL CPU TIME:   3286.3 (INIT =      2.5 + K-POINTS =   3283.8)

luna-mpi_2_1:TIME HAMILT (CPU)  =   384.5, HNS =   199.2, DIAG =  1504.9
luna-mpi_2_1:TOTAL CPU TIME:   2091.4 (INIT =      2.5 + K-POINTS =   2088.9)

luna-mpi_4_1:TIME HAMILT (CPU)  =   196.6, HNS =   105.2, DIAG =   785.2
luna-mpi_4_1:TOTAL CPU TIME:   1089.9 (INIT =      2.5 + K-POINTS =   1087.3)

luna-mpi_2_2:TIME HAMILT (CPU)  =   193.7, HNS =   103.7, DIAG =   752.3
luna-mpi_2_2:TOTAL CPU TIME:   1052.6 (INIT =      2.5 + K-POINTS =   1050.1)

luna-mpi_4_2:TIME HAMILT (CPU)  =   102.2, HNS =    58.5, DIAG =   546.2
luna-mpi_4_2:TOTAL CPU TIME:    709.8 (INIT =      2.6 + K-POINTS =    707.2)

luna-mpi_4_4:TIME HAMILT (CPU)  =    53.8, HNS =    31.6, DIAG =   251.6
luna-mpi_4_4:TOTAL CPU TIME:    340.0 (INIT =      2.6 + K-POINTS =    337.4)

luna-mpi_4_8:TIME HAMILT (CPU)  =    31.6, HNS =    18.3, DIAG =   176.9
luna-mpi_4_8:TOTAL CPU TIME:    229.7 (INIT =      2.6 + K-POINTS =    227.1)

Xeon X3210 2.13GHz Quad Core, 1066 MHz FSB (first number: jobs/node; 2nd number: nodes).

1 MPI, 1 Thread     1423 Secs HAMILT (CPU ) =   223.0, HNS=174.1, DIAG=  1021.4
2 MPI, 1 Thread     1242 Secs HAMILT (WALL) =   120.3, HNS=129.8, DIAG=   988.4
4 MPI, 1 Thread     1175 Secs HAMILT (WALL) =    80.1, HNS=116.8, DIAG=   977.9

20 nodes AMD-Opteron Dual CPU/Dual Core 2.8Ghz (IBM 3455), 8Gb RAM DDR2 667MHz, Voltaire 20Gbps Infiniband Ifort 10.1.11 + libgotoopteron64p-r1.23.so # Intel Cluster MKL, MPI-CH1.2 Even more detailed data can be found here.

 1 core/ 1 node: TIME HAMILT (CPU) = 314.8, HNS = 341.4, HORB = 0.0, DIAG = 1990.4
                 TOTAL CPU TIME: 2648.7 (INIT = 2.1 + K-POINTS = 2646.7)
 4 core/ 4 node: TIME HAMILT (CPU) = 79.6, HNS = 91.9, HORB = 0.0, DIAG = 483.8
                 TOTAL CPU TIME: 657.5 (INIT = 2.1 + K-POINTS = 655.4)
 8 core/ 8 node: TIME HAMILT (CPU) = 40.0, HNS = 48.6, HORB = 0.0, DIAG = 268.4
                 TOTAL CPU TIME: 359.3 (INIT = 2.1 + K-POINTS = 357.2)
16 core/16 node: TIME HAMILT (CPU) = 22.0, HNS = 26.7, HORB = 0.0, DIAG = 159.9
                 TOTAL CPU TIME: 210.8 (INIT = 2.1 + K-POINTS = 208.7)

 4 core/ 1 node: TIME HAMILT (CPU) = 85.7, HNS = 96.0, HORB = 0.0, DIAG = 668.0
                 TOTAL CPU TIME: 851.8 (INIT = 2.1 + K-POINTS = 849.7)
 8 core/ 2 node: TIME HAMILT (CPU) = 40.5, HNS = 50.8, HORB = 0.0, DIAG = 344.8
                 TOTAL CPU TIME: 438.3 (INIT = 2.1 + K-POINTS = 436.3)
12 core/ 3 node: TIME HAMILT (CPU) = 27.6, HNS = 35.8, HORB = 0.0, DIAG = 247.1
                 TOTAL CPU TIME: 312.6 (INIT = 2.1 + K-POINTS = 310.5)
16 core/ 4 node: TIME HAMILT (CPU) = 22.1, HNS = 27.9, HORB = 0.0, DIAG = 194.3
                 TOTAL CPU TIME: 246.6 (INIT = 2.1 + K-POINTS = 244.5)
20 core/ 5 node: TIME HAMILT (CPU) = 18.1, HNS = 22.6, HORB = 0.0, DIAG = 165.3
                 TOTAL CPU TIME: 208.3 (INIT = 2.1 + K-POINTS = 206.3)

* libgoto blas libraries are available from: http://www.tacc.utexas .edu/resources/software/



[Home] [(L)APW+lo] [Features] [Hard+Soft] [Order info] [Papers] [Reg Users]

©2001 by P. Blaha and K. Schwarz