CRYSTAL in parallel: replicated and distributed (MPP) data Torino, September 5

Torino, September 5th, 2013
CRYSTAL in parallel: replicated and
distributed (MPP) data
Roberto Orlando
Dipartimento di Chimica
Università di Torino
Via Pietro Giuria 5, 10125 Torino (Italy)
roberto.orlando@unito.it
1
Why parallel?
•Faster time to solution;
•More available memory;
•High Performance Computing (HPC) resources available and not many
softwares can run efficiently on thousands of processors.
The programmer’s concerns:
•Load imbalance: the time taken will be that of the longest job;
•Handling communications: the processors will need to talk to each other and
communication is slow;
•Handling Input/Output (I/O): in most cases I/O is slow and should be avoided.
The User’s concerns:
•Choose an appropriate number of processors to run a job depending on the
problem size (mostly determined by the number of basis functions in the unit cell).
2
1
Towards large unit cell systems
serially
A CRYSTAL job can be run
crystal
Pcrystal
in parallel
MPPcrystal
Message Passing Interface (MPI) for communications
Pcrystal
uses replicated data storage in memory
MPPcrystal
for large unit cell systems on high-performance computers
use of Linear Algebra Library parallel routines (Scalapack) for
diagonalization, matrix products, Choleski decomposition
enhanced distribution of data in memory among processors
3
Running CRYSTAL in parallel
Pcrystal
• full parallelism in the calculation of the interactions (one- and
two-electron integrals)
• distribution of tasks in the reciprocal space: one k point per
processor
• no call to external libraries
• few inter-process communications
MPPcrystal
• full parallelism in the calculation of the interactions (oneand two-electron integrals)
• double-level distribution of tasks in the reciprocal space:
one k point to a subset of processors
• use of Linear Algebra Library parallel routines (Scalapack)
for diagonalization, matrix products, Choleski decomposition
• enhanced distribution of data in memory among processors
• many inter-process communications
4
2
Pcrystal and MPPcrystal in action
10,000 basis functions
Example of a CRYSTAL calculation:
16 processors available
4 k points sampled in reciprocal space
Processors
0-3
4-7
12
13 14
15
k4
k3
k1
Active
Tasks in real space (integrals)
Idle
Tasks in reciprocal space
Pcrystal
k2
Active
Tasks in real space (integrals)
Tasks in reciprocal space
8 - 11
k4
k3
k2
MPPcrystal
k1
5
Pcrystal - Implementation
Standard compliant:
•Fortran 90
•MPI for message passing
Replicated data:
•Each k point is independent: each processor performs the linear algebra
(FC=EC) for a subset of the k points that the job requires;
•Very few communications (potentially good scaling), but potential load
imbalance;
•Each processor has a complete copy of all the matrices used in the linear
algebra;
•The limit on the size of job is given by the memory required to store the linear
algebra matrices for one k point;
•Number of k points limits the number of processors that can be exploited: in
general scales very well provided the number of processors ≤ number of k
points.
6
3
MPPcrystal – Implementation I
Standard compliant:
•Fortran 90
•MPI for message passing
•ScaLAPACK 1.7 (Dongarra et al.) for linear algebra on distributed matrices
•www.netlib.org/scalapack/scalapack_home.html
Distributed data:
•Each processor holds only a part of each of the matrices used in the linear
algebra (FC=EC);
•Number of processors that can be exploited is NOT limited by the number of k
points (great for large Γ point only calculations);
•Use ScaLAPACK for e.g.
Choleski decomposition
Matrix matrix multiplies
Linear equation solves
•As distributed data communications are required to perform the linear algebra;
•However, N3 operations but only N2 data to communicate.
F
o
r
7
MPPcrystal – Implementation II
Scaling:
•Scaling gets better for larger systems;
•Very rough rule of thumb: if N basis functions can exploit up to around N/20
processors (optimal ratio: N/50);
•One further method that MPPcrystal uses is multilevel parallelism: if have 4
real k points and 32 procs each diagonalization will be done by 8 processors,
so each diagonalization has to scale to fewer processors
•Complicated by complex k points
•Very useful for medium-large sized systems (for a big enough problem can
scale very well)
Non implemented features in MPPcrystal:
•Will fail quickly and cleanly if requested feature not implemented, such as:
symmetry adaption of the Crystalline Orbitals (for large high symmetry
systems Pcrystal may be more effective)
CPHF
Raman Intensities
8
4
MCM-41 mesoporous material model
P. Ugliengo, M. Sodupe, F. Musso, I. J. Bush, R. Orlando, R. Dovesi, Advanced Materials 20, (2008).
B3LYP approximation
Hexagonal lattice with P1 symmetry
580 atoms per cell (7800 basis functions)
IR spectrum
recorded on a
micelle-templated
silica calcinated at
823 K, water
outgassed at 423 K
MTS/423 K
B3LYP
3800
3600
3400
3200
3000
Simulated powder spectrum: no relevant reflexions
at higher 2 because of short-range disorder
9
MCM-41:increasing the unit cell
R. Orlando, M. Delle Piane, I. J. Bush, P. Ugliengo, M. Ferrabone, R. Dovesi, J. Comput. Chem. 33, 2276 (2012).
SPEEDUP 
T32 32
TNC NC
Supercells of the MCM-41 have been grown
along the c crystallographic axis: Xn (side
along c is n times that in X1).
X10 contains 77,560 AOs in the unit cell.
Calculations run on IBM SP6 at Cineca:
•Power6 processors (4.7 GHz) with peak
performance of 101 Tflops/s
Speedup vs number of cores (NC) for
SCF+total energy gradient calculations
•Infiniband X4 DDR internal network
10
5
MCM-41:scaling of the main steps in MPPcrystal
two-electron integrals
+
X4
total energy gradient
one-electron integrals
Fock matrix
diagonalization
exchange-correlation
functional integration
X
preliminary steps
Percentage data measure
parallelization efficiency.
Data in parenthesis: the
amount of time for that task.
11
Running MCM-41 on different HPC architectures
X1
IBM Blue Gene P at Cineca (Bologna)
Cray XE6 - HECToR (Edimburgh)
IBM Sp6 at CINECA (Bologna)
12
6
Memory storage optimization
TOO MANY K POINTS IN THE PACK-MONKHORST NET: INCREASE LIM001
Most of the static allocations have been made dynamic:
•array size now fits the exact memory requirement;
•no need to recompile the code for large calculations;
•a few remaining fixed limits can be extended from input:
CLUSTSIZE (maximum number of a atoms in a generated
cluster; default setting: number of atoms i the unit cell)
LATVEC (maximum number of lattice vectors to be classified;
default value: 3500).
n2atom-size arrays “are distributed” among the cores.
Data are removed from memory as soon as they are not in use.
13
LOWMEM option
The LOWMEM keyword avoids allocation of large arrays generally with a slight
increase in the CPU time (by default in MPPcrystal):
•atomic orbital pair elements in matrices are located in real time without storing a
large table into memory
•Fock and Density matrices are only stored in their “irreducible” forms; symmetry
related elements are computed in real time
•Expansion of AO pair electron density for the “bipolar” approximation of 2-electron
integrals into multipole moments is performed in real time instead of storing large
buffers to memory
•Information about the grid of points used in DFT exchange-correlation functional
integration (point cartesian coordinates, multiplicity, Becke’s weights) is distributed
among processors
Dynamically allocated memory monitoring by means of: MEMOPRT, MEMOPRT2
14
7
Speeding up two-electron integrals
2g



F12g   P34l  10
l 0
h 0 
4h+l
1
3h
electron 1
electron 2
4h l  
2g | 3h
1 0
1
2
3h | 2g

4h l 

Integrals are screened on the basis of the overlap between
atomic orbitals.
In large unit cells a lot of (3, 4) pairs do not overlap to (1, 2g).
120.00
The following integrals are equivalent
by atomic orbital permutation:
100.00
1 2 | 3
  3 4 |1
g
0
h
l
4
h
  1 2
2   3
hl
0
g h
0
g
|4
h l
h
3

T [sec]
80.00
0
4l | 2g h 1h 
Linearization
60.00
Permutation symmetry
40.00
20.00
Implemented for P1 symmetry.
0.00
0
2
4
6
8
10
Xn
15
Improved memory storage in Pcrystal
Fg
Fk
Vk† Fk Vk
Transformation of the Fock and the Density matrix into the basis set of the SymmetryAdapted Crystalline Orbitals (SACO) is operated from the “irreducible” F g to each block
of VkFkVk (irreducible representation) straightforwardly, without forming the full blocks of
Fk:
•the maximum size of matrices to be diagonalized is that of the largest block
•parallelization goes from k points down to the irreducible representations (many more than the
number of k-points in highly symmetric cases)
16
8
Memory storage for fullerenes of increasing size
(n,n)-Fullerenes
n
Sirr
NAO
Sred
1
840
1759
2
3360
6707
169980
716130
3
7560
14570
1609020
4
13440
25377
2847690
5
21000
39047
4432140
6
30240
55661
6362370
7
41160
75138
8638380
8
53760
97559
11260170
9
68040
122843
14227740
10
84000
151071
17541090
n=7
NAO: number of basis functions
Sirr: size of the “irreducible” part of the
overlap matrix represented in real
space (number of matrix elements)
Sred: size of the full overlap matrix
represented in real space (number of
matrix elements)
17
Fullerenes: matrix block size in the SACO basis
(n,n)
Ag
Au
F1g
F1u
F2g
F2u
Gg
Gu
Hg
Hu
NAO
(1,1)
10
4
18
24
18
24
28
28
38
32
840
(2,2)
34
22
78
90
78
90
112
112
146
134
3360
(3,3)
72
54
180
198
180
198
252
252
324
306
7560
(4,4)
124
100
324
348
324
348
448
448
572
548
13440
(5,5)
190
160
510
540
510
540
700
700
890
860
21000
(6,6)
270
234
738
774
738
774
008
1008
1278
1242
30240
(7,7)
364
322
1008
1050
1008
1050
1372
1372
1736
1694
41160
(8,8)
472
424
1320
1368
1320
1368
1792
1792
2264
2216
53760
(9,9)
594
540
1674
1728
1674
1728
2268
2268
2862
2808
68040
(10,10)
730
670
2070
2130
2070
2130
2800
2800
3530
3470
84000
tSCF: wallclock time (in seconds) for running 20 SCF cycles on a single core
tSCF
316
6565
20054
47412
87445
18
9
Conclusions
CRYSTAL
•can be run in parallel on a large number of processors efficiently, with very
good scalability;
•is portable to different HPC platforms;
•allowed the calculation of the total energy and wavefunction of MCM41-X14,
containing more than 100,000 basis functions (8000 atoms), on 2048
processors;
•has been improved as concerning data storage to memory;
•has been made more efficient as for the calculation of the Coulomb and
exchange series;
•Memory storage for highly symmetric cases has been drastically reduced by
extending the use of SACOs to all steps in reciprocal space;
•Task farming in Pcrystal will soon be moved from the k-point level to that of
the irreducible representations.
19
10