HighSpeed PseudoOrthogonalization for the CarParrinello Method
Tomoo AOYAMA, Takatoshi HIGUCHI and Umpei NAGASHIMA
Return
1 Introduction
The CarParrinello method [1] is a wellknown approach for abinitio calculations to research properties of solids and surfaces. The method requires huge CPU power, and the dominant part is 3DFFT [2, 3] until 1000 atoms. However, over 1000 atoms the part is the Schmidt orthogonalization [4]. Because, the number of 3DFFT calculations is O(M^{2}logM), and that of the Schmidt orthogonalization is O(M^{3}), where M is the number of vectors and O(X) means on the order of X. In this study, our objective was to reduce the number of calculations of the Schmidt orthogonalization by considering the arithmetic operations, and data transfer (for alignment of configurable hardware such as Field Programmable Gate Array (FPGA)).
Several approaches were considered. One approach was to improve the rate of data transfer [5]. Although this approach cannot reduce the number of calculations, it can be effective in improving data transfer among caches and main storage in parallel machines. Possible approaches to reduce the number of calculations are limited, because essential calculations of rotation and matrixproducts are already algorithms of O(M^{3}).
An approximate orthogonalization therefore seems to be the best option. Such an approximation is acceptable because the SelfConsistent Field (SCF) step in the CarParrinello method does not require decimal 13 digits at the nonconvergence stages.
2 Approximation approaches
2. 1 Direct generation of orthogonal vectors
One approach is to generate orthogonal vectors directly without the orthogonalization. The number of rotation calculations of M vectors is O(M^{3}). If the rotations are limited to M' elements (<M), this number decreases to O(M'M^{2}). Consider the following vectors,
where dik is the delta function, sign{} assigns the sign of the second argument to a variable of the first argument, and random() is a function to generate uniform random numbers [6] in the interval [1, 1]. The set {(Ai)k; k=1,2,...M<=N} is orthogonal. For a set of finite rotations, we therefore can construct an orthogonal vector set, although it cannot simulate many waves. We did not adopt this approach.
2. 2 Pseudoorthogonalization
Another approach is approximate orthogonalization as follows. In the Schmidt orthogonalization, a normalized vector is (Ai), and the set is {(Ai)_{k}; i<=N, k<=M}, where M is the number of vectors and N is the number of elements. A scalar P_{kl} based on their product is then
An orthogonal set {(Bi)_{k}} can be defined as
The number of calculations is N*M*(M1)/2. To reduce this number, we introduced the following pseudoorthogonalization:
This orthogonalization is valid when K=M or when {(Ai)_{k}}=dik. The vector set {(Ci)_{k}} is approximately orthogonal. The validity of this pseudoorthogonalization needs to be tested for K.
3 Numerical tests
3. 1 Vectors constructed using uniform random numbers
To test the validity of the pseudoorthogonalization, first we define the following vectors,
The random numbers are uniform in the interval [1, 1]. The vectors are normalized as
The average error of the orthogonalization was estimated as follows.

A vector set {(Ai)_{k}} was regarded as a matrix A_{ik},

Offdiagonal elements, A_{ik} (i<>k), are zero in ideal orthogonalization. Thus, S_{i<>k}A^{2}_{ik}=0.

The precision of the pseudoorthogonalization was then represented by the error amplitude Lik and the average error amplitude ~L as follows:
Figure 1 shows ~L for calculation using Eqs. (5) and (6) with {(Ai)_{k}} as a function of dimensionless D={1, 3, 5, 7, 10, 15, 20, 25, 30} and N={262, 372, 528, 748, 1060, 1500}.
Figure 1. Average error ~L of the pseudoorthogonalization.
The results reveal that ~L decreases with increasing D. The orthogonalization is complete at K=M (where K is in Eq. (6)). The result, ~L < O(2) when D>=30, N=1500, is acceptable for the initial SCF steps of the CarParrinello method. When N is a sufficiently large number, ~L ®0 due to the normalization of Eq. (8). The CPU speedup ratio is M/K.
Figure 2 shows the maximum Lik, max(Lik), of the pseudoorthogonalization.
Figure 2. Maximum error max(Lik) of the pseudoorthogonalization.
The only deficiency in this pseudoorthogonalization is that max(Lik) does not approach zero when N®¥. The nonzeros are in the space between K and M of Eq. (6). That is {i<=K and k<=K: (Ai)_{k}=dik; otherwise: (Ai)_{k}=Lik<>0}. If the waves, vectors in the CarParrinello method, are distributed in the space, the pseudoorthogonalization is not valid.
3. 2 Diagonal dominance vectors
The orthogonalization is complete when {(Ai)_{k}}=dik. When the vectors are arranged into a matrix that has diagonal dominance, and thus are called "diagonal dominance vectors," the pseudoorthogonalization approaches complete transformation. Here, these dominance vectors were constructed as
where random() is a function to generate uniform random numbers, and R is a small positive coefficient.
When R=0.1, the amplitude of the offdiagonal elements is O(1). When {R=O(n), n>=3}, however, the amplitude is smaller, namely, O(n1), and depends on D and N. Figure 3 shows ~L for the offdiagonal elements. When the boundary condition is R=0.01, there is little difference between ~L for N=262 and that for N=1500. In conclusion, the pseudoorthogonalization is effective for small N (i.e., N300).
Figure 3. Average error ~L for the offdiagonal elements of diagonal dominance vectors.
4 Precision of pseudoorthogonalization for initial vectors
4. 1 Band vector
The ordering of nonorthogonal vectors, namely, initial vectors, can be considered a set. Here, we consider this set as a matrix, and define a "band vector" as a bandstructured matrix. In the CarParrinello method, the band vector is a localized wave on the atoms related to the band and is defined as
where W is the band width and R is the same coefficient as in Eq. (11). Eq. (12) is for the band part. The band vector is normalized also. Figure 4 shows the precision of the offdiagonal elements when R=0.1 and W=3; i.e., the band width is 7. The initial sharp decrease in ~L reveals the band structure. When N=1500, ~L is 2.1 lower, indicating an improvement in precision. In conclusion, the use of band vectors in the pseudoorthogonalization is effective for localized waves.
Figure 4. Average error ~L of the offdiagonal elements in a band vector orthogonalization.
4. 2 Alternative sign vectors
The band vector improves the precision of the pseudoorthogonalization certainly; however, it is restricted within localized waves. We wish the approach to general waves. We consider orthogonal vectors,
where Re[] and Im[] are extraction functions of the real and imaginary parts of the complex. Although the vectors themselves have not been adopted in the CarParrinello method because of the small degreeoffreedom, the sign of the vectors can be used to improve the precision of nonlocalized waves. We wish to gain precision from cost of the degreeoffreedom. First, we consider a set of synthesized random numbers,
where Sgn() is a function to obtain the sign of the vector and sign{} is a function in Fortran language. Here, Srandom() is called "signrestricted (random) numbers." Using these signrestricted numbers and the concept of band vector, the real parts of the initial vectors can be defined as,
Note the absence of the parameter R as in Eq. (13). Table 1 shows the precision of the offdiagonal elements calculated using the signrestricted numbers for N=1500. This approach improved the precision by a factor of about 1.5, without requiring an Rparameter and without increasing the CPU time.
Table 1. Average precision of offdiagonal elements of an orthogonal set.
D  Original  Signrestricted  Improvement 
10  0.0171  0.0114  1.5 
15  0.0155  0.0101  1.5 
20  0.0135  0.0090  1.5 
25  0.0119  0.0080  1.5 
30  0.0099  0.0070  1.4 
5 Application of pseudoorthogonalization to the CarParrinello method
We coded a complex form of the pseudoorthogonalization into the SCF part of the CarParrinello method, where M=16, N=147, D=5, and the molecule was bulksilicon. Compared with metals, bulksilicon is a semiconductor and the wave function is localized on atoms. Figure 5 shows the calculated total energy (in units of a.u.) as a function of SCF iterations.
The difference in energy calculated using the pseudoorthogonalization and that using the Schmidtorthogonalization was less than 10^{5} [a.u.], which is sufficient precision to discuss the energy status. Both calculation methods converged relatively rapidly.
Figure 5. Convergence of SCF energies for bulksilicon calculated using pseudoorthogonalization and Schmidt orthogonalization.
6 Hardware problems using FPGA
When pseudoorthogonalization is performed using a FPGA with 10^{6} gates [7], the Input/Output (I/O) interface is the key component. Despite the vast arithmetic resources currently available, the number of I/O pins is inadequate to take advantage of pseudoorthogonalization on decimal 13 digits. This problem of a narrow interface needs to be solved by hardware designers today. The problem is found generally on 64 bits floating point numerical calculations.
6. 1 Data compression
Several data compression methods are currently available for general applications. One method involves a simple algorithm specialized for an I/O interface for an 8byte Floating Point expression (FP8), which has 13 significant digits. Pseudoorthogonalization has about 23 significant digits. The element values of orthogonal vectors are in the [x, +x] interval, where x is a positive finite number less than 1. On the condition, the FP8 should be converted to the fixed point format based on the complementary expression of 2. The fixed point format has multiple expressions of 18 bytes, which express numbers in [2^{n}, +2^{n}1], n={7, 15, 31, 63}. The conversion and the reverse algorithms are
where X is a number in FP8, and T is a number in the fixed point format of nbits. These can be expressed as FTn, which can be transferred through nbits interface pins. Thus, such conversion improves the data transfer rate in pseudoorthogonalization by 64/n.
6. 2 Precision of the FP8 to FT8/FT16 conversion
Table 2 shows the precision of conversions between FP8 and FT8/FT16 formats using uniform random numbers in the interval [1,1] for N=1500.
Table 2. Precision of the offdiagonal elements (n=8) and band parts (n=16, or diagonal elements) in the conversion from 8 or 16bit fixedpoint format to 8byte floating point format.
 Speedup ratio  Stand.Dev.  Maximum error 
n=8  8  0.0046  0.0079 
n=16  4  0.000018  0.000031 
The precision for the offdiagonal elements and band parts was O(3) and O(5), respectively, both of which are sufficient to transfer pseudoorthogonalization vectors. In conclusion, because the data transfer in pseudoorthogonalization predominantly involves offdiagonal elements, the conversion can increase the transfer rate through the interface by a factor of 8.
7 Conclusions
A highspeed orthogonalization method derived from the Schmidt orthogonalization in partial space was developed. Because this is an approximate transformation, it is called a pseudoorthogonalization. The speedup ratio is about 3.3, and the precision is O(2), which was estimated based on the amplitude of the offdiagonal elements in 1500^{th} dimensional vectors. Both the speedup ratio and precision are affected by the dimensions of the partial space.
Although this pseudoorthogonalization is highspeed, its precision is insufficient for general applications. We therefore considered two approaches to improve the precision: introduction of a band structure for vectors, and use of alternative sign random numbers. The precision improved by a factor of 2.1 and 1.5, respectively, without increasing the CPU time. The first approach can be used only for localized waves, whereas the latter can be used for general applications.
To demonstrate the applicability of this highspeed pseudoorthogonalization for largescale calculations, we coded a complex form of the pseudoorthogonalization into the SCF part of the CarParrinello method. The difference in total energy for bulksilicon calculated using the pseudoorthogonalization and the Schmidt orthogonalization was less than O(5). Although this is acceptable, the orthogonalization is an approximation with O(2); and therefore should be replaced by the original Schimidtorthogonalization at the convergence limit of the SCF.
In further research, we will execute abinitio calculations using configurable devices such as FPGA. Recent advances in FGPA include advances in various arithmetic resources, cache, FIFO (First In First Out memory), and programmable mega gates. The "bottleneck" seems to be I/O interfaces. To overcome this, in this study we considered a data compression method for the 8byte floating point format. The compression ratio is 4 or 8, and the precision is O(5) or O(3), respectively, both of which are sufficient to rapidly transfer pseudoorthogonalization vectors.
References
[ 1] Car, R. and Parrinello, M., Phys. Rev. Lett., 55, 24712474 (1985).
[ 2] Tohru Sasaki and Kazuaki Murakami, Reconfigurable 3DFFT Processor for the CarParrinello Method (in Japanese), J. Comput. Chem. Jpn, 4, 147154 (2005).
[ 3] A representative mathematical library for 3D_FFT is Intel Math Kernel Library (version 8.1). The introduction is
http://www.xlsoft.com/jp/product/intel/perflib/mkl/whatsnew.html
[ 4] http://www.cc.tsukuba.ac.jp/mimosa/workshop/2004/
[ 5] Hiroshi Murakami, An orthogonalization to realize massive datatransfer effectively for huge dimensional vectors (in Japanese), Soc. of Compt. Chem. of Japan, annual conference, 2P14 (2004).
[ 6] D. E. Knuth, The Art of Computer Programming, Stanford Univ. Press.
[ 7] Introduction for Virtex4 FPGA family, Xcelljournal, 52, XILINX Co. Ltd., San Jose, CA, 2005.
Return