High-Speed Pseudo-Orthogonalization for the Car-Parrinello Method

Tomoo AOYAMA, Takatoshi HIGUCHI and Umpei NAGASHIMA


1 Introduction

The Car-Parrinello method [1] is a well-known approach for ab-initio calculations to research properties of solids and surfaces. The method requires huge CPU power, and the dominant part is 3D-FFT [2, 3] until 1000 atoms. However, over 1000 atoms the part is the Schmidt orthogonalization [4]. Because, the number of 3D-FFT calculations is O(M2logM), and that of the Schmidt orthogonalization is O(M3), where M is the number of vectors and O(X) means on the order of X. In this study, our objective was to reduce the number of calculations of the Schmidt orthogonalization by considering the arithmetic operations, and data transfer (for alignment of configurable hardware such as Field Programmable Gate Array (FPGA)).
Several approaches were considered. One approach was to improve the rate of data transfer [5]. Although this approach cannot reduce the number of calculations, it can be effective in improving data transfer among caches and main storage in parallel machines. Possible approaches to reduce the number of calculations are limited, because essential calculations of rotation and matrix-products are already algorithms of O(M3).
An approximate orthogonalization therefore seems to be the best option. Such an approximation is acceptable because the Self-Consistent Field (SCF) step in the Car-Parrinello method does not require decimal 13 digits at the non-convergence stages.

2 Approximation approaches

2. 1 Direct generation of orthogonal vectors

One approach is to generate orthogonal vectors directly without the orthogonalization. The number of rotation calculations of M vectors is O(M3). If the rotations are limited to M' elements (<M), this number decreases to O(M'M2). Consider the following vectors,

where dik is the delta function, sign{} assigns the sign of the second argument to a variable of the first argument, and random() is a function to generate uniform random numbers [6] in the interval [-1, 1]. The set {(Ai)k; k=1,2,...M<=N} is orthogonal. For a set of finite rotations, we therefore can construct an orthogonal vector set, although it cannot simulate many waves. We did not adopt this approach.

2. 2 Pseudo-orthogonalization

Another approach is approximate orthogonalization as follows. In the Schmidt orthogonalization, a normalized vector is (Ai), and the set is {(Ai)k; i<=N, k<=M}, where M is the number of vectors and N is the number of elements. A scalar Pkl based on their product is then

An orthogonal set {(Bi)k} can be defined as

The number of calculations is N*M*(M-1)/2. To reduce this number, we introduced the following pseudo-orthogonalization:

This orthogonalization is valid when K=M or when {(Ai)k}=dik. The vector set {(Ci)k} is approximately orthogonal. The validity of this pseudo-orthogonalization needs to be tested for K.

3 Numerical tests

3. 1 Vectors constructed using uniform random numbers

To test the validity of the pseudo-orthogonalization, first we define the following vectors,

The random numbers are uniform in the interval [-1, 1]. The vectors are normalized as

The average error of the orthogonalization was estimated as follows.
  1. A vector set {(Ai)k} was regarded as a matrix Aik,
  2. Off-diagonal elements, Aik (i<>k), are zero in ideal orthogonalization. Thus, Si<>kA2ik=0.
  3. The precision of the pseudo-orthogonalization was then represented by the error amplitude Lik and the average error amplitude ~L as follows:

Figure 1 shows ~L for calculation using Eqs. (5) and (6) with {(Ai)k} as a function of dimensionless D={1, 3, 5, 7, 10, 15, 20, 25, 30} and N={262, 372, 528, 748, 1060, 1500}.

Figure 1. Average error ~L of the pseudo-orthogonalization.

The results reveal that ~L decreases with increasing D. The orthogonalization is complete at K=M (where K is in Eq. (6)). The result, ~L < O(-2) when D>=30, N=1500, is acceptable for the initial SCF steps of the Car-Parrinello method. When N is a sufficiently large number, ~L ®0 due to the normalization of Eq. (8). The CPU speed-up ratio is M/K.
Figure 2 shows the maximum Lik, max(Lik), of the pseudo-orthogonalization.

Figure 2. Maximum error max(Lik) of the pseudo-orthogonalization.

The only deficiency in this pseudo-orthogonalization is that max(Lik) does not approach zero when N®¥. The non-zeros are in the space between K and M of Eq. (6). That is {i<=K and k<=K: (Ai)k=dik; otherwise: (Ai)k=Lik<>0}. If the waves, vectors in the Car-Parrinello method, are distributed in the space, the pseudo-orthogonalization is not valid.

3. 2 Diagonal dominance vectors

The orthogonalization is complete when {(Ai)k}=dik. When the vectors are arranged into a matrix that has diagonal dominance, and thus are called "diagonal dominance vectors," the pseudo-orthogonalization approaches complete transformation. Here, these dominance vectors were constructed as

where random() is a function to generate uniform random numbers, and R is a small positive coefficient.
When R=0.1, the amplitude of the off-diagonal elements is O(-1). When {R=O(-n), n>=3}, however, the amplitude is smaller, namely, O(-n-1), and depends on D and N. Figure 3 shows ~L for the off-diagonal elements. When the boundary condition is R=0.01, there is little difference between ~L for N=262 and that for N=1500. In conclusion, the pseudo-orthogonalization is effective for small N (i.e., N300).

Figure 3. Average error ~L for the off-diagonal elements of diagonal dominance vectors.

4 Precision of pseudo-orthogonalization for initial vectors

4. 1 Band vector

The ordering of non-orthogonal vectors, namely, initial vectors, can be considered a set. Here, we consider this set as a matrix, and define a "band vector" as a band-structured matrix. In the Car-Parrinello method, the band vector is a localized wave on the atoms related to the band and is defined as

where W is the band width and R is the same coefficient as in Eq. (11). Eq. (12) is for the band part. The band vector is normalized also. Figure 4 shows the precision of the off-diagonal elements when R=0.1 and W=3; i.e., the band width is 7. The initial sharp decrease in ~L reveals the band structure. When N=1500, ~L is 2.1 lower, indicating an improvement in precision. In conclusion, the use of band vectors in the pseudo-orthogonalization is effective for localized waves.

Figure 4. Average error ~L of the off-diagonal elements in a band vector orthogonalization.

4. 2 Alternative sign vectors

The band vector improves the precision of the pseudo-orthogonalization certainly; however, it is restricted within localized waves. We wish the approach to general waves. We consider orthogonal vectors,

where Re[] and Im[] are extraction functions of the real and imaginary parts of the complex. Although the vectors themselves have not been adopted in the Car-Parrinello method because of the small degree-of-freedom, the sign of the vectors can be used to improve the precision of non-localized waves. We wish to gain precision from cost of the degree-of-freedom. First, we consider a set of synthesized random numbers,

where Sgn() is a function to obtain the sign of the vector and sign{} is a function in Fortran language. Here, S-random() is called "sign-restricted (random) numbers." Using these sign-restricted numbers and the concept of band vector, the real parts of the initial vectors can be defined as,

Note the absence of the parameter R as in Eq. (13). Table 1 shows the precision of the off-diagonal elements calculated using the sign-restricted numbers for N=1500. This approach improved the precision by a factor of about 1.5, without requiring an R-parameter and without increasing the CPU time.

Table 1. Average precision of off-diagonal elements of an orthogonal set.

5 Application of pseudo-orthogonalization to the Car-Parrinello method

We coded a complex form of the pseudo-orthogonalization into the SCF part of the Car-Parrinello method, where M=16, N=147, D=5, and the molecule was bulk-silicon. Compared with metals, bulk-silicon is a semi-conductor and the wave function is localized on atoms. Figure 5 shows the calculated total energy (in units of a.u.) as a function of SCF iterations.
The difference in energy calculated using the pseudo-orthogonalization and that using the Schmidt-orthogonalization was less than 10-5 [a.u.], which is sufficient precision to discuss the energy status. Both calculation methods converged relatively rapidly.

Figure 5. Convergence of SCF energies for bulk-silicon calculated using pseudo-orthogonalization and Schmidt orthogonalization.

6 Hardware problems using FPGA

When pseudo-orthogonalization is performed using a FPGA with 106 gates [7], the Input/Output (I/O) interface is the key component. Despite the vast arithmetic resources currently available, the number of I/O pins is inadequate to take advantage of pseudo-orthogonalization on decimal 13 digits. This problem of a narrow interface needs to be solved by hardware designers today. The problem is found generally on 64 bits floating point numerical calculations.

6. 1 Data compression

Several data compression methods are currently available for general applications. One method involves a simple algorithm specialized for an I/O interface for an 8-byte Floating Point expression (FP8), which has 13 significant digits. Pseudo-orthogonalization has about 2-3 significant digits. The element values of orthogonal vectors are in the [-x, +x] interval, where x is a positive finite number less than 1. On the condition, the FP8 should be converted to the fixed point format based on the complementary expression of 2. The fixed point format has multiple expressions of 1-8 bytes, which express numbers in [-2n, +2n-1], n={7, 15, 31, 63}. The conversion and the reverse algorithms are

where X is a number in FP8, and T is a number in the fixed point format of n-bits. These can be expressed as FTn, which can be transferred through n-bits interface pins. Thus, such conversion improves the data transfer rate in pseudo-orthogonalization by 64/n.

6. 2 Precision of the FP8 to FT8/FT16 conversion

Table 2 shows the precision of conversions between FP8 and FT8/FT16 formats using uniform random numbers in the interval [-1,1] for N=1500.

Table 2. Precision of the off-diagonal elements (n=8) and band parts (n=16, or diagonal elements) in the conversion from 8 or 16-bit fixed-point format to 8-byte floating point format.
Speed-up ratioStand.Dev.Maximum error

The precision for the off-diagonal elements and band parts was O(-3) and O(-5), respectively, both of which are sufficient to transfer pseudo-orthogonalization vectors. In conclusion, because the data transfer in pseudo-orthogonalization predominantly involves off-diagonal elements, the conversion can increase the transfer rate through the interface by a factor of 8.

7 Conclusions

A high-speed orthogonalization method derived from the Schmidt orthogonalization in partial space was developed. Because this is an approximate transformation, it is called a pseudo-orthogonalization. The speed-up ratio is about 3.3, and the precision is O(-2), which was estimated based on the amplitude of the off-diagonal elements in 1500th dimensional vectors. Both the speed-up ratio and precision are affected by the dimensions of the partial space.
Although this pseudo-orthogonalization is high-speed, its precision is insufficient for general applications. We therefore considered two approaches to improve the precision: introduction of a band structure for vectors, and use of alternative sign random numbers. The precision improved by a factor of 2.1 and 1.5, respectively, without increasing the CPU time. The first approach can be used only for localized waves, whereas the latter can be used for general applications.
To demonstrate the applicability of this high-speed pseudo-orthogonalization for large-scale calculations, we coded a complex form of the pseudo-orthogonalization into the SCF part of the Car-Parrinello method. The difference in total energy for bulk-silicon calculated using the pseudo-orthogonalization and the Schmidt orthogonalization was less than O(-5). Although this is acceptable, the orthogonalization is an approximation with O(-2); and therefore should be replaced by the original Schimidt-orthogonalization at the convergence limit of the SCF.
In further research, we will execute ab-initio calculations using configurable devices such as FPGA. Recent advances in FGPA include advances in various arithmetic resources, cache, FIFO (First In First Out memory), and programmable mega gates. The "bottleneck" seems to be I/O interfaces. To overcome this, in this study we considered a data compression method for the 8-byte floating point format. The compression ratio is 4 or 8, and the precision is O(-5) or O(-3), respectively, both of which are sufficient to rapidly transfer pseudo-orthogonalization vectors.


[ 1] Car, R. and Parrinello, M., Phys. Rev. Lett., 55, 2471-2474 (1985).
[ 2] Tohru Sasaki and Kazuaki Murakami, Reconfigurable 3D-FFT Processor for the Car-Parrinello Method (in Japanese), J. Comput. Chem. Jpn, 4, 147-154 (2005).
[ 3] A representative mathematical library for 3D_FFT is Intel Math Kernel Library (version 8.1). The introduction is
[ 4] http://www.cc.tsukuba.ac.jp/mimosa/workshop/2004/
[ 5] Hiroshi Murakami, An orthogonalization to realize massive data-transfer effectively for huge dimensional vectors (in Japanese), Soc. of Compt. Chem. of Japan, annual conference, 2P14 (2004).
[ 6] D. E. Knuth, The Art of Computer Programming, Stanford Univ. Press.
[ 7] Introduction for Virtex-4 FPGA family, Xcelljournal, 52, XILINX Co. Ltd., San Jose, CA, 2005.