Performance evaluation of parallel sparse matrix-vector product on multi-core computers
We consider the problem of parallelizing the sparse matrix-vector multiplication algorithm for symmetric multi-processed architectures. Matrices are stored using the compressed sparse row-column format (CSRC), designed for proﬁting from the symmetric non-zero pattern observed in global ﬁnite element matrices. Unlike classical compressed storage formats, performing the sparse matrix-vector product using the CSRC requires thread-safe access to the destination vector. To avoid race conditions, we have implemented two partitioning strategies. In the ﬁrst one, each thread allocates an array for storing its contributions, which are later combined in an accumulation step. The second strategy employs a coloring algorithm for grouping rows that can be concurrently processed by threads. Our results indicate that, although theoretically aording only limited parallelism, the ﬁrst approach leads to the best performance improvements for most matrices. 1 Keywords: parallel and distributed computing; performance analysis; sparse matrixvector multiplication; compressed data structures; multi-core architectures; ﬁnite element method.
It is not feasible anymore to expect performance gains due to continuously increasing processor clock speeds. Nowadays, processor vendors has been concentrate on developing systems which group two or more processors onto a single socket, sharing the same memory resources. This technology, called multi-core, has been successfully employed to dierent application domains ranging from computer graphics to scientiﬁc computing, and in these times it is commonly encountered on high performance clusters, desktop computers, notebooks, and even mobile devices. The spread of such architecture has consequently stimulated an increasing number of researches on parallel algorithms. To obtain ecient implementations of parallel algorithms, one must consider the underlying architecture on which the program is supposed to be run. In fact, even processors belonging to the multi-core family present dierent hardware layouts, which can make an implementation to perform poorly on one platform, while running fast on another. As an example of such issue, multi-core processors can share the same path to the memory or not, therefore leading programmers to take care of thread and memory anity. The ﬁnite element method is usually the ﬁrst choice for numerically solving huge systems of partial dierential equations. Matrices arising from ﬁnite element discretizations are usually sparse, i.e., most of its entries are zeros. Eectively storing sparse matrices requires the use of compressed data structures. The compressed sparse row (CSR) format is regarded as the standard way of representing sparse matrices. It stores contiguously in memory nonzero entries belonging to the same row. While in a dense representation any element of a matrix can be randomly accessed through the use of its row and column indices, the CSR explicitly stores in memory the combinatorial information about a matrix. Given an n × n matrix A with nnz non-zero coecients, the original version of the CSR  consists of three arrays: two integer arrays ia(n + 1) and ja(nnz) for storing combinatorial data, and one ﬂoating-point array a(nnz) containing the non-zero coecients. The value ia(i) points to the ﬁrst element of row i in the a array, i.e., row i is deﬁned as the subset of a starting and ending at ia(i) and ia(i + 1) － 1, respectively. Column indices for every non-zero entry in a are stored in the ja array. There is also the compressed sparse column format which is the transpose of CSR. The CSR format can represent matrices of arbitrary shapes and symmetry properties. In the context of the ﬁnite element method, however, the generality provided by the CSR is underused as most matrices have symmetric non-zero pattern. In this case, it would be sucient to store, roughly, half of the matrix connectivity. The compressed sparse rowcolumn (CSRC) format was designed to take proﬁt from this fact . Basically, it stores the column indices for only half of the o-diagonal entries. As the working set size has a great impact on the performance of CSR-like data structures, the running time of algorithms such as the matrix-vector product is always improved with the CSRC. Also, biconjugate gradient solvers would rather beneﬁt from CSRC, because the transpose matrix is implicitly deﬁned. It is known that the performance of ﬁnite element codes using iterative solvers is dominated by the computations associated with the matrix-vector multiplication algorithm. In this algorithm, we are given an n × n sparse matrix A containing nnz non-zeros, and a dense n-vector x, called the source vector. The output is an n-vector y, termed the destination vector, which stores the result of the Ax operation. Performing this operation using the CSR 1
format is trivial, but it was observed that the maximum performance in Mﬂops sustained by a naïve implementation can reach only 10% of the machine peak performance . Its parallelization is also very straightforward, and can conduct to reasonable speedups, but the deﬁciencies of its sequential counterpart are carried over its parallel version. Related work. The work of Goumas et al.  contains a thorough analysis of a number of factors which may degrade the performance of both sequential and multi-thread versions of the matrix-vector multiplication kernel employing the CSR. All performance tests were carried out on three dierent platforms, considering SMP, SMT and ccNUMA machines. The multi-thread version was developed using the Native POSIX Thread Library (NPTL). Two partitioning schemes were implemented, one guided by the number of rows itself and the other by the number of non-zero coecients per thread. It was observed that the later approach contributes to a better loading balancing, thus improving signiﬁcantly the speedups. For large matrices, they obtained average speedups of 1.96 and 2.13 using 2 and 4 threads, respectively, on an Intel Core 2 Xeon (codename Woodcrest). In this platform, their code reached an average of 815 Mﬂops for 2 threads, and 849 Mﬂops when spawning 4 threads. These results, however, include some matrices for which there were observed superlinear speedups, where the size of the working set ﬂuctuates around the L2 cache size. This performance changes considerably when restricted to matrices whose working set size is far from ﬁtting in cache. More speciﬁcally, speedups drop to around 1.65 and 2, corresponding to the 2 and 4 threads cases. Memory contention is viewed as the major bottleneck of implementations of the sparse matrix-vector product. This problem was tackled by Kourtis et al.  via compression techniques, reducing both the matrix connectivity and the ﬂoating-point numbers being stored. Although leading to good scalability, they obtained at most a 2-fold speedup on 8 threads, for matrices out of cache. The experiments were conducted on two Intel Clovertown with 4MB of L2 cache each. In the same direction, Belgin et al.  have proposed the use of a pattern-based blocking scheme for reducing the index overhead, what would alleviate the pressured on the memory bus and increase performance. Accompanied by software prefetching and vectorization techniques, it provided an average sequential speedup of 1.4. Its parallel implementation requires the synchronization of the accesses to the y vector. To do such, each thread maintains a private vector for storing partial values, which are summed up in a reduction step into the global destination vector. They observed average speedups around 1.04, 1.11 and 2.3 when spawning 2, 4, and 8 threads, respectively. These results were obtained on a 2-socket Intel Harpertown 5400 with 8GB of RAM and 12MB L2 cache per socket. Dierent row-wise partitioning methods were considered by Liu et al. . Besides splitting almost equally the number of non-zeros among threads, they evaluated the eect of the automatic scheduling mechanisms provided by OpenMP, namely, the static, dynamic and guided schedules. Once more, the non-zero strategy has also been the best choice. They have also considered the block CSR format. The experiments were run using four AMD Opteron 870 dual-core processors, with 16GB of RAM and 2 × 1MB of L2 cache. For large matrices, both CSR and block CSR schemes resulted in poor scalability. The maximum speedup for such matrices was approximately 2, using 8 threads.
Williams et al.  evaluated the sparse matrix-vector kernel using the CSR format on several up-to-date chip multiprocessor systems, such as the heterogeneous STI Cell. They examined various optimization strategies that can also be applied to boosting single-core performance, including software pipelining, branch elimination, SIMDization, explicit prefetching, 16-bit indices, and register, cache and translation lookaside buer (TLB) blocking. It was compared against a code using the MPI-based SpMV in PETSc , conﬁgured for using a shared memory device, with calls to OSKI . A row-wise approach was employed for thread scheduling. As regarding ﬁnite element matrices, it exhibited maximum speedups of 1.12 and 1.77 using 8 threads on an Intel Xeon (Clovertown) processor and an AMD Opteron (Barcelona), respectively. More recently, Buluç et al.  have presented a blocked structure that allows ecient computation of both Ax and AT x in parallel. It can be roughly seen as a dense collection of sparse blocks, rather than a sparse collection of dense blocks, as in the BCSR format. In experiments carried out on an ccNUMA machine featuring AMD Opteron 8214 processors (Santa Rosa), there were no improvements over a serial CSR implementation. In fact, it was always slower on banded matrices. Concerning its√ parallelization, however, it was proved that it yields a very attractive parallelism of Θ(nnz/ n log n). The code was developed using the Cilk++ extension for C++, which enables the development of multi-core applications. In practice, it scaled up to 4 threads on an Intel Xeon X5460 (Harpertown), and presented linear speedups on an AMD Opteron and an Intel Core i7 (Nehalem). On the later, where the best results were attained, it reached speedups of 1.86, 2.97 and 3.71 using 2, 4 and 8 threads, respectively. Although authors have emphasized that this structure would be also suitable for calculations on symmetric matrices, it does not seem to straightly allow the simultaneous computation of yi ← yi + aij xj and yj ← yj + aij xi in a single loop, as CSR does. Overview. The remainder of this paper is organized as follows. Section 2 contains a precise deﬁnition of the CSRC format accompanied with a description of the matrix-vector multiplication algorithm using such structure. Its parallelization is described in Sect. 3. There, we present two strategies for avoiding memory conﬂicts during write accesses to the destination vector. Our results are shown in Sect. 4, supplemented with some worth remarks. We ﬁnally draw some conclusions in Sect. 5.
2 The CRSC storage format
The compressed sparse row-column (CSRC) format is aimed at proﬁting from the symmetric non-zero pattern of matrices arising in ﬁnite element modelling , which is the target domain application of this work. Given an arbitrary global ﬁnite element matrix A = (aij ), the CSRC decomposes A into the sum AD + AL + AU , where AD , AL , and AU correspond to the diagonal, lower and upper parts of A, respectively. The sub-matrix AL (resp. AU ) is stored in a row-wise (resp. column-wise) manner. In practice, the CSRC format splits the o-diagonal coecients of A into two ﬂoatingpoint arrays, namely al(m) and au(m), m = 1 (nnz － n), where the lower and upper entries 2 of A are stored. In other words, if j < i, then aij is stored in al, and au contains its transpose 3
Figure 1: The layout of CSRC for an arbitrary 9 × 9 non-symmetric matrix. aji . The diagonal elements are stored in an array denoted by ad. Other two integer arrays, ia(n + 1) and ja(m), are also maintained. These arrays can be deﬁned in terms of either the upper or the lower coecients of A. The ia array contains pointers to the beginning of each row (resp. column) of A in al (resp. au), and ja contains column (resp. row) indices for those non-zero coecients belonging to the lower (resp. upper) part of A. Another interpretation is that AL is represented in CSR, while AU is stored using CSC. We illustrate the CSRC format for an arbitrary matrix in Fig. 1. Notice that the CSRC could be viewed as the sparse skyline (SSK) format restricted to structurally symmetric matrices [12, Sect. 10.4.5]. However, as shown in Sect. 2.1, we made it capable of representing rectangular matrices after minor modiﬁcations. Furthermore, to our knowledge, this is the ﬁrst evaluation of such structure on modern multi-processed machines.
2.1 Extension to rectangular matrices
The way CSRC is deﬁned would disallow us handling matrices with dierent aspect ratios other than square. In matrices generated by an overlapping partitioning strategy in our distributed-memory ﬁnite element code, it is natural the appearance of rectangular matrices with a remarkable property. More speciﬁcally, given an n × m matrix A constructed by our overlapping scheme, with m > n, then A can be written as the sum AS + AR , where AS and AR are of order n×n and n×k, respectively, with k = m－n. In addition, the AS matrix has symmetric non-zero pattern, and sometimes it is also numerically symmetric. Thus, it can be represented by the CSRC deﬁnition given before, while AR is stored using an auxiliary CSR scheme.
2.2 Sequential matrix-vector product
The sequential version of the CSRC matrix-vector multiplication algorithm has the same loop structure as for CSR. Given an n × n matrix A, it is traversed by rows, and row i is processed from the left to the right up to its diagonal element aii . Because we assume A is structurally symmetric, its upper part can be simultaneously traversed. That is, we are allowed to compute both yi ← yi + aij xj and yj ← yj + aji xi , in the i-th loop. If A is symmetric, we can further eliminate one load instruction when retrieving its upper entries. For rectangular matrices, we insert another loop to processing the coecients stored in the
- Square matrices.
- Rectangular matrices.
Figure 2: Code snippets for the non-symmetric matrix-vector multiplication algorithm using CSRC for (a) square and (b) rectangular matrices. CSR. Figure 2 contains Fortran implementations of the SpMV using CSRC for square and rectangular matrices.
3 Parallel implementation
The matrix-vector multiplication algorithm for CSRC has the characteristic that the lower and upper parts of the input matrix A are simultaneously processed. In the i-th loop, the coecients aij , i > j, and their respective transposes are traversed. Therefore, if two threads work on dierent rows, for example, i and j, j > i, it is not unlikely that the thread owning row j needs to modify the same position k, k ≥ i, in vector y where the other thread is currently working into. That is, the matrix-vector product using CSRC has a race condition in the access of the destination vector y. If a thread owns row i and a second thread owning row j, j > i, requires to modify yi , it is called a direct conﬂict. Otherwise, we call it indirect. These conﬂicts are illustrated as a graph in Fig. 3a for the matrix previously presented in Fig. 1. Each vertex corresponds to a row in A. Direct and indirect conﬂicts are indicated by solid and dashed lines, respectively. In the graph, there are 13 direct and 7 indirect conﬂicts. For banded matrices, we expect the number of indirect conﬂicts decreases considerably. In short, the data structure is required to support concurrent reading and writing into the y vector. These operations need to be thread-safe, but at the same time very ecient, given the ﬁne granularity of the operations. Common strategies to circumventing this problem would employ atomic primitives, locks, or the emerging transactional memory model. The overheads introduced by these approaches are rather costly, compared to the total cost of accessing the y vector. A more promising solution would be determining subsets of rows that can be handled by distinct threads concurrently. In this paper, we have considered two of such solutions, here termed local buers and colorful, as described next. Our algorithms were analyzed by using the concepts of work and span [6, Ch. 27]. The work T1 of an algorithm is the total cost of running it on exactly one processor, and the span T∞ is equal to its cost when running on an inﬁnite number of processors. The parallelism of a given algorithm is then deﬁned as the ratio T1 /T∞ . So, the greater is the parallelism of an algorithm, the better the theoretical guarantees on its performance. The work of the matrix-vector multiply using the CSRC is clearly Θ(nnz). To calculate its span, we need to consider our two partitioning strategies separately.
- Conﬂict graph
- Local buers
Figure 3: Illustration of the local buers and colorful partitioning methods for 3 threads in an arbitrary 9 × 9 matrix.
3.1 Local buers method
One way to avoid the race condition at the y vector is to assign dierent destination vectors to each thread. That is, thread ti would compute its part of the solution, store it in a local buer yi , and then accumulate this partial solution into the y vector. This method, here called local buers method, is illustrated in Fig. 3b, which shows the distribution of rows for an arbitrary non-symmetric 9 × 9 matrix. In the example, the matrix is split into three regions. The number of non-zeros for each thread is 7, 5 and 21. One could argue that a row-wise partitioning could result in poor load balancing, since the work per thread strongly depends on the number of non-zeros per row. In Sect. 4, we present results for such non-zero guided approach, where the rows are required to be contiguously traversed. Unfortunately, the span of this strategy is Θ(n), due to the accumulation step. This gives us a parallelism of O(nnz/n), which would be quite low for massive multi-processors. The platforms considered herein, however, feature at most four processors. Our experiments will show that the local buers method provides reasonable parallelism for such systems.
3.2 Colorful method
The colorful method partitions the matrix A into sets of pairwise conﬂict-free rows. These sets are determined by applying the sequential coloring algorithm  on the conﬂict graph of A. The direct conﬂicts of row i are exactly the rows corresponding to column indices of the non-zero entries at row i, below the diagonal. They can be computed in a single loop through the CSRC structure. Computing indirect conﬂicts is more demanding. In our implementation, these are determined with the aid of the induced subgraph spanned by the direct conﬂicts. Given two rows i and j, i = j, if the intersection of their neighborhood is non-empty, then they are indirectly in conﬂict. After coloring the conﬂict graph, the color classes correspond to conﬂict-free blocks. At this moment, the matrix-vector product can be carried out concurrently inside each block. Observe that for a rectangular matrix, a coloring of its square part is still valid for the whole matrix. This is because the rectangular part is accessed by rows. The layout of a 5-colored 6
matrix is depicted in Fig. 3c, and its associated conﬂict graph in Fig 3a. Let G denote the conﬂict graph of an n × n matrix A. It is known that any coloring of G must use at least ∆+1 colors, where ∆ denotes the maximum vertex degree in G. Therefore, the span of the colorful sparse matrix-vector product would be Θ(∆), and the parallelism O(nnz/∆). Although this would provide us a high degree of parallelism, the possibility of exploiting systems based on cache hierarchies decreases, which aects considerably the code performance.
4 Experimental results
Our implementation was evaluated on two Intel processors, including an Intel Core 2 Duo E8200 (Wolfdale) and an Intel i7 940 (Bloomﬁeld). The Wolfdale processor runs at 2.66GHz, with L2 cache of 6MB and 8GB of RAM, and the Bloomﬁeld one runs at 2.93GHz with 8MB of L3 cache and 8GB of RAM. Our interest on Intel Core 2 Duo machines lies on the fact that we have a dedicated 32-nodes cluster of such processors, where our ﬁnite element simulations are carried out. The code was parallelized using OpenMP directives. All results were obtained using Intel Fortran compiler (ifort) version 11.1 with level 2 optimizations (-O2 ﬂag) enabled. Machine counters were accessed through the PAPI 3.7.1 library API . The measurements of speedups and Mﬂops rates were carried out with PAPI instrumentation disabled. Unfortunately, we could not evaluate data cache misses on the Bloomﬁeld machine. We have run our code on a data set comprised of one dense matrix of order 10K, 16 matrices selected from the University of Florida sparse matrix collection , and 3 groups of 11 matrices each, called angical, tracer, and cube2M, of our own devise. Inside each group, matrices correspond to one global ﬁnite element matrix output by our sequential ﬁnite element code, and ﬁve global matrices for both of the adopted domain partitioning schemes, overlapping (“_o” sux) and non-overlapping (“_n” sux), corresponding to dierent numbers of sub-domains, that is, 2, 4, 8, 16, and 32 sub-domains. The sparse matrix-vector product was run a thousand times for each matrix, which is a reasonable number for iterative solvers like the preconditioned conjugate gradient method (PCG) and the generalized minimum residual method (GMRES). All values shown here correspond to the median over ﬁve of such runs. Sequential performance. We have compared the sequential performance of CSRC against a standard CSR implementation. For symmetric matrices, we have chosen the OSKI implementation as the representative of the symmetric CSR algorithm, assuming that only the lower part of A is stored. In the sparse matrix-vector product, each element of the matrix is accessed exactly once. Thus, accessing these entries incurs only on compulsory misses. On the other hand, the elements of x and y are accessed multiple times. This would enable us to take advantage of modern cache-based machines while reusing recently accessed values. In the CSR, the access pattern of the x vector is known to be the major hindrance to the exploitation of data reuse, because arrays y, ia, ja and a all have stride-1 accesses. Since the y vector is not traversed using stride-1 anymore in CSRC, one could argue that there would be an increase
Figure 4: Percentages of L2 cache and TLB misses for the sequential sparse matrix-vector product on the Wolfdale processor.
Figure 5: The sequential performance in Mﬂops per second of the matrix-vector product using CSR and CSRC for dierent matrices on the Wolfdale processor. in the number of cache misses. However, as presented in Fig. 4, experiments on L2 cache data misses suggest just the converse. The same occurs with the TLB misses. 2 The performance of the algorithm considered herein is memory bounded, since the number of load/store operations is at least as greater as the number of ﬂoating-point multiplyadd instructions. In a dense matrix-vector product, we need to carry out O(n3 ) operations on O(n2 ) amount of data. For sparse matrices, these quantities are both O(n2 ), which contributes to amplify the pressure over the memory. Observe that, for square matrices, the computation of Ax using the CSRC requires the execution of n multiply and nnz － n multiply-add operations, whereas the CSR algorithm requires nnz multiply-add operations. On systems without fused multiply-add operations, CSR and CSRC algorithms would perform 2nnz and 2nnz － n ﬂoating-point instructions, respectively. On the other hand, the 5 number of load instructions for CSR is 3nnz, and 2 nnz － 1 n for the CSRC format. Hence 2 the ratio between loadings and ﬂops is approximately 1.26 for CSRC and exactly 1.5 for CSR. This bandwidth mitigation may be the most relevant reason for the eciency of CSRC shown in Fig. 5. Furthermore, it is remarkable the advantages of using the extended CSRC on a rectangular matrix when its square part is symmetric. 3 8
2 V: Is it really true? Make an L2 cache misses graph for an speciﬁc matrix.
3 V: Test -funroll-loops on cube2M.
Figure 6: Speedups achieved by the local buers strategy implementation of the parallel the matrix-vector product using CSRC for dierent matrices on the Wolfdale processor.
Figure 7: Speedups achieved by the local buers strategy implementation of the parallel the matrix-vector product using CSRC for dierent matrices on the Bloomﬁeld processor. Multi-thread version. Our parallel implementation was evaluated with up to 4 threads on Bloomﬁeld with Hyper-Threading disabled. The values of speedups presented in this section are relative to the pure sequential CSRC code, and not to the one thread case. Figures 6 and 7 shows the speedups attained by the local buers strategy on Wolfdale and Bloomﬁeld, respectively. The overhead introduced by the accumulation step at the end of computations is notorious when using 1 thread. However, this can be easily overcome by checking the number of threads at runtime, and, if there exists only one thread in the working team, working on the global destination vector instead. GOODS: WEIRD: Matrices viscorocks, mixtank_new, bone010, angical, angical_n16 presented super-speedups on Wofdale, even though none of them ﬁt into the L2 cache. BADS: There were some matrices, NOTORIOUSLY matrix tmt_unsym, that experienced slow downs with 2 threads. Note that the rectangular CSRC performed well only on small matrices, e.g., matrices angical_o32, tracer_o32 and cube2M_o32. Although being theoretically promising, the colorful strategy resulted in poor performance, because it decreases spatial locality of vectors x and y. The rows are not guaranteed 9
Figure 8: Speedups achieved by the colorful strategy implementation of the parallel the matrix-vector product using CSRC for dierent matrices on the Wolfdale processor.
Figure 9: Speedups achieved by the colorful strategy implementation of the parallel the matrix-vector product using CSRC for dierent matrices on the Bloomﬁeld processor. to be traversed with stride-1 anymore. Moreover, the colorful strategy has the coloring algorithm as the major drawback. Its running time may take a huge part of the total run time. For our matrices, we could have computed the coloring and saved for later computations.
5 Conclusions and future work
We were concerned with the parallelization of the matrix-vector multiplication algorithm using the CSRC data structure, and focused on the multi-core architecture. We have advocated that multi-core parallelization alone can be more eective than local improvements on the serial implementation. For ﬁnite element matrices, the CSRC has demonstrated to be the best choice between the standard CSR representation. The parallel computation of the transpose matrix-vector multiplication is considered costly for the standard CSR. An easy but still expensive solution would be to convert it to the CSC format before spawning threads. In the CSRC, however, this operation is very straightforward. Given a parallelization of CSRC, we just need to exchange the addresses of 10
al and au, and we are done. Clearly, the computational costs remain the same. The solutions hereby presented are part of a distributed-memory implementation of the ﬁnite element method, and extends previously obtained results . The performance of such implementation will be the object of future investigations. Coloring with non-zero guided strategy.
We would like to thank Prof. José A. F. Santiago for granting access to his Intel i7 machine used in our experiments. This work was partially supported by the National Counsel of Technological and Scientiﬁc Development, Brazil.
Need an essay? You can buy essay help from us today!