Run-Time Optimization of Sparse Matrix-Vector Multiplication on SIMD Machines Louis H. Ziantz C. Ozturan Boleslaw K. Szymanski Sparse matrix-vector multiplication forms the heart of iterative linear solvers used widely in scientific computations (e.g., finite element methods). In such solvers, the matrix-vector product is computed repeatedly, often thousands of times, with updated values of thevector until convergence is achieved. In an SIMD architecture, each processor has to fetch the updated off-processor vector elements while computing its share of the product. In this paper, we report on run-time optimization of array distribution and off processor data fetching to reduce both the communication and computation time. The optimization is applied to a sparse matrix stored in a compressed sparse row-wise format. Actual runs on test matrices produced up to a 35 percent relative improvement over a block distribution with a naive multiplication algorithm while simulations overa wider range of processors indicate that up to a 60 percent improvement may be possible in some cases. Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY cs-94-13
Run-Time Optimization of Sparse Matrix-Vector Multiplication on SIMD Machines
Louis H. Ziantz
C. Ozturan
Boleslaw K. Szymanski
Sparse matrix-vector multiplication forms the heart of iterative linear solvers used widely in scientific computations (e.g., finite element methods). In such solvers, the matrix-vector product is computed repeatedly, often thousands of times, with updated values of thevector until convergence is achieved. In an SIMD architecture, each processor has to fetch the updated off-processor vector elements while computing its share of the product. In this paper, we report on run-time optimization of array distribution and off processor data fetching to reduce both the communication and computation time. The optimization is applied to a sparse matrix stored in a compressed sparse row-wise format. Actual runs on test matrices produced up to a 35 percent relative improvement over a block distribution with a naive multiplication algorithm while simulations overa wider range of processors indicate that up to a 60 percent improvement may be possible in some cases.
Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY
cs-94-13