Run-Time Optimization of Sparse Matrix-Vector Multiplication on SIMD Machines

Run-Time Optimization of Sparse Matrix-Vector Multiplication on SIMD Machines Louis H. Ziantz C. Ozturan Boleslaw K. Szymanski Sparse matrix-vector multiplication forms the heart of iterative linear solvers used widely in scientific computations (e.g., finite element methods). In such solvers, the matrix-vector product is computed repeatedly, often thousands of times, with updated values of thevector until convergence is achieved. In an SIMD architecture, each processor has to fetch the updated off-processor vector elements while computing its share of the product. In this paper, we report on run-time optimization of array distribution and off processor data fetching to reduce both the communication and computation time. The optimization is applied to a sparse matrix stored in a compressed sparse row-wise format. Actual runs on test matrices produced up to a 35 percent relative improvement over a block distribution with a naive multiplication algorithm while simulations overa wider range of processors indicate that up to a 60 percent improvement may be possible in some cases. Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY cs-94-13

Run-Time Optimization of Sparse Matrix-Vector Multiplication on SIMD Machines

Louis H. Ziantz

C. Ozturan

Boleslaw K. Szymanski

Sparse matrix-vector multiplication forms the heart of iterative linear solvers used widely in scientific computations (e.g., finite element methods). In such solvers, the matrix-vector product is computed repeatedly, often thousands of times, with updated values of thevector until convergence is achieved. In an SIMD architecture, each processor has to fetch the updated off-processor vector elements while computing its share of the product. In this paper, we report on run-time optimization of array distribution and off processor data fetching to reduce both the communication and computation time. The optimization is applied to a sparse matrix stored in a compressed sparse row-wise format. Actual runs on test matrices produced up to a 35 percent relative improvement over a block distribution with a naive multiplication algorithm while simulations overa wider range of processors indicate that up to a 60 percent improvement may be possible in some cases.

Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY

cs-94-13