Matrix-Free Algorithms | Nikhil Kodali

collaborators: Gourab Panigrahi, Debashis Panda, Dr. Phani Motamarri

(Panigrahi et al., 2024)

Recent advancements in hardware-aware algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications have shown that on-the-fly matrix-vector products can reduce arithmetic complexity and enhance data access efficiency. These matrix-free approaches exploit the tensor-structured FE polynomial basis for integral evaluations without explicit cell-level matrix construction. While the existing implementations of such algorithms are well-suited for the action of FE-discretized matrices on a single vector, they are not directly applicable to matrix-multivector products involving multiple vectors. To address this limitation, we have proposed a computationally efficient and scalable matrix-free implementation procedure for computing FE-discretized matrix-multivector products on multinode CPU architectures. Our implementation achieves significant speedups of up to 4.4x compared to baseline cell-matrix implementation when utilizing 1024 vectors and FE interpolating polynomial orders ranging from 6 to 8. The observed speedups underscore the potential of our matrix-free implementation to enhance the overall performance of subspace-iteration methods used in solving the Kohn-Sham eigenproblem incorporating non-collinear spin and spin-orbit coupling on multinode CPU and GPU architectures.

References

2024

JPDC
Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems

Gourab Panigrahi, Nikhil Kodali, Debashis Panda, and 1 more author

Journal of Parallel and Distributed Computing, 2024

Abs DOI Bib PDF

Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we address a critical gap in existing matrix-free implementations which are not well suited for the action of FE discretized matrices on very large number of vectors. In particular, we propose efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. To this end, we employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we develop strategies to overlap compute with data movement for achieving efficient pipelining and reduced data accesses through the use of GPU-shared memory, constant memory and kernel fusion. Our implementation outperforms the baselines for Helmholtz operator action on 1024 vectors, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs (3072 cores) and GPUs (24 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multi-node CPUs (3072 cores) and GPUs (24 GPUs), respectively.
@article{panigrahi_fast_2023, address = {Rochester, NY}, title = {Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems}, journal = {Journal of Parallel and Distributed Computing}, url = {https://www.sciencedirect.com/science/article/pii/S0743731524000893}, doi = {10.1016/j.jpdc.2024.104925}, language = {en}, urldate = {2023-11-06}, author = {Panigrahi, Gourab and Kodali, Nikhil and Panda, Debashis and Motamarri, Phani}, volume = {192}, pages = {104925}, year = {2024}, issn = {0743-7315}, keywords = {Matrix-free, Finite element method, Sum factorization, Scalable algorithms for heterogeneous architectures}, dimensions = {true}, ownpub = {true} }