publications
publications in reversed chronological order.
2024
- JPDCFast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systemsGourab Panigrahi, Nikhil Kodali, Debashis Panda, and 1 more authorJournal of Parallel and Distributed Computing, 2024
Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we address a critical gap in existing matrix-free implementations which are not well suited for the action of FE discretized matrices on very large number of vectors. In particular, we propose efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. To this end, we employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we develop strategies to overlap compute with data movement for achieving efficient pipelining and reduced data accesses through the use of GPU-shared memory, constant memory and kernel fusion. Our implementation outperforms the baselines for Helmholtz operator action on 1024 vectors, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs (3072 cores) and GPUs (24 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multi-node CPUs (3072 cores) and GPUs (24 GPUs), respectively.
@article{panigrahi_fast_2023, address = {Rochester, NY}, title = {Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems}, journal = {Journal of Parallel and Distributed Computing}, url = {https://www.sciencedirect.com/science/article/pii/S0743731524000893}, doi = {10.1016/j.jpdc.2024.104925}, language = {en}, urldate = {2023-11-06}, author = {Panigrahi, Gourab and Kodali, Nikhil and Panda, Debashis and Motamarri, Phani}, volume = {192}, pages = {104925}, year = {2024}, issn = {0743-7315}, keywords = {Matrix-free, Finite element method, Sum factorization, Scalable algorithms for heterogeneous architectures}, dimensions = {true}, ownpub = {true} }
- Under ReviewFinite-element methods for noncollinear magnetism and spin-orbit coupling in real-space pseudopotential density functional theoryNikhil Kodali, and Phani MotamarriOct 2024
We introduce an efficient finite-element approach for large-scale real-space pseudopotential density functional theory (DFT) calculations incorporating noncollinear magnetism and spin-orbit coupling. The approach, implemented within the open-source DFT-FE computational framework, fills a significant gap in real-space DFT calculations using finite element basis sets, which offer several advantages over traditional DFT basis sets. In particular, we leverage the local reformulation of DFT electrostatics to derive the finite-element (FE) discretized governing equations involving two-component spinors. We subsequently utilize an efficient self-consistent field iteration approach based on Chebyshev filtered subspace iteration procedure exploiting the sparsity of local and non-local parts of FE discretized Hamiltonian to solve the underlying nonlinear eigenvalue problem based on a two-grid strategy. Furthermore, we propose using a generalized functional within the framework of noncollinear magnetism and spin-orbit coupling with a stationary point at the minima of the Kohn-Sham DFT energy functional to develop a unified framework for computing atomic forces and periodic unit-cell stresses. Validation studies against plane-wave implementations show excellent agreement in ground-state energetics, vertical ionization potentials, magnetic anisotropy energies, band structures, and spin textures. The proposed method achieves up to 8x-11x speed-ups for semi-periodic and non-periodic systems with ~5000-7000 electrons in terms of minimum wall times compared to widely used plane-wave implementations on CPUs in addition to exhibiting significant computational advantage on GPUs.
@misc{kodali2024finiteelementmethodsnoncollinearmagnetism, title = {Finite-element methods for noncollinear magnetism and spin-orbit coupling in real-space pseudopotential density functional theory}, author = {Kodali, Nikhil and Motamarri, Phani}, url = {https://arxiv.org/abs/2410.02754}, doi = {10.48550/arXiv.2410.02754}, language = {en}, year = {2024}, month = oct, eprint = {2410.02754}, dimensions = {true}, ownpub = {true} }
2017
- PhysRevBShort-range atomic ordering in nonequilibrium silicon-germanium-tin semiconductorsS. Mukherjee, N. Kodali, D. Isheim, and 5 more authorsPhysical Review B, Apr 2017Publisher: American Physical Society
The precise knowledge of the atomic order in monocrystalline alloys is fundamental to understand and predict their physical properties. With this perspective, we utilized laser-assisted atom probe tomography to investigate the three-dimensional distribution of atoms in nonequilibrium epitaxial Sn-rich group-IV SiGeSn ternary semiconductors. Different atom probe statistical analysis tools including frequency distribution analysis, partial radial distribution functions, and nearest-neighbor analysis were employed in order to evaluate and compare the behavior of the three elements to their spatial distributions in an ideal solid solution. This atomistic-level analysis provided clear evidence of an unexpected repulsive interaction between Sn and Si leading to the deviation of Si atoms from the theoretical random distribution. This departure from an ideal solid solution is supported by first-principles calculations and attributed to the tendency of the system to reduce its mixing enthalpy throughout the layer-by-layer growth process.
@article{mukherjee_short-range_2017, title = {Short-range atomic ordering in nonequilibrium silicon-germanium-tin semiconductors}, volume = {95}, issn = {2469-9950}, url = {http://link.aps.org/doi/10.1103/PhysRevB.95.161402}, doi = {10.1103/PhysRevB.95.161402}, number = {16}, urldate = {2017-04-11}, journal = {Physical Review B}, author = {Mukherjee, S. and Kodali, N. and Isheim, D. and Wirths, S. and Hartmann, J. M. and Buca, D. and Seidman, D. N. and Moutanabbir, O.}, month = apr, year = {2017}, note = {Publisher: American Physical Society}, pages = {161402}, publisher = {American Physical Society,}, dimensions = {true}, ownpub = {true} }