publications | Nikhil Kodali

2025

PhysRevB
Finite-element methods for noncollinear magnetism and spin-orbit coupling in real-space pseudopotential density functional theory

Nikhil Kodali, and Phani Motamarri

Phys. Rev. B, May 2025

Abs DOI Bib PDF

We introduce an efficient finite-element approach for large-scale real-space pseudopotential density functional theory (DFT) calculations incorporating noncollinear magnetism and spin-orbit coupling. The approach, implemented within the open-source DFT-FE computational framework, fills a significant gap in real-space DFT calculations using finite element basis sets, which offer several advantages over traditional DFT basis sets. In particular, we leverage the local reformulation of DFT electrostatics to derive the finite-element (FE) discretized governing equations involving two-component spinors. We subsequently utilize an efficient self-consistent field iteration approach based on Chebyshev filtered subspace iteration procedure exploiting the sparsity of local and non-local parts of FE discretized Hamiltonian to solve the underlying nonlinear eigenvalue problem based on a two-grid strategy. Furthermore, we propose using a generalized functional within the framework of noncollinear magnetism and spin-orbit coupling with a stationary point at the minima of the Kohn-Sham DFT energy functional to develop a unified framework for computing atomic forces and periodic unit-cell stresses. Validation studies against plane-wave implementations show excellent agreement in ground-state energetics, vertical ionization potentials, magnetic anisotropy energies, band structures, and spin textures. The proposed method achieves up to 8x-11x speed-ups for semi-periodic and non-periodic systems with ~5000-7000 electrons in terms of minimum wall times compared to widely used plane-wave implementations on CPUs in addition to exhibiting significant computational advantage on GPUs.
@article{kodali2024finiteelementmethodsnoncollinearmagnetism, title = {Finite-element methods for noncollinear magnetism and spin-orbit coupling in real-space pseudopotential density functional theory}, author = {Kodali, Nikhil and Motamarri, Phani}, journal = {Phys. Rev. B}, volume = {111}, pages = {195129}, numpages = {27}, year = {2025}, month = may, publisher = {American Physical Society}, doi = {10.1103/PhysRevB.111.195129}, url = {https://link.aps.org/doi/10.1103/PhysRevB.111.195129}, dimensions = {true}, ownpub = {true} }
under review
Residual-based Chebyshev filtered subspace iteration for sparse Hermitian eigenvalue problems tolerant to inexact matrix-vector products

Nikhil Kodali, Kartick Ramakrishnan, and Phani Motamarri

May 2025

Abs Bib PDF

Chebyshev Filtered Subspace Iteration (ChFSI) has been widely adopted for computing a small subset of extreme eigenvalues in large sparse matrices. This work introduces a residual-based reformulation of ChFSI, referred to as R-ChFSI, designed to accommodate inexact matrix-vector products while maintaining robust convergence properties. By reformulating the traditional Chebyshev recurrence to operate on residuals rather than eigenvector estimates, the R-ChFSI approach effectively suppresses the errors made in matrix-vector products, improving the convergence behaviour for both standard and generalized eigenproblems. This ability of R-ChFSI to be tolerant to inexact matrix-vector products allows one to incorporate approximate inverses for large-scale generalized eigenproblems, making the method particularly attractive where exact matrix factorizations or iterative methods become computationally expensive for evaluating inverses. It also allows us to compute the matrix-vector products in lower-precision arithmetic allowing us to leverage modern hardware accelerators. Through extensive benchmarking, we demonstrate that R-ChFSI achieves desired residual tolerances while leveraging low-precision arithmetic. For problems with millions of degrees of freedom and thousands of eigenvalues, R-ChFSI attains final residual norms in the range of 10^-12 to 10^-14, even with FP32 and TF32 arithmetic, significantly outperforming standard ChFSI in similar settings. In generalized eigenproblems, where approximate inverses are used, R-ChFSI achieves residual tolerances up to ten orders of magnitude lower, demonstrating its robustness to approximation errors. Finally, R-ChFSI provides a scalable and computationally efficient alternative for solving large-scale eigenproblems in high-performance computing environments.
@misc{kodali2025residualbasedchebyshevfilteredsubspace, title = {Residual-based Chebyshev filtered subspace iteration for sparse Hermitian eigenvalue problems tolerant to inexact matrix-vector products}, author = {Kodali, Nikhil and Ramakrishnan, Kartick and Motamarri, Phani}, year = {2025}, eprint = {2503.22652}, archiveprefix = {arXiv}, primaryclass = {physics.comp-ph}, url = {https://arxiv.org/abs/2503.22652}, dimensions = {true}, ownpub = {true} }

2024

JPDC
Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems

Gourab Panigrahi, Nikhil Kodali, Debashis Panda, and 1 more author

Journal of Parallel and Distributed Computing, May 2024

Abs DOI Bib PDF

Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we address a critical gap in existing matrix-free implementations which are not well suited for the action of FE discretized matrices on very large number of vectors. In particular, we propose efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. To this end, we employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we develop strategies to overlap compute with data movement for achieving efficient pipelining and reduced data accesses through the use of GPU-shared memory, constant memory and kernel fusion. Our implementation outperforms the baselines for Helmholtz operator action on 1024 vectors, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs (3072 cores) and GPUs (24 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multi-node CPUs (3072 cores) and GPUs (24 GPUs), respectively.
@article{panigrahi_fast_2023, address = {Rochester, NY}, title = {Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems}, journal = {Journal of Parallel and Distributed Computing}, url = {https://www.sciencedirect.com/science/article/pii/S0743731524000893}, doi = {10.1016/j.jpdc.2024.104925}, language = {en}, urldate = {2023-11-06}, author = {Panigrahi, Gourab and Kodali, Nikhil and Panda, Debashis and Motamarri, Phani}, volume = {192}, pages = {104925}, year = {2024}, issn = {0743-7315}, keywords = {Matrix-free, Finite element method, Sum factorization, Scalable algorithms for heterogeneous architectures}, dimensions = {true}, ownpub = {true} }

2017

PhysRevB
Short-range atomic ordering in nonequilibrium silicon-germanium-tin semiconductors

S. Mukherjee, N. Kodali, D. Isheim, and 5 more authors

Physical Review B, Apr 2017

Publisher: American Physical Society

Abs DOI Bib PDF

The precise knowledge of the atomic order in monocrystalline alloys is fundamental to understand and predict their physical properties. With this perspective, we utilized laser-assisted atom probe tomography to investigate the three-dimensional distribution of atoms in nonequilibrium epitaxial Sn-rich group-IV SiGeSn ternary semiconductors. Different atom probe statistical analysis tools including frequency distribution analysis, partial radial distribution functions, and nearest-neighbor analysis were employed in order to evaluate and compare the behavior of the three elements to their spatial distributions in an ideal solid solution. This atomistic-level analysis provided clear evidence of an unexpected repulsive interaction between Sn and Si leading to the deviation of Si atoms from the theoretical random distribution. This departure from an ideal solid solution is supported by first-principles calculations and attributed to the tendency of the system to reduce its mixing enthalpy throughout the layer-by-layer growth process.
@article{mukherjee_short-range_2017, title = {Short-range atomic ordering in nonequilibrium silicon-germanium-tin semiconductors}, volume = {95}, issn = {2469-9950}, url = {http://link.aps.org/doi/10.1103/PhysRevB.95.161402}, doi = {10.1103/PhysRevB.95.161402}, number = {16}, urldate = {2017-04-11}, journal = {Physical Review B}, author = {Mukherjee, S. and Kodali, N. and Isheim, D. and Wirths, S. and Hartmann, J. M. and Buca, D. and Seidman, D. N. and Moutanabbir, O.}, month = apr, year = {2017}, note = {Publisher: American Physical Society}, pages = {161402}, publisher = {American Physical Society,}, dimensions = {true}, ownpub = {true} }