nvidia-cusparselt-cu12
v0.8.1NVIDIA cuSPARSELt
$ uv add nvidia-cusparselt-cu12###################################################################################
cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication
NVIDIA cuSPARSELt is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a structured sparse matrix with 50\% sparsity ratio:
D = Activation(\alpha op(A) \cdot op(B) + \beta op(C) + bias)
where op(A)/op(B) refers to in-place operations such as transpose/non-transpose, and alpha, beta are scalars or vectors.
The cuSPARSELt APIs allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.
Download: developer.nvidia.com/cusparselt/downloads
Provide Feedback: [email protected]
Examples: cuSPARSELt Example 1, cuSPARSELt Example 2
Blog post:
- Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt
- Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines_
- Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture_
================================================================================
Key Features
- NVIDIA Sparse MMA tensor core support
- Mixed-precision computation support:
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | Input A/B | Input C | Output D | Compute | Block scaled | Support SM arch | +==============+================+=================+=============+=================================+====================+ | FP32 | FP32 | FP32 | FP32 | No | | +--------------+----------------+-----------------+-------------+ + | | BF16 | BF16 | BF16 | FP32 | | 8.0, 8.6, 8.7 | +--------------+----------------+-----------------+-------------+ + 9.0, 10.0, 10.1 | | FP16 | FP16 | FP16 | FP32 | | 11.0, 12.0, 12.1 | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | FP16 | FP16 | FP16 | FP16 | No | 9.0 | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | INT8 | INT8 | INT8 | INT32 | No | | + +----------------+-----------------+ + + 8.0, 8.6, 8.7 + | | INT32 | INT32 | | | 9.0, 10.0, 10.1 | + +----------------+-----------------+ + + 11.0, 12.0, 12.1 + | | FP16 | FP16 | | | | + +----------------+-----------------+ + + + | | BF16 | BF16 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | INT8 | INT8 | INT8 | INT32 | No | | + +----------------+-----------------+ + + 8.0, 8.6, 8.7 + | | INT32 | INT32 | | | 9.0, 10.0, 10.1 | + +----------------+-----------------+ + + 11.0, 12.0, 12.1 + | | FP16 | FP16 | | | | + +----------------+-----------------+ + + + | | BF16 | BF16 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | E4M3 | FP16 | E4M3 | FP32 | No | 9.0, 10.0, 10.1 | + +----------------+-----------------+ + + 11.0, 12.0, 12.1 + | | BF16 | E4M3 | | | | + +----------------+-----------------+ + + + | | FP16 | FP16 | | | | + +----------------+-----------------+ + + + | | BF16 | BF16 | | | | + +----------------+-----------------+ + + + | | FP32 | FP32 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | E5M2 | FP16 | E5M2 | FP32 | No | 9.0, 10.0, 10.1 | + +----------------+-----------------+ + + 11.0, 12.0, 12.1 + | | BF16 | E5M2 | | | | + +----------------+-----------------+ + + + | | FP16 | FP16 | | | | + +----------------+-----------------+ + + + | | BF16 | BF16 | | | | + +----------------+-----------------+ + + + | | FP32 | FP32 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | E4M3 | FP16 | E4M3 | FP32 | A/B/D_OUT_SCALE = VEC64_UE8M0 | 10.0, 10.1, 11.0 | + +----------------+-----------------+ + + 12.0, 12.1 + | | BF16 | E4M3 | | D_SCALE = 32F | | + +----------------+-----------------+ +---------------------------------+ + | | FP16 | FP16 | | A/B_SCALE = VEC64_UE8M0 | | + +----------------+-----------------+ + + + | | BF16 | BF16 | | | | + +----------------+-----------------+ + + + | | FP32 | FP32 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | E2M1 | FP16 | E2M1 | FP32 | A/B/D_SCALE = VEC32_UE4M3 | 10.0, 10.1, 11.0 | + +----------------+-----------------+ + + 12.0, 12.1 + | | BF16 | E2M1 | | D_SCALE = 32F | | + +----------------+-----------------+ +---------------------------------+ + | | FP16 | FP16 | | A/B_SCALE = VEC32_UE4M3 | | + +----------------+-----------------+ + + + | | BF16 | BF16 | | | | + +----------------+-----------------+ + + + | | FP32 | FP32 | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
- Matrix pruning and compression functionalities
- Activation functions, bias vector, and output scaling
- Batched computation (multiple matrices in a single run)
- GEMM Split-K mode
- Auto-tuning functionality (see
cusparseLtMatmulSearch()) - NVTX ranging and Logging functionalities
================================================================================
Support
- Supported SM Architectures:
SM 8.0,SM 8.6,SM 8.7,SM 8.9,SM 9.0,SM 10.0,SM 10.1(for CTK 12),SM 11.0(for CTK 13),SM 12.0,SM 12.1 - Supported CPU architectures and operating systems:
+------------+--------------------+ | OS | CPU archs | +============+====================+ | Windows | x86_64 | +------------+--------------------+ | Linux | x86_64, Arm64 | +------------+--------------------+
================================================================================
Documentation
Please refer to https://docs.nvidia.com/cuda/cusparselt/index.html for the cuSPARSELt documentation.
================================================================================
Installation
The cuSPARSELt wheel can be installed as follows:
pip install nvidia-cusparselt-cuXX
where XX is the CUDA major version.
Details
- Version
- 0.8.1
- License
- NVIDIA Proprietary Software
- Maintainer
- NVIDIA Corporation
Links
Release Cadence
Platforms
Maintainers
- NVIDIA Corporation · [email protected]