PyTorch TorchInductor integrates CuteDSL as the automatic tuning backend for matrix multiplication

robot
Abstract generation in progress

ME News Update, April 7 (UTC+8), the official PyTorch team recently announced that CuteDSL has been integrated as the fourth matrix multiplication auto-tuning backend into TorchInductor. The selection of this backend is based on three criteria: not adding excessive maintenance burden, not slowing down compilation or benchmarking times, and providing better performance on target workloads. CuteDSL, actively developed by NVIDIA, offers optimized kernel templates, with compilation times comparable to existing backends and significantly better than the CUTLASS C++ path that requires full \nvcc\ compilation. This backend is built on the same abstraction as CUTLASS C++, written in Python, resulting in faster compilation and easier maintenance, and has demonstrated strong performance in FP8 GEMM and Epilogue fusion. The team focuses on optimizing GEMM (matrix multiplication) because it accounts for the majority of computational overhead in Transformer models. CuteDSL generates low-level code by providing handcrafted optimized templates, avoiding the complexity of writing kernels from scratch, and fully exposing thread and memory hierarchy, supporting architecture-specific features. (Source: InFoQ)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin