CUDA Python, here we come: Nvidia offers Python devs the gift of GPU acceleration

AI/ML

By Julia Schmidt

October 22, 2021

CUDA Python, here we come: Nvidia offers Python devs the gift of GPU acceleration

Nvidia’s CUDA team has released version 11.5 of its parallel computing platform and pushed a Python-flavoured subproject forward in the process.

With the update, CUDA Python leaves its preview phase behind and should now be fit for general usage. The collection of Cython bindings, Python API wrappers, and libraries is meant to help data scientists and developers from the fields of HPC and ML to use Nvidia GPUs to accelerate their projects by providing them with a simpler way of accessing the CUDA host APIs.

When used correctly, Nvidia promises for CUDA Python code to be about as performant as its C++ equivalent. The project is meant to work on all platforms also supporting CUDA, and needs Cython 0.29.24, pytest 6.2.4, pytest-benchmark 3.4.1, and numpy 1.21.1 (or newer versions) to be fully functioning.

Going back to CUDA itself, version 11.5 has learned to work with signed and unsigned normalised 8- and 16-bit types and was fitted with a first version of a new __int128 data type. Developers however shouldn’t expect too much from the latter yet, as the new type still lacks broad support for math operations, libraries, and dev tools at this point. Starting with the current release, floating point division should work a bit faster when the divisor is known at compile time, though users have to enable the corresponding optimisation via nvcc -Xcicc -opt-fdiv=1 first.

Users looking for a specific behaviour when caching data on the device side can try configuring it now using annotated pointers. CUDA 11.5 also includes the option of setting up per-process memory access policies to have more control over multiple processes sharing GPUs, and there are new functions for inclusive and exclusive scans for cooperative groups available.

Other than that the CUDA compiler team was able to rework the component so that it can now link with cubins larger than 2GB, and supports numerous pragmas for more control over diagnostic messages. Developers can now use builtin_assume to specify address space to allow for efficient loads and stores, and set the -arch=all or -arch=all-major options to generate code for multiple architectures at once.

Library cuBLAS comes with new auxiliary functions cublasGetStatusName(), cublasGetStatusString(), additional epilogue options to support fusion in DLtraining, and vector alpha support for per-row scaling in TN int32 math Matmul with int8 output as part of the release.

The CUDA team hasn’t deprecated any features with this update, though Nvidia driver support for Kepler is removed beginning with R495. Things look slightly different for the Math library though, so checking the project’s documentation is advised to learn about deprecated APIs.

Sourcegraph coding assistant now supports Anthropic Claude 3 – though limited to 7K token input

Supabase moves out of beta, adds supports for Swift, plugs in Oriole storage engine

Go dev survey shows frustration with Python’s dominance of AI

AI coding: Hugging Face engineer extols benefits of open source models, but hard questions remain

.NET Smart Components experiment the "Visual Basic" of AI programming?

GitHub autofix progresses to public beta: insecure code corrected with AI, but only for enterprise

JetBrains bows to user pressure and unbundles AI Assistant in new IntelliJ IDEA beta

Hands On: Netlify AI-assisted deployment aims to reduce log-diving

Stack Overflow turns to Google for hosting and AI features, trusts in Gemini for tech answers

Employing your cloud data warehouse to scale up AI/ML

Rust-based Zed editor now open source – with built-in support for OpenAI and GitHub Copilot

AI assistance is leading to lower code quality, claim researchers

CUDA Python, here we come: Nvidia offers Python devs the gift of GPU acceleration

ABOUT US

FOLLOW US