diff --git a/doc/languages-frameworks/cuda.section.md b/doc/languages-frameworks/cuda.section.md index 1a40f4cc1a24..39ad3a564b56 100644 --- a/doc/languages-frameworks/cuda.section.md +++ b/doc/languages-frameworks/cuda.section.md @@ -65,97 +65,6 @@ for your specific card(s). Library maintainers should consult [NVCC Docs](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/) and release notes for their software package. -## Adding a new CUDA release {#adding-a-new-cuda-release} - -> **WARNING** -> -> This section of the docs is still very much in progress. Feedback is welcome in GitHub Issues tagging @NixOS/cuda-maintainers or on [Matrix](https://matrix.to/#/#cuda:nixos.org). - -The CUDA Toolkit is a suite of CUDA libraries and software meant to provide a development environment for CUDA-accelerated applications. Until the release of CUDA 11.4, NVIDIA had only made the CUDA Toolkit available as a multi-gigabyte runfile installer, which we provide through the [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit) attribute. From CUDA 11.4 and onwards, NVIDIA has also provided CUDA redistributables (“CUDA-redist”): individually packaged CUDA Toolkit components meant to facilitate redistribution and inclusion in downstream projects. These packages are available in the [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) package set. - -All new projects should use the CUDA redistributables available in [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) in place of [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit), as they are much easier to maintain and update. - -### Updating CUDA redistributables {#updating-cuda-redistributables} - -1. Go to NVIDIA's index of CUDA redistributables: -2. Make a note of the new version of CUDA available. -3. Run - - ```bash - nix run github:connorbaker/cuda-redist-find-features -- \ - download-manifests \ - --log-level DEBUG \ - --version \ - https://developer.download.nvidia.com/compute/cuda/redist \ - ./pkgs/development/cuda-modules/cuda/manifests - ``` - - This will download a copy of the manifest for the new version of CUDA. -4. Run - - ```bash - nix run github:connorbaker/cuda-redist-find-features -- \ - process-manifests \ - --log-level DEBUG \ - --version \ - https://developer.download.nvidia.com/compute/cuda/redist \ - ./pkgs/development/cuda-modules/cuda/manifests - ``` - - This will generate a `redistrib_features_.json` file in the same directory as the manifest. -5. Update the `cudaVersionMap` attribute set in `pkgs/development/cuda-modules/cuda/extension.nix`. - -### Updating cuTensor {#updating-cutensor} - -1. Repeat the steps present in [Updating CUDA redistributables](#updating-cuda-redistributables) with the following changes: - - Use the index of cuTensor redistributables: - - Use the newest version of cuTensor available instead of the newest version of CUDA. - - Use `pkgs/development/cuda-modules/cutensor/manifests` instead of `pkgs/development/cuda-modules/cuda/manifests`. - - Skip the step of updating `cudaVersionMap` in `pkgs/development/cuda-modules/cuda/extension.nix`. - -### Updating supported compilers and GPUs {#updating-supported-compilers-and-gpus} - -1. Update `nvccCompatibilities` in `pkgs/development/cuda-modules/_cuda/db/bootstrap/nvcc.nix` to include the newest release of NVCC, as well as any newly supported host compilers. -2. Update `cudaCapabilityToInfo` in `pkgs/development/cuda-modules/_cuda/db/bootstrap/cuda.nix` to include any new GPUs supported by the new release of CUDA. - -### Updating the CUDA Toolkit runfile installer {#updating-the-cuda-toolkit} - -> **WARNING** -> -> While the CUDA Toolkit runfile installer is still available in Nixpkgs as the [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit) attribute, its use is not recommended and should it be considered deprecated. Please migrate to the CUDA redistributables provided by the [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) package set. -> -> To ensure packages relying on the CUDA Toolkit runfile installer continue to build, it will continue to be updated until a migration path is available. - -1. Go to NVIDIA's CUDA Toolkit runfile installer download page: -2. Select the appropriate OS, architecture, distribution, and version, and installer type. - - - For example: Linux, x86_64, Ubuntu, 22.04, runfile (local) - - NOTE: Typically, we use the Ubuntu runfile. It is unclear if the runfile for other distributions will work. - -3. Take the link provided by the installer instructions on the webpage after selecting the installer type and get its hash by running: - - ```bash - nix store prefetch-file --hash-type sha256 - ``` - -4. Update `pkgs/development/cuda-modules/cudatoolkit/releases.nix` to include the release. - -### Updating the CUDA package set {#updating-the-cuda-package-set} - -1. Include a new `cudaPackages__` package set in `pkgs/top-level/all-packages.nix`. - - - NOTE: Changing the default CUDA package set should occur in a separate PR, allowing time for additional testing. - -2. Successfully build the closure of the new package set, updating `pkgs/development/cuda-modules/cuda/overrides.nix` as needed. Below are some common failures: - -| Unable to ... | During ... | Reason | Solution | Note | -| --- | --- | --- | --- | --- | -| Find headers | `configurePhase` or `buildPhase` | Missing dependency on a `dev` output | Add the missing dependency | The `dev` output typically contain the headers | -| Find libraries | `configurePhase` | Missing dependency on a `dev` output | Add the missing dependency | The `dev` output typically contain CMake configuration files | -| Find libraries | `buildPhase` or `patchelf` | Missing dependency on a `lib` or `static` output | Add the missing dependency | The `lib` or `static` output typically contain the libraries | - -In the scenario you are unable to run the resulting binary: this is arguably the most complicated as it could be any combination of the previous reasons. This type of failure typically occurs when a library attempts to load or open a library it depends on that it does not declare in its `DT_NEEDED` section. As a first step, ensure that dependencies are patched with [`autoAddDriverRunpath`](https://search.nixos.org/packages?channel=unstable&type=packages&query=autoAddDriverRunpath). Failing that, try running the application with [`nixGL`](https://github.com/guibou/nixGL) or a similar wrapper tool. If that works, it likely means that the application is attempting to load a library that is not in the `RPATH` or `RUNPATH` of the binary. - ## Running Docker or Podman containers with CUDA support {#cuda-docker-podman} It is possible to run Docker or Podman containers with CUDA support. The recommended mechanism to perform this task is to use the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html). @@ -201,7 +110,7 @@ $ nix run nixpkgs#jq -- -r '.devices[].name' < /var/run/cdi/nvidia-container-too all ``` -### Specifying what devices to expose to the container {#specifying-what-devices-to-expose-to-the-container} +### Specifying what devices to expose to the container {#cuda-specifying-what-devices-to-expose-to-the-container} You can choose what devices are exposed to your containers by using the identifier on the generated CDI specification. Like follows: @@ -222,7 +131,7 @@ GPU 1: NVIDIA GeForce RTX 2080 SUPER (UUID: ) By default, the NVIDIA Container Toolkit will use the GPU index to identify specific devices. You can change the way to identify what devices to expose by using the `hardware.nvidia-container-toolkit.device-name-strategy` NixOS attribute. ::: -### Using docker-compose {#using-docker-compose} +### Using docker-compose {#cuda-using-docker-compose} It's possible to expose GPU's to a `docker-compose` environment as well. With a `docker-compose.yaml` file like follows: @@ -256,3 +165,142 @@ services: - nvidia.com/gpu=0 - nvidia.com/gpu=1 ``` + +## Contributing {#cuda-contributing} + +::: {.warning} +This section of the docs is still very much in progress. Feedback is welcome in GitHub Issues tagging @NixOS/cuda-maintainers or on [Matrix](https://matrix.to/#/#cuda:nixos.org). +::: + +### Package set maintenance {#cuda-package-set-maintenance} + +The CUDA Toolkit is a suite of CUDA libraries and software meant to provide a development environment for CUDA-accelerated applications. Until the release of CUDA 11.4, NVIDIA had only made the CUDA Toolkit available as a multi-gigabyte runfile installer, which we provide through the [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit) attribute. From CUDA 11.4 and onwards, NVIDIA has also provided CUDA redistributables (“CUDA-redist”): individually packaged CUDA Toolkit components meant to facilitate redistribution and inclusion in downstream projects. These packages are available in the [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) package set. + +All new projects should use the CUDA redistributables available in [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) in place of [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit), as they are much easier to maintain and update. + +#### Updating redistributables {#cuda-updating-redistributables} + +1. Go to NVIDIA's index of CUDA redistributables: +2. Make a note of the new version of CUDA available. +3. Run + + ```bash + nix run github:connorbaker/cuda-redist-find-features -- \ + download-manifests \ + --log-level DEBUG \ + --version \ + https://developer.download.nvidia.com/compute/cuda/redist \ + ./pkgs/development/cuda-modules/cuda/manifests + ``` + + This will download a copy of the manifest for the new version of CUDA. +4. Run + + ```bash + nix run github:connorbaker/cuda-redist-find-features -- \ + process-manifests \ + --log-level DEBUG \ + --version \ + https://developer.download.nvidia.com/compute/cuda/redist \ + ./pkgs/development/cuda-modules/cuda/manifests + ``` + + This will generate a `redistrib_features_.json` file in the same directory as the manifest. +5. Update the `cudaVersionMap` attribute set in `pkgs/development/cuda-modules/cuda/extension.nix`. + +#### Updating cuTensor {#cuda-updating-cutensor} + +1. Repeat the steps present in [Updating CUDA redistributables](#cuda-updating-redistributables) with the following changes: + - Use the index of cuTensor redistributables: + - Use the newest version of cuTensor available instead of the newest version of CUDA. + - Use `pkgs/development/cuda-modules/cutensor/manifests` instead of `pkgs/development/cuda-modules/cuda/manifests`. + - Skip the step of updating `cudaVersionMap` in `pkgs/development/cuda-modules/cuda/extension.nix`. + +#### Updating supported compilers and GPUs {#cuda-updating-supported-compilers-and-gpus} + +1. Update `nvccCompatibilities` in `pkgs/development/cuda-modules/_cuda/data/nvcc.nix` to include the newest release of NVCC, as well as any newly supported host compilers. +2. Update `cudaCapabilityToInfo` in `pkgs/development/cuda-modules/_cuda/data/cuda.nix` to include any new GPUs supported by the new release of CUDA. + +#### Updating the CUDA Toolkit runfile installer {#cuda-updating-the-cuda-toolkit} + +::: {.warning} +While the CUDA Toolkit runfile installer is still available in Nixpkgs as the [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit) attribute, its use is not recommended, and it should be considered deprecated. Please migrate to the CUDA redistributables provided by the [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) package set. + +To ensure packages relying on the CUDA Toolkit runfile installer continue to build, it will continue to be updated until a migration path is available. +::: + +1. Go to NVIDIA's CUDA Toolkit runfile installer download page: +2. Select the appropriate OS, architecture, distribution, and version, and installer type. + + - For example: Linux, x86_64, Ubuntu, 22.04, runfile (local) + - NOTE: Typically, we use the Ubuntu runfile. It is unclear if the runfile for other distributions will work. + +3. Take the link provided by the installer instructions on the webpage after selecting the installer type and get its hash by running: + + ```bash + nix store prefetch-file --hash-type sha256 + ``` + +4. Update `pkgs/development/cuda-modules/cudatoolkit/releases.nix` to include the release. + +#### Updating the CUDA package set {#cuda-updating-the-cuda-package-set} + +1. Include a new `cudaPackages__` package set in `pkgs/top-level/all-packages.nix`. + + - NOTE: Changing the default CUDA package set should occur in a separate PR, allowing time for additional testing. + +2. Successfully build the closure of the new package set, updating `pkgs/development/cuda-modules/cuda/overrides.nix` as needed. Below are some common failures: + +| Unable to ... | During ... | Reason | Solution | Note | +| -------------- | -------------------------------- | ------------------------------------------------ | -------------------------- | ------------------------------------------------------------ | +| Find headers | `configurePhase` or `buildPhase` | Missing dependency on a `dev` output | Add the missing dependency | The `dev` output typically contains the headers | +| Find libraries | `configurePhase` | Missing dependency on a `dev` output | Add the missing dependency | The `dev` output typically contains CMake configuration files | +| Find libraries | `buildPhase` or `patchelf` | Missing dependency on a `lib` or `static` output | Add the missing dependency | The `lib` or `static` output typically contains the libraries | + +Failure to run the resulting binary is typically the most challenging to diagnose, as it may involve a combination of the aforementioned issues. This type of failure typically occurs when a library attempts to load or open a library it depends on that it does not declare in its `DT_NEEDED` section. Try the following debugging steps: + +1. First ensure that dependencies are patched with [`autoAddDriverRunpath`](https://search.nixos.org/packages?channel=unstable&type=packages&query=autoAddDriverRunpath). +2. Failing that, try running the application with [`nixGL`](https://github.com/guibou/nixGL) or a similar wrapper tool. +3. If that works, it likely means that the application is attempting to load a library that is not in the `RPATH` or `RUNPATH` of the binary. + +### Writing tests {#cuda-writing-tests} + +::: {.caution} +The existence of `passthru.testers` and `passthru.tests` should be considered an implementation detail -- they are not meant to be a public or stable interface. +::: + +In general, there are two attribute sets in `passthru` that are used to build and run tests for CUDA packages: `passthru.testers` and `passthru.tests`. Each attribute set may contain an attribute set named `cuda`, which contains CUDA-specific derivations. The `cuda` attribute set is used to separate CUDA-specific derivations from those which support multiple implementations (e.g., OpenCL, ROCm, etc.) or have different licenses. For an example of such generic derivations, see the `magma` package. + +::: {.note} +Derivations are nested under the `cuda` attribute due to an OfBorg quirk: if evaluation fails (e.g., because of unfree licenses), the entire enclosing attribute set is discarded. This prevents other attributes in the set from being discovered, evaluated, or built. +::: + +#### `passthru.testers` {#cuda-passthru-testers} + +Attributes added to `passthru.testers` are derivations which produce an executable which runs a test. The produced executable should: + +- Take care to set up the environment, make temporary directories, and so on. +- Be registered as the derivation's `meta.mainProgram` so that it can be run directly. + +::: {.note} +Testers which always require CUDA should be placed in `passthru.testers.cuda`, while those which are generic should be placed in `passthru.testers`. +::: + +The `passthru.testers` attribute set allows running tests outside the Nix sandbox. There are a number of reasons why this is useful, since such a test: + +- Can be run on non-NixOS systems, when wrapped with utilities like `nixGL` or `nix-gl-host`. +- Has network access patterns which are difficult or impossible to sandbox. +- Is free to produce output which is not deterministic, such as timing information. + +#### `passthru.tests` {#cuda-passthru-tests} + +Attributes added to `passthru.tests` are derivations which run tests inside the Nix sandbox. Tests should: + +- Use the executables produced by `passthru.testers`, where possible, to avoid duplication of test logic. +- Include `requiredSystemFeatures = [ "cuda" ];`, possibly conditioned on the value of `cudaSupport` if they are generic, to ensure that they are only run on systems exposing a CUDA-capable GPU. + +::: {.note} +Tests which always require CUDA should be placed in `passthru.tests.cuda`, while those which are generic should be placed in `passthru.tests`. +::: + +This is useful for tests which are deterministic (e.g., checking exit codes) and which can be provided with all necessary resources in the sandbox. diff --git a/doc/redirects.json b/doc/redirects.json index 4175d8ef4c91..c262568cf5ca 100644 --- a/doc/redirects.json +++ b/doc/redirects.json @@ -2756,33 +2756,53 @@ "cuda": [ "index.html#cuda" ], - "adding-a-new-cuda-release": [ - "index.html#adding-a-new-cuda-release" - ], - "updating-cuda-redistributables": [ - "index.html#updating-cuda-redistributables" - ], - "updating-cutensor": [ - "index.html#updating-cutensor" - ], - "updating-supported-compilers-and-gpus": [ - "index.html#updating-supported-compilers-and-gpus" - ], - "updating-the-cuda-toolkit": [ - "index.html#updating-the-cuda-toolkit" - ], - "updating-the-cuda-package-set": [ - "index.html#updating-the-cuda-package-set" + "cuda-contributing": [ + "index.html#cuda-contributing" ], "cuda-docker-podman": [ "index.html#cuda-docker-podman" ], - "specifying-what-devices-to-expose-to-the-container": [ + "cuda-package-set-maintenance": [ + "index.html#cuda-package-set-maintenance", + "index.html#adding-a-new-cuda-release" + ], + "cuda-passthru-testers": [ + "index.html#cuda-passthru-testers" + ], + "cuda-passthru-tests": [ + "index.html#cuda-passthru-tests" + ], + "cuda-specifying-what-devices-to-expose-to-the-container": [ + "index.html#cuda-specifying-what-devices-to-expose-to-the-container", "index.html#specifying-what-devices-to-expose-to-the-container" ], - "using-docker-compose": [ + "cuda-updating-redistributables": [ + "index.html#cuda-updating-redistributables", + "index.html#updating-cuda-redistributables" + ], + "cuda-updating-cutensor": [ + "index.html#cuda-updating-cutensor", + "index.html#updating-cutensor" + ], + "cuda-updating-supported-compilers-and-gpus": [ + "index.html#cuda-updating-supported-compilers-and-gpus", + "index.html#updating-supported-compilers-and-gpus" + ], + "cuda-updating-the-cuda-package-set": [ + "index.html#cuda-updating-the-cuda-package-set", + "index.html#updating-the-cuda-package-set" + ], + "cuda-updating-the-cuda-toolkit": [ + "index.html#cuda-updating-the-cuda-toolkit", + "index.html#updating-the-cuda-toolkit" + ], + "cuda-using-docker-compose": [ + "index.html#cuda-using-docker-compose", "index.html#using-docker-compose" ], + "cuda-writing-tests": [ + "index.html#cuda-writing-tests" + ], "cuelang": [ "index.html#cuelang" ], @@ -3027,7 +3047,7 @@ "ex-buildGoModule": [ "index.html#ex-buildGoModule" ], - "ssec-go-toolchain-versions" : [ + "ssec-go-toolchain-versions": [ "index.html#ssec-go-toolchain-versions" ], "buildGoModule-goModules-override": [