Skip to main content

Installation

This guide covers installing the NVIDIA NIM Operator in your Kubernetes cluster.

Prerequisites

Before installing the operator, ensure your cluster meets these requirements:
1

Kubernetes version

Kubernetes v1.28 or higher is required.
kubectl version --short
2

NVIDIA GPU Operator

Install the NVIDIA GPU Operator to provide GPU device plugins and drivers.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace
3

cert-manager (recommended)

The operator uses cert-manager for admission webhook certificates when the admission controller is enabled (default).
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
4

Cluster access

Ensure you have cluster-admin privileges to install CRDs and cluster-scoped resources.
The admission controller can be disabled if cert-manager is not available, but it’s recommended for production use to validate resource configurations.

Installation methods

Verify installation

Check that the operator is running:
kubectl get pods -n nim-operator
Expected output:
NAME                                           READY   STATUS    RESTARTS   AGE
nim-operator-controller-manager-7d9f8c5b6d-x9k2p   1/1     Running   0          2m
Verify the CRDs are installed:
kubectl get crds | grep nvidia.com
Expected output:
nimbuilds.apps.nvidia.com
nimcaches.apps.nvidia.com
nimpipelines.apps.nvidia.com
nimservices.apps.nvidia.com
nemocustomizers.apps.nvidia.com
nemodatastores.apps.nvidia.com
nemoentitystores.apps.nvidia.com
nemoevaluators.apps.nvidia.com
nemoguardrails.apps.nvidia.com
Check operator logs:
kubectl logs -n nim-operator -l control-plane=controller-manager

Configuration options

Operator arguments

The operator accepts these command-line arguments:
  • --health-probe-bind-address - Address for health probe server (default: :8081)
  • --metrics-bind-address - Address for metrics server (default: :8080)
  • --leader-elect - Enable leader election for HA deployments

Environment variables

Key environment variables:
  • WATCH_NAMESPACE - Namespace to watch (empty = all namespaces)
  • OPERATOR_NAMESPACE - Namespace where operator is deployed
  • OPERATOR_VERSION - Operator version (set automatically)

Logging configuration

Configure logging behavior via Helm values:
operator:
  log:
    development: false
    level: info  # debug | info | warn | error | dpanic | panic | fatal
    encoder: json  # json | console
    stacktraceLevel: error

Resource limits

Adjust operator resource requirements:
operator:
  resources:
    limits:
      cpu: "1"
      memory: 256Mi
    requests:
      cpu: 500m
      memory: 128Mi

High availability

For production deployments, enable multiple replicas with leader election:
operator:
  replicas: 3
  args:
    - --health-probe-bind-address=:8081
    - --metrics-bind-address=:8080
    - --leader-elect

Admission controller configuration

The admission controller validates and mutates NIM resources before they’re persisted.
operator:
  admissionController:
    enabled: true
    tls:
      mode: "cert-manager"
      certManager:
        issuerType: "selfsigned"
        issuerName: ""
        dnsNames: []
Disabling the admission controller removes validation and defaulting for NIM resources, which may lead to misconfigurations.

Security context

The operator runs with these security settings by default:
operator:
  podSecurityContext:
    seccompProfile:
      type: RuntimeDefault
    runAsNonRoot: true
  containerSecurityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
These settings are compatible with restricted pod security standards and OpenShift SCCs.

Platform-specific configuration

OpenShift

The operator works out-of-the-box on OpenShift with automatic SCC handling:
helm install nim-operator nvidia/k8s-nim-operator \
  -n nim-operator \
  --create-namespace
NIM deployments automatically use appropriate SCCs:
  • nonroot - Default for standard deployments
  • anyuid - When using proxy certificates
  • hostmount-anyuid - When using hostPath volumes

VMware TKGS

For VMware Tanzu Kubernetes Grid Service:
operator:
  podSecurityContext:
    seccompProfile:
      type: RuntimeDefault
    runAsNonRoot: true
This is the default configuration and works with TKGS pod security policies.

Air-gapped environments

For air-gapped deployments:
  1. Mirror the operator image to your private registry
  2. Configure image pull secrets:
operator:
  image:
    repository: <your-registry>/k8s-nim-operator
    tag: <version>
    pullSecrets:
      - name: private-registry-secret
  1. Ensure all dependency images are also mirrored (cert-manager, etc.)

Upgrade

Helm upgrade

Upgrade to the latest version:
helm repo update
helm upgrade nim-operator nvidia/k8s-nim-operator \
  -n nim-operator \
  -f values.yaml

kubectl upgrade

For kubectl installations, upgrade CRDs first:
kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/crds.yaml
kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/install.yaml
The operator automatically upgrades CRDs when using Helm with operator.upgradeCRD: true (default).

Uninstall

Uninstall with Helm

helm uninstall nim-operator -n nim-operator

Uninstall with kubectl

Delete operator resources:
kubectl delete -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/install.yaml
Delete CRDs (this will delete all NIM resources):
kubectl delete -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/crds.yaml
Deleting CRDs will delete all NIMService, NIMCache, and other NIM resources in your cluster. Ensure you have backups if needed.

Delete namespace

kubectl delete namespace nim-operator

Troubleshooting

Operator pod not starting

If the operator pod fails to start:
  1. Check image pull permissions
  2. Verify cert-manager is running (if admission controller is enabled)
  3. Check resource availability
kubectl describe pod -n nim-operator -l control-plane=controller-manager

Webhook certificate issues

If you see webhook-related errors:
kubectl get certificates -n nim-operator
kubectl describe certificate -n nim-operator
kubectl logs -n cert-manager -l app=cert-manager

CRD installation fails

If CRD installation fails, ensure you have cluster-admin privileges:
kubectl auth can-i create customresourcedefinitions

Operator logs show errors

Increase log verbosity for debugging:
operator:
  log:
    level: debug
    development: true

Next steps

Quick start

Deploy your first NIM microservice

NIMService guide

Learn about NIMService configuration options

NIMCache guide

Understand model caching strategies

Production best practices

Configure for production deployments