Installation

This guide covers installing the NVIDIA NIM Operator in your Kubernetes cluster.

Prerequisites

Before installing the operator, ensure your cluster meets these requirements:

Kubernetes version

Kubernetes v1.28 or higher is required.

kubectl version --short

NVIDIA GPU Operator

Install the NVIDIA GPU Operator to provide GPU device plugins and drivers.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace

cert-manager (recommended)

The operator uses cert-manager for admission webhook certificates when the admission controller is enabled (default).

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

Cluster access

Ensure you have cluster-admin privileges to install CRDs and cluster-scoped resources.

The admission controller can be disabled if cert-manager is not available, but it’s recommended for production use to validate resource configurations.

Installation methods

Helm (recommended)
kubectl
Development

Install with Helm

Helm is the recommended installation method as it simplifies upgrades and configuration management.

Add the Helm repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install the operator

Install with default settings:

helm install nim-operator nvidia/k8s-nim-operator \
  -n nim-operator \
  --create-namespace

Install with custom configuration

Create a values.yaml file with your configuration:

values.yaml

operator:
  replicas: 1
  image:
    repository: ghcr.io/nvidia/k8s-nim-operator
    tag: main
    pullPolicy: Always
    pullSecrets: []
  
  # Operator resource limits
  resources:
    limits:
      cpu: "1"
      memory: 256Mi
    requests:
      cpu: 500m
      memory: 128Mi
  
  # Logging configuration
  log:
    level: info  # debug | info | warn | error
    encoder: json  # json | console
    stacktraceLevel: error
  
  # Admission controller settings
  admissionController:
    enabled: true
    tls:
      mode: "cert-manager"  # cert-manager | secret
      certManager:
        issuerType: "selfsigned"  # selfsigned | clusterissuer | issuer
        issuerName: ""
        dnsNames: []
  
  # Node scheduling
  nodeSelector: {}
  tolerations:
    - key: "node-role.kubernetes.io/control-plane"
      operator: "Equal"
      value: ""
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: "node-role.kubernetes.io/control-plane"
                operator: In
                values: [""]

# Enable NVIDIA Node Feature Discovery rules
nfd:
  nodeFeatureRules:
    deviceID: false

# Enable Dynamo (optional dependency)
dynamo:
  enabled: false

Install with custom values:

helm install nim-operator nvidia/k8s-nim-operator \
  -n nim-operator \
  --create-namespace \
  -f values.yaml

Upgrade the operator

helm upgrade nim-operator nvidia/k8s-nim-operator \
  -n nim-operator \
  -f values.yaml

Install with kubectl

For environments where Helm is not available, you can install using kubectl and manifests.

Install CRDs

kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/crds.yaml

Install operator

kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/install.yaml

This will:

Create the nim-operator namespace
Deploy the operator controller
Configure RBAC permissions
Set up admission webhooks

When using kubectl installation, you need to manually update CRDs during upgrades. Helm handles this automatically.

Install for development

For development and testing, you can build and deploy from source:

Clone the repository

git clone https://github.com/NVIDIA/k8s-nim-operator.git
cd k8s-nim-operator

Build the operator image

make build \
  IMAGE_NAME=<your-registry>/k8s-nim-operator \
  VERSION=<tag> \
  -f deployments/container/Makefile

Push to registry

docker push <your-registry>/k8s-nim-operator:<tag>

Install CRDs

make install

Deploy operator

make deploy IMG=<your-registry>/k8s-nim-operator:<tag>

Verify installation

Check that the operator is running:

kubectl get pods -n nim-operator

Expected output:

NAME                                           READY   STATUS    RESTARTS   AGE
nim-operator-controller-manager-7d9f8c5b6d-x9k2p   1/1     Running   0          2m

Verify the CRDs are installed:

kubectl get crds | grep nvidia.com

Expected output:

nimbuilds.apps.nvidia.com
nimcaches.apps.nvidia.com
nimpipelines.apps.nvidia.com
nimservices.apps.nvidia.com
nemocustomizers.apps.nvidia.com
nemodatastores.apps.nvidia.com
nemoentitystores.apps.nvidia.com
nemoevaluators.apps.nvidia.com
nemoguardrails.apps.nvidia.com

Check operator logs:

kubectl logs -n nim-operator -l control-plane=controller-manager

Configuration options

Operator arguments

The operator accepts these command-line arguments:

--health-probe-bind-address - Address for health probe server (default: :8081)
--metrics-bind-address - Address for metrics server (default: :8080)
--leader-elect - Enable leader election for HA deployments

Environment variables

Key environment variables:

WATCH_NAMESPACE - Namespace to watch (empty = all namespaces)
OPERATOR_NAMESPACE - Namespace where operator is deployed
OPERATOR_VERSION - Operator version (set automatically)

Logging configuration

Configure logging behavior via Helm values:

operator:
  log:
    development: false
    level: info  # debug | info | warn | error | dpanic | panic | fatal
    encoder: json  # json | console
    stacktraceLevel: error

Resource limits

Adjust operator resource requirements:

operator:
  resources:
    limits:
      cpu: "1"
      memory: 256Mi
    requests:
      cpu: 500m
      memory: 128Mi

High availability

For production deployments, enable multiple replicas with leader election:

operator:
  replicas: 3
  args:
    - --health-probe-bind-address=:8081
    - --metrics-bind-address=:8080
    - --leader-elect

Admission controller configuration

The admission controller validates and mutates NIM resources before they’re persisted.

operator:
  admissionController:
    enabled: true
    tls:
      mode: "cert-manager"
      certManager:
        issuerType: "selfsigned"
        issuerName: ""
        dnsNames: []

Disabling the admission controller removes validation and defaulting for NIM resources, which may lead to misconfigurations.

Security context

The operator runs with these security settings by default:

operator:
  podSecurityContext:
    seccompProfile:
      type: RuntimeDefault
    runAsNonRoot: true
  containerSecurityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL

These settings are compatible with restricted pod security standards and OpenShift SCCs.

Platform-specific configuration

OpenShift

The operator works out-of-the-box on OpenShift with automatic SCC handling:

helm install nim-operator nvidia/k8s-nim-operator \
  -n nim-operator \
  --create-namespace

NIM deployments automatically use appropriate SCCs:

nonroot - Default for standard deployments
anyuid - When using proxy certificates
hostmount-anyuid - When using hostPath volumes

VMware TKGS

For VMware Tanzu Kubernetes Grid Service:

operator:
  podSecurityContext:
    seccompProfile:
      type: RuntimeDefault
    runAsNonRoot: true

This is the default configuration and works with TKGS pod security policies.

Air-gapped environments

For air-gapped deployments:

Mirror the operator image to your private registry
Configure image pull secrets:

operator:
  image:
    repository: <your-registry>/k8s-nim-operator
    tag: <version>
    pullSecrets:
      - name: private-registry-secret

Ensure all dependency images are also mirrored (cert-manager, etc.)

Upgrade

Helm upgrade

Upgrade to the latest version:

helm repo update
helm upgrade nim-operator nvidia/k8s-nim-operator \
  -n nim-operator \
  -f values.yaml

kubectl upgrade

For kubectl installations, upgrade CRDs first:

kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/crds.yaml
kubectl apply -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/install.yaml

The operator automatically upgrades CRDs when using Helm with operator.upgradeCRD: true (default).

Uninstall

Uninstall with Helm

helm uninstall nim-operator -n nim-operator

Uninstall with kubectl

Delete operator resources:

kubectl delete -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/install.yaml

Delete CRDs (this will delete all NIM resources):

kubectl delete -f https://github.com/NVIDIA/k8s-nim-operator/releases/latest/download/crds.yaml

Deleting CRDs will delete all NIMService, NIMCache, and other NIM resources in your cluster. Ensure you have backups if needed.

Delete namespace

kubectl delete namespace nim-operator

Troubleshooting

Operator pod not starting

If the operator pod fails to start:

Check image pull permissions
Verify cert-manager is running (if admission controller is enabled)
Check resource availability

kubectl describe pod -n nim-operator -l control-plane=controller-manager

Webhook certificate issues

If you see webhook-related errors:

kubectl get certificates -n nim-operator
kubectl describe certificate -n nim-operator
kubectl logs -n cert-manager -l app=cert-manager

CRD installation fails

If CRD installation fails, ensure you have cluster-admin privileges:

kubectl auth can-i create customresourcedefinitions

Operator logs show errors

Increase log verbosity for debugging:

operator:
  log:
    level: debug
    development: true

Next steps

Quick start

Deploy your first NIM microservice

NIMService guide

Learn about NIMService configuration options

NIMCache guide

Understand model caching strategies

Production best practices

Configure for production deployments

Get Started

Core Concepts

NIM Services

NeMo Microservices

Configuration

Operations

​Installation

​Prerequisites

​Installation methods

​Install with Helm

​Add the Helm repository

​Install the operator

​Install with custom configuration

​Upgrade the operator

​Install with kubectl

​Install CRDs

​Install operator

​Install for development

​Clone the repository

​Build the operator image

​Push to registry

​Install CRDs

​Deploy operator

​Verify installation

​Configuration options

​Operator arguments

​Environment variables

​Logging configuration

​Resource limits

​High availability

​Admission controller configuration

​Security context

​Platform-specific configuration

​OpenShift

​VMware TKGS

​Air-gapped environments

​Upgrade

​Helm upgrade

​kubectl upgrade

​Uninstall

​Uninstall with Helm

​Uninstall with kubectl

​Delete namespace

​Troubleshooting

​Operator pod not starting

​Webhook certificate issues

​CRD installation fails

​Operator logs show errors

​Next steps

Quick start

NIMService guide

NIMCache guide

Production best practices

Installation

Prerequisites

Installation methods

Install with Helm

Add the Helm repository

Install the operator

Install with custom configuration

Upgrade the operator

Install with kubectl

Install CRDs

Install operator

Install for development

Clone the repository

Build the operator image

Push to registry

Install CRDs

Deploy operator

Verify installation

Configuration options

Operator arguments

Environment variables

Logging configuration

Resource limits

High availability

Admission controller configuration

Security context

Platform-specific configuration

OpenShift

VMware TKGS

Air-gapped environments

Upgrade

Helm upgrade

kubectl upgrade

Uninstall

Uninstall with Helm

Uninstall with kubectl

Delete namespace

Troubleshooting

Operator pod not starting

Webhook certificate issues

CRD installation fails

Operator logs show errors

Next steps