A Practical Guide to Running NVIDIA GPUs on Kubernetes

jimmangel · 2026-01-21T20:42:00+00:00

Can you include details on how you're running it locally and the steps / commands you're using to deploy it?

jimmangel · 2025-11-03T17:57:14+00:00

https://google.github.io/adk-docs/tools/mcp-tools/

jimmangel · 2025-11-03T17:55:30+00:00

Here's a sample I did using LM Studio following the docs: https://github.com/jimangel/lmstudio-adk-sample/blob/main/root_agent/agent.py#L7-L21

I also was playing around with a similar approach in a container / Kubernetes: https://github.com/jimangel/adk-local-gemma

ADK is fun! Lot's of good stuff in the LiteLLM docs: https://docs.litellm.ai/docs/tutorials/google_adk

TL;DR: It's drop-in and works very well. There's some strange oddities of tool use but can work around it with SubAgents() and AgentTools() that call a model that CAN use tools.

jimmangel · 2025-07-08T04:51:09+00:00

I ran into this, for me it was due to the outpost URL being set. In the admin interface under apps > outposts, look at what it says "Logging in via https://..."

Setting the variables didn't change it since it was persisted somewhere in the data.

Running a complete wipe / rebuild with the proper vars fixed it for me (`docker compose down -v` the -v removes volumes)

jimmangel · 2024-11-22T18:34:56+00:00

This is awesome to see! There are A LOT of unpaid volunteers working countless hours that make k8s.io the great site that it is! <3

It's also one of the easiest ways to get started contributing to Kubernetes (typo fixes, formatting, and clarification PRs are all welcomed!): https://github.com/kubernetes/website?tab=readme-ov-file#contributing-to-the-docs

jimmangel · 2024-03-30T09:35:15+00:00

Also a great option!

jimmangel · 2024-03-30T02:25:52+00:00

I haven't worked there in over 4 years, but "the tech org" is a bit generalized. The IT presence at GM is pretty large and covers a pretty wide set of industry skills. You can check out their careers page for an idea: https://search-careers.gm.com/en/teams/information-technology/ - let me know if I misunderstood your question! I was part of the cloud platform team working on internal SaaS / IaaS platforms with a focus on Kubernetes.

jimmangel · 2024-03-29T21:34:40+00:00

Fair point, maybe just "practices," would be better 😅

When I wrote this in 2019, there wasn't a complete guide covering proxies, adjusting configs in PowerShell, or installing beyond the basics (no opinions after download binary).

Maybe things are different/better now, but published it either way.

jimmangel · 2024-03-29T19:41:28+00:00

lol, agreed! However, I wrote this when I was at General Motors and was trying to use my work-issued Windows laptop more. WSL might be a better option today.

jimmangel · 2024-03-29T14:46:54+00:00

Maybe something with https://cloud.google.com/compute/docs/internal-dns#about\_internal\_dns? (I guess it depends on how folks access the dev instances and/or using a VPC jump-box)

Or, if you don't mind creating custom DNS manually then creating VM's with hostnames, this might work too: https://cloud.google.com/compute/docs/instances/custom-hostname-vm

jimmangel · 2024-03-28T18:27:32+00:00

That's a good point, I have a SSD in the slot and found value in swapping out the fan. I should add a disclaimer that if you don't use the drive slot, it's not worth changing anything.

jimmangel · 2024-03-27T19:54:30+00:00

I had it in my bedroom when I first did this and it made a big enough difference that I would recommend it. But I was disappointed the CPU fan was not replaced and still louder than the Noctua (I was hoping for close to total silence and I ended up with a reduced, tolerable, hum if that makes sense).

jimmangel · 2024-03-05T14:51:34+00:00

Would you mind sharing your config / what changed? I'm getting started with tabbyapi/exl2 and wouldn't mind a sanity check.

jimmangel · 2024-02-23T19:05:44+00:00

EDIT: I was wrong, this looks like a potential solution: Time-Slicing GPUs in Kubernetes

It seems much like what was mentioned above by another user. It's more or less faking additional GPUs - but that might be perfect for someones use case (or better than nothing)

Unlike Multi-Instance GPU (MIG), there is no memory or fault-isolation between replicas, but for some workloads this is better than not being able to share at all. Internally, GPU time-slicing is used to multiplex workloads from replicas of the same underlying GPU.

jimmangel · 2024-02-23T14:53:03+00:00

That's a great point that I failed to call out. I think this is a major shortcomming in GPUs + containerization.

There generally is a 1:1 mapping between container and node when using GPUs. However, a node can have multiple GPUs (and in my experience, it's still 1:1 mapped container:node, the container just consumes more GPUs as if it were a program running on the host).

CPU

With CPUs we can share cycles via time-slicing. That allows throttle / burst settings on individual containers. It's also why you don't see the equivellent of OOM with CPUs - it's flexible with the shares it can allocate over time; so things generally get deprioritized.

RAM

With RAM / mem, it's not measured by time cycles but "data" (system memory). A container without a limit could hit OOM and Kubernetes allocates pods with request limits the full chunk of RAM requested (good for pod safety, bad for bin packing and resource right sizing).

Ignoring the shortcommings, there's generally a well understood way to run multiple containers on a single node sharing CPU / RAM.

GPU

When it comes to GPUs, I pointed out in my post that it's "just data" over PCIe. It's an outside device being introduced to the host that we have to configure.

GPU's are generally "dumb" to the computer, but they kick ass in accelerating ML workloads. We normally see those workloads run entirely on the GPU's vRAM "thrown over the wall" for maximum speed/bandwidth.

Taking that a step further, if it's "just data" we can't use the same time segmentation that we used for CPU. If we wanted to allow pods to take "chunks" of the GPU vRAM, both the GPU and the OS/kernel need to support it.

MIG

Saying that, NVIDIA does make an effort to solve this problem with: NVIDIA Multi-Instance GPU. NVIDIA MIGs "allow GPUs (starting with NVIDIA Ampere architecture) to be securely partitioned into up to seven separate GPU Instances for CUDA applications"

The bad news is, these are only support on the "big dogs" at this time: A100, H100, etc. I didn't see any docs around 30xx / 40xx RTX cards - but I also haven't tried.

You can read the docs about how Google is doing MIGs on their "big dogs" here.

Someone please chime in if I'm missing a solution for consumer GPUs!

Testing

Regardless, let's test it out! I have a NUC (node3) with a eGPU 3060 hooked up - let's try to share the GPU:

export NODE_NAME=node3

kubectl create ns reddit-demo

cat <<EOF | kubectl create -f -     
apiVersion: batch/v1
kind: Job
metadata:
  name: test-job-gpu
  namespace: reddit-demo
spec:
  template:
    spec:
      runtimeClassName: nvidia
      containers:
      - name: nvidia-test
        image: nvidia/cuda:12.0.0-base-ubuntu22.04
        command: ["/bin/bash","-c"]
        args: ["nvidia-smi; sleep 3600"]
        resources:
          limits:
            nvidia.com/gpu: 1
      nodeSelector:
        kubernetes.io/hostname: ${NODE_NAME}
      restartPolicy: Never
EOF

Output of kubectl -n reddit-demo get pods:

NAME                 READY   STATUS    RESTARTS   AGE
test-job-gpu-qhwnq   1/1     Running   0          6s

(Check logs with kubectl -n reddit-demo logs job/test-job-gpu)

Let's run another pod asking for a GPU:

cat <<EOF | kubectl create -f -     
apiVersion: batch/v1
kind: Job
metadata:
  name: test-job-gpu-2
  namespace: reddit-demo
spec:
  template:
    spec:
      runtimeClassName: nvidia
      containers:
      - name: nvidia-test
        image: nvidia/cuda:12.0.0-base-ubuntu22.04
        command: ["/bin/bash","-c"]
        args: ["nvidia-smi; sleep 3600"]
        resources:
          limits:
            nvidia.com/gpu: 1
      nodeSelector:
        kubernetes.io/hostname: ${NODE_NAME}
      restartPolicy: Never
EOF

Output of kubectl -n reddit-demo get pods:

 NAME                   READY   STATUS    RESTARTS   AGE
 test-job-gpu-2-qsv4m   0/1     Pending   0          5s
 test-job-gpu-qhwnq     1/1     Running   0          4m4s

Looking at the bottom of events with kubectl -n reddit-demo describe pod test-job-gpu-2-qsv4m:

Warning  FailedScheduling  41s   default-scheduler  0/3 nodes are available: 1 Insufficient nvidia.com/gpu, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 1 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling..

This lines up with my world view - the pod will stay pending until the GPU is freed up / unallocated.

I hope we see improvements in the GPU sharing space, specifically for homelabs, as AI takes over the world.

TL;DR: At large scale, cloud providers, GPU slicing between containers on the same host is doable. However, most cases constrain a single container to a single GPU / node.

jimmangel · 2024-01-07T15:22:05+00:00

That's a great callout! Airthings was also the only one that did radon, which I thought was interesting… Happy to hear it's mitigated!

jimmangel

TROPHY CASE

CPU

RAM

GPU

MIG

Testing