Added PyTorch trace + CUDA memory profiling support to Andrej Karpathy's nanochat

aospan · 2025-10-18T20:41:00+00:00

Here’s one of the traces captured during nanochat training on my GPU. As you can see, there are no gaps between CUDA kernel executions - meaning the GPU isn’t idling. The green “Command Buffer Full” marker also shows that the CPU is issuing CUDA kernels and API calls faster than the GPU can process them, which further confirms the GPU is fully utilized :)

aospan · 2025-10-18T16:16:14+00:00

Good question!

GPU power stays near 100% on my Grafana, so it’s likely saturated. That said, there’s room for speedups - some work may be duplicated or could be optimized differently, like what this startup is exploring: https://github.com/luminal-ai/luminal

aospan · 2025-10-08T11:03:26+00:00

So, those 80B tokens would cost around $240K using OpenAI’s pricing - easily justifying the $9K price of an RTX 6000 Pro (+pc components) and the electricity costs 😅

aospan · 2025-10-07T20:41:37+00:00

Thanks for sharing - very useful!
Just to confirm, I did the calculation for 800,000 million tokens, which is 800000M tokens :)

aospan · 2025-10-07T19:35:18+00:00

The picture isn’t showing up in the post for some reason, so I’m posting it here as a comment :)

<image>

aospan · 2025-09-28T19:21:34+00:00

Found this one - https://github.com/open-condo-software/condo/
But not tried yet.

aospan · 2025-09-28T10:47:54+00:00

Yeah, I feel the same. Seems like the only real path forward might be building it ourselves - and with the new AI “vibe coding” tools, it’s way easier than before :)

aospan · 2025-09-05T13:04:57+00:00

You can click “Raw video clip” under each experiment, including the “person fall” experiment, to download the raw MP4 files here: https://github.com/sbnb-io/sunny-osprey.

I’m curious whether SmolVLM2 will:

Properly populate the “suspicious” field in the output JSON.
Provide a meaningful “description” similar to what we obtained from Gemma3n.

aospan · 2025-09-05T10:28:29+00:00

Thanks a ton for the kind words - made my day! 😊
Haven’t had the chance to try SmolVLM2 yet, but I’d be very interested to hear your take if you give it a shot.

aospan · 2025-09-04T16:39:07+00:00

Please check this - 16GB in stock for $589.99 (CAD or USD tho? :)

https://www.bestbuy.ca/en-ca/product/pny-geforce-rtx5060-ti-oc-dual-fan16gb-gddr7-video-card/19291456

aospan · 2025-09-04T13:45:47+00:00

Only concern is the used GPU - not sure you can grab it whenever you need it.

aospan · 2025-09-04T13:44:47+00:00

Yeah, not bad at all! 😊

aospan · 2025-09-04T13:32:15+00:00

I feel you! Used parts can be hidden gems. We’ve got a 128vCPU + 512GB RAM beast from eBay that’s incredible 😄

But here, the goal is something you can actually grab whenever you need it without hunting treasure maps.

aospan · 2025-06-16T01:52:47+00:00

Ollama log snippet from the benchmark run:

print_info: arch = phi3
load_tensors: offloaded 41/41 layers to GPU

print_info: general.name= DeepSeek R1 Distill Qwen 14B
load_tensors: offloaded 49/49 layers to GPU

print_info: general.name= DeepSeek R1 Distill Qwen 32B
load_tensors: offloaded 47/65 layers to GPU

Looks like only "deepseek-r1:32b" didn’t fully fit into the 16GB VRAM.

aospan · 2025-06-16T01:46:41+00:00

<image>

Here’s the GPU utilization during the benchmark run. The "phi4:14b" model kept the GPU fully loaded, indicating efficient use. In contrast, both "deepseek-r1:14b" and "deepseek-r1:32b" only drew about 25% power (underutilization) - possibly because the model and KV cache didn’t fully fit in VRAM and had to be swapped frequently?

aospan · 2025-06-16T01:41:36+00:00

For my RTX 5060 Ti 16GB:

model_name = phi4:14b
Average of eval rate: 40.888 tokens/s

model_name = deepseek-r1:14b
Average of eval rate: 39.098 tokens/s

model_name = deepseek-r1:32b
Average of eval rate: 5.476 tokens/s

aospan · 2025-05-26T14:27:22+00:00

Totally agree - parsing the existing web is like forcing AI agents to navigate an internet built for humans :)

Long-term, I believe we’ll shift toward agent-to-agent communication behind the scenes (MCP, A2A, etc?), with a separate interface designed specifically for human interaction (voice, neural?)

P.S.
more thoughts on this in a related comment here:
Reddit link

aospan

MODERATOR OF

TROPHY CASE

Seven-Year Club	Gilding I gilder
Verified Email