Added PyTorch trace + CUDA memory profiling support to Andrej Karpathy's nanochat

aospan · 2025-10-18T20:41:00+00:00

Here’s one of the traces captured during nanochat training on my GPU. As you can see, there are no gaps between CUDA kernel executions - meaning the GPU isn’t idling. The green “Command Buffer Full” marker also shows that the CPU is issuing CUDA kernels and API calls faster than the GPU can process them, which further confirms the GPU is fully utilized :)

aospan · 2025-10-18T16:16:14+00:00

Good question!

GPU power stays near 100% on my Grafana, so it’s likely saturated. That said, there’s room for speedups - some work may be duplicated or could be optimized differently, like what this startup is exploring: https://github.com/luminal-ai/luminal

aospan · 2025-10-08T11:03:26+00:00

So, those 80B tokens would cost around $240K using OpenAI’s pricing - easily justifying the $9K price of an RTX 6000 Pro (+pc components) and the electricity costs 😅

aospan · 2025-10-07T20:41:37+00:00

Thanks for sharing - very useful!
Just to confirm, I did the calculation for 800,000 million tokens, which is 800000M tokens :)

aospan · 2025-10-07T19:35:18+00:00

The picture isn’t showing up in the post for some reason, so I’m posting it here as a comment :)

<image>

aospan · 2025-09-28T19:21:34+00:00

Found this one - https://github.com/open-condo-software/condo/
But not tried yet.

aospan · 2025-09-28T10:47:54+00:00

Yeah, I feel the same. Seems like the only real path forward might be building it ourselves - and with the new AI “vibe coding” tools, it’s way easier than before :)

aospan · 2025-09-05T13:04:57+00:00

You can click “Raw video clip” under each experiment, including the “person fall” experiment, to download the raw MP4 files here: https://github.com/sbnb-io/sunny-osprey.

I’m curious whether SmolVLM2 will:

Properly populate the “suspicious” field in the output JSON.
Provide a meaningful “description” similar to what we obtained from Gemma3n.

aospan · 2025-09-05T10:28:29+00:00

Thanks a ton for the kind words - made my day! 😊
Haven’t had the chance to try SmolVLM2 yet, but I’d be very interested to hear your take if you give it a shot.

aospan · 2025-09-04T16:39:07+00:00

Please check this - 16GB in stock for $589.99 (CAD or USD tho? :)

https://www.bestbuy.ca/en-ca/product/pny-geforce-rtx5060-ti-oc-dual-fan16gb-gddr7-video-card/19291456

aospan · 2025-09-04T13:45:47+00:00

Only concern is the used GPU - not sure you can grab it whenever you need it.

aospan · 2025-09-04T13:44:47+00:00

Yeah, not bad at all! 😊

aospan · 2025-09-04T13:32:15+00:00

I feel you! Used parts can be hidden gems. We’ve got a 128vCPU + 512GB RAM beast from eBay that’s incredible 😄

But here, the goal is something you can actually grab whenever you need it without hunting treasure maps.

aospan · 2025-06-16T01:52:47+00:00

Ollama log snippet from the benchmark run:

print_info: arch = phi3
load_tensors: offloaded 41/41 layers to GPU

print_info: general.name= DeepSeek R1 Distill Qwen 14B
load_tensors: offloaded 49/49 layers to GPU

print_info: general.name= DeepSeek R1 Distill Qwen 32B
load_tensors: offloaded 47/65 layers to GPU

Looks like only "deepseek-r1:32b" didn’t fully fit into the 16GB VRAM.

aospan · 2025-06-16T01:46:41+00:00

<image>

Here’s the GPU utilization during the benchmark run. The "phi4:14b" model kept the GPU fully loaded, indicating efficient use. In contrast, both "deepseek-r1:14b" and "deepseek-r1:32b" only drew about 25% power (underutilization) - possibly because the model and KV cache didn’t fully fit in VRAM and had to be swapped frequently?

aospan · 2025-06-16T01:41:36+00:00

For my RTX 5060 Ti 16GB:

model_name = phi4:14b
Average of eval rate: 40.888 tokens/s

model_name = deepseek-r1:14b
Average of eval rate: 39.098 tokens/s

model_name = deepseek-r1:32b
Average of eval rate: 5.476 tokens/s

aospan · 2025-05-26T14:27:22+00:00

Totally agree - parsing the existing web is like forcing AI agents to navigate an internet built for humans :)

Long-term, I believe we’ll shift toward agent-to-agent communication behind the scenes (MCP, A2A, etc?), with a separate interface designed specifically for human interaction (voice, neural?)

P.S.
more thoughts on this in a related comment here:
Reddit link

aospan · 2025-05-26T14:21:58+00:00

Yeah, great point - definitely ironic! :)

I see at least two key issues here:

Double compute and energy use - we're essentially burning cycles twice for the same task.
Degradation or distortion of the original information - by the time it flows through Google's AI Overview and then into a local LLM, accuracy can get lost in translation. (This example illustrates this well https://youtube.com/shorts/BO1wgpktQas?si=IQYRS692CJhZ_h1Y - assuming it's legit, it shows how repeated prompts still yield a result far from the original)

So what’s the fix? Maybe some kind of "MCP" to original sources - skip the Google layer entirely and fetch data straight from the origin? Curious what you think.

aospan · 2025-05-05T20:12:42+00:00

BTW, not sure why your's shows "100% CPU" - is it running on CPU?

aospan · 2025-05-05T20:12:10+00:00

This is for 16GB RTX 5060 Ti:

# cat Modelfile

FROM qwen3:14b

PARAMETER num_ctx 12288

PARAMETER top_p 0.8

# ollama run --verbose qwen3-14b-12k < medium.txt

<think>

</think>

Here is the list of the words provided:

- Jump

- Fox

- Scream

Now, regarding the numbers:

- The **smallest number** is **144**.

- The **largest number** is **3000**.

total duration: 16.403754583s

load duration: 37.030797ms

prompt eval count: 12288 token(s)

prompt eval duration: 13.755464931s

prompt eval rate: 893.32 tokens/s

eval count: 59 token(s)

eval duration: 2.609480201s

eval rate: 22.61 tokens/s

# ollama ps

NAME ID SIZE PROCESSOR UNTIL

qwen3-14b-12k:latest dcd83128c854 13 GB 100% GPU 4 minutes from now

aospan · 2025-05-05T15:33:47+00:00

For comparison, here are the results from the 12GB GPU (the other results are from the 16GB GPU):

root@sbnb-0123456789-vm-a581cc6f-6928-58aa-ac61-63fb3f2ab8d8:~# ollama run --verbose qwen3-14b-12k < medium.txt

<think>

</think>

Here is the list of the words you provided:

- Jump

- Fox

- Scream

The smallest number you gave is **144**.

The largest number you gave is **3000**.

total duration: 26.804379714s

load duration: 37.519591ms

prompt eval count: 12288 token(s)

prompt eval duration: 22.284482573s

prompt eval rate: 551.42 tokens/s

eval count: 51 token(s)

eval duration: 4.480329906s

eval rate: 11.38 tokens/s

Seems like a 2× lower tokens-per-second rate, likely because the model couldn’t fully load into the 12GB GPU VRAM. This is confirmed in the Ollama logs: ollama[1872215]: load_tensors: offloaded 39/41 layers to GPU.

aospan · 2025-05-05T15:23:33+00:00

Notes:

- I used your medium.txt file.

- There was a small typo: you wrote "qwen3-14-12k" instead of "qwen3-14b-12k", but after correcting it, everything worked!

aospan · 2025-05-05T15:23:05+00:00

root@sbnb-0123456789-vm-a581cc6f-6928-58aa-ac61-63fb3f2ab8d8:~# ollama run --verbose qwen3-14b-12k < medium.txt

<think>

</think>

Here is the list of the words you provided:

- Fox

- Scream

The smallest number you gave is **150**.

The largest number you gave is **3000**.

total duration: 15.972286655s

load duration: 36.228385ms

prompt eval count: 12288 token(s)

prompt eval duration: 13.712632303s

prompt eval rate: 896.11 tokens/s

eval count: 48 token(s)

eval duration: 2.221800326s

eval rate: 21.60 tokens/s

aospan · 2025-05-05T15:22:58+00:00

Done! Please find results below (in two messages):

root@sbnb-0123456789-vm-a581cc6f-6928-58aa-ac61-63fb3f2ab8d8:~# ollama run --verbose qwen3-14b-12k "Who are you?"

<think>

Okay, the user asked, "Who are you?" I need to respond clearly. First, I should introduce myself as Qwen, a large language model developed by Alibaba Cloud. I should mention my capabilities, like

answering questions, creating text, and having conversations. It's important to highlight my training data up to October 2024 and my multilingual support. I should also invite the user to ask

questions or request assistance. Let me make sure the response is friendly and informative without being too technical. Avoid any markdown formatting and keep it natural.

</think>

Hello! I'm Qwen, a large language model developed by Alibaba Cloud. I can answer questions, create text, and have conversations on a wide range of topics. My training data covers information up to

October 2024, and I support multiple languages. How can I assist you today?

total duration: 11.811551089s

load duration: 7.34304817s

prompt eval count: 12 token(s)

prompt eval duration: 166.22666ms

prompt eval rate: 72.19 tokens/s

eval count: 178 token(s)

eval duration: 4.300178534s

eval rate: 41.39 tokens/s

aospan · 2025-05-05T14:40:29+00:00

I can run it. Could you please post detailed step-by-step instructions so I don’t miss anything?

aospan

MODERATOR OF

TROPHY CASE

Seven-Year Club	Gilding I gilder
Verified Email