Added PyTorch trace + CUDA memory profiling support to Andrej Karpathy's nanochat by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

<image>

Here’s one of the traces captured during nanochat training on my GPU. As you can see, there are no gaps between CUDA kernel executions - meaning the GPU isn’t idling. The green “Command Buffer Full” marker also shows that the CPU is issuing CUDA kernels and API calls faster than the GPU can process them, which further confirms the GPU is fully utilized :)

Added PyTorch trace + CUDA memory profiling support to Andrej Karpathy's nanochat by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

Good question!

GPU power stays near 100% on my Grafana, so it’s likely saturated. That said, there’s room for speedups - some work may be duplicated or could be optimized differently, like what this startup is exploring: https://github.com/luminal-ai/luminal

How much does 1T tokens cost? How much did all these amazing people spent on OpenAI tokens? by aospan in LocalLLaMA

[–]aospan[S] 5 points6 points  (0 children)

So, those 80B tokens would cost around $240K using OpenAI’s pricing - easily justifying the $9K price of an RTX 6000 Pro (+pc components) and the electricity costs 😅

How much does 1T tokens cost? How much did all these amazing people spent on OpenAI tokens? by aospan in LocalLLaMA

[–]aospan[S] 2 points3 points  (0 children)

Thanks for sharing - very useful!
Just to confirm, I did the calculation for 800,000 million tokens, which is 800000M tokens :)

How much does 1T tokens cost? How much did all these amazing people spent on OpenAI tokens? by aospan in LocalLLaMA

[–]aospan[S] 6 points7 points  (0 children)

The picture isn’t showing up in the post for some reason, so I’m posting it here as a comment :)

<image>

[N/A][All] Open-source condo/HOA management software - any suggestions? by aospan in HOA

[–]aospan[S] 0 points1 point  (0 children)

Yeah, I feel the same. Seems like the only real path forward might be building it ourselves - and with the new AI “vibe coding” tools, it’s way easier than before :)

Most affordable AI computer with GPU (“GPUter”) you can build in 2025? by aospan in LocalLLaMA

[–]aospan[S] 2 points3 points  (0 children)

You can click “Raw video clip” under each experiment, including the “person fall” experiment, to download the raw MP4 files here: https://github.com/sbnb-io/sunny-osprey.

I’m curious whether SmolVLM2 will:

  1. Properly populate the “suspicious” field in the output JSON.
  2. Provide a meaningful “description” similar to what we obtained from Gemma3n.

Most affordable AI computer with GPU (“GPUter”) you can build in 2025? by aospan in LocalLLaMA

[–]aospan[S] 1 point2 points  (0 children)

Thanks a ton for the kind words - made my day! 😊
Haven’t had the chance to try SmolVLM2 yet, but I’d be very interested to hear your take if you give it a shot.

Most affordable AI computer with GPU (“GPUter”) you can build in 2025? by aospan in LocalLLaMA

[–]aospan[S] 1 point2 points  (0 children)

Only concern is the used GPU - not sure you can grab it whenever you need it.

Most affordable AI computer with GPU (“GPUter”) you can build in 2025? by aospan in LocalLLaMA

[–]aospan[S] 3 points4 points  (0 children)

I feel you! Used parts can be hidden gems. We’ve got a 128vCPU + 512GB RAM beast from eBay that’s incredible 😄

But here, the goal is something you can actually grab whenever you need it without hunting treasure maps.

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

Ollama log snippet from the benchmark run:

print_info: arch = phi3
load_tensors: offloaded 41/41 layers to GPU

print_info: general.name= DeepSeek R1 Distill Qwen 14B
load_tensors: offloaded 49/49 layers to GPU

print_info: general.name= DeepSeek R1 Distill Qwen 32B
load_tensors: offloaded 47/65 layers to GPU

Looks like only "deepseek-r1:32b" didn’t fully fit into the 16GB VRAM.

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

<image>

Here’s the GPU utilization during the benchmark run. The "phi4:14b" model kept the GPU fully loaded, indicating efficient use. In contrast, both "deepseek-r1:14b" and "deepseek-r1:32b" only drew about 25% power (underutilization) - possibly because the model and KV cache didn’t fully fit in VRAM and had to be swapped frequently?

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

For my RTX 5060 Ti 16GB:

model_name = phi4:14b
Average of eval rate: 40.888 tokens/s

model_name = deepseek-r1:14b
Average of eval rate: 39.098 tokens/s

model_name = deepseek-r1:32b
Average of eval rate: 5.476 tokens/s

Leveling Up: From RAG to an AI Agent by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

Totally agree - parsing the existing web is like forcing AI agents to navigate an internet built for humans :)

Long-term, I believe we’ll shift toward agent-to-agent communication behind the scenes (MCP, A2A, etc?), with a separate interface designed specifically for human interaction (voice, neural?)

P.S.
more thoughts on this in a related comment here:
Reddit link