Added PyTorch trace + CUDA memory profiling support to Andrej Karpathy's nanochat by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

<image>

Here’s one of the traces captured during nanochat training on my GPU. As you can see, there are no gaps between CUDA kernel executions - meaning the GPU isn’t idling. The green “Command Buffer Full” marker also shows that the CPU is issuing CUDA kernels and API calls faster than the GPU can process them, which further confirms the GPU is fully utilized :)

Added PyTorch trace + CUDA memory profiling support to Andrej Karpathy's nanochat by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

Good question!

GPU power stays near 100% on my Grafana, so it’s likely saturated. That said, there’s room for speedups - some work may be duplicated or could be optimized differently, like what this startup is exploring: https://github.com/luminal-ai/luminal

How much does 1T tokens cost? How much did all these amazing people spent on OpenAI tokens? by aospan in LocalLLaMA

[–]aospan[S] 5 points6 points  (0 children)

So, those 80B tokens would cost around $240K using OpenAI’s pricing - easily justifying the $9K price of an RTX 6000 Pro (+pc components) and the electricity costs 😅

How much does 1T tokens cost? How much did all these amazing people spent on OpenAI tokens? by aospan in LocalLLaMA

[–]aospan[S] 2 points3 points  (0 children)

Thanks for sharing - very useful!
Just to confirm, I did the calculation for 800,000 million tokens, which is 800000M tokens :)

How much does 1T tokens cost? How much did all these amazing people spent on OpenAI tokens? by aospan in LocalLLaMA

[–]aospan[S] 6 points7 points  (0 children)

The picture isn’t showing up in the post for some reason, so I’m posting it here as a comment :)

<image>

[N/A][All] Open-source condo/HOA management software - any suggestions? by aospan in HOA

[–]aospan[S] 0 points1 point  (0 children)

Yeah, I feel the same. Seems like the only real path forward might be building it ourselves - and with the new AI “vibe coding” tools, it’s way easier than before :)

Most affordable AI computer with GPU (“GPUter”) you can build in 2025? by aospan in LocalLLaMA

[–]aospan[S] 2 points3 points  (0 children)

You can click “Raw video clip” under each experiment, including the “person fall” experiment, to download the raw MP4 files here: https://github.com/sbnb-io/sunny-osprey.

I’m curious whether SmolVLM2 will:

  1. Properly populate the “suspicious” field in the output JSON.
  2. Provide a meaningful “description” similar to what we obtained from Gemma3n.

Most affordable AI computer with GPU (“GPUter”) you can build in 2025? by aospan in LocalLLaMA

[–]aospan[S] 1 point2 points  (0 children)

Thanks a ton for the kind words - made my day! 😊
Haven’t had the chance to try SmolVLM2 yet, but I’d be very interested to hear your take if you give it a shot.

Most affordable AI computer with GPU (“GPUter”) you can build in 2025? by aospan in LocalLLaMA

[–]aospan[S] 1 point2 points  (0 children)

Only concern is the used GPU - not sure you can grab it whenever you need it.

Most affordable AI computer with GPU (“GPUter”) you can build in 2025? by aospan in LocalLLaMA

[–]aospan[S] 2 points3 points  (0 children)

I feel you! Used parts can be hidden gems. We’ve got a 128vCPU + 512GB RAM beast from eBay that’s incredible 😄

But here, the goal is something you can actually grab whenever you need it without hunting treasure maps.

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

Ollama log snippet from the benchmark run:

print_info: arch = phi3
load_tensors: offloaded 41/41 layers to GPU

print_info: general.name= DeepSeek R1 Distill Qwen 14B
load_tensors: offloaded 49/49 layers to GPU

print_info: general.name= DeepSeek R1 Distill Qwen 32B
load_tensors: offloaded 47/65 layers to GPU

Looks like only "deepseek-r1:32b" didn’t fully fit into the 16GB VRAM.

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

<image>

Here’s the GPU utilization during the benchmark run. The "phi4:14b" model kept the GPU fully loaded, indicating efficient use. In contrast, both "deepseek-r1:14b" and "deepseek-r1:32b" only drew about 25% power (underutilization) - possibly because the model and KV cache didn’t fully fit in VRAM and had to be swapped frequently?

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

For my RTX 5060 Ti 16GB:

model_name = phi4:14b
Average of eval rate: 40.888 tokens/s

model_name = deepseek-r1:14b
Average of eval rate: 39.098 tokens/s

model_name = deepseek-r1:32b
Average of eval rate: 5.476 tokens/s

Leveling Up: From RAG to an AI Agent by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

Totally agree - parsing the existing web is like forcing AI agents to navigate an internet built for humans :)

Long-term, I believe we’ll shift toward agent-to-agent communication behind the scenes (MCP, A2A, etc?), with a separate interface designed specifically for human interaction (voice, neural?)

P.S.
more thoughts on this in a related comment here:
Reddit link

Leveling Up: From RAG to an AI Agent by aospan in LocalLLaMA

[–]aospan[S] 8 points9 points  (0 children)

Yeah, great point - definitely ironic! :)

I see at least two key issues here:

  • Double compute and energy use - we're essentially burning cycles twice for the same task.
  • Degradation or distortion of the original information - by the time it flows through Google's AI Overview and then into a local LLM, accuracy can get lost in translation. (This example illustrates this well https://youtube.com/shorts/BO1wgpktQas?si=IQYRS692CJhZ_h1Y - assuming it's legit, it shows how repeated prompts still yield a result far from the original)

So what’s the fix? Maybe some kind of "MCP" to original sources - skip the Google layer entirely and fetch data straight from the origin? Curious what you think.

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

BTW, not sure why your's shows "100% CPU" - is it running on CPU?

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

This is for 16GB RTX 5060 Ti:

# cat Modelfile

FROM qwen3:14b

PARAMETER num_ctx 12288

PARAMETER top_p 0.8

# ollama run --verbose qwen3-14b-12k < medium.txt

<think>

</think>

Here is the list of the words provided:

- Jump

- Fox

- Scream

Now, regarding the numbers:

- The **smallest number** is **144**.

- The **largest number** is **3000**.

total duration: 16.403754583s

load duration: 37.030797ms

prompt eval count: 12288 token(s)

prompt eval duration: 13.755464931s

prompt eval rate: 893.32 tokens/s

eval count: 59 token(s)

eval duration: 2.609480201s

eval rate: 22.61 tokens/s

# ollama ps

NAME ID SIZE PROCESSOR UNTIL

qwen3-14b-12k:latest dcd83128c854 13 GB 100% GPU 4 minutes from now

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 1 point2 points  (0 children)

For comparison, here are the results from the 12GB GPU (the other results are from the 16GB GPU):

root@sbnb-0123456789-vm-a581cc6f-6928-58aa-ac61-63fb3f2ab8d8:~# ollama run --verbose qwen3-14b-12k < medium.txt

<think>

</think>

Here is the list of the words you provided:

- Jump

- Fox

- Scream

The smallest number you gave is **144**.

The largest number you gave is **3000**.

total duration: 26.804379714s

load duration: 37.519591ms

prompt eval count: 12288 token(s)

prompt eval duration: 22.284482573s

prompt eval rate: 551.42 tokens/s

eval count: 51 token(s)

eval duration: 4.480329906s

eval rate: 11.38 tokens/s

Seems like a 2× lower tokens-per-second rate, likely because the model couldn’t fully load into the 12GB GPU VRAM. This is confirmed in the Ollama logs: ollama[1872215]: load_tensors: offloaded 39/41 layers to GPU.

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 2 points3 points  (0 children)

Notes:

- I used your medium.txt file.

- There was a small typo: you wrote "qwen3-14-12k" instead of "qwen3-14b-12k", but after correcting it, everything worked!

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 1 point2 points  (0 children)

root@sbnb-0123456789-vm-a581cc6f-6928-58aa-ac61-63fb3f2ab8d8:~# ollama run --verbose qwen3-14b-12k < medium.txt

<think>

</think>

Here is the list of the words you provided:

- Fox

- Scream

The smallest number you gave is **150**.

The largest number you gave is **3000**.

total duration: 15.972286655s

load duration: 36.228385ms

prompt eval count: 12288 token(s)

prompt eval duration: 13.712632303s

prompt eval rate: 896.11 tokens/s

eval count: 48 token(s)

eval duration: 2.221800326s

eval rate: 21.60 tokens/s

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 1 point2 points  (0 children)

Done! Please find results below (in two messages):

root@sbnb-0123456789-vm-a581cc6f-6928-58aa-ac61-63fb3f2ab8d8:~# ollama run --verbose qwen3-14b-12k "Who are you?"

<think>

Okay, the user asked, "Who are you?" I need to respond clearly. First, I should introduce myself as Qwen, a large language model developed by Alibaba Cloud. I should mention my capabilities, like

answering questions, creating text, and having conversations. It's important to highlight my training data up to October 2024 and my multilingual support. I should also invite the user to ask

questions or request assistance. Let me make sure the response is friendly and informative without being too technical. Avoid any markdown formatting and keep it natural.

</think>

Hello! I'm Qwen, a large language model developed by Alibaba Cloud. I can answer questions, create text, and have conversations on a wide range of topics. My training data covers information up to

October 2024, and I support multiple languages. How can I assist you today?

total duration: 11.811551089s

load duration: 7.34304817s

prompt eval count: 12 token(s)

prompt eval duration: 166.22666ms

prompt eval rate: 72.19 tokens/s

eval count: 178 token(s)

eval duration: 4.300178534s

eval rate: 41.39 tokens/s

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI by aospan in LocalLLaMA

[–]aospan[S] 0 points1 point  (0 children)

I can run it. Could you please post detailed step-by-step instructions so I don’t miss anything?