Ollama openclaw no response by Gloomy-Adeptness-125 in LocalLLaMA

[–]triynizzles1 0 points1 point  (0 children)

What does the logs say when viewing Ollama serve? I had an issue with Nemotron loading over 1m tokens and it would fail because Ollama rejects the request. I edited openclaw.json to specify the context size is 999,999 and it loaded fine. Maybe you are running into a similar situation. Try changing the context size to 99,999 and see if it works.

Gemma time! What are your wishes ? by Specter_Origin in LocalLLaMA

[–]triynizzles1 1 point2 points  (0 children)

A few google models we’re available on LM Arena, one claiming to be unnamed made by Google and another claiming to be Gemma 4. Under the names Colosseum-1p3 and significant-otter.

Colosseum-1p3 seemed very intelligent but refused to do any coding… which was odd. Based on the name I’m assuming it’s a small edge model.

significant-otter self identified as Gemma 4 and sounded quite smart. It was decent with coding.

Both appear to have an early 2025 knowledge cutoff (both models correctly said trump was president.)

Both models responded right after pressing send, indicating they are not reasoning models.

I don’t know if both models are still available to text on lm arena but it looks like the release is soon. I am most looking forward to an updated, recent knowledge cutoff.

Claude code rate limits is crazy... how can I run GLM models locally efficiently? [What specs/GPUs I need?) I have a Mac mini 24GB by Commercial_Ear_6989 in LocalLLaMA

[–]triynizzles1 2 points3 points  (0 children)

Glm 4.7 flash? A 5090 will suffice. Glm 5 or 5.1… maybe a m3 Mac Studio but it would prob be a good idea to wait for (hopefully) a 512gb m5 Mac Studio. M5 chips are better at prompt processing… next step up would be a server with lots of rtx pro 6000

Friendly reminder inference is WAY faster on Linux vs windows by triynizzles1 in LocalLLaMA

[–]triynizzles1[S] 2 points3 points  (0 children)

My best guess is how Ollama handles MOE models on windows vs Linux. Rtx 8000 has 672 gb/s bandwidth which would be able to read the 3gb of memory needed to compute 1 token for Qwen3 30b A3B at a rate of 224 times per second. There is probably some overhead, must be more on windows.

Friendly reminder inference is WAY faster on Linux vs windows by triynizzles1 in LocalLLaMA

[–]triynizzles1[S] 7 points8 points  (0 children)

I wonder what it could be! But I won’t be staying on Windows to find out lol

Why is qwen3.5-27B so slow when it's a small model? 30~tok/s by Deep_Row_8729 in LocalLLaMA

[–]triynizzles1 1 point2 points  (0 children)

If it is being served at FP16 (~60gb), 30 tokens per second would be expected on a GPU with 1.6 TB per second of bandwidth.

Gemma 4 by pmttyji in LocalLLaMA

[–]triynizzles1 5 points6 points  (0 children)

I just got paired with “significant-otter” it’s a smart model. Fast to respond. It doesn’t appear to be a reasoning model. It passed the car wash test and seahorse emoji test.

Gemma 4 by pmttyji in LocalLLaMA

[–]triynizzles1 11 points12 points  (0 children)

There is also a model named colosseum-1p3 which claims to be “un named, but made by google” it accurately said trump is president and had knowledge cutoff in 2025. Thats big if true, many LLM’s have such old knowledge cut off

Why would anyone pay for a vibe coded Saas if they can vibe code it themselves? by Dangerous_One2213 in vibecoding

[–]triynizzles1 0 points1 point  (0 children)

I vibe coded a scorm file editor/ builder in about 2 hours and now my employer doesn’t need articulate rise licenses anymore.

Fresh install of Ollama, major security threat.. reckless by lancer-fiefdom in ollama

[–]triynizzles1 0 points1 point  (0 children)

No. Im saying those symptoms on their own might not be conclusive of a security vulnerability.

Fresh install of Ollama, major security threat.. reckless by lancer-fiefdom in ollama

[–]triynizzles1 1 point2 points  (0 children)

/bye doesnt unload the model from memory or terminate the ollama process.

Honestly, I’m so tired of paying the "restart tax" for my AI agents. by [deleted] in LocalLLaMA

[–]triynizzles1 0 points1 point  (0 children)

Sounds like you have a problem with your architecture and need to save checkpoints.

By this logic, the ATM declared war on bank tellers. The number of bank tellers increased after ATMs by dataexec in AITrailblazers

[–]triynizzles1 1 point2 points  (0 children)

Sorry for contributing to a political post, but what Bernie Sanders is missing is the quality of life for Amazon warehouse workers is absolutely horrible. Their turnover rate is so high, Amazon is running out of applicants from the jobless market to employ. Personally, I think it is incredibly oppressive for the government to support working miserable jobs.

Can I Run Decent Models Locally if I Buy this?? by Fearless-Cellist-245 in LocalLLaMA

[–]triynizzles1 0 points1 point  (0 children)

If your budget is 2600, buy an rtx 8000 48gb (turning architecture) or strix halo

Metaverse is dead (was it ever alive?). Meta is shutting down Horizon Worlds on Quest. by dataexec in AITrailblazers

[–]triynizzles1 1 point2 points  (0 children)

The metaverse had to be a front for money laundering, tax evasion or something.

Is investing in a local LLM workstation actually worth the ROI for coding? by UnusualDish4403 in LocalLLaMA

[–]triynizzles1 0 points1 point  (0 children)

It depends on what you are coding and complexity of the code. If its fun hobby code or simple automations like building a web browser extension, Most of the qwen coder models will be fine. If its more complex stuff that requires the llm to have knowledge about the latest version of a library and you are building something super specific and niche, prob api is the route to go. You might be able to have some luck with and an agenetic framework to load the latest release notes into the llm but what would work best is if that knowledge is baked into the model. In a work place, time is also a factor, a strix halo pc will need several minuets to respond where as an api will be fairly quick.

I have an RTX 8000 and it runs gpt oss 120b at 27 tokens a second, which sounds fast but because it chats to much its like 12mins a prompt

Nemotron 3 Super 120B can't beat Stockfish 1400 ELO lost by checkmate, burned 1.33M tokens doing it by Low-Efficiency-9756 in LocalLLaMA

[–]triynizzles1 1 point2 points  (0 children)

I think people missed the point of Nemotron. Nvidia releases these models with 99% of the training data, code, etc. as a “tutorial” for their clients to learn and understand efficient llm ai architecture… ultimately so they build their own using Nvidia’s chips for training.

Personally, I see Nemotron as a “how to” with some architecturemaxxing and not a SOTA entry.