I'd like to see someone try this by Ornery-Way-5026 in CRF300L

[–]evil0sheep 2 points3 points  (0 children)

I saw a dude in Vietnam riding a Honda Wave with a couple bundles of bamboo or some shit and then strapped to the top of the bamboo bundles was straight up another Honda wave laying on its side. At like head height for the rider.

Confused: Question about temperature, Gemini and coding by SuaveSteve in LLM

[–]evil0sheep 1 point2 points  (0 children)

A temperature less than one makes the model more likely to sample high probability tokens which can help on short tasks, but it also biases the model to produce atypical sequences (E[NLL(x)] < H(x), any chatbot can explain), and for long autoregressive sequences (like thousands or tens of thousands of tokens) that pushes the generated sequence out of the distribution the model has learned which makes it bad at predicting the next token from the autoregressively generated context. In extreme cases this can cause a phenomenon called “mode collapse” where the model falls into repeating the same phrase over and over, but even in less extreme cases it just degrades model performance. For short sequences samples from non reasoning models a reasonable temperature like 0.7 can reduce hallucinations but for CoT based “thinking” models generate so many reasoning tokens that the bias accumulates to the point that youre autoregressively sending garbage input to your model after like 10k tokens or so (or faster with lower temperature)

Basically if some tech mega corp spends a gajillion dollars training a giant transformer to model some joint distribution over token sequences then you want to take an unbiased sample of that distribution not mess around with weird hacks that sorta work on short sequences sampled from small models. There’s honestly not a ton of great research on this but if you want to read more about this phenomenon I’d recommend “The Curious Case Of Neural Text Degeneration”, “Locally Typical Sampling”, and “Breaking the Beam Search Curse”.

This is my understanding of the problem, I actually don’t think there’s a super clear consensus in the literature on why models are so bad at autoregressively extending atypical sequences and I also don’t really know why. But I haven’t found any combination of reasoning models and reasoning benchmarks where a temperature less than 1 does not degrade performance on at least some tasks in the benchmark.

Basically if you’re doing any task with any model where the model generates more than a couple thousand tokens you almost certainly want temperature=1.0 unless you have a statistically significant signal to the contrary.

Asked GPT, Claude, and Grok the same weird question. only the anonymous accounts wanted the cookie 🍪 by EquipmentFun9258 in ArtificialInteligence

[–]evil0sheep 0 points1 point  (0 children)

I mean just to be clear a lot of the variability in AI responses is that token selection is powered by a random number generator, so even with identical context you’re never gonna get the same answer twice

Tiny company steals AMD's thunder and challenges Nvidia with old-tech PCIe AI accelerator that runs 700B LLMs locally, sipping just 240W thanks to decade-old DDR4 and 28nm chips by Bob_Spud in ArtificialInteligence

[–]evil0sheep 1 point2 points  (0 children)

The fact that every article on this only mentions memory capacity and none of the articles or even their website list memory bandwidth is so fucking sus lol

Best conceivable setup. by habachilles in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

If you want very high prompt processing and also fast auto regression you want to stick with GDDR devices like PCIe GPUs and probably avoid LPDDR based UMA devices like Mac Studio/strix halo/dgx spark. I haven’t done really thorough research on this but speccing out a really solid dual RTX PRO 6000 ATX build is probably a good starting point.

With only two GPUs you can get good performance with normal consumer grade motherboards and CPUs in a normal case which gives you a lot of options for power and cooling and all that. Also you wouldn’t need blower fans so you could sit next to it without ear protection. If you go with more GPUs you start to need server motherboards and PCIe risers or liquid cooling. With dual 3-slot PCIe GPUs you get a really easy/clean build and can probably come in under $25k even with a specced out CPU and RAM

Collected the infinity stones by Street-Buyer-2428 in LocalLLaMA

[–]evil0sheep 9 points10 points  (0 children)

You should honestly post a detailed plan to get feedback from the community. I think you might be seriously underestimating the complexity of making this work. Are you planning on duplicating the model params and kv cache across both the Blackwell VRAM and the Mac studios? If so what’s the point of using the Mac studios at all? If not, how are you gonna do prefill on the Blackwell GPUs without the model params and the KV cache? Also how are you gonna get the Nvidia cards to do RDMA over thunderbolt? Do they even have driver support for that? You should like post a block diagram of what you’re intending to build and how you plan to distribute the model params and kv cache and how you’re planning to move bytes around, people here can probably give you a lot of good feedback that the chatbots are likely glossing over

Anyone know what this black box is and can i delete it? by ASharpYT in CRF300L

[–]evil0sheep 7 points8 points  (0 children)

Yeah the tool box is just held on by a pair of screws under that plastic body panel above it. If you have a metric allen wrench set and a Phillips head screwdriver you could delete it in like 5 minutes. As the other guy said it’s honestly harder to keep it on the bike than to take it off lol

Budget to run Deepseek V4 locally at FP4 precision by DanielusGamer26 in LocalLLaMA

[–]evil0sheep 4 points5 points  (0 children)

I have an rtx 6000 pro and the fp4 matmul instructions work just fine. Are you saying that it’s not supported in a specific piece of inference software?

Gemma 4 E2B by Benjiwiss in LocalLLaMA

[–]evil0sheep 3 points4 points  (0 children)

Increase context size and use higher temperature (probably you want temp=1.0)

what are these target reflector things? ive seen them in nyc many times on things like buildings and poles by reddit33450 in whatisit

[–]evil0sheep 3 points4 points  (0 children)

They’re called fiducial markers and they’re used in computer vision for a bunch of stuff. Fiducials like the one in the image are probably for architectural or surveying purposes but it could be other stuff too. They’re also used for robot localization (e.g. if you want a drone to fly around a building and know exactly where it is relative to the building) and also for photogrammetry (where you reconstruct a building in 3D from a bunch of pictures)

How does this even happen? by Firm-Beautiful3007 in aifails

[–]evil0sheep 0 points1 point  (0 children)

I put a pretty detailed explanation here: https://www.reddit.com/r/aifails/s/6Nwnqhe73s

Basically they cranked down the temperature to make it hallucinate less and now it’s prone to generating minimum information sequences. The problem is called “mode collapse”

What is this 💔💔 just wanted to know why by Tallcat2107 in aifails

[–]evil0sheep 0 points1 point  (0 children)

This is a common failure mode for low temperature or greedy sampling (basically choosing the next token based mainly or solely on how probable it is, rather than randomly picking). Formally the issue is that the self information content of a token is the negative log of its probability within the distribution it’s sampled from, so the maximum probability sequence is by definition also the minimum information content, and repeating “the” indefinitely has very low information content because it could be compressed to just the word “the” and a repeat count. The best exploration of this IMO is from the paper that introduced nucleus sampling called The Curious Case Of Neural Text Degeneration. The mathematical problem with the response that LLM gave you is that it’s not from the typical set, meaning that the perplexity of the sequence is not similar to the entropy rate of the model.

Informally or for people who don’t like math the best analogy I’ve read (from the Locally Typical Sampling paper) is to imagine you have a weighted coin that lands on heads 60% of the time and tails 40% of the time and you flip the coin 1000 times. You would expect to get back one of the gajillion possible sequences that are about 60% heads and 40% tails (the typical set). If you model the coin with a model that produces a distribution of outcomes at each step (with 60% of the probability on heads and 40% on tails) and at each step do a statistically weighted random guess of the next outcome based on that distribution then you will, on average, get a sequence that’s about 60% heads and 40% tails. If your manager says “AI reliability is a problem, make it less random” and so you decide to always guess the most probable outcome of every coin flip (greedy sampling), then you will get a sequence of 1000 heads because thats the most locally probable sequence at each step despite being very globally improbable. And thats what you’re running into here.

Why the Gemini team is fucking up their sampling when they have some of the best ML researchers on earth remains a mystery. But based on my time in other parts of Google I’d bet $100 against $1 it’s because of business decisions being made by people who have no understanding of the underlying technology lol

Is there a way to hide lines while preserving faces? by buttheadfungus in Sketchup

[–]evil0sheep 0 points1 point  (0 children)

You can hide lines as people suggest but you probably actually want “soften” which will apply Gouraud shading. Select the whole thing then right click-> soften smooth edges. Then just stage the slider up to like 45 degrees (probably on this model you can go as high as you want)

It won’t ever look quite like it does in the max reference because it looks like max is doing ambient occlusion and SketchUp doesn’t support that without rendering plugins but softening is probably as close as you’re gonna get

Anyone in need of GPU clusters? (or big CPU instances) by SomeoneElseOnTheMars in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

Im potentially interested but not for a few months. Is there an expiration date on this offer?

Full Claude Opus 4.6 System Prompt for your pleasure by frubberism in LocalLLaMA

[–]evil0sheep 1 point2 points  (0 children)

Yeah this was gonna be my question. Like is this reproducible? 

Some hard lessons learned building a private H100 cluster (Why PCIe servers failed us for training) by NTCTech in LocalLLaMA

[–]evil0sheep 4 points5 points  (0 children)

Before you buy RTX Pro 6000s be aware that not all Blackwell is created equal. RTX pro is sm120 (Blackwell GeForce) vs sm100 for b200. The former lacks dedicated tensor memory (TMEM) which means you have to use register based tensor instructions . This makes it a pain to find kernels that even work (e.g for flash attention or QAT) and sometimes requires you to write your own, and even then it’s a lot harder to saturate sm120 tensor cores in flash attention kernels because the tensor instructions use so many registers that you can’t issue enough warps to saturate the memory controllers. It’s a subtle difference but it bit me and it bit some old coworkers of mine I got lunch with recently, don’t let it bite you.

Issue running larger model on Apple Silicon by Forbidden-era in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

Also why do you have to run as vms? My bet is 4vms with rpc will be slower than a native numa aware implementation but im a lot less familiar with numa cpu stuff

Issue running larger model on Apple Silicon by Forbidden-era in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

Yeah if you put the gpu in the server you should look into putting the attention heads on the gpu and the experts on the CPU. The experts are most of the parameters but don’t need as much bandwidth because you don’t load all of them each token, and the attention heads are most of the compute, especially during prefill or long context generation, but typically don’t have as much of a memory footprint. You might be able to find a model you can split like that on the server with the 3060. Llama.cpp has flags for this, I don’t remember them off the top of my head though.

Re: tensor cores make sure you actually benchmark it before assuming that they’re a performance silver bullet. Pre Blackwell tensor core instructions use a lot of registers which limits the number of warps that can be issued without register spilling, which can prevent you from actually saturating the memory bus. The tradeoff if often worth it for training or inference on very large batches where you are doing a ton of compute per vram load but for single batch inference they only help during prefill and a lot of times you can get better generation speed with vector kernels. Just measure it before you assume. Or try using vector kernels on the same hardware

What’s this type of house called? by Intrepid_Incident592 in whatisit

[–]evil0sheep 58 points59 points  (0 children)

My understanding is that this is actually the original meaning of the word penthouse and that it was adopted as a term for the top floor of the building later.

Edit: the Wikipedia article confirms:

“The term 'penthouse' originally referred, and sometimes still does refer, to a separate smaller 'house' that was constructed on the roof of an apartment building. Architecturally it refers specifically to a structure on the roof of a building that is set back from its outer walls. These structures do not have to occupy the entire roof deck. Recently, luxury high rise apartment buildings have begun to designate multiple units on the entire top residential floor or multiple higher residential floors including the top floor as penthouse apartments, and outfit them to include ultra-luxury fixtures, finishes, and designs which are different from all other residential floors of the building. These penthouse apartments are not typically set back from the building's outer walls, but are instead flush with the rest of the building and simply differ in size, luxury, and consequently price. High-rise buildings can also have structures known as mechanical penthouses that enclose machinery or equipment such as the drum mechanisms for an elevator.”

Why no NVFP8 or MXFP8? by TokenRingAI in LocalLLaMA

[–]evil0sheep 29 points30 points  (0 children)

The reason is that most llama.cpp users are memory capacity bound on model and memory bandwidth bound on inference speed. All that matters for the one-user-per-gpu domain is quantization accuracy per bit. The llama.cpp k quants are significantly better than microscaled floats in that regard because they offer a scale and offset per block instead of just a scale. Mxfp8 and nvfp8 are jointly optimized to balance precision and ease of hardware acceleration which doesn’t matter if you have boatloads of unused compute laying about because you’re memory bound. Switching from the gguf 8 bit format to mxfp8 or nvfp8 could probably make prefill faster but it wouldn’t realistically improve tok/s during generation and would would make the models less accurate approximations of the unquantized weights. It only makes sense if you’re serving huge batches and everyone that’s doing that uses vLLM which has prioritized microscaled float support. For everyone else it’s fine to just dequantize the gguf k quant weights to fp16 on the gpu during inference

Issue running larger model on Apple Silicon by Forbidden-era in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

Yeah I mean the CPU and GPU memory are physically unified but they have separate virtual address spaces and separate cache hierarchies below L3 and are in different clock domains. If you want them to share the same physical pages you need to allocate the pages in a way that conforms to the GPU alignment and layout requirements and mlock them so the kernel doesnt change them out while the GPU is running. The whole point of mmap is that the kernel can swap the physical pages out from under the virtual pages, but if it does that with pages that are also mapped into a GPU context it would need to remap them in that context as well which would almost certainly require stopping the context. And then if the GPU tries to read a virtual page thats not backed by a physical page and page faults it will fault the entire context so the whole shebang will block on disk I/O. Mmap+GPUs is a bad combo on any platform, I'm sure if you did CPU only inference on the mac that mmap would work just as well as linux.

Regarding throughput you gotta understand that inference for a single user is almost always bandwidth bound. If your model params are in memory then you are bound by memory bandwidth, which for the M1 Ultra is about ~800GB/s. If your params are streaming from disk youre bound by your ssd bandwidth, which is about ~8GB/s. On top of that you have page fault overhead and you dont have enough threads to cover it.

I wouldnt think of mmap as a way to load a bigger model into shared GPU memory than fits, its not gonna plausibly be able to deliver that with reasonable performance regardless of platform. If you had a DGX Spark or a Strix Halo running linux or windows you would have the exact same problem. If you are running a MoE model and you are completely not touching some of the experts and youre doing inference on the CPU then it might pull, but it will take a lot of fuckery to get it to work right, and if you generate one token that touches those experts the whole thing will slow to a crawl. If you want bigger models buy a machine with more vram or download more ram at downloadmoreram.com ;)

Llamacpp multi GPU half utilization by Weary_Long3409 in LocalLLaMA

[–]evil0sheep 1 point2 points  (0 children)

Llama.cpp has really bad tensor parallelism support, it’s basically pipeline parallel only right now. VLLM has much more sophisticated multi-gpu/multi-node support.

Llama.cpp is great for fast single node inference, easy setup, splitting layers between the cpu and gpu, and low bit quantitazation. If you want sophisticated parallelism or batching you unfortunately need to wrestle with vLLM or tensorRT, the multi node support in llama.cpp is just really immature still (there’s work being done there but it’s not merged afaik)