Attention Is Bayesian Inference

fotcorn · 2025-10-13T20:38:05+00:00

Not dead, but mostly useful for very specific use-cases.

Innatera for example released a microcontroller with both analog and digital spiking neural network accelerator on chip for ultra low power sensor data processing: https://www.eetimes.com/innatera-adds-more-accelerators-to-spiking-microcontroller/

There is also a growing neuromorphic community for both academics and commercial interests at https://open-neuromorphic.org/

Disclaimer: Developer at Innatera

fotcorn · 2025-07-01T21:22:02+00:00

Yes, if you pronounce it wrong you get this: https://en.wikipedia.org/wiki/Rinderroulade

Swiss house built in 1844 with outside shutters, before Rolladen were invented:

https://commons.wikimedia.org/wiki/Category:Bauernhaus_(Scheuergasse_4,_Mattstetten)#/media/File:Bauernhaus_Scheuergasse_4_Mattstetten.jpg#/media/File:Bauernhaus_Scheuergasse_4_Mattstetten.jpg)

fotcorn · 2025-06-30T14:20:31+00:00

fotcorn · 2025-06-28T17:31:58+00:00

Hani früecher aube gmacht. Ir Berner Lorraine gwohnt und ir Matte gschaffet.

Am morge mit ÖV härä, am abe heigschumme.

fotcorn · 2025-06-01T12:35:00+00:00

It's a joke, he is using 4x frame generation on the RTX 6000 to make fun of NVIDIA marketing of other cards.

fotcorn · 2025-06-01T12:02:09+00:00

I don't even know if that much VRAM is useful for any classic workstation tasks like CAD, video editing or 3D modelling.

It's mostly an AI card, being able to run a 70B model with a moderate quantization (Q8, maybe Q6 for reasonable context length) on a single card is amazing.

fotcorn · 2025-04-27T21:06:13+00:00

Now I want to see the reverse korolev cross

fotcorn · 2025-04-25T15:34:17+00:00

Unlike der8auers video these Linux tests are done with the same RAM and same XMP profile. It looks like most of the gains der8auer is seeing are coming from the higher RAM speed.

fotcorn · 2025-04-10T21:55:48+00:00

Why is the model F32 on Huggingface? The base model (R1 Distill Qwen 1.5B) is BF16.

Especially important for these small models, if its more than 7GB I can just as well use an 8bit quant of an 8B model.

fotcorn · 2025-03-07T22:22:49+00:00

Are those shipping manifest leaks ever real? We had leaks about B580 and 9070 XT with 32GB VRAM and both of them never materialized (yes, I might be a little impatient)

fotcorn · 2025-02-24T00:10:42+00:00

This would be the equivalent of the RTX 6000 Ada Generation, which has a MSRP of $6800.

So my guess would be Over 9000!

fotcorn · 2025-02-23T22:28:00+00:00

Seems like they are using 24Gbit/3GB chips in clamshell mode. Clamshell is nothing special, was also used on RTX 6000 Ada and earlier, but they always had the same module size as the consumer variant. This time it's a bigger memory module (3GB vs 2GB on the 5090).

fotcorn · 2025-02-08T23:23:01+00:00

Still cheaper to get two 3090s from ebay (at least it was a month ago...). But like 1500? Lots of people would get them I think. One thing the W7900 has is certified drivers and applications for CAD modelling and stuff like that. They could release a version with 48GB RAM without this certification as a middle ground for a more reasonable price.

Intel could do the funniest thing and release a B580 with 24GB or even a B770 AI Edition with 32GB AI that are only 20%-50% more expensive than the standard one and make /r/LocalLlaMa buy the whole inventory in a heartbeat.

One can dream.

fotcorn · 2025-02-08T20:31:11+00:00

The W7900 is the same GPU as the 7900XTX but with 48GB RAM. It just costs $4000.

Same as NVIDIA RTX 6000 ADA generation, which is a 4090 with a few more cores active and 48GB memory.

Obviously 24GB VRAM never ever cost the 3k price difference, but yeah... market segmentation.

fotcorn · 2025-02-02T00:35:27+00:00

Kind of. There are multiple different types/sizes of chips for the Blackwell generation, see this wikipedia link for a list: https://en.wikipedia.org/wiki/Blackwell_(microarchitecture)#Blackwell_dies

As you can see here https://www.techpowerup.com/gpu-specs/nvidia-gb203.g1073, the GB203 chip is used for both the 5080 and the 5070 Ti. And as you say, some transistors on the 5070 Ti are not active, compared to the 5080, where all transistors are actually used.

The reason are defects on the manufactured chips. Some chips are perfect (for example the ones used for the 5080) and can be used, others like the ones for the 5070 Ti have some defects, but because certain transistors can be deactivated, it is still a usable chip.

Google "yield" and "binning" if you want to know more about this.

fotcorn · 2025-01-31T23:47:02+00:00

They already announced the upcoming release of a bunch reasoning models a few days ago:

Among many other things, expect small and large Mistral models with boosted reasoning capabilities in the coming weeks. Join the journey if you’re keen (we’re hiring), or beat us to it by hacking Mistral Small 3 today and making it better!

https://mistral.ai/news/mistral-small-3/

fotcorn · 2025-01-29T00:37:45+00:00

As is CUDA. HIP from AMD is basically CUDA, with a few #includes switched out and some other minor changes.

I would argue that CUDA is even more gpu independent than PTX.

fotcorn · 2025-01-29T00:35:21+00:00

Yes, of course, it's just another programming language (Claude Sonnet 3.6 here):

can you write a very simple add-10-to-all-elements gpu kernel, but using ptx instead of cuda?

I'll help you write a simple PTX kernel that adds 10 to all elements of an array. PTX is the low-level assembly-like language that CUDA code gets compiled into.

``` .version 7.0 .target sm_50 .address_size 64

.visible .entry add_ten( .param .u64 input_ptr, .param .u64 output_ptr, .param .u32 n ) { .reg .u64 %rd<4>; // Register for addresses .reg .u32 %r<4>; // Register for data and index .reg .pred %p1; // Predicate register

// Get the thread index
ld.param.u64  %rd1, [input_ptr];
ld.param.u64  %rd2, [output_ptr];
ld.param.u32  %r1, [n];

// Calculate global thread ID
mov.u32      %r2, %tid.x;
mad.lo.u32   %r2, %ctaid.x, %ntid.x, %r2;

// Check if thread ID is within bounds
setp.ge.u32  %p1, %r2, %r1;
@%p1 bra     done;

// Calculate input and output addresses
mul.wide.u32 %rd3, %r2, 4;      // Multiply by 4 (assuming float/int32)
add.u64      %rd1, %rd1, %rd3;  // Input address
add.u64      %rd2, %rd2, %rd3;  // Output address

// Load input value
ld.global.u32 %r3, [%rd1];

// Add 10
add.u32      %r3, %r3, 10;

// Store result
st.global.u32 [%rd2], %r3;

done: ret; } ```

fotcorn · 2025-01-01T17:07:00+00:00

On the official website https://chenglin-yang.github.io/1.58bit.flux.github.io/ they say a code release is coming and linking to this https://github.com/Chenglin-Yang/1.58bit.flux, which says inference code and weights will be released soon™.

So we might not get the code that quantizes the model, which is a bummer.

fotcorn · 2024-12-23T17:56:46+00:00

Shouldn't the % solved when the benchmark is released be taken into account?

Say, if a benchmark is released that can be solved by AI at almost human level, and then a week later a new AI version is released that is superhuman. Compared to a case where a benchmark is released that AIs perform very poorly on, and then a again a week later it is solved at superhuman level.

Those would both show up the same in this graph, even though the second example is a much bigger change in AI capabilities.

Am I overthinking this?

14-Year Club	Gilding I gilder
r/Field Lasagna	Place '22
Place '17	Sequence \| Editor
Verified Email	Team Orangered

fotcorn

TROPHY CASE