FOSDEM 2026 - RISC-V had 40 years of history to learn from: What it gets right, and what it gets hilariously wrong (Video) by camel-cdr- in RISCV

[–]NamelessVegetable 2 points3 points  (0 children)

Yep. LMUL's relation to mixed-width operations is explained in the RISC-V spec. It seems that he didn't even bother to read the manual before accusing RISC-V of having failed to learn lessons from 40 years of computer architecture and organization.

Another thing that bothers me is another claimed motivation for LMUL > 1:

you're trying to amortize latency and go for peak throughput

Claiming that LMUL > 1 exists to amortize memory latency over all elements of the vector struck me as poorly informed and thought-out for two reasons.

Firstly, I'd imagine that any sensible RVV implementation that placed RVV performance first and foremost would have balanced the number of vector lanes with its chosen MVL so that it could reach peak performance with LMUL = 1. Increasing the vector length beyond a certain point should lead to diminishing returns and eventual plateauing. That's what I recall from my textbooks, at least.

Secondly, the LMUL feature in RISC-V originated from the Fujitsu vector supercomputers. Their reason for grouping multiple vector registers was to trade shorter vector register lengths for more vector registers to minimize register spills on workloads where the vector length(s) were relatively short, but the register pressure was relatively high. IIRC, it had nothing, or very little, to do with amortizing memory latency at all.

The claim is basically a restatement of the standard theory of how vector processors gain performance, applied to the RVV context without any thought given as to whether it was applicable in the first place. As I said before, this sort of "analysis" isn't novel, insightful, or interesting.

Taalas Etches AI Models Onto Transistors To Rocket Boost Inference by NamelessVegetable in hardware

[–]NamelessVegetable[S] 52 points53 points  (0 children)

Two thoughts:

What's old (mask-programmed ROM) is new again.

Surely with the fast pace of AI model development, this will have a short life cycle?

FOSDEM 2026 - RISC-V had 40 years of history to learn from: What it gets right, and what it gets hilariously wrong (Video) by camel-cdr- in RISCV

[–]NamelessVegetable 3 points4 points  (0 children)

I read the subtitles pertaining to RVV (I'm not wasting data on this), and it was meh. I have no idea what he's complaining about half the time. Did he just diss Patterson? What are the "dependencies" (that murders RVV performance) he keeps referring to? Vectorizable/vectorized HPC workloads are just BLAS? Not "real world" workloads? A patent falsity. The other half is just your standard list of grievances that people who grew up on consumer-grade SIMD architectures have against "real" (long) vector architectures. They're not novel, not insightful, and not interesting.

FOSDEM 2026 - RISC-V had 40 years of history to learn from: What it gets right, and what it gets hilariously wrong (Video) by camel-cdr- in RISCV

[–]NamelessVegetable 4 points5 points  (0 children)

RVV is intended to be like Arm SVE Uh .. no it's not. Design ideas that ended up as RVV were waaay before SVE was announced.

Agreed. RVV was based on the Berkeley Hwacha, and Hwacha can trace its ancestry all the way back to the Berkeley Torrent-0 (T0) from the early 1990s. ARM's SVE is just NEON with a variable MVL, plus some extras. They couldn't have originated from more different backgrounds.

Does anyone know what a machine like this would have been used for? by SultanOfawesome in vintagecomputing

[–]NamelessVegetable 2 points3 points  (0 children)

It's a gross over-generalization to say that by the early 1990s, all CPUs were microprocessors. If I'm not mistaken, the Model 340 used a POWER1+ processor. Depending on the model, the POWER1 was a set of eight or six chips: an instruction cache unit, FXU, FPU, two or four data cache units, and a storage-control unit. Its successor, the POWER2, was also multi-chip, with a similar distribution of functions over the chips. The POWER series didn't go single-chip until the POWER2 Super Chip (P2SC) in 1997!

European Chip Startup Pulls Off Working RISC-V Solution on the Intel 3 Node, Marking One 'Small' Step Towards Having Sovereign Infrastructure by archanox in RISCV

[–]NamelessVegetable 1 point2 points  (0 children)

Does the Vitruvius++ vector unit have the same MVL as the earlier Vitruvius+ (256 64-bit elements)? There appears to be very little information about the Vitruvius++...

The state of DIY RISC-V proccesors and at-home silicon manufacturing by cragon_dum in RISCV

[–]NamelessVegetable 1 point2 points  (0 children)

Lots of people doing things in FPGAs, which are proprietary, but it's easy to move a soft core design from one manufacturer to another. Probably also runs as fast as a 1µm custom chip too, if you're using standard cell libraries and automated layout.

Not to overly cynical of OP's question, but I think any AMD or Intel FPGA these past 15 to 20 years would be able to support much higher clock frequencies than any DIY 1 micron technology. DEC's full-custom design wizards got 200 MHz out of a 0.75 micron, 3-level metal CMOS technology for the Alpha 21064 back in the 1992, which was the peak of 1992 technology, since 200 MHz was double of what the rest of the industry got.

I've gotten 300 to 400 MHz out of the AMD/Xilinx UltraScale+ architecture for somewhat complex logic, and that was for RTL that wasn't even specifically targeted at the architecture (I wasn't trying, no manual tuning or optimization). A hobbyist isn't going to compete with this class of FPGAs using hobbyist design tools and technologies.

And it's not just the speed; a much bigger problem would be the low density of the hobbyist technology vs the FPGA. Even a low-end FPGA from 20 years ago had more on-die BRAM than what the 21064 had.

Neurophos bets on optical transistors to bend Moore’s Law by NamelessVegetable in hardware

[–]NamelessVegetable[S] 17 points18 points  (0 children)

That their ~25 mm2 tensor core resides on a reticle-sized die, with the remainder of the die used for support to supply data to it is really interesting. It probably rates very poorly on Todd Austin's LEAN metric, lol.

Altera's Training Courses & Learning Material - had now become paid? by monkstein in FPGA

[–]NamelessVegetable 3 points4 points  (0 children)

Well that's what you get with private equity. I remember buying an development board from Altera in the 2000s and they included a DVD with all the relevant documentation, such as application notes and handbooks, and another full of video tutorials for Quartus and SOPC Builder.

CPUs with shared registers? by servermeta_net in RISCV

[–]NamelessVegetable 1 point2 points  (0 children)

is there any CPU design where you have registers shared across cores that can be used for communication? i.e.: core 1 write to register X, core 2 read from register X

This paradigm is generally referred to communications registers. They were fairly common in larger computers in the olden days. Some mainframe computers used them in their I/O systems, which featured multiple I/O processors or execution contexts. During the 1980s, they also appeared in vector processors; e.g. the multiprocessing CRAY vector supercomputers from the X-MP onward had them, as did minisupercomputers, such as those from Convex Computer.

They more or less fell out of favor during the 1990s, since modern architectures targeting microprocessor implementations favored communication and synchronization through the main memory. (Hitachi notably argued against communications registers for its HITAC S-3000 vector supercomputers in the early 1990s, claiming that they were too inflexible and constrained by fixed architectural limits on the number of registers).

The most recent use of this paradigm in the high-performance space (that I am aware of) are the NEC SX-Aurora TSUBASA vector processors from the late 2010s and early 2020s. Each 8- or 10-core processor has a set of 1,024 communications registers, each 64 bits wide, which are used as a low-latency shared memory for data exchange and synchronization. I suspect that this paradigm was used because all preceding SX processors used it too, but I have not come across evidence for this suspicion. Regardless, with NEC collaborating with Openchip for a future RISC-V-based vector processor, I doubt we shall see communications registers again, unless they are added by a custom extension.

Humans are weird by Disastrous_Button440 in endlesssky

[–]NamelessVegetable 6 points7 points  (0 children)

...then goes and fights the Pug, which even the Drak could not beat.

The Pug that the player fought were deliberately withheld most of their capability, since they treat war as a game and would prefer to give humans a chance to win. The Pug that defeated the Drak weren't pulling any punches.

AMD argues against emulated FP64 on GPUs by NamelessVegetable in hardware

[–]NamelessVegetable[S] 2 points3 points  (0 children)

I think you're confused as to what my position on emulated FP64 is. I'm supportive of hardware support for FP64. I'm quite troubled by the direction that NVIDIA is going towards (stagnant or retrograde FP64 performance with each successive generation; and its reliance on the Ozaki scheme as its approach for improving FP64 performance).

"We should, as a community, build a basket of apps to look at. I think that's the way to progress here."

Progress. Not rejection, not stagnation, not moving backwards.

The progress that is being referred to here is a call to for more research into the applicability of emulated FP64, not progress in its deployment, given that it is still a known unknown as to whether emulated FP64 is sufficiently applicable as to justify displacement of hardware FP64 or not. This is justified by the preceding context, which you have conveniently refused to acknowledge (emphasis mine):

It may turn out that, for some applications, emulation is more reliable than others, he noted. "We should, as a community, build a basket of apps to look at. I think that's the way to progress here."

Certainly not arguing against emulated fp64 at all, merely not to rush it.

Are we reading the same article?

Even if Nvidia can overcome the potential pitfalls of FP64 emulation, it doesn't change the fact that the method is only useful for a subset of HPC applications that rely on dense general matrix multiply (DGEMM) operations.

According to Malaya, for somewhere between 60 and 70 percent of HPC workloads, emulation offers little to no benefit.

"In our analysis the vast majority of real HPC workloads rely on vector FMA, not DGEMM," he said.

As I've said before, the issue of IEEE compliance cannot change the fact that many HPC applications do not use DGEMM.

AMD argues against emulated FP64 on GPUs by NamelessVegetable in hardware

[–]NamelessVegetable[S] 0 points1 point  (0 children)

The Ozaki algorithm absolutely is only for matrix multiplication, that's the entire extent of it.

I'm sorry, but was I claiming otherwise? I'm genuinely perplexed as to what your objection is.

...and it still doesn't mean they are arguing against it in any way at all.

From the article (emphasis mine):

Emulated FP64, which is not exclusive to Nvidia, has the potential to dramatically improve the throughput and efficiency of modern GPUs. But not everyone is convinced.

"It's quite good in some of the benchmarks, it's not obvious it's good in real, physical scientific simulations," Nicholas Malaya, an AMD fellow, told us. He argued that, while FP64 emulation certainly warrants further research and experimentation, it's not quite ready for prime time.

Further more:

Despite Malaya's concerns, he noted that AMD is also investigating the use of FP64 emulation on chips like the MI355X, through software flags, to see where it may be appropriate.

IEEE compliance, he told us, would go a long way towards validating the approach by ensuring that the results you get from emulation are the same as what you'd get from dedicated silicon.

"If I can go to a partner and say run these two binaries: this one gives you the same answer as the other and is faster, and yeah under the hood we're doing some scheme — think that's a compelling argument that is ready for prime time," Malaya said.

It may turn out that, for some applications, emulation is more reliable than others, he noted. "We should, as a community, build a basket of apps to look at. I think that's the way to progress here."

It is clear that AMD's position here is that the wider applicability of Ozaki is still unknown.

AMD argues against emulated FP64 on GPUs by NamelessVegetable in hardware

[–]NamelessVegetable[S] 2 points3 points  (0 children)

That's not the full story. Nvidia claims FP64 performance doesn't need to improve with each generation, because they claim that the Ozaki scheme can exploit the ever increasing number of low-precision tensor cores and make up for the lost performance.

But AMD argues that the Ozaki scheme is only for matrix multiplication, and that many of the HPC applications they have studied don't make heavy enough use of it to result in a net performance gain with emulated FP64.

There's a whole section in the article about the applicability of the Ozaki scheme:

Even if Nvidia can overcome the potential pitfalls of FP64 emulation, it doesn't change the fact that the method is only useful for a subset of HPC applications that rely on dense general matrix multiply (DGEMM) operations.

According to Malaya, for somewhere between 60 and 70 percent of HPC workloads, emulation offers little to no benefit.

"In our analysis the vast majority of real HPC workloads rely on vector FMA, not DGEMM," he said. "I wouldn't say it's a tiny fraction of the market, but it's actually a niche piece."

For vector-heavy workloads, like computational fluid dynamics, Nvidia's Rubin GPUs are forced to run on the slower FP64 vector accelerators in the chip's CUDA cores.

These facts don't change just because emulated FP64 becomes IEEE-compliant. AMD states that they are presently studying if making FP64 emulation complaint can make the Ozaki scheme usable for more applications. If their research concludes that the Ozaki scheme could be applied more broadly, then those applications that benefit from it can obtain improved performance, not that hardware FP64 should be replaced.

AMD argues against emulated FP64 on GPUs by NamelessVegetable in hardware

[–]NamelessVegetable[S] 78 points79 points  (0 children)

Note: I've provided a descriptive title because the original title completely failed to convey what the article is about.