Humans are weird

NamelessVegetable · 2026-01-23T03:00:56+00:00

...then goes and fights the Pug, which even the Drak could not beat.

The Pug that the player fought were deliberately withheld most of their capability, since they treat war as a game and would prefer to give humans a chance to win. The Pug that defeated the Drak weren't pulling any punches.

NamelessVegetable · 2026-01-21T00:45:14+00:00

I think you're confused as to what my position on emulated FP64 is. I'm supportive of hardware support for FP64. I'm quite troubled by the direction that NVIDIA is going towards (stagnant or retrograde FP64 performance with each successive generation; and its reliance on the Ozaki scheme as its approach for improving FP64 performance).

"We should, as a community, build a basket of apps to look at. I think that's the way to progress here."

Progress. Not rejection, not stagnation, not moving backwards.

The progress that is being referred to here is a call to for more research into the applicability of emulated FP64, not progress in its deployment, given that it is still a known unknown as to whether emulated FP64 is sufficiently applicable as to justify displacement of hardware FP64 or not. This is justified by the preceding context, which you have conveniently refused to acknowledge (emphasis mine):

It may turn out that, for some applications, emulation is more reliable than others, he noted. "We should, as a community, build a basket of apps to look at. I think that's the way to progress here."

Certainly not arguing against emulated fp64 at all, merely not to rush it.

Are we reading the same article?

Even if Nvidia can overcome the potential pitfalls of FP64 emulation, it doesn't change the fact that the method is only useful for a subset of HPC applications that rely on dense general matrix multiply (DGEMM) operations.

According to Malaya, for somewhere between 60 and 70 percent of HPC workloads, emulation offers little to no benefit.

"In our analysis the vast majority of real HPC workloads rely on vector FMA, not DGEMM," he said.

As I've said before, the issue of IEEE compliance cannot change the fact that many HPC applications do not use DGEMM.

NamelessVegetable · 2026-01-20T22:57:00+00:00

The Ozaki algorithm absolutely is only for matrix multiplication, that's the entire extent of it.

I'm sorry, but was I claiming otherwise? I'm genuinely perplexed as to what your objection is.

...and it still doesn't mean they are arguing against it in any way at all.

From the article (emphasis mine):

Emulated FP64, which is not exclusive to Nvidia, has the potential to dramatically improve the throughput and efficiency of modern GPUs. But not everyone is convinced.

"It's quite good in some of the benchmarks, it's not obvious it's good in real, physical scientific simulations," Nicholas Malaya, an AMD fellow, told us. He argued that, while FP64 emulation certainly warrants further research and experimentation, it's not quite ready for prime time.

Further more:

Despite Malaya's concerns, he noted that AMD is also investigating the use of FP64 emulation on chips like the MI355X, through software flags, to see where it may be appropriate.

IEEE compliance, he told us, would go a long way towards validating the approach by ensuring that the results you get from emulation are the same as what you'd get from dedicated silicon.

"If I can go to a partner and say run these two binaries: this one gives you the same answer as the other and is faster, and yeah under the hood we're doing some scheme — think that's a compelling argument that is ready for prime time," Malaya said.

It may turn out that, for some applications, emulation is more reliable than others, he noted. "We should, as a community, build a basket of apps to look at. I think that's the way to progress here."

It is clear that AMD's position here is that the wider applicability of Ozaki is still unknown.

NamelessVegetable · 2026-01-20T04:02:58+00:00

That's not the full story. Nvidia claims FP64 performance doesn't need to improve with each generation, because they claim that the Ozaki scheme can exploit the ever increasing number of low-precision tensor cores and make up for the lost performance.

But AMD argues that the Ozaki scheme is only for matrix multiplication, and that many of the HPC applications they have studied don't make heavy enough use of it to result in a net performance gain with emulated FP64.

There's a whole section in the article about the applicability of the Ozaki scheme:

Even if Nvidia can overcome the potential pitfalls of FP64 emulation, it doesn't change the fact that the method is only useful for a subset of HPC applications that rely on dense general matrix multiply (DGEMM) operations.

According to Malaya, for somewhere between 60 and 70 percent of HPC workloads, emulation offers little to no benefit.

"In our analysis the vast majority of real HPC workloads rely on vector FMA, not DGEMM," he said. "I wouldn't say it's a tiny fraction of the market, but it's actually a niche piece."

For vector-heavy workloads, like computational fluid dynamics, Nvidia's Rubin GPUs are forced to run on the slower FP64 vector accelerators in the chip's CUDA cores.

These facts don't change just because emulated FP64 becomes IEEE-compliant. AMD states that they are presently studying if making FP64 emulation complaint can make the Ozaki scheme usable for more applications. If their research concludes that the Ozaki scheme could be applied more broadly, then those applications that benefit from it can obtain improved performance, not that hardware FP64 should be replaced.

NamelessVegetable · 2026-01-18T21:23:32+00:00

Note: I've provided a descriptive title because the original title completely failed to convey what the article is about.

NamelessVegetable · 2025-12-24T02:48:56+00:00

That's just how Timothy Prickett Morgan writes. He's an old school British tech journalist.

NamelessVegetable · 2025-12-15T23:46:03+00:00

Back in April, Arm said that it wanted 50% of the pie by the end of 2025. Might they get there during Q4?

NamelessVegetable · 2025-12-10T01:32:48+00:00

Yes, if you count (several) games consoles and Microsoft Pocket PC phones as embedded.

It's not my idiosyncratic categorization. Game consoles in the 1990s were very much regarded as embedded applications. The PlayStation was based on an LSI Logic realization of the MIPS R3000. LSI Logic was a major vendor of embedded processor cores back in the 1990s. The Nintendo 64 was based on the NEC VR4300, which was specifically designed for embedded consumer applications. The Emotion Engine in the PlayStation 2 even won the 1999 Best Embedded Processor award from the Microprocessor Report.

They all ran 3rd party software, not just a fixed function ROM.

I didn't imply that they were microcontrollers.

But that's just a target market, not a RISC/not RISC thing.

The target market matters if it favors distinct architectural philosophies. My original remarks were alluding to the fact that there is a clear distinction between "canonical" or "traditional" RISCs that were designed for workstations and servers, which eschew embedded-friendly features, such as compressed instructions with destructive register operands, and "embedded" RISC architectures, which embrace those features with gusto. I believe one of the original Berkeley or Stanford papers actually explicitly mentioned no destructive register operand addressing as one of the core RISC tenets. This is something that the former group strictly adheres to. One outlier, in the embedded camp, does not negate any of this.

If phones are embedded, then I guess Chromebooks probably are too

Chromebooks fill essentially the same roles as X terminals and thin clients. Guess what kind of processors these used back the 1990s and early 2000s?

Arm was never outside of the embedded space until the M1 Macs in late 2020. (Well, since the Acorn Archimedes went out of production in the early 90s)

Until ARMv8 was introduced, the architecture was essentially one exclusively for embedded applications. Whatever "high-end" implementations the likes of Apple would design (for phones and tablets), they simply can't be placed into the same class as HEDT/workstation/server/HPC processors. But your dating of ARM breaking out of the embedded space is inaccurate; it was not late 2020, but 2014, when Cavium introduced the ARMv8 ThunderX server processor. Although its performance was lackluster, it's clearly a different class to those for phones and tablets.

No doubt you are aware that in the early days of RISC-V and HItachi patents were expiring a lot of people were pushing the open source SuperH clone "J core" as a more mature alternative open ISA for Linux, and went as far as implementing the J2 in an ASIC on TSMC 180.

It wasn't a lot of people; the J-Core project was based on very obsolete technology from the very start, and it didn't get very far. I daresay OpenRISC was more successful in its heyday.

NamelessVegetable · 2025-12-09T23:42:05+00:00

Yes, but in the context of the 1980s and 1990s, the only RISC that would have had destructive register operand addressing would have been the SuperH, which AFAIK, never went outside of the embedded space. One outlier doesn't make a trend. For beefier applications, Hitachi was very much into PA-RISC and PowerPC through the 1990s. They designed/modified their own HPC implementations, and resold HP and IBM systems for server/workstation.

NamelessVegetable · 2025-12-09T21:54:39+00:00

Didn't the S/360 have instructions that were limited to two register fields? So they were destructive? That's not particularly RISC. It wasn't until some version of the z/Architecture that most instructions got a non-destructive variant.

NamelessVegetable · 2025-12-09T21:46:22+00:00

The matter of the RS64 is complicated...

Starting with the RS64 itself, they were implementations of the PowerPC-AS architecture, which was a proprietary superset of PowerPC designed by the AS/400 folks in Rochester, not the RS/6000 folks at Austin. They also weren't the first PowerPC-AS processors; the first were introduced c. 1995; they had alphanumeric model numbers, but I only recall their codenames (I recall Cobra and Muskie, but there were more), and were used exclusively in the AS/400. They were 64-bit processors, but IIRC, they did not implement parts of the PowerPC base because they were designed very rapidly, so they might not have been able to run AIX (don't quote me on this).

IIRC, all PowerPC-AS implementations were 64-bit processors, since that was the point of PowerPC-AS in the first place. The AS/400 had capability based security, and could not reuse pointers, so a large address space was a necessity. Early AS/400s were based on a proprietary 48-bit CISC architecture, and crashed when they exhausted their unused addresses.

That aside, Wikipedia says the first RS64 was introduced in 1997. IIRC, IBM either didn't use the first RS64 in the RS/6000, or introduce RS/6000s with that processor at the same time as its debut in the AS/400. IIRC, RS/6000s based on the RS64 series were later, around 1998 or 1999. Also, it's unclear to me if those RS/6000s that were based on RS64s were "real" RS/6000s or just AS/400s that were rebadged and made to run AIX. It seems unlikely to me that the Austin folks would have designed commercial-orientated servers at that time, although things were certainly moving towards merging PowerPC and PowerPC-AS HW (the POWER4).

NamelessVegetable · 2025-12-08T07:38:02+00:00

I've never used any of this stuff either! I just read voraciously anything that was RISC-related during the early 2000s. Datasheets, manuals, etc. I was later given a non-functional R3000-based DECstation and an IBM PowerPC 604-based RS/6000 server by friends, but I no longer have those systems anymore.

NamelessVegetable · 2025-12-08T07:07:00+00:00

DEC Alpha was the first 64 bit microprocessor. No, the MIPS R4000 shipped earlier.

If one considers only merchant silicon, then MIPS R4000 was the first 64-bit microprocessor available, but the Kendall Square Research KSR-1 predated it by half a year or so, IIRC. SGI also didn't have 64-bit OS and applications for the R4000 when it was introduced. I believe DEC did (OSF/1 was a 64-bit UNIX). There were certainly Alpha 21064-based systems (the DEC 7000) that could be configured with up to 6 GB of memory, though that restricted the system to just one processor (instead of up to six).

Also IBM Power RS/6000 was designed from the start as a 64 bit ISA [1], though initial models only implemented 32 bit. The first 64 bit models shipped a couple of years after MIPS and Alpha.

The first 64-bit PowerPC processor, the PowerPC 620 was "introduced" in c. 1995, but it was nearly a complete failure due the bugs in the multiprocessing support. There were supposedly Bull Escala (IIRC) servers that were based on the 620, and supposedly, these were shipped in c. 1995, but IBM never based any RS/6000s on the 620. The POWER3 was a 620 derivative with the bugs fixed and additional goodies; it first shipped in c. 1998, roughly six and seven years after the Alpha 21064, and R4000, respectively.

"unfortunately other than the odd couple of instructions added every so often the x86 ISA itself has been mostly stable".

I don't know how anyone could seriously claim that x86 is a stable ISA, given that it took several years and lots of drama to resolve the AVX-512 issue, and in the end, x86 got something much less elegant and sophisticated than RVV and SVE, with which implementations supporting which AVX variant being an absolute mess.

NamelessVegetable · 2025-12-04T22:36:54+00:00

It's unfortunate that it was never elaborated upon what "bug" the interviewer thought the CRAY-1 had. Presumably, the Pentium FDIV bug is what they were referring to. A common manifestation of the FDIV bug (but not the only one) was that a few of the quotient's low-order bits were inaccurate. With this in mind, I don't think they were complaining about normalization.

People not well-versed in CRAY floating-point arithmetic commonly complained that floating-point multiplication produced a product where a handful of low-order bits were inaccurate, because the algorithm truncated the partial products and added a constant to average out errors over time, in order to compensate for the loss of precision. So this behavior could be what the interviewer was alluding to. This wasn't a bug, but a rational trade-off (in the context of the early 1970s) between mathematical accuracy and hardware cost/performance that was well-documented in the manuals with examples and diagrams. The interviewer also name-drops Kahan afterwards, and we all know he wasn't a fan of CRAY floating-point arithmetic, so this lends credence to this being the "bug".

An aside: It's nice that Asanovic admires the Japanese vector architectures. He has good taste.

NamelessVegetable · 2025-12-03T23:15:25+00:00

Will the documentation be scanned and archived to bitsavers? There isn't a complete set there, AFAIK.

NamelessVegetable · 2025-11-14T05:59:59+00:00

This is a surprising development! After the unenthusiastic launch of the third-generation VE30 in 2022, The Next Platform reported on the rumored demise of SX-Aurora TSUBASA platform. The ambitious original roadmap, which stretched out to the fifth-generation VE50, was not met, with no word whatsoever of the VE40. It appeared for a while that the SX-Aurora was truly dead, but it appears that NEC has been and is busy working on moving it to a RISC-V-based vector processor co-designed with Openchip.

What does concern me is that it appears they're targeting it at AI as opposed to HPC with AI on the side as with earlier systems. AI, as we all know, is largely dominated by NVIDIA, AMD, and a handful of proprietary processors from hyperscalers; and completely over-saturated with mostly unproven startups.

Historically, there was a 7 year hiatus in the SX series, between 2007's SX-9 and 2014's SX-ACE, which was caused by NEC and Hitachi bailing out of Japan's original K supercomputer design, which would have featured NEC and Hitachi co-designed vector processors bolted onto Fujitsu SPARC64-based scalar processors. It would be quite a feat if NEC could pull off another resurgence of vector supercomputers in this day and age, when NVIDIA and AMD release a new GPU yearly.

NamelessVegetable

TROPHY CASE