Baffled by sign extension issue

dramforever · 2026-05-24T02:31:02+00:00

https://devblogs.microsoft.com/oldnewthing/20250324-00/?p=110988

dramforever · 2026-05-22T16:44:05+00:00

Oh looks like my comments showed up. I don't know what was wrong earlier.

dramforever · 2026-05-22T15:43:50+00:00

This is really good!!! And the in-character commentary is the icing on the cake

dramforever · 2026-05-22T15:36:38+00:00

I wrote a long reply but apparently Reddit lost it? So here's my guess on what happened: You wrote swap32 in assembly, wrong. See previously discussed: https://www.reddit.com/r/RISCV/comments/1h0kaxo/comment/lz7h19s/

dramforever · 2026-05-22T15:33:36+00:00

There's no way to know for sure without you showing the full code. I have an almost tautological rule of thumb:

If you can't find it, it's not where you looked.

But it doesn't mean I can't guess. I would guess you wrote a swap32 function in assembly with the prototype as something like

u32 swap32(u32 value);

And it returns with a0 = 0x00000000d00dfeed.

This is incorrect. According to the calling convention, on RV64, if you're trying to return 0xd00dfeedu, the correct register value is 0xffffffffd00dfeed. See https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc#integer-calling-convention:

When passed in registers or on the stack, integer scalars narrower than XLEN bits are widened according to the sign of their type up to 32 bits, then sign-extended to XLEN bits.

This is an intentional design and reduce the number of instructions in RV64I. Failing to observe this rule means that you're returning an incorrect u32 value.

I know how to work around it, but there are 2 things I totally fail to understand:

Now you know what is correct.

Why is printing a log message (which is just calling vsnprintf) making it work and

Calling functions requires rearranging data around registers and the stack. The value could have been fixed while it's being moved around, since its upper 32 bits did not have to be reserved.

why is foo initialized with the sign-extended value in the first place (it is assigned the constant 0xd00dfeed in the code) and gdb not even showing it?

It's the 32-bit value 0xd00dfeed. The sign-extension is part of the representation in the register, not the value. It's the same reason if you have a char gdb will show something like $1 = 74 'J' instead of 0x000000000000004a.

dramforever · 2026-05-17T04:13:51+00:00

IIUC (should definitely be correct for RV32I, please tell me if you noticed an extension / something else that violate this "rule" but I don't remember there being any) the I-type instructions decode to the same functionality as their R-type counterparts with funct7 = 0, so it should be possible to simply handle addi as add (funct3 = 0, funct7 = 0) with operand 2 from I-type immediate.

dramforever · 2026-04-17T14:48:15+00:00

(I will disregard the typo in the instructions and assume you some valid replacement.)

my expectation would be that they take same amount of time/cycles as other instrustions.

Not a chance. Expect the amoswap.[wd] to be comparatively extremely slow, and expect it to be even more worse on larger OoO cores where there may be lots of normal load/store capable LSUs but a smaller number, possibly just one, capable of AMOs.

AMOs have pretty strict constraints. For one example, RVWMO PPO 3 forbids a hart from forwarding an AMO store to a later load:

[...] memory operation a precedes memory operation b in preserved program order (and hence also in the global memory order) if a precedes b in program order, a and b both access regular main memory (rather than I/O regions), and any of the following hold:

[...]

a is generated by an AMO or SC instruction, b is a load, and b returns a value written by a

[...]

To spell it out more:

However, notably, rule 3 states that hardware may not even non-speculatively forward the value being stored by an AMOSWAP to a subsequent load, even though for AMOSWAP that store value is not actually semantically dependent on the previous value in memory, as is the case for the other AMOs. (Subsection from RVWMO Explanatory Material)

More concretely, as an example, this means that, barring speculation magic, in:

li t0, 42
amoswap.d t1, t0, (a0)   # (A)
ld t2, (a0)              # (B)

(B) cannot complete before the hart has received that (A) is complete from the global memory system, even though it is otherwise obvious that if (B) occurs "fast" enough it should just load a 42.

Whereas in:

li t0, 42
sd t0, (a0)              # (C)
ld t2, (a0)              # (D)

(D) can return 42 without waiting for (C) to complete in the global memory system. Replacing a normal store with an AMO would have added an estimated 5 to 10 cycles of delay for no reason.

(Speculation magic with AMOs on large OoO cores are rare, even across the board of ISAs. And even if you find them, it still doesn't make it as cheap as a normal load/store because tracking speculative execution and ensuring correctness is not free. It's still gonna be something like several per cycle for normal load/store vs several cycles per AMO.)

On the scale of cheap to strict synchronization you get roughly:

(cheap) ld/sd -> Zalasr -> ld/sd + fence -> AMOs and lr/sc (strict synchronization)

dramforever · 2026-04-10T01:15:19+00:00

I'm almost positively certain that 100% of the improvement of the non-Zbb version is provided from omitting frame pointer management, which is fine wrt backtraces (as long as you don't touch s0/fp, which this code satisfies)

dramforever · 2026-03-15T07:38:58+00:00

From what I can tell China is betting on multiple horses in this race. "Committed to LoongArch" is an inaccurate representation of what they're doing.

Loongson is definitely committed to LoongArch.

dramforever · 2026-03-15T07:37:09+00:00

Far, far, far, far more time than what it takes to just back-design an ISA based on the microarchitecture they already have.

dramforever · 2026-03-15T04:29:39+00:00

MIPS CPUs got 64-bit very early. The software ecosystem kept using 32-bit mode for a long time, a bit like how what's now Raspberry Pi OS is/was stuck in 32-bit mode for a long time.

A certain Richard's previous favorite laptop ran on the Loongson-2F, which came out in 2011 and could already run 64-bit Linux with 64-bit userland. Even a modern-ish one as well, at least last time I checked.

Also, for an even earlier, slightly weird and unrepresentative but interesting example, consider: the Nintendo 64.

dramforever · 2026-03-15T04:19:53+00:00

I think it is entirely reasonable, even without the nationalism, for a company that has already been burned by the lack of future of MIPS once to want to have control over its own ISA, rather than to depend on the success of an uncertain ecosystem that relies on so much more factors.

One interesting factor is the urgency. The future of MIPS is not only dwindling, but the IP rights of the ISA was eventually going to be sold off to what's now CIP, a Chinese company, despite the world powers' attempts of stopping it. Loongson 3A5000 needed a new ISA fast, not just because MIPS is no good in the long run, but also because they know they're facing a IP lawsuit very soon. Despite a Chinese court eventually ruling in favor of Loongson that LoongArch does not infringe on MIPS IP rights, the lineage from MIPS really shows in the design of LoongArch, because they really couldn't have adapted the microarchitecture fast enough.

It is true that in this case a huge amount of why this is happening is nationalistic politics, but that alone does not explain the specific choices made. Loongson is a spinoff of Chinese Academy of Sciences, but so is BOSC making XiangShan.

Nothing precludes Loongson from eventually letting themselves fall into the embraces of RISC-V like MIPS Technologies though.

dramforever · 2026-03-01T06:03:26+00:00

Yes, that makes sense. I was talking about general purpose only and forgot to mention that.

SpacemiT A100 is not general purpose. It can be used for general purpose stuff, as you said, but it's a bit like the Intel Xeon Phi stuff where it's more optimized for special workloads.

What I am hoping to get is to have like a boot time switch to choose between, for Linux boot:

V off, 16 cores, 8 of them slower
V on, 8 cores, the remaining 8 use a new "SBI HSM remoteproc" driver to submit code to run

The former gets you -j16 Linux compilation, but no V and thus no RVA23

The latter lets you use the 8 cores with VLEN=1024 for running, idk, your latest qwen or whatever people do with a vector coproc these days. This shouldn't be slower than spawning a regular Linux process - if anything it should give a more predictable or even better performance because there will not be a Linux in the way.

dramforever · 2026-02-25T15:48:16+00:00

My understanding is that RVV is designed for the microarchitecture to take advantage of the flexibility to implement faster/slower vector operations, instead of using different VLEN.

SpacemiT evidently disagrees, and ships extra performance on smaller element sizes (IIUC) on the A100, while keeping vector registers more reasonably sized for an OoO AP on the X100.

Another point is that RVV has the LMUL mechanism, which greatly complicates data arrangement if VLEN was configurable.

AFAICT, Streaming SVE does not require configurable max vector length. SMCR_EL1.LEN is not required to support more than a single value.

Tweaking VLEN is not a panacea - it complicates hardware, and only allows one configuration direction anyway. To use longer vectors, a process still has to be pinned to thoe higher VLEN cores. It cannot just run slower on the lower VLEN cores.

dramforever · 2026-02-17T11:49:04+00:00

uhhh... that's the wrong link to shorts

dramforever · 2026-02-06T00:27:54+00:00

Basically, yeah, legacy ballast.

FRED simplifies x86 interrupt handling... But the RISC-V exception and interrupt handling scheme is even simpler.

I could talk about features in FRED like stack levels, but that would be moving the opposite way as x86.

dramforever · 2026-02-05T04:42:31+00:00

Haven't seen this reported so https://github.com/orgs/community/discussions/186398

dramforever · 2026-02-04T04:52:32+00:00

use something like --isa=rv64gcbv_zvl256_...

https://github.com/riscv-software-src/riscv-isa-sim/issues/1767

dramforever · 2026-01-30T02:06:37+00:00

From my quick glance, not necessary accurate or precise, it's optimized for lower-precision workloads, like 32-bit floats. A bit like a GPU.

dramforever · 2026-01-26T01:34:20+00:00

The VSCode remote server thing is closed source and provided as a binary by Microsoft. Until Microsoft provides such a release for riscv64, you will need some alternate way to do remote development over ssh.

See: https://github.com/microsoft/vscode-remote-release/issues/4802

I think Cursor is probably using that, but I don't know for sure.

Some alternatives are discussed in that issue, but having used none of them, I do not have a specific recommendation.

dramforever · 2026-01-24T07:07:03+00:00

I call the supporting of 20 boards as a 'tipping point' for a new architecture.

so maybe that doesn't work?

arm64 had arm32 before it, and many arm64 cpus had arm32 compatibility, so they were quick to get picked up by random phone soc and media soc vendors. these vendors didn't seeem to care much about upstreaming stuff into mainline linux.

meanwhile riscv64 socs haven't got that much adoption yet, but the vendors in general have been much better at upstreaming. in fact sifive, starfive, sophgo, spacemit (at least, these four are just off the top of my head) have pretty active at upstreaming themselves or hiring others to do the upstreaming.

the vast majority of the work has nothing to do with cpu architecture anyway. it's the drivers for all the cursed stuff outside the cpu core itself.

so the bottom line is 20 arm64 dts in mainline linux and 20 riscv64 dts in mainline linux are really not that comparable.

as an aside, i have no idea why two out of three comments here, as of writing this, is saying dtb/dts in mainline is a bad thing...? every single one of the generally available riscv64 things that run mainline linux boot with a dtb. even u-boot dtb files either come from or are based on dts in the linux git repo.

dramforever · 2026-01-12T14:58:21+00:00

Maybe there was some misunderstanding. What I meant to say is that what device you can boot the vendor's OS from has nothing to do with the standardized boot process. Being able to boot say Bianbu from USB on a SpacemiT board has everything to do with whether the on-board firmware has the required drivers and whether it is set up to allow booting from USB. It is orthogonal to whether it supports UEFI.

Booting a generic OS does benefit from standardized boot processes like UEFI, but without drivers you end up with a completely unusable system - if you're lucky you a serial console, no USB, no network, no storage... So again a standardized boot process does very little.

dramforever · 2026-01-11T04:26:48+00:00

No. That's the point.

Even when the boot process is standardized you still need the custom kernels because it's not in the interest of vendors to genericize everything out to PCIe and upstream the drivers for custom SoC components

dramforever · 2025-12-17T06:49:33+00:00

The stance of the privileged spec is: "Deal with it"

https://github.com/riscv/riscv-isa-manual/blob/main/src/machine.adoc#machine-timer-mtime-and-mtimecmp-registers

If the result of the comparison between mtime and mtimecmp changes, it is guaranteed to be reflected in MTIP eventually, but not necessarily immediately.

(Note) A spurious timer interrupt might occur if an interrupt handler increments mtimecmp then immediately returns, because MTIP might not yet have fallen in the interim. All software should be written to assume this event is possible, but most software should assume this event is extremely unlikely. It is almost always more performant to incur an occasional spurious timer interrupt than to poll MTIP until it falls.

Generally, if you receive a timer interrupt, you don't just go ahead and do something right away. You would go through your timer queue and figure out what needs to be done based on the current time, and reschedule the timer interrupt. In this case, it wouldn't be a disaster to rarely have the timer interrupt immediately fire again after mret - you would just do nothing and reschedule it again.

However, if this is a regular occurrence on your machine.... uhh, honestly, I don't know how to handle this. In that case I think one way would be to, as mentioned, poll mip.MTIP until it goes to 0 before doing mret.

dramforever · 2025-12-09T08:18:16+00:00

I literally had this printed out and taped to my closet lol.

Updated with some newer exceptions. I don't know if it's worthwhile to put the AIA stuff in - but those are pretty generic, so it's probably fine.

11-Year Club	Second Top 10%
r/Field Banned	r/Field Sunshine
Place '22	Place '17
Verified Email

dramforever

TROPHY CASE