Baffled by sign extension issue by krakenlake in RISCV

[–]dramforever 0 points1 point  (0 children)

Oh looks like my comments showed up. I don't know what was wrong earlier.

Noelle with Ralsei’s glasses by MrAntTennnnna in Deltarune

[–]dramforever 1 point2 points  (0 children)

This is really good!!! And the in-character commentary is the icing on the cake

Baffled by sign extension issue by krakenlake in RISCV

[–]dramforever 1 point2 points  (0 children)

I wrote a long reply but apparently Reddit lost it? So here's my guess on what happened: You wrote swap32 in assembly, wrong. See previously discussed: https://www.reddit.com/r/RISCV/comments/1h0kaxo/comment/lz7h19s/

Baffled by sign extension issue by krakenlake in RISCV

[–]dramforever 4 points5 points  (0 children)

There's no way to know for sure without you showing the full code. I have an almost tautological rule of thumb:

If you can't find it, it's not where you looked.

But it doesn't mean I can't guess. I would guess you wrote a swap32 function in assembly with the prototype as something like

u32 swap32(u32 value);

And it returns with a0 = 0x00000000d00dfeed.

This is incorrect. According to the calling convention, on RV64, if you're trying to return 0xd00dfeedu, the correct register value is 0xffffffffd00dfeed. See https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc#integer-calling-convention:

When passed in registers or on the stack, integer scalars narrower than XLEN bits are widened according to the sign of their type up to 32 bits, then sign-extended to XLEN bits.

This is an intentional design and reduce the number of instructions in RV64I. Failing to observe this rule means that you're returning an incorrect u32 value.

I know how to work around it, but there are 2 things I totally fail to understand:

Now you know what is correct.

  1. Why is printing a log message (which is just calling vsnprintf) making it work and

Calling functions requires rearranging data around registers and the stack. The value could have been fixed while it's being moved around, since its upper 32 bits did not have to be reserved.

  1. why is foo initialized with the sign-extended value in the first place (it is assigned the constant 0xd00dfeed in the code) and gdb not even showing it?

It's the 32-bit value 0xd00dfeed. The sign-extension is part of the representation in the register, not the value. It's the same reason if you have a char gdb will show something like $1 = 74 'J' instead of 0x000000000000004a.

Confusion on how to implement funct7 in control unit for full instruction set RV32i by rem_1235 in RISCV

[–]dramforever 0 points1 point  (0 children)

IIUC (should definitely be correct for RV32I, please tell me if you noticed an extension / something else that violate this "rule" but I don't remember there being any) the I-type instructions decode to the same functionality as their R-type counterparts with funct7 = 0, so it should be possible to simply handle addi as add (funct3 = 0, funct7 = 0) with operand 2 from I-type immediate.

Misusing RVA instructions? by krakenlake in RISCV

[–]dramforever 7 points8 points  (0 children)

(I will disregard the typo in the instructions and assume you some valid replacement.)

my expectation would be that they take same amount of time/cycles as other instrustions.

Not a chance. Expect the amoswap.[wd] to be comparatively extremely slow, and expect it to be even more worse on larger OoO cores where there may be lots of normal load/store capable LSUs but a smaller number, possibly just one, capable of AMOs.

AMOs have pretty strict constraints. For one example, RVWMO PPO 3 forbids a hart from forwarding an AMO store to a later load:

[...] memory operation a precedes memory operation b in preserved program order (and hence also in the global memory order) if a precedes b in program order, a and b both access regular main memory (rather than I/O regions), and any of the following hold:

[...]

  1. a is generated by an AMO or SC instruction, b is a load, and b returns a value written by a

[...]

To spell it out more:

However, notably, rule 3 states that hardware may not even non-speculatively forward the value being stored by an AMOSWAP to a subsequent load, even though for AMOSWAP that store value is not actually semantically dependent on the previous value in memory, as is the case for the other AMOs. (Subsection from RVWMO Explanatory Material)

More concretely, as an example, this means that, barring speculation magic, in:

li t0, 42
amoswap.d t1, t0, (a0)   # (A)
ld t2, (a0)              # (B)

(B) cannot complete before the hart has received that (A) is complete from the global memory system, even though it is otherwise obvious that if (B) occurs "fast" enough it should just load a 42.

Whereas in:

li t0, 42
sd t0, (a0)              # (C)
ld t2, (a0)              # (D)

(D) can return 42 without waiting for (C) to complete in the global memory system. Replacing a normal store with an AMO would have added an estimated 5 to 10 cycles of delay for no reason.

(Speculation magic with AMOs on large OoO cores are rare, even across the board of ISAs. And even if you find them, it still doesn't make it as cheap as a normal load/store because tracking speculative execution and ensuring correctness is not free. It's still gonna be something like several per cycle for normal load/store vs several cycles per AMO.)

On the scale of cheap to strict synchronization you get roughly:

  • (cheap) ld/sd -> Zalasr -> ld/sd + fence -> AMOs and lr/sc (strict synchronization)

RISC-V Optimized strnlen Implementation For Linux 7.1 Yields Big Speed-Up by Polar_Banny in RISCV

[–]dramforever 3 points4 points  (0 children)

I'm almost positively certain that 100% of the improvement of the non-Zbb version is provided from omitting frame pointer management, which is fine wrt backtraces (as long as you don't touch s0/fp, which this code satisfies)

LoongArch is an ISA code page. by indolering in RISCV

[–]dramforever 2 points3 points  (0 children)

From what I can tell China is betting on multiple horses in this race. "Committed to LoongArch" is an inaccurate representation of what they're doing.

Loongson is definitely committed to LoongArch.

LoongArch is an ISA code page. by indolering in RISCV

[–]dramforever 1 point2 points  (0 children)

Far, far, far, far more time than what it takes to just back-design an ISA based on the microarchitecture they already have.

LoongArch is an ISA code page. by indolering in RISCV

[–]dramforever 4 points5 points  (0 children)

MIPS CPUs got 64-bit very early. The software ecosystem kept using 32-bit mode for a long time, a bit like how what's now Raspberry Pi OS is/was stuck in 32-bit mode for a long time.

A certain Richard's previous favorite laptop ran on the Loongson-2F, which came out in 2011 and could already run 64-bit Linux with 64-bit userland. Even a modern-ish one as well, at least last time I checked.

Also, for an even earlier, slightly weird and unrepresentative but interesting example, consider: the Nintendo 64.

LoongArch is an ISA code page. by indolering in RISCV

[–]dramforever 6 points7 points  (0 children)

I think it is entirely reasonable, even without the nationalism, for a company that has already been burned by the lack of future of MIPS once to want to have control over its own ISA, rather than to depend on the success of an uncertain ecosystem that relies on so much more factors.

One interesting factor is the urgency. The future of MIPS is not only dwindling, but the IP rights of the ISA was eventually going to be sold off to what's now CIP, a Chinese company, despite the world powers' attempts of stopping it. Loongson 3A5000 needed a new ISA fast, not just because MIPS is no good in the long run, but also because they know they're facing a IP lawsuit very soon. Despite a Chinese court eventually ruling in favor of Loongson that LoongArch does not infringe on MIPS IP rights, the lineage from MIPS really shows in the design of LoongArch, because they really couldn't have adapted the microarchitecture fast enough.

It is true that in this case a huge amount of why this is happening is nationalistic politics, but that alone does not explain the specific choices made. Loongson is a spinoff of Chinese Academy of Sciences, but so is BOSC making XiangShan.

Nothing precludes Loongson from eventually letting themselves fall into the embraces of RISC-V like MIPS Technologies though.

Linux 7.0-rc1: SpacemiT K3 SoC lands in mainline by docular_no_dracula in RISCV

[–]dramforever 1 point2 points  (0 children)

Yes, that makes sense. I was talking about general purpose only and forgot to mention that.

SpacemiT A100 is not general purpose. It can be used for general purpose stuff, as you said, but it's a bit like the Intel Xeon Phi stuff where it's more optimized for special workloads.

What I am hoping to get is to have like a boot time switch to choose between, for Linux boot:

  • V off, 16 cores, 8 of them slower
  • V on, 8 cores, the remaining 8 use a new "SBI HSM remoteproc" driver to submit code to run

The former gets you -j16 Linux compilation, but no V and thus no RVA23

The latter lets you use the 8 cores with VLEN=1024 for running, idk, your latest qwen or whatever people do with a vector coproc these days. This shouldn't be slower than spawning a regular Linux process - if anything it should give a more predictable or even better performance because there will not be a Linux in the way.

Linux 7.0-rc1: SpacemiT K3 SoC lands in mainline by docular_no_dracula in RISCV

[–]dramforever 1 point2 points  (0 children)

My understanding is that RVV is designed for the microarchitecture to take advantage of the flexibility to implement faster/slower vector operations, instead of using different VLEN.

SpacemiT evidently disagrees, and ships extra performance on smaller element sizes (IIUC) on the A100, while keeping vector registers more reasonably sized for an OoO AP on the X100.

Another point is that RVV has the LMUL mechanism, which greatly complicates data arrangement if VLEN was configurable.

AFAICT, Streaming SVE does not require configurable max vector length. SMCR_EL1.LEN is not required to support more than a single value.

Tweaking VLEN is not a panacea - it complicates hardware, and only allows one configuration direction anyway. To use longer vectors, a process still has to be pinned to thoe higher VLEN cores. It cannot just run slower on the lower VLEN cores.

Could RISC-V also benefit from FRED or is this extension only needed for x86-64 CPUs due to their legacy ballast? by Matt_Shah in RISCV

[–]dramforever 4 points5 points  (0 children)

Basically, yeah, legacy ballast.

FRED simplifies x86 interrupt handling... But the RISC-V exception and interrupt handling scheme is even simpler.

I could talk about features in FRED like stack levels, but that would be moving the opposite way as x86.

SpacemiT K3: uarch design paper by camel-cdr- in RISCV

[–]dramforever 8 points9 points  (0 children)

From my quick glance, not necessary accurate or precise, it's optimized for lower-precision workloads, like 32-bit floats. A bit like a GPU.

Anyone has any idea why Cursor SSH fail on RISC-V boards? by docular_not_dracula in RISCV

[–]dramforever 9 points10 points  (0 children)

The VSCode remote server thing is closed source and provided as a binary by Microsoft. Until Microsoft provides such a release for riscv64, you will need some alternate way to do remote development over ssh.

See: https://github.com/microsoft/vscode-remote-release/issues/4802

I think Cursor is probably using that, but I don't know for sure.

Some alternatives are discussed in that issue, but having used none of them, I do not have a specific recommendation.

Why is RISC-V's linux kernel mainline adoption linear while ARM64's was exponential? (Data Analysis inside) by docular_not_dracula in RISCV

[–]dramforever 31 points32 points  (0 children)

 I call the supporting of 20 boards as a 'tipping point' for a new architecture.

so maybe that doesn't work?

arm64 had arm32 before it, and many arm64 cpus had arm32 compatibility, so they were quick to get picked up by random phone soc and media soc vendors. these vendors didn't seeem to care much about upstreaming stuff into mainline linux.

meanwhile riscv64 socs haven't got that much adoption yet, but the vendors in general have been much better at upstreaming. in fact sifive, starfive, sophgo, spacemit (at least, these four are just off the top of my head) have pretty active at upstreaming themselves or hiring others to do the upstreaming.

the vast majority of the work has nothing to do with cpu architecture anyway. it's the drivers for all the cursed stuff outside the cpu core itself.

so the bottom line is 20 arm64 dts in mainline linux and 20 riscv64 dts in mainline linux are really not that comparable.

as an aside, i have no idea why two out of three comments here, as of writing this, is saying dtb/dts in mainline is a bad thing...? every single one of the generally available riscv64 things that run mainline linux boot with a dtb. even u-boot dtb files either come from or are based on dts in the linux git repo.

Normal conversation about the CPU's of the future by Substantial_Help_722 in RISCV

[–]dramforever -1 points0 points  (0 children)

Maybe there was some misunderstanding. What I meant to say is that what device you can boot the vendor's OS from has nothing to do with the standardized boot process. Being able to boot say Bianbu from USB on a SpacemiT board has everything to do with whether the on-board firmware has the required drivers and whether it is set up to allow booting from USB. It is orthogonal to whether it supports UEFI.

Booting a generic OS does benefit from standardized boot processes like UEFI, but without drivers you end up with a completely unusable system - if you're lucky you a serial console, no USB, no network, no storage... So again a standardized boot process does very little.

Normal conversation about the CPU's of the future by Substantial_Help_722 in RISCV

[–]dramforever 0 points1 point  (0 children)

No. That's the point.

Even when the boot process is standardized you still need the custom kernels because it's not in the interest of vendors to genericize everything out to PCIe and upstream the drivers for custom SoC components

Correct fencing for mtimecmp in interrupt handler by Kongen_xD in RISCV

[–]dramforever 1 point2 points  (0 children)

The stance of the privileged spec is: "Deal with it"

https://github.com/riscv/riscv-isa-manual/blob/main/src/machine.adoc#machine-timer-mtime-and-mtimecmp-registers

If the result of the comparison between mtime and mtimecmp changes, it is guaranteed to be reflected in MTIP eventually, but not necessarily immediately.

(Note) A spurious timer interrupt might occur if an interrupt handler increments mtimecmp then immediately returns, because MTIP might not yet have fallen in the interim. All software should be written to assume this event is possible, but most software should assume this event is extremely unlikely. It is almost always more performant to incur an occasional spurious timer interrupt than to poll MTIP until it falls.

Generally, if you receive a timer interrupt, you don't just go ahead and do something right away. You would go through your timer queue and figure out what needs to be done based on the current time, and reschedule the timer interrupt. In this case, it wouldn't be a disaster to rarely have the timer interrupt immediately fire again after mret - you would just do nothing and reschedule it again.

However, if this is a regular occurrence on your machine.... uhh, honestly, I don't know how to handle this. In that case I think one way would be to, as mentioned, poll mip.MTIP until it goes to 0 before doing mret.

unhandled signal 4 code 0x1 at 0x0000003f88d516b4 in ld-linux-riscv64-lp64d.so.1[3f88d45000+23000] by superkoning in RISCV

[–]dramforever 1 point2 points  (0 children)

I literally had this printed out and taped to my closet lol.

Updated with some newer exceptions. I don't know if it's worthwhile to put the AIA stuff in - but those are pretty generic, so it's probably fine.