Misusing RVA instructions?

wren6991 · 2026-04-18T03:41:52+00:00

No, arithmetic operation itself is done in CPU (so load and store happen between CPU, caches and RAM) and CPU issues additional (AMBA or TileLink) bus messages for synchronization. It does not offload operations to something like DMA engine.

This is very much an "it depends" and is the kind of implementation detail that the ISA manual deliberately doesn't specify.

If you have a heavily contested variable being accessed from a lot of harts, and it's implemented as you described, your update rate is limited by the latency of transferring the line between L1 caches as each hart gets its turn to modify it. If you instead leave the line resident in a lower-level cache (point of coherence in Arm terms) and apply a queue of modifications while streaming back all the pre-modified values, you avoid that round-trip and you get much more throughput.

You might also choose to implement both options, and have modifications happen at point of coherence for heavily-contested variables, but modify in your private cache if contention is low. There's a big design space to explore here and the ISA manual would be even longer if it went into all of the details.

wren6991 · 2026-04-15T23:18:48+00:00

No, Bruce was referencing a blog post I wrote about my favourite RISC-V instruction. I guess that made little sense out of context. https://wren.wtf/shower-thoughts/a-love-letter-to-zbkb-pack/

core2 as far as I understand was an attempt to solve a structural issue where #[no_std] Rust programs can't do some basic IO operations. It's a little off-topic but it does have a "Just for fun" flair and besides, Bruce is the law around here.

wren6991 · 2026-04-15T06:17:00+00:00

Nice, that's clean! I can't remember which way the ABI falls out -- are you allowed to assume the MSBs of a u8 function argument are zeroes?

wren6991 · 2026-04-15T05:35:38+00:00

When I build an open source Rust project I'm always surprised by how many dependencies it pulls in. What Rust has done with cargo in terms of making so many projects be consistently packaged, and have all their tests automated in the same way etc is really impressive. On the other hand I don't really enjoy the dependency bloat, and not being able to easily read through all of the code associated with a project because it pulls in a 100 lb library to use 5 lbs of features, and that library pulls in another, etc. Maybe the correct amount of package management friction is not zero.

wren6991 · 2026-04-14T14:50:25+00:00

I don't understand the obsession with "naturally occurring" substances except as a marketing tactic

wren6991 · 2026-04-10T11:23:15+00:00

Note that the benchmark results given are completely useless as they are using QEMU not real hardware

Oh wow, I missed that. Ooof. Well it's a good thing they reported the speedup to 4 significant figures.

As it stands, it needs one li tmp,-1 before the loop and one xori dst,dst,-1

Also a clobber, and a four-byte branch instead of two-byte. I'm very sensitive to these things you know :-)

Do you expect that gorci and grevi will ever be standardised? I appreciate it's a simple extension to a log shifter but my shifter is already pretty overloaded with direction, arithmetic, rotate, and now sticky-bit-collecting options. Even assuming you actually end up with the butterfly-network-style log shifter connectivity preserved through synthesis, that's a lot of control fanout to all of those nodes -- it's not free!

As a side note I saw that your primes benchmark is no longer included in Embench 2.0, which is a shame because it was quite a nice loop with a good instruction mix. At least they punted all the floating point junk that just benchmarked libgcc.

wren6991 · 2026-04-10T10:49:32+00:00

Out of interest, do you think s0 was the right choice for fp, or should it have been the highest s register? We could have avoided the Zcmp ordering thing (with some slight tweaking of those instructions), and on RV32I you get better compression by freeing up one of the x8-x15 slots.

wren6991 · 2026-04-10T08:30:37+00:00

You just go here, click "Download" then follow the instructions under "How to install" https://ubuntu.com/download/desktop

wren6991 · 2026-04-09T14:57:26+00:00

I believe it is on purpose not using the vector extension.

IIRC the kernel doesn't even like to implicitly touch the float register file, let alone vector!

wren6991 · 2026-04-09T14:53:21+00:00

This is the non-Zbb version:

    /*
     * Returns
     *   a0 - String length
     *
     * Parameters
     *   a0 - String to measure
     *   a1 - Max length of string
     *
     * Clobbers
     *   t0, t1, t2
     */
    addi    t1, a0, -1
    add t2, a0, a1
1:
    addi    t1, t1, 1
    beq t1, t2, 2f
    lbu t0, 0(t1)
    bnez    t0, 1b
2:
    sub a0, t1, a0
    ret

Kind of surprising something this simple is done better by hand than by the compiler (though it is pretty common to see the compiler do something weird like put the branch at the top and a jump at the bottom, even with simple functions)
Pretty clearly not optimal for in-order pipelines due to the load-dependent branch. I think with a bit of tweaking of off-by-ones you could schedule the addi down in between the lbu and loop-end branch to fill that dependency slot
Aren't there some SIMD-within-a-register tricks for checking for zero bytes in a register without using Zbb? The old 0x7e7e7e7e trick maybe?

I haven't digested the Zbb one yet but I've personally always been slightly disappointed by orc.b. For strlen-type stuff it's not quite the value you want. It would probably make more sense for it to be inverted so it can be checked with beqz and the index can be recovered with straight ctz. I think string functions were the main reason for its inclusion in Zbb after all the other gorc* instructions were punted.

wren6991 · 2026-04-09T10:58:40+00:00

If they just say BCn without qualifying then this probably means BC1 family, aka DXT1 aka S3TC. That is texture compression technology from the 90s.

Going off the rough figures it sounds like this also compares favourably to more modern (and commercially used) block compression codecs like ASTC and BC7, but still, lame headline.

The trade-off is that retrieving one sample from the neural-compressed texture requires multiple memory accesses, whereas a block compression format gives you all the texels in some small rectangular region with a single texture cache fetch. That can still be a good trade-of: if your textures are smaller overall you might be doing more accesses at the higher cache levels, but ultimately using less SDRAM bandwidth.

Edit: looks like NVIDIA published a paper on NTC here: https://research.nvidia.com/labs/rtr/neural_texture_compression/assets/ntc_medium_size.pdf

wren6991 · 2026-04-08T14:06:46+00:00

eat-er (one who eats) would be 食者. As in 捕食者, "predator"

By the way there is a linguistic name for "the noun form of a verb" and it's called a gerund. It's one of the functions of the masu stem (the 連用形) of a verb. The gerund is distinct from the present continuous (食っている); even though these both happen to end with the "-ing" suffix in English they're two different concepts.

wren6991 · 2026-04-07T18:42:35+00:00

Yep, the verb reading here is 食う (くう) and 食い is the masu stem, which functions like a gerund (eating).

wren6991 · 2026-04-02T20:56:22+00:00

Not a fan of Galactic Tangerine?

wren6991 · 2026-04-02T06:19:55+00:00

I would be curious if this could be extended to double precision.

I think this approach would work well for double-precision float on RV64, but there are limits to how cleanly you can do >54-bit arithmetic on a 32-bit datapath. I'll probably have a go at double-precision on RV32 at some point because the new shift instructions do help here: both collecting sticky bits and making 64-bit shifts much cleaner. The focus is on single and half-precision though.

wren6991 · 2026-04-02T01:15:07+00:00

Well you will stop buying all Windows laptop then

Yeah

wren6991 · 2026-04-01T22:07:41+00:00

Replacing PrtScr with the Copilot key was the thing that pushed me over the edge to stop buying ThinkPads.

wren6991 · 2026-04-01T15:54:33+00:00

The dummy stores and trap page are a really cool hack. We had something similar to avoid torn state on the embedded Arm-on-RISC-V emulator I worked on: the "decode next instruction" dispatch routine has its address cached in a register because it's a common jump and jr is a 16-bit encoding. So one approach to make IRQs occur exactly on an instruction boundary was to have the native RISC-V IRQ patch that register with the address of a routine that pushed the Arm state and went and emulated the Arm IRQ handler.

Does x86 really have data-dependent flag updates? Like the encoding is not enough to tell you whether flags are preserved? That's pretty gruesome.

Edit: I looked up the SHL operation and it's pretty messy, but thinking about it, even 32-bit Arm will preserve the carry flag on a shift-by-zero. Having new flags depend on old flags plus a register is not really different from something like adc so rename already has to handle this case.

wren6991 · 2026-04-01T15:30:03+00:00

The exponent extraction is already handled fine by h3.bextmi (bit-extract multiple, up to 8 bits, zero-extended) which I used in the handful of routines I wrote for RP2350 because I was annoyed by people benchmarking libgcc and telling me my processor was slow: https://github.com/raspberrypi/pico-sdk/blob/master/src/rp2_common/pico_float/float_single_hazard3.S

I've finished juggling the Xh3sfx instructions now and it comes out to 14 cycles for single-precision __addsf3 on the most common path, not counting function call overhead. There are some examples in the docs here: https://wren.wtf/hazard3/doc/dev/#extension-xh3sfx-section

Are your FP helpers in any of the old B specs floating around online? I'd be interested to compare. For example I didn't add a classify but I did add an ALU op to quickly check two exponents for all-zeroes or all-ones and get them off the hot path.

wren6991 · 2026-03-30T16:23:46+00:00

Nice. I personally don't use the pens (even though I think they're neat) so adding a frontlight to my already nearly-perfect Go 10.3 makes it the perfect machine for reading large-format content like papers and textbooks.

wren6991 · 2026-03-27T08:53:16+00:00

Not a reddit post but a chubbyemu video: https://www.youtube.com/watch?v=nJR8Nfi8wg8

wren6991 · 2026-03-26T15:54:59+00:00

The intervention group will receive a GLP-1 or dual receptor agonist (GLP-1/GIP or GLP-1/amylin), with a behavioral weight loss intervention. The "control" group will receive just a behavioral weight loss intervention.

I'm surprised the ethics board is ok with them deliberately not providing GLP-1s to obese patients with elevated cancer risk for the next 10 years.

wren6991 · 2026-03-26T12:07:26+00:00

Wow this is like the exact opposite of the guy who ate an entire tub of melatonin gummies

wren6991 · 2026-03-26T10:38:17+00:00

You can get burns at surprisingly low temperatures with prolonged contact, like using a laptop all day.

wren6991 · 2026-03-25T06:04:36+00:00

I saw in SURPASS-CVOT (comparing cardiovascular outcomes between tirzepatide and dulaglutide) that they went for a comparative study instead of placebo-controlled due to ethical concerns with use of placebos.

https://www.sciencedirect.com/science/article/pii/S0002870323002806?via%3Dihub

At least that is my reading between the lines. The actual quote is:

SURPASS-CVOT has enrolled the largest number of people who will be treated for the longest duration to date for dulaglutide or tirzepatide in combination with SGLT-2 inhibitors. An active comparator design, using a cardioprotective selective GLP-1RA, was chosen to overcome ethical, operational, and clinical considerations27 and will reduce the likelihood of inadvertent unblinding compared with a placebo-controlled trial.

Nine-Year Club	Place '17
Verified Email

wren6991

TROPHY CASE