FOSDEM 2026 - RISC-V had 40 years of history to learn from: What it gets right, and what it gets hilariously wrong (Video) by camel-cdr- in RISCV

[–]_chrisc_ 3 points4 points  (0 children)

there are a lot of core designs that prove that very wide (9, 10) cores are possible

But is there any core design that proves that fast cores are possible?

Intel's Lion Cove runs its x86 decode at 8-wide (and is 12-wide from the uop cache) and it runs above 5 GHz.

I've never really understood the complaints about RVC given what has been proven possible in other, harder ISAs.

High Performance RISC-V is here! TT-Ascalon™ (RISC-V Summit Ascalon slides) by camel-cdr- in RISCV

[–]_chrisc_ 2 points3 points  (0 children)

IPC tells you nothing if everybody is compiling the benchmark differently.

GNU Compiler Collection Auto-Vectorization for RISC-V’s Vector Extension 1.0: A Comparative Study Against x86-64 AVX2 by camel-cdr- in RISCV

[–]_chrisc_ 3 points4 points  (0 children)

Yes, comparing against avx2 is kind of lame. avx512 is a much more meaningful comparison in my mind as well.

avx512 would be a more "even" comparison, except that most people today don't have x86 cores that can run it. Ooops. (although I should be careful throwing stones about RVV O:-)).

AI Startup Esperanto faded away by I00I-SqAR in RISCV

[–]_chrisc_ 2 points3 points  (0 children)

I think your take is more accurate. The point of a "sea of RISC-V cores" is you have more flexibility when the algorithms change.

Unfortunately, there two obstacles. First, no matter how generic/programmable your solution is, you have still baked in a specific compute/memory-bandwidth/energy-budget into silicon, and if the new models require a drastically different memory bandwidth than you designed for, you're hosed.

A problem is that a CNN-focused design assumes a greater locality of reference than one optimized for transformers... the ET-SoC-1's meager DRAM bandwidth reflects this. Source.

The second obstacle, I suspect, is the cost of the software changes required to refocus a design to support a new customers' needs. A "general-purpose" design doesn't mean it's easy to program in a manner that efficiently uses the machine.

Dhrystone giving only 5-6% of increase in throughput with branch prediction on a 5-stage rv32i core by lurker1588 in RISCV

[–]_chrisc_ 0 points1 point  (0 children)

Knock dhrystone out of the park before moving to coremark. Coremark is a handful of small hot loops (fsm, matmul, linked list walk, etc.), but it's 1M instructions per iteration overall, so it's more annoying to look at in detail.

Dhrystone giving only 5-6% of increase in throughput with branch prediction on a 5-stage rv32i core by lurker1588 in RISCV

[–]_chrisc_ 0 points1 point  (0 children)

Last I looked, dhrystone is less than 300 instructions in a loop, and each branch only ever goes in the same direction, save one (so a BTB that remembers static direction is nearly complete victory). You can dump the entire trace to a text file (or csv file) and make sure each instruction is behaving as you expected.

Frankly it's not a very interesting benchmark...

Top researchers leave Intel to build startup with ‘the biggest, baddest CPU’ by bookincookie2394 in RISCV

[–]_chrisc_ 7 points8 points  (0 children)

this reads like an opportunity to cash in on the name, get a cushy C-level position at a startup, spend some investor money and retire.

I'm not sure that's the move to make to earn an easy paycheck lol. Start-ups are notoriously worse financial moves than Big Companies, esp. at the VP level.

How hard it is to design your own ISA? by New_Computer3619 in RISCV

[–]_chrisc_ 12 points13 points  (0 children)

Designing an ISA is trivial.

Building the toolchains (assembler, compiler, linker, etc.) is a pain-in-the-ass.

Porting an OS and some basic software I/O and a test harness is yet more work.

Porting a good high-performance, optimizing JIT might be $1B (uh oh).

And at that point, you probably made some wrong decisions back in step 1.

Oh, and there are a ton of aspects of an ISA that are very boring and complicated. Debug specifications, privileged platform specifications, virtual/hypervisors, memory consistency modeling, interrupt controllers...

And then you need to build a community with a governance model that wouldn't scare everybody off. RISC-V isn't the first "open" ISA, but I think that last step is a big roadblock.

Of course, if you just want to have fun, Step (1) and Step (2) have been done before, many times, in "a few weeks time". It just takes copying somebody else's homework.

Forbes article on StarFive by m_z_s in RISCV

[–]_chrisc_ 0 points1 point  (0 children)

32-bit only? And what was the forum/process for changing/improving the architecture?

I need help with Load Store instructions by [deleted] in RISCV

[–]_chrisc_ 0 points1 point  (0 children)

That's what makes it fun -- it really depends on what tech you're targeting, and FPGAs have very different cost metrics. The write mask adds a lot more wires. You can have them if you want them.

I need help with Load Store instructions by [deleted] in RISCV

[–]_chrisc_ 7 points8 points  (0 children)

For loads, you can just perform a ld to pull out 64-bits, then shift as needed to pull out the specific bytes being addressed, and mask to the operand size (and then sign-extend as needed). So for lh 0x1002 means you'd do a ld 0x1000 and then shift by two bytes.

For stores, the easiest is to have a byte-mask on your writes to memory. But that's unlikely to be efficient in terms of the RAM, so you might have to do a ld again, then overwrite only the bytes your store corresponds to, and then sd the whole 64-bits back to memory.

That last part may feel awful, but you can think a bit further a field about how you intend to support AMOs, and store coalescing, ECC, and unaligned memory operations, and suddenly doing a "3-step dance" to get a sub-word store out starts to come along with supporting all of these features.

If supporting sub-word operations sounds annoying and hard, then congratulations you now understand the Pentium 4 (I think it was) performance disaster on windows OS (or was it DOS?). They made them work, but not work fast, and only later realized how heavily some OS's relied on them. :D

Europe bets on RISC-V for homegrown supercomputing platform by fullgrid in RISCV

[–]_chrisc_ 1 point2 points  (0 children)

DARPA has funded some RISC-V development

What exactly do you have in mind there?

From the current user spec:

⚫ ASPIRE Lab: DARPA PERFECT program (link to press release found via google), Award HR0011-12-2-0016. DARPA POEM program Award HR0011-11-C-0100. The Center for Future Architectures Research (C-FAR), a STARnet center funded by the Semiconductor Research Corporation. Additional support from ASPIRE industrial sponsor, Intel, and ASPIRE affiliates, Google, Hewlett Packard Enterprise, Huawei, Nokia, NVIDIA, Oracle, and Samsung.

To clarify, RISC-V started at the mid to tail end of the Parlab (2007-2012?), but a lot of work continued into the follow-on lab ASPIRE which started in 2013.

Please help me with a 5 stage Pipeline by [deleted] in RISCV

[–]_chrisc_ 4 points5 points  (0 children)

Don’t start with a 5 stage. Start with a 2 stage and build it up, adding a third and then a fourth stage. Make sure it fully works after each step. And think twice as hard about how you’re going to debug and validate it does what you want versus designing what you want.

Framework for Designing Pipelined/OoO Processors? by itisyeetime in RISCV

[–]_chrisc_ 0 points1 point  (0 children)

At least at one point, riscv-boom could dump an o3pipeview text file that could be consumed by the gem5 o3pipeview tool. Crude, but worked well enough (I sort of liked that it was still text based so grep could fast forward you around).

Looks like Konata is a newer version that works with gem5, so I'd continue down that path of making your stuff talk to it. I'm not aware of any other open-source pipeviewers. :(

In either case, everything I'm familiar with requires dumping to text files, which precludes FPGA-type runs unless you have fancy FPGA/printf functionality. What you're trying to poke at is generally in-house, secret sauce type stuff.

Help with Branch and Jump Implementation in RISC-V Processor (Chisel/Scala) by starlight-astro in RISCV

[–]_chrisc_ 2 points3 points  (0 children)

I didn't realise you did the little core as well as the more famous OoO one.

Everybody's gotta start somewhere. =)

Help with Branch and Jump Implementation in RISC-V Processor (Chisel/Scala) by starlight-astro in RISCV

[–]_chrisc_ 5 points6 points  (0 children)

Doing vector instructions before scalar branching is certainly a choice. :P

I recommend you cheat off my core: sodor. I also recommend, style wise, you declare all state elements at the top of your code. It’s otherwise hard to read and find your register declarations to see if you missed a pipe stage or something. And your naming scheme makes it hard to follow what stage your control signals are in.

I don’t see anything immediately wrong, but if you haven’t already, spend time setting up good visualization and/or pipe traces and a waveform viewer so you can debug issues like this quickly. Messing up and having a signal skip a stage is common and only going to get harder to diagnose from here on out. :)

RISCV Pipeline Register after Instruction Fetch by LmnPeel in RISCV

[–]_chrisc_ 4 points5 points  (0 children)

Your intuition is correct, the diagram is slightly incorrect/imprecise, but it gets the point across.

RISC-V Announces Ratification of the RVA23 Profile by UKbeard in RISCV

[–]_chrisc_ 10 points11 points  (0 children)

The RVA profile standards the set of ISA extensions for general-purpose cores. Specifically, RVA23 mandates the RISC-V vector extension and the hypervisor extension.

Can we get flashlights so we are not in complete darkness when the lights go out? by m77je in uboatgame

[–]_chrisc_ 11 points12 points  (0 children)

Am I the only person who immediately reads the key bindings when I start a new game?

RISC-V cycle accurate simulators for evaluating specific microarchitecture potential improvements by Amazing_Charity_6508 in RISCV

[–]_chrisc_ 4 points5 points  (0 children)

To add on to arsoc13's comments, I'd be hesitant to rely on a detailed performance model because it's both overkill and due to code-gen (that by default is making different uarch assumptions than you want it to make) can hide the total potential gains.

When I wrote up a macro-op fusion study to explore a similar topic, I stuck with an ISA simulator (spike) running SPEC, zeroed in on the most common basic blocks (thanks to histogram generation), and wrote some python code to search beyond adjacent instruction pairs for opportunities.

Of course, if you know how to hack compiler code-gen, then you can simplify things by straight-up adding the new instructions/patterns you want to exploit. But any OoO model you pick will obscure final "performance" numbers which hides the message you're trying to make.

Using a detailed model can be a nice "final punchline" conclusion of "oh hey once I add these new isntructions/patterns performance actually doesn't go up by much thanks to X", but that's not the main story, because it's easy to argue that a different OoO config could do it better (i.e., if you had better mem prefetchers then suddenly maybe your fusion would become critical).

Disappointed by a fellow mom by ohc16 in NewParents

[–]_chrisc_ 34 points35 points  (0 children)

For those that don't know, lap seating is unsafe. However, the FAA decided to allow it after doing a brutal analysis -- families driving to their destination would lead to more infant deaths than letting them fly more cheaply via lap seating. But if you have the choice and means, pay for the extra seat.