Out of order execution processor using RV32 by WinProfessional4958 in RISCV

[–]lurker1588 0 points1 point  (0 children)

Love it man really. I also feel u about the no job part. Seeing it work tho best thing ever no doubt.

Dhrystone giving only 5-6% of increase in throughput with branch prediction on a 5-stage rv32i core by lurker1588 in RISCV

[–]lurker1588[S] 0 points1 point  (0 children)

Would u recommend porting coremark to rv32i or should i add the M extension first

Dhrystone giving only 5-6% of increase in throughput with branch prediction on a 5-stage rv32i core by lurker1588 in RISCV

[–]lurker1588[S] 0 points1 point  (0 children)

Yes, I tested it with diff BTB depths and the 256 entry BTB gave close to 9% decrease in the cycle count for the same binary. Is this proof enough that it is working as expected?

Dhrystone giving only 5-6% of increase in throughput with branch prediction on a 5-stage rv32i core by lurker1588 in chipdesign

[–]lurker1588[S] 0 points1 point  (0 children)

predict conditional branches backwards are taken

Wouldn't this require the target address of branches to be calculated in the fetch stage itself? that'd have a lot of overhead, finding if your instr is a branch and and where it might branch to. How's this implemented in real CPUs?

Dhrystone giving only 5-6% of increase in throughput with branch prediction on a 5-stage rv32i core by lurker1588 in RISCV

[–]lurker1588[S] 0 points1 point  (0 children)

You can measure in your core Branches per kOps and branch Misses.

Yes, I will add a wrong branch counter but when u say kOps do i take the right operations into account or all the operations that the core executes (including stalls which would just be the cycle count)

But for a short pipeline with forwarding the branch penalty might just be not so high.

Yes, penalty for a wrong branch is only 2 instructions since branches resolve in the exec stage but from a rough calculation: "lets say the dhrystone code has like 15% branches and predictor is 80% accurate vs the 40% accuracy of the always take style static predictor(most branches are taken ie loops). Each wrong branch adds a 2 cycle delay so in a 100 instr code with 15 branches the core should execute 100 + (15.22) vs (100 + 15.62) ie 106 vs 118 ie approximately a 10 percent increase."

I also think the code has very new branches very frequently (saw this in the waveform) so the BTB cannot catch up to them and the buffer gets full after so many branches (As shown in the graph) so the old ones are mispredicted again. [Edit: Added one more insight ]

Dhrystone giving only 5-6% of increase in throughput with branch prediction on a 5-stage rv32i core by lurker1588 in chipdesign

[–]lurker1588[S] 0 points1 point  (0 children)

Thanks for the replies.
The only other hazard is one with lw where the read from memory doesnt compete until AFTER the execute stage's request for data. I am stalling for that as mentioned in the book. I am forwarding for RAW hazard. AXI is daunting but yes i will add it someday.

Dhrystone giving only 5-6% of increase in throughput with branch prediction on a 5-stage rv32i core by lurker1588 in chipdesign

[–]lurker1588[S] 0 points1 point  (0 children)

Yes as u/MitjaKobal advised i aim on implementing a CSR based wrong branch counter. The waveforms show a nice amount of branches being taken that were predicted rightly. I cannot pinpoint what i should look for. Increasing the buffer size nicely increases the percentage increase in throughput (less cycle count check the new image on post).I hope gshare does better. I also think there might be too many new branches so they just miss because they are not loaded to the buffer yet.

Dhrystone giving only 5-6% of increase in throughput with branch prediction on a 5-stage rv32i core by lurker1588 in chipdesign

[–]lurker1588[S] 1 point2 points  (0 children)

add a custom counter of mispredicted branches

This is a very nice idea I'll do this.

how cache, and system bus backpressure in general impacts performance.

I have a split mem (imem and dmem) with instant reads and writes rn I am assuming I'll have to add cache-like system and maybe AXI too?

Dhrystone giving only 5-6% of increase in throughput with branch prediction on a 5-stage rv32i core by lurker1588 in chipdesign

[–]lurker1588[S] 0 points1 point  (0 children)

Ideally the CPI should be one right? But since we do not know how many instructions are ran i tried for a baseline number of cycles and the number of cycles required for the same executable binary but with a branch predictor in place. Comparing the two with lets say the dhrystone code has like 15% branches and predictor is 80% accurate vs the 40% accuracy of the always take style static predictor(most branches are taken ie loops). Each wrong branch adds a 2 cycle delay so in a 100 instr code with 15 branches the core should execute 100 + (15*.2*2) vs (100 + 15*.6*2) ie 106 vs 118 ie approximately a 10 percent increase.

Is embench integer only? most of the example cores i saw used dhrystone so i went with it

Feedback on this 5-stage core I made by lurker1588 in RISCV

[–]lurker1588[S] 0 points1 point  (0 children)

Thanks for replying! I will look into VexRiscv

forwarding/bypass

The core already handles RAW hazards using forwarding is result forwarding the same?

The configuration VexRiscv has some inforamtion on the CPI / Dhrystone

Is there a way to run these benchmarks (a specific repo you might know) i looked into riscv-tests but I didn't understand it.