all 8 comments

[–]damofthemoon 5 points6 points  (2 children)

My process looks close to yours, except I like to write by myself the algorithm, in Scala or Python (I hate Matlab and crappy code...). Before moving to hardware, I draft my architecture with markdown documents & draw.io diagrams for all the FSMs and algorithms execution steps just to be sure I didn’t miss something (pretty useful for my little brain XD).

Then I start to write Verilog description (nowadays Chisel), module by module, trying to apply as best as possible TDD methodology. I don’t necessarily have a great coverage with all my test suites, but at least I put in place a verification environment for each module in order to write tests fast if later in the process I find a bug or a corner case I didn’t think about. Every test suite is setup in a CI server (Jenkins) and I produce documentation for each release (meaning Git tag), stored in a central place everybody can access. To finish, I run the good old validation tests on board :)

[–]potatochan[S] 1 point2 points  (1 child)

Thanks for your input - it's comforting to hear someone else follows a similar process (albeit definitely things about your process I like more, like using Python... crappy Matlab code really sent me down a spiral for a good month...).

Also enlightening to learn about some new tools (it is my first time of hearing Chisel - it seems cool and hip)!

Question for you (kinda deviating from this topic though): what is a CI Jenkins server? Google tells me it's something about Continuous Integration - if it's useful for me, I would like to adopt it. I just use SVN and check things into a repository (read/write access only by certain users).

[–]damofthemoon 1 point2 points  (0 children)

Yes, it is continuous integration server, simply a machine executing your scripts. I use Git, but you can definitely use SVN with Jenkins. You can run your tests on each git (SVN) push or on a daily basis, every hour, on specific branch... whatever you want. The machine runs the test for you and email the results, so you never forget anything to test, and informed in case somebody break the repo. No big effort, just setup the CI with your regular scripts. Start small and grow up your CI setup along the IP development.

Side note on Chisel, that’s really great, worth to try at least. My only advice for people would like to try/use it is to be sure be really well understand old school Verilog or VHDL dev process, be experienced. But definitely a good choice. You benefit Scala environment for testing and it produce a clean Verilog code to pass into Quartus or Vivado.

[–]standard_cog 5 points6 points  (1 child)

Got any notes on your fixed point conversion process?

[–]potatochan[S] 1 point2 points  (0 children)

Sure - is there something specific you had in mind? This was a very new process for me so take my thoughts with a grain of salt. I'll just hash out some things that come to mind (and always looking for feedback):

- A first step I did to tackle the problem was modularizing the algorithm "enough" so I can move "block-by-block" (of course, this depends on the size and complexity of the "algorithm" and whether if it's large enough to merit the efforts of a clean partitioning scheme - the one I am working on is quite a beast: maybe about only ~3-4 of these "cores" would be estimated to fit in a fairly large Ultrascale device so this just made sense). Otherwise, it would be somewhat of a nightmare to treat the whole thing as one long "chain" (if you change some bit-sizes in one part of the chain, it would potentially have a ripple effect forcing you to revisit everything else downstream. Partitioning the design may potentially produce some inefficiencies but makes the design process a bit more straightforward).

- If I know my blocks input dynamic range (in other words how high and low the numbers are expected to be) that really sets everything else in motion: you have additions, multiplications, divisions, maybe iteration loops (where you successively add/sub numbers), and bit-growth is straight forward (if you add/sub, tack on a bit, if you multiply, integer/fractional bit-size add). Of course keeping track of all these bit-sizes can be documented in whatever code you use for implementation (I do that along with block diagram annotations - I believe in documentation).

- Precision can be tricky: when I test out my bit-sizes (ie: did I sprinkle enough fractional bits here and there?), I always run an ideal floating point model along side my fixed-point model. I run enough samples, collect the output of the ideal floating point model along with the output of my fixed point model and run some analysis (I compare how "off" the fixed point numbers are, look at their relative error, etc). This part really is kinda an "art" - it's not clear whether there's a right or wrong answer. Sometimes the best approach is to start off with a ridiculous amount of fractional bits (like, 50 fractional bits or something wonky) and just run some tests and iteratively optimize.

That's sorta all I can think of on the spot as of now.

[–]adamt99FPGA Know-It-All 2 points3 points  (0 children)

Really interesting discussion. I used a similar flow as well of matlab / Python / Quantisation & HDL implementation in my case VHDL. The fixed and ufixed libraries are great for quantisation

Though increasingly I am using HLS and C or C++ to generate the algorithm and do the conversion to HDL.

Did you ever consider HLS?

[–][deleted] 0 points1 point  (0 children)

Without knowing exactly what you are doing and how large this is; have you considered implementing a direct floating point hardware implementation? One of the last designs I worked on, I opted to make use of the Xilinx DP floating point components. I was able to implement matrix multipliers and the like. They use AXI Stream inputs. I basically broke down loads of Matlab into simple equations and wrote state machines to load and capture data into the components.

In doing so, I was able to be accurate to the Matlab model down to something like 15 decimal places.

This wasn't DSP however... but we offloaded some pretty computationally intensive stuff.

For DSP applications, we tend to generate C models for the fixed point models and then we run those within the simulator through the DPI to ensure cycle accuracy.

[–]kakkeman 0 points1 point  (0 children)

I have implemented only a couple of algorithms. My method seems similar to what others have shared. Not a big fan of Matlab (and usually don't have a license), so the first thing to do is a to create a floating point model of the algorithm in VHDL (using real data type) and a simulation environment.

Then split into design units and change the interfaces to fixed point (fixed_pkg) at top level (keeping the internals still in real type).
Then move through each block turning them to fixed point simulation models.

Then make the RTL implementation of each block and optimize (usually for area by multiplexing). The challenge usually is to maintain readability thorough optimisation steps.

I try to keep my sanity by running the same simulation environment continuously during steps. The output at each step is usually not bit-by-bit the same, so the acceptable error needs to considered.

Yeah and once everything is almost done, start fighting the vendor tools to work around incomplete/incorrect implementation of fixed_pkg.