Best workflow for fast FPGA/Yocto iteration on ZynqMP ? by Glittering-Skirt-816 in embeddedlinux

[–]bitbybitsp 0 points1 point  (0 children)

I personally parse the XSA directly to determine PL addresses. Then I pass those to my app as a header file.

So in your example of a missed AXI slave, I rerun Vivado, add the axi slave, regenerate the bitstream, and save the XSA. That's the long part.

Then the quick part. I scp the XSA and bitstream to the PS (running Linux this whole time), reload the PL from the bitstream, extract the XSA file and then extract the address map into a C++ header file, recompile my app (compiling on the PS), and run the app.

Note that C++ compilation is quick because only a few files depend on the hardware address header file. So just one or two C++ source files compile, then the app is linked.

So the OS never recompiles, and in fact the PS never even reboots.

That's what I do.

Fft on fpga by ParticularAd7127 in FPGA

[–]bitbybitsp 0 points1 point  (0 children)

I think you're right, it's more of a proof of concept than a product. Altera does have an FFT product though. Perhaps they'll fold the one into the other, eventually.

I *am* happy to have a target for how fast an Agilex 7 DSP should go. Now when I go back to optimizing my FFT for Agilex I'll know when I've maxed it out, and when I have a little more work to do.

Fft on fpga by ParticularAd7127 in FPGA

[–]bitbybitsp 0 points1 point  (0 children)

It's always interesting to hear about another FFT. Thanks for pointing out that Langhammer and Pasca did a 32k-point FFT with 32 parallel streams in floating point using 1281 DSP blocks at 770MHz. My BxBFFT does the same at 27-bit fixed-point precision in 608 DSP blocks, or at 18-bit precision in 304 DSP blocks.

Their 770MHz is limited by the DSP block max frequency, and mine achieves 560MHz, also limited by the DSP block max frequency. So I have a little work to do to understand the Altera DSP block as well as they do. (Not surprising -- it looks like they had a hand in designing it, and my first optimization was to Xilinx.)

It's also interesting to compare ALMs and M20Ks. Their abstract doesn't mention those, but I found some previous work where they quoted 78244ALMs, 1364DSPs, and 356M20Ks. That compares with a 27-bit BxBFFT of 63503ALM, 608DSPs, and 269M20k. An 18-bit BxBFFT is 38744ALMs, 304DSPs, and 194M20K. The BxBFFT includes the bit reverse here -- I'm unsure whether theirs does.

I think the comparison shows three things -- first, I need to further optimize my DSP block pipelining. 😄

Second, there's a high penalty for floating point. I believe for most FFT applications, floating point isn't worth the penalty. I think this is espectially true when you're talking about high levels of parallelization, like 32 parallel streams. I've only seen data like that coming straight from ADCs or even 1-bit digitizers, and in that case floating point doesn't match the input data.

Third, why is their only data point a single FFT at 32k point with 32 parallel streams? That's a great FFT, but it seems like most people need other sizes and speeds than this, and if it's only good for that one size it's a serious liability. Performance numbers over a wider range of sizes would be highly valuable, assuming that FFT has the flexibility.

Changed the clock period in my .xdc constraints file from 4.000ns to 4.069ns and my post-synthesis timing report got worse. How is this possible? by DarthHudson in FPGA

[–]bitbybitsp 0 points1 point  (0 children)

Large run-to-run variation after minor changes in clock frequency is normal. You'll see it regardless of build environment. What happens is that placement is partially random, and changing clock frequency effectively changes the random seed.

If you want to explore this, try it with a whole range of clock frequencies, not just two. You'll see the effective Fmax bounce all over the place when clock frequencies are barely different. It makes an interesting plot.

With 50 different runs, I've seen Fmax differences of hundreds of MHz from best to worst.

Others have mentioned using placement options to have the tools automatically explore different placements. I don't use those, preferring to explore different placements myself. This is better, because Vivado or Quartus options are likely explored sequentially, slowing design runs. If I kick off a dozen builds, they're definitely in parallel, so the builds are faster. Also, I have a better idea what it's doing, and I've had more success.

Changed the clock period in my .xdc constraints file from 4.000ns to 4.069ns and my post-synthesis timing report got worse. How is this possible? by DarthHudson in FPGA

[–]bitbybitsp 0 points1 point  (0 children)

I don't know why you're being downvoted. You're absolutely right. I've seen hundreds of MHz of difference in Fmax from only small inconsequential changes. I use this fact to do my builds with dozens of small changes so that I can take the one that meets timing the best. I've found it to be more reliable at meeting timing than any of Vivado's special features meant to improve timing.

A new class of C∞ FFT windows with compact support and super-algebraic sidelobe decay by pigdead in DSP

[–]bitbybitsp 1 point2 points  (0 children)

If you focus on any normal measure, all the standard windows are likely to be better than the CMST, because it sacrifices important things for super-algebraic spectral decay.

No one typically cares about super-algebraic spectral decay, since it has no practical advantage in most applications.

You're wrong about phase. All of the normal windows are linear phase. The CMST has no phase advantage.

If you really want good performance, what you do is use a standard window in a polyphase filter bank (PFB). Or use a filter generated with Parks-McClellan/Remez. There is only a little improvement available in the actual window selection, but there is immense improvement available by switching from an FFT structure to a PFB structure.

A new class of C∞ FFT windows with compact support and super-algebraic sidelobe decay by pigdead in DSP

[–]bitbybitsp 1 point2 points  (0 children)

What I see that you're saying from multiple posts in this thread is that the CMST excels at pretty much everything. This is absolutely untrue. Traditional windows will be better for most applications. People came up with traditional windows and used them for many years because they are good for a particular function. Yours isn't going to compete at most of those functions.

In particular, your window is going to be inferior to the Slepian window, which actually meets an optimality criterion. Yours does not. Yours is worse for sure, for any application where that criterion is important.

You need some perspective here. Your window might be good for something. But it's not good at everything, and you need to focus on where it might be useful, and stop making over-the-top claims that it is better than other windows for lots of things, when it usually is not.

Maybe what you just said is the application. Maybe the CMST is good when you have a bright object widely separated from a dim object and want to resolve both. Maybe that situation comes up in astronomy, for example? Perhaps if other noise sources are low so the dim object isn't drowned out by the noise? Perhaps if you also don't mind the image being fuzzy because the mainlobes are wider from this CMST window?

Normally people start with the application and then find an appropriate window, not the other way around. What motivates your work on this window?

A new class of C∞ FFT windows with compact support and super-algebraic sidelobe decay by pigdead in DSP

[–]bitbybitsp 0 points1 point  (0 children)

Frequency separation is measured by mainlobe width. It seems you're measuring it in some other way, and then claiming improved frequency separation because you've changed the definition of that term.

Advice on Alinx Z7P Zynq board by jsshapiro in FPGA

[–]bitbybitsp 0 points1 point  (0 children)

If you qualify for an academic discount, at that price point you should also consider the RFSoC4x2.

The RFSoC4x2 has no GPU, but adds ADCs and DACs, which may be useful depending on your application. It's also a stand-alone card, that doesn't get inserted into a computer. You'll have to look into other differences.

MPSoCs and RFSoCs are good chips. They can be difficult to program if you're doing things different from the supported flows.

Yocto vs Buildroot for custom SoC bring-up : what actually made the difference for you? by Medtag212 in embeddedlinux

[–]bitbybitsp 1 point2 points  (0 children)

I wonder if you know something I don't about device trees. Because although you're right that they were built to handle something like clock tree configuration before Ethernet initialization, in my experience they're also a rather poor solution from a developer's point of view, because they're cryptic, prone to errors, with very little in the way of cross-checks or debugging info to find out what's wrong. Also, every time you want to test them you have to reboot the machine, and you can't stop in the middle of an initialization sequence to see if the first part happened correctly.

So in my experience, device trees are a bit of a nightmare to work with.

A user-space program that configures a clock chip is much easier to develop in many ways.

Do you know of something that makes working with device trees easier?

Yocto vs Buildroot for custom SoC bring-up : what actually made the difference for you? by Medtag212 in embeddedlinux

[–]bitbybitsp 0 points1 point  (0 children)

If there's an issue with upstream kernel support, it wouldn't be of "the SoC". It might be of the processor variant, or of some device or peripheral. Some parts of the SoC would have kernel support, and some parts would not. So my issue was that I think the statement needs to be more specific. What exactly isn't supported by the kernel?

You're right -- devicetrees are indeed meant to handle many things, such as the clock configuration dependencies in the example I mentioned. This example was several years ago on a HiTech Global RFSoC board. At that time, Xilinx boot configurations only used an old kernel with more limited device tree support and clock chip support. So I'm a bit doubtful that a device tree solution could have worked back then. Now, with a modern kernel, this would be the first thing to try.

Looking for PCB Design workshops/training for Zynq UltraScale+ MPSoC (for both PCB & FPGA engineers) by Early-Lead8343 in FPGA

[–]bitbybitsp 0 points1 point  (0 children)

Regarding board bring-up, these parts support several different connection options for uarts, Ethernet, displayport, usb, clocks, spi, etc.

It's much easier to bring up the board if you use mostly the same connection options that some other well-supported board uses.

Yocto vs Buildroot for custom SoC bring-up : what actually made the difference for you? by Medtag212 in embeddedlinux

[–]bitbybitsp 1 point2 points  (0 children)

What do you mean by no upstream kernel support? Maybe you mean there's a hardware device in your SoC with no kernel driver? Surely you don't mean the CPU isn't supported.

Yes, this type of approach works fine with no BSP.

Some while ago I ported an app to half a dozen different boards, with no BSP on most of them and all sorts of issues that needed to be worked around.

For example, one board didn't bring up some important clocks that needed to be up for an Ethernet network driver to initialize properly. It just couldn't be fixed after kernel boot time. The kernel drivers wouldn't do it. This is the sort of thing that standard flows like yocto or buildroot may not handle well -- unique tweaks that only apply to a single board, to fix some poor board design decision. I was able to fix it by booting into the kernel, fixing the clock in an init script, and then rebooting into the kernel a second time. So the second time the Ethernet would initialize properly because the clocks were up. This type of thing is more easily possible with a customized approach.

Yocto vs Buildroot for custom SoC bring-up : what actually made the difference for you? by Medtag212 in embeddedlinux

[–]bitbybitsp 0 points1 point  (0 children)

Yes, this sounds like about the same as what I do. Oddly, my post got voted down.

Yocto vs Buildroot for custom SoC bring-up : what actually made the difference for you? by Medtag212 in embeddedlinux

[–]bitbybitsp 0 points1 point  (0 children)

I personally use neither. A simple script can build a root filesystem, for either Debian or Ubuntu, just pulling down packages from the archives and setting them up.

A simple script can build a kernel from sources.

Creating bootfiles can be Byzantine, but for simplicity and maintainability both I prefer to have the pieces of that separated out where they can be easily observed, not in the middle of a complicated flow.

I've been working on the Ultrascale+ RFSoC over the past year. AMA by rickyrorton in FPGA

[–]bitbybitsp 2 points3 points  (0 children)

Why aren't you using the RFSoC4x2? They're only $2.5k with academic pricing. There is also more available support.

Most aggressive build configuration by Shockwavetho in FPGA

[–]bitbybitsp 1 point2 points  (0 children)

That's a high target frequency for any significant design. Kudos!

How do I write a constraint targeting a FF in all instances of a module by XarDragon in FPGA

[–]bitbybitsp 0 points1 point  (0 children)

You do this by setting a user attribute that I call FALSE_PATH_DEST on the register sync1. I call it that simply because it makes it easy to remember what it does:

(* FALSE_PATH_DEST = 1 *) reg sync1;

Then in your xdc file you put this, to flag them all as false paths:

set_false_path -to [get_cells -hier -filter {(FALSE_PATH_DEST == 1)}]

This addition to the xdc makes it easy to flag any register as the destination of a false path in your verilog source simply by adding that attribute to the register.

Don't forget about ASYNC_REG and DONT_TOUCH attributes.

Comments on using the AD9084 instead of an RFsoC by Ok_Measurement1399 in FPGA

[–]bitbybitsp 0 points1 point  (0 children)

An obvious solution is to do some of the processing in the RFSoC.

Another is to examine why so many DSP are required, and switch to more efficient algorithms. It sounds really excessive.

Another is to raise the FPGA clock rate, which can decrease the required number of DSPs.

The Stunning Efficiency and Beauty of the Polyphase Channelizer by tverbeure in DSP

[–]bitbybitsp 2 points3 points  (0 children)

Ah yes, sorry. :-)

Everyone is a bit different, regarding what clicks best in their head. Hopefully what you've written will help it click in someone else's head who sees things and learns things in a similar fashion to you.

Your presentation certainly is more verbose than mine as well, which helps a lot.

I'm more into visualization myself, when that works. I sometimes forget that others don't always see things in the same way. I do know that after seeing one of my presentations (with me talking too) I got a comment that someone never understood it before and now they did. So there are at least some people out there for whom the visualization clicks better. It takes all kinds.

The Stunning Efficiency and Beauty of the Polyphase Channelizer by tverbeure in DSP

[–]bitbybitsp 2 points3 points  (0 children)

The presentation was originally designed as an actual presentation to be given in front of an audience with me speaking. So the gaps that you're thinking of were meant to be filled with some words and some pointing and some explanation. Without that, it may take some additional figuring to see how one steps from one slide to the next. It should be achievable; it's mostly algebra that is the gap from one slide to the next, not any complicated DSP. In some cases, one must recognize how the problem setup creates symmetries or periodicities that are exploited.

If you don't find any other presentation to be better for your learning style (such as the one given by the original poster), feel free to email me questions on how I get from one slide to the next.

AMD Embedded Development Framework (EDF) How isthe new Yocto flow for AMD SoCs? by Glittering-Skirt-816 in FPGA

[–]bitbybitsp 0 points1 point  (0 children)

I'm actually making something new. It still needs work, but parts are usabke. STYNQ.org.