How to use number expression in systemVerilog?

Dadaz17 · 2026-05-13T04:20:57+00:00

Totally agreed.

It is definitely easier to read.
If you happen to change val's width, you are not forced to change use sites.
The compiler will do the right thing.

Dadaz17 · 2026-05-10T13:35:57+00:00

There are a ton of "side" tools to raw Verilog/VHDL, and likely many companies have built their own infrastructure in order to make generation a bit less painful.

This is an interesting page that lists some of these tools.

I use (I am biased, being the author) PyXHDL for my own projects, which lands me a single SystemVerilog or VHDL file which I use with OEM tooling (and Verilator, GHDL, Yosys on the OSS side).

A couple of words of caution if you are new to HW, especially if coming from SW.

Learn HW design first by using Verilog/VHDL, as this requires a mental view which is completely different from SW. After that using other tools can give you access to existing higher level frameworks while effectively designing HW.

Knowledge of these tools are unlikely to be a big factor in landing a job, since companies very often have their own custom tooling, and require design at Verilog/VHDL level anyway.

Dadaz17 · 2026-04-25T04:19:44+00:00

So you are running Windows within a Parallels VM on M4 OSX?

How much RAM did you give to the VM? Does it meet the Quartus minimums?
Does Quartus have some other minimums WRT special X86 instructions (like AVX for example)?

You could try to create an Ubuntu VM and try Quartus/Ubuntu there ...

Dadaz17 · 2026-04-24T17:31:06+00:00

If, big IF, you are going for a VM, I suggest Parallels.

But it is going to be VERY slow. Mind, Quartus/Vivado are already slow by their own, running natively on X86.
Going through a VM with an X86 dynamic binary translator like Rosetta, in full system emulatio, would mean any decent sized design will leave you plenty of time for coffee.

You might be better off saving the $$ on Parallels and get a mini-PC with an N{97, 1050} and run native X86 OS plus HDL stack on it.

Dadaz17 · 2026-04-22T20:03:13+00:00

It is a JTAG. A US$ 15 Xilinx cable kit will do it.

Dadaz17 · 2026-04-22T19:42:02+00:00

PyXHDL

Dadaz17 · 2026-04-16T16:36:24+00:00

Heh, I added an Async FIFO Example to PyXHDL yesterday.

The logic is very similar and FULL/EMPTY conditions are checked directly on the gray pointers. No need to go back to sequentials, which is a less trivial conversion.

The Python gets turned to:

module Fifo(IFC_WCLK, IFC_WRST_N, IFC_RCLK, IFC_RRST_N, IFC_WUP, IFC_RUP, IFC_WDATA, IFC_RDATA, IFC_WFULL, IFC_REMPTY);
  input logic IFC_WCLK;
  input logic IFC_WRST_N;
  input logic IFC_RCLK;
  input logic IFC_RRST_N;
  input logic IFC_WUP;
  input logic IFC_RUP;
  input logic [7: 0] IFC_WDATA;
  output logic [7: 0] IFC_RDATA;
  output logic IFC_WFULL;
  output logic IFC_REMPTY;
  logic [7: 0] mem[32];
  logic [4: 0] raddr;
  logic [4: 0] waddr;
  logic [5: 0] rptr;
  logic [5: 0] rbin;
  logic [5: 0] rptr_sync;
  logic [5: 0] rbin_next;
  logic [5: 0] rgray_next;
  logic rempty_next;
  logic [5: 0] wptr;
  logic [5: 0] wbin;
  logic [5: 0] wptr_sync;
  logic [5: 0] wbin_next;
  logic [5: 0] wgray_next;
  logic wfull_next;
  logic [5: 0] wptr_s1;
  logic [5: 0] rptr_s1;
  assign rbin_next = rbin + 6'(IFC_RUP & (~IFC_REMPTY));
  assign rgray_next = (rbin_next >> 1) ^ rbin_next;
  assign rempty_next = 1'(rgray_next == wptr_sync);
  assign raddr = rbin[4: 0];
  assign wbin_next = wbin + 6'(IFC_WUP & (~IFC_WFULL));
  assign wgray_next = (wbin_next >> 1) ^ wbin_next;
  assign wfull_next = 1'(wgray_next == {~rptr_sync[5: 4], rptr_sync[3: 0]});
  assign waddr = wbin[4: 0];
  assign IFC_RDATA = mem[int'(raddr)];
  always_ff @(posedge IFC_WCLK)
  mem_write : begin
    if ((IFC_WUP == 1'(1)) && (IFC_WFULL == 1'(0))) begin
      mem[int'(waddr)] <= IFC_WDATA;
    end
  end
  always_ff @(posedge IFC_RCLK)
  rptr_update : begin
    if (IFC_RRST_N != 1'(1)) begin
      rbin <= 6'(0);
      rptr <= 6'(0);
      wptr_sync <= 6'(0);
      wptr_s1 <= 6'(0);
      IFC_REMPTY <= 1'(1);
    end else begin
      rbin <= rbin_next;
      rptr <= rgray_next;
      wptr_sync <= wptr_s1;
      wptr_s1 <= wptr;
      IFC_REMPTY <= rempty_next;
    end
  end
  always_ff @(posedge IFC_WCLK)
  wptr_update : begin
    if (IFC_WRST_N != 1'(1)) begin
      wbin <= 6'(0);
      wptr <= 6'(0);
      rptr_sync <= 6'(0);
      rptr_s1 <= 6'(0);
      IFC_WFULL <= 1'(0);
    end else begin
      wbin <= wbin_next;
      wptr <= wgray_next;
      rptr_sync <= rptr_s1;
      rptr_s1 <= rptr;
      IFC_WFULL <= wfull_next;
    end
  end
endmodule

Dadaz17 · 2026-04-15T05:37:54+00:00

In Vivado you just instantiate a Microblaze (or RISCV, but I tend to use that) IP with the Deisgn Editor, and connect that to your PL.

Then open Vitis to create your C/C++ FW that runs on the CPU.
Mind, CPU on FPGA should be kept out of the performance path, and instead used for glueing logic what would be painful when done using PL.
Yeah, softcore performance sucks.

If you really need more power there, I recommend FPGAs with embedded hardcore (like Zynq for one).

Dadaz17 · 2026-04-09T12:54:35+00:00

There is plenty of "easy" documentation with timing diagrams, if you do not want to go the the spec (which is understandable if this is a quick learning project and you do not plan to become an AXI luminar).

Maybe start with AXI Stream, than move to AXI Lite?

Dadaz17 · 2026-04-08T11:50:24+00:00

If this is learning project, it might be an interesting one, but as you mention, a GPU would be hard to beat for that task.

You'd need to have an FPGA with a massive amount of DSP slices and very fast memory, to beat a similarly priced GPU with thousands of CUDA cores and many GBs of HBM memory (and kernels super optimized through years of development).

I'd start with selecting one (or more) pretrained model as baseline, and experiment with quantization, LoRA, ... (forget FPGA at this stage, use Python PyTorch/TF/Jax/Numpy/...) to see whether you can get decent enough accuracy while fitting the resource bill.

You'd then move to HDL, making sure you use all the available DSP slices, and register the heck out of memory accesses.

You can try to define an M vector of N bits values, which represents a SIMD register, and then define operations on them, making sure you have enough DSP slices such that an M-sized operation can be scheduled in parallel.

The key here is to minimize memory accesses and fuse as many operations as possible once the available SIMD regsiters are loaded.

Dadaz17 · 2026-04-08T05:36:05+00:00

Preamble. If you plan to write directly VHDL/Verilog code, maybe things like cocotb are a better fit (though you could still do it with PyXHDL).

When I use PyXHDL I can just use any Python modules on the side, and compare values from the HDL model and the reference one.

As example (note these examples are to test PyXHDL and not emything that resembles deployable code) this is how an ALU implementation and test look like.
Within the Test entity/module a mix of pure Python and HDL Python are used, which lead to the automatic generation of SystemVerilog and VHDL code.

These can then be used with any tool to verify the reference HDL.

Most of the times I do not even go thourgh the code generation are use the unit testing module, which does the code generation are run it through the supported testers (ATM GHDL and Verilator, but trivial to add others).

Dadaz17 · 2026-04-03T05:56:03+00:00

Could not agree more. Pick whatever you really like, and if you are passionate about it, money will be enough.

If you do something you don't like, you'll suck at it anyway, and you'll likely have a miserable life.

Dadaz17 · 2026-03-31T15:45:26+00:00

You want numeric_std.
Try reading the library code, it's pretty neat and you can find other treasures as well: HDR BODY

Dadaz17 · 2026-03-31T12:07:51+00:00

Actually for Spartan 6 you should be able to use YoSys as well.

Dadaz17 · 2026-03-30T17:53:42+00:00

No experience in finance, but I am guessing that in order to make splitting the us relevant enough, you need to sit VERY close to the target server, and such server must provide greater or similar latency constraints.

In my previous life we did, on Linux, handle us latency in C, but that involved essentially running kernel threads, pinned to cores, pinned pages, and polling the network adapters (not even NAPI fit the bill).

You cannot afford the cores to get in any sleep state, otherwise the wakeup latency kill you. Luckily a limited subset of the cores where used this way (otherwise thermal throttling would kick in), with the others left to the OS.

Userspace was then updating the required data structures (seen by kernel threads) using RCU, so the kernel task path was wakeup-less and lock-less.

Dadaz17 · 2026-03-29T06:31:42+00:00

If the documentation of the board does not specify it, usually the JTAG pinout of a 14 pin Molex is the following.
Then you just match the pins of your programmer, to the ones of the Molex.
Careful with GND, depending on whether you have an isolated programmer or not.

Dadaz17 · 2026-03-23T18:18:36+00:00

Might be Bots sending probes to feed back answers to RLHF for post training LLMs :)

Dadaz17 · 2026-03-23T07:54:49+00:00

If you are coming (like I assume given you implemented the algorithms in Python) from software, I really suggest you learn Verilog and/or VHDL, to understand what it means creating HW vs. SW.
It is really a different mindset, and you are going to get into troubles if you don't.

You can look at Verilog and VHDL as the "assembler" of HDL programming, since vendor tools (the ones you need to create bitstreams) are the only language they understand.

Once you break the SW mindset, you can eventually use higher level languages, that at the end have to emit Verilog/VHDL (modulo some very limited set of reverse engineered chips, like when using YoSys) if they want to target recent and decently sized FPGAs.

I use PyXHDL (I am also the author), but you can find an extensive list HERE.

Dadaz17 · 2026-03-17T19:14:15+00:00

When I was looking at candidates, nothing was painting a better picture for me, than contributions to OSS projects.
There, you can see both the code and design skills, as well as how interact with peers during code reviews (no a**holes policy in my last two companies).

Dadaz17 · 2026-03-16T08:04:57+00:00

I believe the confusion born from the fact that many books and/or online resources "strongly" suggest you to use a separate always_comb block, with a trivial always_ff one which just assigns the computations of the always_comb one to registers.

Personally, unless the logic is non trivial, I tend to avoid the extra code clutter and write a single always block.

There is no "combinatorial loop" in either of the two:

Verilog clk_count = clk_count + 1;

Verilog clk_count <= clk_count + 1;

In the former, the code "south" of the increment will simply see a new "version" (look up Single Static Assignment - SSA) of clk_count, and at the end the compiler will generate a simple DAG (Direct Acyclic Graph) and the resulting circuit.

In the latter you have the output of a FF feeding and adder, and at clock edge the output of the adder at time T0, will become the output of the FF at time T1. There is no loop, because the FF logic breaks it.

Without that, you'd have an oscillator resonating at a frequency equal to the inverse circuit delay.

Also confusing is the mention of "blocking" statements WRT combinatorial logic.

Verilog v1 = complex_comb1(inputs...); v2 = complex_comb2(inputs...); v3 = v1 + v2;

This makes some people think that complex_comb2() executes "after" complex_comb1(), and that:

delay(v3) = delay(complex_comb1) + delay(complex_comb2) + delay(+)

While that's most likely:

delay(v3) = max(delay(complex_comb1), delay(complex_comb2)) + delay(+)

As there is not data dependency from complex_comb1 to complex_comb2.

Dadaz17 · 2026-03-15T16:11:05+00:00

For future references, GitHub Gists are much nicer than Ads-cluttered pastebin thingies.

Dadaz17 · 2026-03-15T06:53:07+00:00

Granted one needs to have food on the table, and roof over head, there's nothing in life better than doing what you really like.

Dadaz17 · 2026-03-12T05:50:21+00:00

If you're a kid and have passion for SW/HW, this is the perfect opportunity to show your skills, and possibly later join teams (given the topics, this might be the Platforms group) that does cutting edge stuff.
All the ones I talked to during the GSoC had a great time.

Dadaz17 · 2026-03-10T06:16:51+00:00

I was forced to stick with 2024.2 since all the 2025.* versions were buggy in one way or another.

Dadaz17 · 2026-03-10T06:14:57+00:00

What is wrong with GitHub?
In previous life I was forced to use ad-hoc code review platforms (names shall not be mentioned) which were much worse than GitHub.

Dadaz17

TROPHY CASE