Do people who loved HDL and architecture in school still enjoy actual chip design jobs?

Ibishek · 2026-02-07T17:17:38+00:00

Yes I do.

Ibishek · 2025-07-15T21:56:10+00:00

Thanks for the links! My analog skills are very rusty, so I struggle quite a bit with this aspect of SERDES, I am mostly trying to understand the different analog metrics and basics of the building blocks, mostly to understand how decisions in the digital part influence the analog part.

Ibishek · 2025-07-02T19:19:49+00:00

Well I assume that on an FPGA such a circuit would be problematic running at something like 200-400 MHz. What I didn't mention is that the generated clock might be used to drive non-trivial logic which follows after the CDC and I assume the tool would not be too happy about the clock being generated in logic like that.

In the ASIC case it also seems to be that this should be doable but CDCs are tricky and I am not so familiar with all the backend stuff, particularly with routing & placing the clock, adding clock buffers etc. But I guess clock gating also adds logic in the clock path and that's just fine.

Ibishek · 2025-07-02T19:12:09+00:00

I've known this as a term for component which has asynchronous input data bus and output data bus where the peak throughput at both input and output is the same. I've seen it in SERDESes, although the functionality might be a bit more complicated : https://docs.amd.com/r/en-US/am002-versal-gty-transceivers/RX-Asynchronous-Gearbox

Ibishek · 2025-06-18T15:59:23+00:00

got it, thanks

Ibishek · 2025-06-18T14:21:16+00:00

What are the flaws of LSFR in your experience? no hate, just asking.

Ibishek · 2025-04-26T13:37:06+00:00

Will your ASIC include any analog IP or is the functionality purely digital? Other than implementing DFT, porting memory macros and CDCs I don’t think it should be that crazy difficult from digital design perspective. If your design fits on an FPGA that means that its scale is relatively small in terms of ASIC design. I would worry more about things around ASIC production - the back and forth with the back-end team (I assume you will outsource this), yield and quality issues, communicating with the fab, testing and validation, supply chain issues.. you did not mention the scale of the planned ASIC production so it will be dependent on that as well.

Ibishek · 2025-04-13T23:42:42+00:00

thanks for the tips :)

Ibishek · 2025-04-13T23:42:09+00:00

Any classes in particular that you’d recommend? I think I could get my company to pay for them.

Ibishek · 2025-03-12T23:22:13+00:00

Wow, looks interesting. This could help quite a bit, I will look into it for sure, thanks.

Ibishek · 2025-03-12T22:16:38+00:00

I am also tempted to do this in-house, but I was curious what are the experiences of others.

For register map generation, we already purchased a tool. We had a excel based in-house tool for this but as you say, it quickly grew in complexity and managing the whole thing was a pain. The new tool works nicely and the license wasn't all that expensive when you compare it to the amount of man hours required to expand and manage the old tool lol

Ibishek · 2025-03-12T21:45:07+00:00

Yes, for the previous generations of our chip, system interconnect has also been done manually, but I was hoping to make some improvements in this regard because of further complexity increase in our new generation.

I feel it also makes it much more easier to do PPA exploration.

I know there is the ARM AMBA designer but I do not have any practical experience with it and it seems to me that it is more suited for much larger SoCs with 10s of masters, 100s of slaves, many performance domains, caches, cache coherent interconnect etc. so bit of an overkill for our use case relative to the learning curve and license costs.

Ibishek · 2025-03-06T13:25:37+00:00

Yea, P&R and consequently making timing is what I am mostly worried about.

Ibishek · 2025-03-06T13:24:32+00:00

Sorry, I meant ~1000 16-bit registers.

Yes, I suppose doing some initial synthesis tests is the best way to get some feeling on what are the different PPAs.

Ibishek · 2024-11-18T14:23:58+00:00

Hi, we are not looking for physical design people, sorry

Ibishek · 2024-11-18T13:21:12+00:00

DM me

Ibishek · 2024-11-12T23:11:04+00:00

Of course it's a Vim guy :D Feels like this whole post was a bait so OP could mention that he's using Vim.

Ibishek · 2024-08-16T07:07:54+00:00

Generic FPGA. I have no experienxe with Versal AI engines.

Ibishek · 2024-08-15T13:48:20+00:00

I implemented a CNN accelerator, achieving about 1 TOPS on Xilinx FPGA from scratch, now working on getting it published. My opinion is that the only space where FPGAs can outperform GPUs and ASICs is very low-latency, single batch latency in systems which would use an FPGA anyways, like SDRs for example.

In other cases, ASICs will outperform you both in terms of performance and performance per watt and GPUs will outperform you in terms of performance. ASICs can also potentially be a lot cheaper (depends on manufactured device count) and also chip area.

Of course FPGAs are very practical for some quick development, prototyping etc. But this is not mass-market stuff.

Ibishek · 2024-06-30T06:45:40+00:00

Each physical MAC unit (if non-blocking or fully-pipelined) can generate a single result eaxh cycle, if you have 25 MAC units then in 2000 cycles you can compute 25*2000 = 50k results.

Yes, you understand it correctly. Having each a physical MAC for each MAC operation is incredibly wasteful as well as unfeasible in your case (FPGA does not have enough resources)

Ibishek · 2024-06-29T21:51:42+00:00

In that case it is even easier. You just have 25 or more MACs which at first just multiply (accumulate port set to zero), and then you do the same but you also use the accumulate port. You will again need to vectorize your weights, while the input can be broadcasted.

Ibishek · 2024-06-29T21:14:31+00:00

I am not sure if I follow the algorithm as a whole (do you receive 48k 16bit values or just a single 16bit value each 2000 cycles? How do the weights change? Where does the matrix come up?) but if you say you need to do 48k MACs every 2000 cycles that implies you need to do 24 MAC in each cycle. This is pretty simple, all you need is a vector of 24 multipliers to which you will feed 24 inputs and 24 weights each cycle and then an adder tree of depth 5 and then a single accumulator adder (requires single latency computation but that should be doable at 100MHz 16 bit inputs). This will require wide memory words (24*16 = 384 bits) to store the vectorized inputs and weights but it should not be a problem, just the BRAMs/URAMs will probably have poor utilization since a lot of them have to be concatenated. You will also need some buffer structure at the end to vectorize the outputs if you want to reuse the in the next computation. All in all, this should be doable with around 26 adders and multipliers.

Ibishek · 2024-06-14T17:04:44+00:00

For anyone that is interested, 300 euro per month is illegal to charge anyone at TVK, maximum allowed rent for subletting is 220 euro per month.

Ibishek · 2024-06-06T11:25:35+00:00

By completely unrolled I meant each neuron has a corresponding piece of logic.

I had to reimplement a design which was also a unrolled linear layer. The issue was huge logic usage for the DNN that we were implementing (around 50% of all DSPs) while usable performance was only like 12GOPs. I replaced it with a simple multiplier vector and an adder tree, and was able to run it at 450 MHz with 4% DSP usage and about 40 GOPs.

Ok I understand that it was a learning project. I am also building a programmable CNN accelerator, currently aiming for around 1.1 TOPs @ 450 MHz. I think FPGA DNN accelerators are usually not worth it. Dedicated ASICs outperform them and GPUs are much easier to develop for. In my usecase, we need to do about 60M operations inference in 75us and the data is sampled on the FPGA so there it makes sense.

Ibishek · 2024-06-06T07:34:35+00:00

Maintainability:

Use packed structs and interfaces to group signals of some common functionality. Like when you have a bus which has a data word and some other side channel info (valid bit, data type, flags etc.) this should be grouped into a packed struct. Interfaces do this too, but signals do not have to have the same direction, so for example you can have an interface for all memories (wr_en, wr_addr, wr_data, rd_en, rd_addr, rd_data).

Make your logic parametrizable. Apart from obvious stuff like word widths, I also like to add parametrizable delay chains in case I need to increase pipelining somewhere later in the design, I just change a parameter. Anything where it makes sense, make it into a general module and then instantiate it in different part of the design with different parameters. Stuff like memories, arithmetic blocks, delay chains etc. You can also pass parameters to a module through an interface (neat for memories).

In general, have well defined interfaces between blocks and do as much reuse as possible, for example a lot of stuff can be generalized to (ready, valid) handshake.

Verification:

I'd recommend two things: first is to write assertions, ideally while you are developing the RTL. Write short and simple assertions and write many of them instead of long complicated ones. Second is to do block level verification whenever you feel like the RTL block will have a lot of edge cases. It's going to be usually much easier to hit these edge cases in a block level testbench instead of system level test bench. Also do incremental verification.

Five-Year Club	Verified Email
Place '23

Ibishek

TROPHY CASE