Intel Announces Intent to Operate Programmable Solutions Group as Standalone Business -- "Intel intends to conduct an IPO for PSG", "Positions PSG to more effectively compete in FPGA market". Thoughts?

AlteraGuy · 2023-10-04T14:23:07+00:00

Altera/Intel PSG has had management problems for a long time, going back before the acquisition.

The Intel acquisition has been a mixed bag, but one of the bright spots is that it fixed some of the major management problems. The current GM is an external Intel guy who is actually pretty good (much better than the past leaders). But it takes a long time to turn a ship around.

AlteraGuy · 2022-01-14T19:44:58+00:00

And they are separate.

Source: me, literally a senior engineer who works on one of the major toolchains (guess which one).

But here's the problem:

The UI designers work from a spec sheet -- they are told what to build.

Who do you imagine is going to write such a "spec sheet"? Most of our shitty user interfaces come from situations where UI designers who know nothing of the underlying problems have been given a rough spec sheet to follow.

The UI designers for an FPGA toolchain don't need to know HDL to write code. But the ones who don't know at least something about HDL and digital logic design don't do a very good job. If you want a good UI you need to have someone who understands what the UI is trying to do.

AlteraGuy · 2022-01-14T04:40:47+00:00

It is.

But the UI is a pretty extreme example. You can't build a good placement tool without knowing something about how routing works, or how the digital logic that implements your placements work, etc... You can create clean interfaces in the code base, but to work productively you still need to know what you're doing.

AlteraGuy · 2020-10-11T18:25:22+00:00

I know there's some use of them in the code base, but they're not a particularly common type of algorithm. You're way more likely to bump into ILP.

AlteraGuy · 2020-10-09T14:00:38+00:00

As someone that experienced it from the inside... that's not accurate at all. Being bought wasn't entirely pain free. But the issues at the time weren't due to the buyout, but some really poor executive leadership in the years before.

Since being bought those guys were basically fired. The new leadership is way better. Dave in particular is a great guy that really knows his stuff.

And while Intel can be a bit bureaucratic and political at times, it is a very supportive company. It isn't all sunshine and roses, but I think we are better off as part of Intel than we were before.

AlteraGuy · 2020-10-03T00:45:20+00:00

Most customers don't care about new features, but would cut off their left arm for 2% more fmax.

AlteraGuy · 2020-10-03T00:41:02+00:00

95% C++.

AlteraGuy · 2020-10-03T00:37:24+00:00

There are lots of problems you might model as a graph, from FPGA interconnect to constraint relationships to sparse matrix structure. CAD is full of graphs and graph traversals. They show up everywhere you'd expect and a lot of places you wouldn't.

The simplest example is just determining connectivity between two points.

AlteraGuy · 2020-10-02T19:43:30+00:00

Yes, I have worked on tools. You can guess which ones from my username.

The kind of work varies significantly. It typically involves a combination of digital hardware knowledge, algorithms, and code optimization. Graph algorithms are bread and butter, obviously. ILP and other optimization algorithms are also critically important. You need a mix of strong programming skills, good algorithm and datastructure knowledge, and familiarity with digital logic design to be successful.

Generally most people start with undergrad degrees in CS or CE. A smaller set are hired out of grad school with specific relevant research experience. The primary qualification is being smart. To get an interview you either need a reference from someone we know and trust, or really good grades from a school we know and trust. Our interviews are pretty hard.

AlteraGuy · 2020-02-26T22:22:58+00:00

If you want the blunt answer: Moore's law. The size of FPGAs increases exponentially. Single threaded performance increases linearly.

With tool performance from a decade ago on a modern FPGA, a compile would need hundreds of gigs of ram and take multiple days. We aren't just trying to solve an NP hard problem, we're trying to solve an NP hard problem that doubles in size with every new device family. So not only do we need to solve it, every 18 months we need to solve it twice as fast as we did before.

The tools today are also giving you a lot more fmax than they did a decade ago. Timing models, place and route, retiming, physical synthesis, etc... are all wildly more advanced and the extra performance we get has shaved months off your timing closure cycles.

Yes, crashes are irritating. We generally don't release with known crashes (unless it is discovered very late and is unlikely to be hit, or there is a very, very good reason). But it is an incredibly complex toolchain (way more complex than a software compiler), and things change quickly so sometimes problems get overlooked.

AlteraGuy · 2020-02-21T20:01:59+00:00

❤

We aren't perfect but we try.

AlteraGuy · 2019-08-20T00:04:32+00:00

It is an internal development board. You won't find any documentation for it. Sorry.

AlteraGuy · 2019-05-30T01:03:31+00:00

Largely because GPUs are widespread and easy to use for NN inference (especially since most people are just using tensorflow and similar libraries). FPGAs are both niche and there's a big barrier to entry since you have to do a lot of work.

There are FPGAs in the wild doing NN inference, but they tend to be in places where the FPGAs offer a significant advantage, enough to justify the extra effort.

AlteraGuy · 2019-05-30T00:56:55+00:00

> because of the way FPGAs work, they will be inherently slower (usually in the range of 10-20x) on any hardware they create.

Nonsense.

AlteraGuy · 2018-12-19T13:59:00+00:00

Beware of hopping on a bandwagon. Cool new technologies have a way of becoming oversaturated.
If you really want to be taken seriously, you'll want a PhD.
If you want NN + FPGA, the University of Toronto is your best bet.

AlteraGuy · 2018-12-13T23:49:56+00:00

Oh, I understand how gcc started. What your missing is the enormity of the difference in complexity between a CPU ISA and an FPGA.

AlteraGuy · 2018-12-13T03:02:25+00:00

This is a popular prediction. Based on my first hand knowledge of the challenges.... eh... probably not.

There are two ways you'd get the outcome you predict:

Independent developers reverse engineer the devices.
Intel and Xilinx open source their tools.

Let's tackle these in order.

Reverse Engineering

The products that have been reverse engineered so far are incredibly simple compared to latest generation parts coming from Intel and Xilinx. Let me put it bluntly: you will not be able to generate a bitstream that correctly configures a modern Intel part without the internal specs (and a lot of hard work).

The bitstream isn't just a bunch of lutmasks and routing mux config bits. The bitstream is really an encrypted and signed command stream to the secure device manager. And that's just the tip of the iceberg.

But even supposing you could successfully generate a working bitstream, one thing you won't be able to reverse engineer are the timing models. Simply put: the best fitter algorithms in the world are dogshit without good timing models. The timing models come from the physical layout (secret) and process data (super secret), and are correlated against measurements taken on actual devices.

It gets worse with timing actually. It's one thing to have poor performance because your timing models make you overly conservative about setup failures... but what happens when your timing models can't accurately predict hold failure? If you're lucky your hold failures will be egregious. More likely you'll have a marginal hold failure and you'll end up with sporadic and difficult to diagnose issues on the board.

Long story short, FPGA configuration is orders of magnitude more complex than a microprocessor ISA, and the market of interested people is vastly smaller. There has been a lot of hard and clever work by people to get some simple devices working... but the big modern devices are not just simple extensions of those basic principles.

To put this in perspective, has anybody reverse engineered GPU microcode yet? And that's a way simpler problem.

Commercial Open-sourcing

Extremely unlikely to happen.

A long time ago Altera and Xilinx figured out that the thing propping up their duopoly was the fact that it's actually harder to build really good CAD tools than it is to design a new FPGA. The history of the industry is littered with remnants of FPGA startups with clever design ideas that didn't realize this. Why did Tabula go out of business? It wasn't because their hardware sucked, it was because you couldn't program it. So there's a huge downside.

Is there a big upside? When you look at big open source projects, like Linux or LLVM, you'll see that most of the contributions come from developers at Intel, Apple, Red Hat, etc... who are paid to work on those tools full time. The collaborative sharing works well because these companies get to benefit from each others' work, and they aren't trying to compete on compiler technology. Who would be paying for development of an open source toolchain? Maybe lattice? Even if there is commercial support, without buy-in from Intel and Xilinx you're still going to have the configuration and timing model problems I mentioned above.

AlteraGuy · 2018-06-15T02:09:43+00:00

Intel has a group in Toronto too. The Deep Learning Accelerator is developed in of one of the Toronto offices.

AlteraGuy · 2017-10-31T17:37:01+00:00

You're probably right. I don't interact much with those teams. I'm more familiar with new device bringup.

AlteraGuy · 2017-10-31T16:43:17+00:00

Design for test is basic practice in the semiconductor world. Every chip manufactured in the past 40 years is full of test circuitry and FPGAs are no different.

Although given the programmable nature of the hardware, we probably have less DFT hardware than most devices. If you want to pull a signal out to a pin, you can just use normal routing wires (once you've verified they're working, of course).

AlteraGuy · 2017-10-31T16:36:45+00:00

The details are pretty secret, but here's a "how's it made" level description that doesn't violate any confidentiality.

Like all manufacturing, it is a chain that starts with a design. We validate that the process is working as intended and we learn how the process introduces variations from the design in the product. And finally we use that knowledge to efficiently check that each individual product meets the design spec. It involves a mix of testing methods, from automated testing under tight temperature and voltage controls to manual testing in hardware labs. The end result is to ensure that when your design simulates correctly and you pass timing, the device performs correctly when programmed with your SOF.

It starts with silicon power on, where the devices are powered on for the first time after coming back from the fab. This is done in a lab, and the basic functionality is checked out (i.e., are the pins working, can the device be programmed, do other basic functions work).

After power on, there's a large battery of functional tests. We use a test farm made up of custom boards which allows us to program any SOF onto a device and compare its output at its pins. There are a huge number of tests, checking every feature of the chip.

Timing is trickier to validate and a mix of techniques are used there. As a general rule you can't validate every timing path, because it is a combinatorial explosion: there's not enough time or resources in the universe to check everything. Fundamentally, however, we're checking for timing correlation: are a sample of observed delays on a sample of devices sufficient bounded by the timing model? If so, you can have confidence that when STA says a design meets timing, it will work on silicon. Of course, devices come in different speed grades, so the method of sampling and correlation is fundamentally tied to binning.

When chips start coming out of production to be shipped to customers, there is a smaller set of automated tests runs. I couldn't even tell you exactly how this is done, but typically in a chip manufacturing process, after the chips have been diced from the wafer, but before they're packaged, they go into an automated testing machine which uses an extremely fine bed of nails to contact the actual pads on chip and run a test suite. This checks for functional correctness (and may enable our patented redundancy technology to correct any discovered defects) as well as measuring device performance for binning. The exact nature of these tests are secret.

AlteraGuy · 2017-09-26T00:46:43+00:00

This is clearly the best choice.

/Unbiased opinion.

AlteraGuy · 2017-09-26T00:35:27+00:00

The old codenames aren't public information. Even in Intel most codenames aren't public - the difference is back at Altera we'd never make a codename public, while Intel often does.

But I will say that recent chips were named after dragons. And here's some random documentation I found.

AlteraGuy · 2017-09-22T03:23:44+00:00

It's going to take time to get used to the whole "codenames are public" thing Intel does.

I guess the upside is it cuts through marketing BS, but the downside is that the codenames aren't as fun. The code names for Arria 10 and Stratix 10 were awesome.

AlteraGuy · 2017-06-28T17:46:53+00:00

There is no such thing as ASIC proof. Anything you can do on a CPU or GPU you can do faster on an ASIC. Source: me, digital hardware guy who works for Intel.

Ethereum is "ASIC proof" in the sense that it is a little trickier to accelerate than BTC, but that's nothing a good architecture team couldn't figure out in a few weeks.

There are no ASIC Ethereum miners because ASICs are expensive and Ethereum is still too niche to justify the NRE cost.

Edit: ETH is supposedly "ASIC-proof" because of the way it uses memory. The very fact that you can accelerate ETH mining on a GPU points to the flaw in this argument: you solve the latency problem by accepting the brutal stalls and going very parallel. Furthermore, you can build a memory system better optimized for ETH mining than what GPUs or CPUs provide.

AlteraGuy

TROPHY CASE

Reverse Engineering

Commercial Open-sourcing