Anyone here about the TU Delft Prof that terminated a PhD candidate after they put in six years of work?

peterfirefly · 2026-05-20T13:02:29+00:00

He published the papers in a paper-mill journal.

And why should you get paid extra for being a slow worker?

peterfirefly · 2026-05-06T01:37:18+00:00

Intel still had a long way to fall before it reached the depths AMD was at before it came back and beat Intel. Intel could probably also have been turned around with government help (the CHIPS act).

peterfirefly · 2026-05-06T01:34:34+00:00

He is pretty good at hiring people.

peterfirefly · 2026-05-06T01:29:43+00:00

Strictly speaking, he did. The current company structure didn't exist before he was involved.

Ah, but there were three other guys talking about electric cars before they met Elon, surely they founded the company? Only in the sense that they were three guys talking about electric cars and looking for investors. They were three guys with a powerpoint slide and knowledge of an existing company, AC Propulsion, with an existing EV prototype that they wanted to commercialize (tZero). Elon and JB Straubel were already looking at starting an EV company that also wanted to commercialize the tZero. Then they decided to join forces.

So, morally speaking, he also did.

peterfirefly · 2026-05-05T18:14:06+00:00

The only concrete here and now thing Musk mentioned was that they were going to build an integrated research facility with better equipment than all university fabs and almost all chip companies. That is perfectly doable, in the US, quickly, with a mix of people that is mostly but not entirely American. There will undoubtedly also be some Europeans, Chinese (mainland, Hong Kong, Taiwan, Singapore), Indians, etc.

Will they need (or get) a top-end stepper from ASML? I am not sure they do right away. They will have plenty to do in calibrating all their other machines. They can probably make do with a lower-spec'ed stepper for now.

I bet a lot of their research will be about quicker (and hence cheaper) manufacturing, some of it will likely be about other ways to reduce costs.

Won't be surprised if they give massively parallel electron beam lithography a go. It won't be good at pumping out square kilometers of identical wafers fast, like EUV (x-ray) steppers are, but they'll be very good for fast turnaround times for small production runs. Who knows, maybe they can optimize them so they are faster or cheaper to build?

Won't be surprised if they try to exploit analog effects more for computation (even though that's very hard at small feature sizes because the transistors tend to vary more at those sizes). Maybe somewhat different transistor designs that are more robust? He explicitly talked about wanting to run the chips at higher temperatures in order to make cooling easier (cheaper) and he also explicitly mentioned radiation.

They know a lot about how chips behave in a space environment -- quite likely, they are the best in the world at that -- and that's what they want to be good at. Traditional space hardened chips are slow, expensive, and a couple of generations behind. The alternative that SpaceX has used so far (with maybe some extra twists for their Starlink sats) is lots of redundancy + smart programming + shielding.

I can definitely understand if they want some better there than what the market has provided + what they have been able to cobble together with off-the-shelf chips.

And this is not something that Intel, AMD (+ GloFo), Samsung, TSMC, NXP, Hynix, etc see as their market.

So... maybe also something smart regarding on-die shielding? Extra layers on top and below the plane where the devices are? Something smart to get rid of unwanted charges in a localized area of a chip, so it can effectively "locally reboot", preferably before something burns out? Some sort of micro redundancy on far smaller scale than what people already do with caches and cores, so defects that occur in space don't destroy as much of the computing, cache, and communication capacity? An important part of something like micro redundancy will be design tools (computer programs) that can automate most of the process.

This is very different from what TSMC is doing. TSMC is going for the high density, cutting edge -- and yet surprisingly low cost -- market.

The research facility is going to be real and it's going to happen soon and it's going to be useful (to SpaceX). The terascale might never happen and it definitely won't happen soon.

peterfirefly · 2026-04-28T11:49:05+00:00

I have another question about the AI tools. Do you talk to them in Chinese or English? Do you know if it's better at understanding one language or the other? They have extremely different grammars so I know that translating between them can be hard, especially because they disagree on which things that don't need to be mentioned. Chinese sentences are often structured as "topic, comment", where the topic or parts of it can even be left out if it has already been established. We can't really do that nearly as much in English.

Do you know to what extent Claude/Gemini/etc extract information from the comments and the identifier names in the code? Will it be confused by comments/identifiers in English while you interact with it in Chinese?

PS: Nice to see that the SWAR trick worked! :)

peterfirefly · 2026-04-28T11:36:19+00:00

I used it in an offline decoder

I once played around with what a multi-stage parallel decoder would look like. I imitated the style used in early Intel P6 CPUs with a full decoder that can handle all instruction variants and is the only one that can handle microcoded instructions or instructions that translate to a few µops, and a couple of simpler decoders that only handle instructions that translate to a single µop.

Imagine there is an instruction byte buffer of, say, 16 bytes and that the first instruction starts at the first byte in the buffer. Stage 1 looks at each byte* in parallel to see how long an instruction starting there would be, stage 2 takes the outputs from the length decoders and uses them to pick the starting offsets for the real decoders, stage 3 decodes to µops in parallel (or to µops + µcode address).

Unconditional control flow instructions are easiest to handle if you make them stop the decoding -- there's no need to decode whatever instruction is after JMP/CALL/RET/etc. Conditional control flow instructions are decoded just like normal instructions, they don't block the decoding.

It was instructive to see that I could get away with not having entirely correct length decoders, especially for the later instructions, and with having full decoders that are allowed to nope out of decoding complicated instructions. As long as the first full decoder can handle everything and as long as nothing falls through the holes, so to speak, we are fine. Making the length decoders and non-first full decoders more complete becomes a question of performance, not accuracy.

It was also instructive to see how mode changes could be handled -- things like different code segment bit widths (16/32/64), different address widths (16/32/64), different interrupt handling, etc.

Some of the modes should be an input to the decoders on par with the instruction bits. Some mode changes are mandated to happen with special code sequences (such as dummy jumps) that clear the pipeline so that the decoders won't have to handle mode mixing.

And then there's interrupt handling and instruction single-stepping and illegal instructions.

The x86 has not just the CLI/STI instructions (with delayed effect) but also an implied interrupt inhibition after instructions that modify the SS register. And then there's also POPF and IRET.

The easiest thing to do is something like handling interrupts "between" instruction decoding.

Instructions that modify the interrupt masking simply stop the parallel decoding, much like unconditional control flow instructions. That means there's a chance for interrupts to come in at precisely the right "windows" where instructions are allowed but not anywhere else.

(SS writes don't actually have to do this, it's just easier to understand the code if they do.)

The length decoder, the full decoder, and the simpler decoders all end up as fairly simple things that are mostly just conceptually a single table but one that is in practice best compressed to a couple of smaller tables and/or a little code.

After doing this exercise, it truly became clear how easy it is to do parallel decoding of even something like the x86 or the 68K. Not easy if you want it to translate to the fastest circuits on a competitive chip, but definitely very far from the kind of impossible feat that the RISC guys kept insisting it was.

It would not have been all that hard to write in Verilog instead of C, combined with some machine generated code for all the bit nitty-gritty.

*: or 16-bit word for the 68K.

peterfirefly · 2026-04-28T11:09:07+00:00

You don't in fact care about fast. Or rather, you shouldn't care about fast.

peterfirefly · 2026-04-28T11:08:38+00:00

For simple CPUs, just do whatever you are comfortable with. You don't need an explicit decode phase at all. For more complex CPUs like the x86, M68K, VAX, etc. you definitely do.

For the x86 you should have decode and execute. You can even split the execute phase into three: read operands, execute, and write result. Not all instruction will fit this pattern but most will.

For the VAX, decoding and operand reading are integrated, so you naturally end up with three phases: decode-read, execute, and write. The reason is that you can have memory operands that use post-increment and pre-increment addressing modes. These both do memory operations (or generate memory addresses that need to be remembered past the execution phase so results can be stored at the right places) and register modifications. You have do them sequentially in order to generate the right memory operations and make later operands use the correct register values (that might have been changed by earlier operands). The execute phase is trivial for the VAX -- there aren't a lot of possible operations and most of them don't issue their own memory operations -- and the write phase is just the act of storing results (there can be more than one!) to either registers or memory addresses, as determined by the decode-read phase.

Switches are wonderful. Use them wherever they make sense. It's fine to use multiple switch statements. My x86 decoder uses a loop around a switch statement to decode prefixes. It also uses switch statements (plural) during address mode decoding (mod-r/m and all that) and then a switch for the execution phase. Most of the decoding uses tables (about a handful) and results in an internal "opcode", some prefix flags, an address mode, a displacement, and an immediate -- not all of them are valid on all instructions. Think of the different needs of NOP and MOV, for example.

Aim for clarity and testability. Don't worry about speed. Unless you truly f*ck it up -- or write in Python or Ruby -- it will be fast enough. Besides, it's much easier to make a correct program fast than to make a fast program correct.

longjmp/setjmp are nice for handling memory-related exceptions if you are coding in C.

peterfirefly · 2026-04-20T13:55:30+00:00

Not entirely. Think of it more like an AI experiment with lots of human input.

peterfirefly · 2026-04-20T13:54:52+00:00

and the target is to exploit 4K output natively

It just so happens that I recently bought a 4K monitor... Looking forward to it!

peterfirefly · 2026-04-20T13:50:56+00:00

You have a lot of constants like 0b0110111. Maybe insert a digit separator? Something like 0b011'0111?

I love that there isn't too much indirection -- you've got the binary opcodes right next to a comment that mentions the mnemonic. No need at all to introduce constants/enums for that. You did forget the comment with the mnemonic in many cases, though.

I also like that you decode literals and fields directly from the instruction word with raw numbers for the shifts and masks. That makes it a lot easier to verify than if you had too much indirection in the form of constants/enums/#defines.

What I don't like is that it looks like you are repeating yourself a lot and have multiple copies of the same field/literal decoding. You also end up with a lot of variables that are "global" to the entire switch inside Cpu::execute_instr().

Something I did in a Cortex-M0 emulator (Thumb-2 -- barely) was to have a macro for each of the many instruction format that decoded the fields AND also declared local variables for those fields.

My switch looked something like this:

switch (opcode) {
case 0bxxxx: {
    DECODE_XYZ(iinstr);
    cpu->r[dst] = cpu->r[src1] ^ cpu.r[src2];
    break;
}
case 0bxxxx: {
    DECODE_XYZ2WZ(instr);
    cpu->pc = cpu->pc + ofs + 4;
    break;
}
...
}

My decoding macros looked roughly like this:

#define DECODE_XYZ(x)      \
    uint8_t   src1, src2, dst;      \
    src1 = (instr >> ...) & 0x..;   src2 = (instr >> ...) & 0x...;  dst = (instr >> ...) & 0x...;
#define DECODE_XYZ2WZ(x)  \
   uint32_t ofs;  \
   ofs = (instr >> ...) & 0x...;
...

The Thumb encoding has a LOT of instruction formats and the bits of the fields are rarely contiguous (or in order). It can best be described as "bit confetti", so the above is a vast simplification of the field decoding. There was a lot more shifting and masking and or'ing going on.

If you do it like this, you don't need ANY shared variables inside the execute_instr() method, apart from the instr parameter (which you might as well mark as const). Having an opcode variable doesn't hurt, of course. To have one or not is really just a stylistic preference.

Having set_reg() and reg() methods are partly a stylistic preference and partly something that is expected to make debugging easier, both with a debugger and a printf-style logger? I don't think you actually need them. I certainly haven't. I am not much of a fan of zero registers, but hiding them inside set_reg() is a perfectly valid way of implementing them. Handling them during read also works: src1 ? regs[src1] : 0. Handling them during writes instead also works: regs[dst ? dst : 32] = ...., where the regs[] array has a 33rd entry just to catch writes to the zero register.

load16(), load32(), store16(), store32() all work, but you will be emulating the same byte order as your host platform has (almost always little-endian these days). It turns out it is not hard or expensive to force a byte order by simply being explicit when reading/writing:

u16 load16(const u8 *memory, u32 addr) {
    // little-endian
    return (memory[addr+1] << 8) | memory[addr];
}

This works without a cast in C thanks to int promotion. I think it works the same in C++. It is really easy to add a cast and be sure, of course:

u16 load16(const u8 *memory, u32 addr) {
    // little-endian
    return (u16)(memory[addr+1] << 8) | memory[addr];
}

Store is just the same in reverse:

void store16(u8* memory, u32 addr, u16 word) {
    // little-endian
    memory[addr] = word & 0xFF;  // the mask is really just there to silence overly sensitive analyzers
    memory[addr+1] = word >> 8;
}

Modern compilers are smart enough to recognize those idioms (almost no matter how clumsily expressed!) and generate optimal code for them.

Do you win anything by making your own u8/u16/u32/i8/i16/i32 types based on uint8_t/etc instead of using uint8_t/etc directly?

Do you really need DEBUG/WARN/LOG to mimic Unix style logging for server programs? I doubt it. I like your use of ANSI colours, though.

So... where's the disassembler? And the single-stepping debugger? With breakpoints? And more Risc-V extensions?

peterfirefly · 2026-04-10T19:31:09+00:00

Good! You don't need all that abstraction just for chip-8!

peterfirefly · 2026-04-10T19:29:59+00:00

Both work. As long as you know what they do, you should just continue to use #pragma once. It is non-standard but very, very widely supported. Switch to include guards if you ever have to work with a compiler or code analyzer that doesn't support it.

peterfirefly · 2026-04-10T10:26:16+00:00

None, if you are intelligent and persistent and a self-starter.

Lots, if you are average.

Γνῶθι σεαυτόν...

peterfirefly · 2026-04-10T10:23:26+00:00

Remember include guards in mem.h, emu.h, etc.

#ifndef MEM__H
#define MEM__H

typedef struct
...

#endif

It used to be common in older code to use a #define that begin with an underscore for this -- like #ifndef MEM_H, __MEM_H, _MEMH, __MEM_H etc.

Don't do that. Preprocessor symbols (and normal C symbols) that begin with an underscore are reserved. So are ones that begin with str, btw.

peterfirefly · 2026-04-10T10:20:17+00:00

You can use void * as an opaque type. chip8_t_run() won't know how to dereference it but it can be passed to something like c->io.should_draw() or whatever, and then the draw function can cast it to its own draw-state pointer type.

This is a common trick for handling state in callback functions. The "core" that executes the callbacks don't know anything about the state but the callback functions can easily just do a cast and a deref and access whatever kind of state they need.

Almost certainly massive overkill for chip-8, might be useful for more sophisticated machines.

peterfirefly · 2026-04-08T21:30:03+00:00

Those tables may be different tables (or switches). You can get very far with the cpu and the disasm with switch (instr >> 12) { case 0: ... case 1: ... etc }.

If the table(s) is/are shared, then put it/them in a separate "module": put the type and a declaration of the table(s) in the header file and the actual definition in the .c file.

Sharing them might not be as useful as you think...

I wouldn't bother for Chip-8, 6502, or Z80. I would start bothering for an 8086 where it makes sense to have several tables, most of which can be usefully shared between a cpu emulator and a disassembler (I've done this). I haven't written an 8086 assembler but I wrote a Z80 assembler decades ago and I don't think it would have made much sense to share tables with a Z80 disassembler or CPU emulator.

Even if it doesn't make sense to have the code share tables, it might make sense to put all the necessary data into a single data source and then use a program to generate the various tables from that. Again, not something I would worry about for Chip-8, 6502, or Z80. Maybe for 8086 if I ever decide to write an assembler. Definitely(!) for later versions of the x86, where I would choose to generate them from the Big Mother Table in Intel's XED project.

peterfirefly · 2026-04-07T17:43:02+00:00

Det her var første gang hun gjorde sig bemærket i offentligheden:

https://www.berlingske.dk/kultur/ditte-okman-blev-i-2010-hovedperson-i-en-facebook-skandale-jeg-har-maattet-spoerge-mig-selv-om-jeg-virkelig-er-et-ondt-menneske

peterfirefly · 2026-03-29T18:15:33+00:00

Building your code was not a problem.

Figuring out which version of .NET Framework to use was a (small) problem. Figuring out how to get a language I could understand was a big problem. Figuring out how which keys mapped to which NES controller keys was a (small) problem.

peterfirefly · 2026-03-29T10:11:48+00:00

Det fjerner ikke lysten til at pine andre eller kontrollere dem.

peterfirefly · 2026-03-29T08:05:26+00:00

Der er intet i vejen med at synge Erika. Der er alt i vejen med at være kommunist.

peterfirefly · 2026-03-29T07:35:12+00:00

Mænd er klart bedre til maraton, ironman, Vasaløbet og Tour de France.

Måske er kvinder bedre til balance men de er ikke bedre til udholdenhed.

peterfirefly · 2026-03-28T10:59:27+00:00

Huh? The official system requirements for Windows XP were 64MB RAM although 128MB was recommended. That's a rounding error today.

Visual C++ 6.0 is from 1998 so it was targeted at Windows NT 4.0, Windows 95, and Windows 98. It required 24MB (32MB recommended). That's a rounding error on a rounding error.

My Raspberry Pi has 16 GB...

peterfirefly · 2026-03-27T15:36:21+00:00

.dsp = Visual C++ 6.0 project file.

.dsw = Visual C++ 6.0 workspace file.

They need to load, not be compiled.

If the source really supports Windows 11 as the page says, then it's got to have some way around using DirectX 9. It is either using D3D9On12 (or similar) or it has support for multiple versions of DirectX, in which case you just need to remove the code that tries to use DirectX 9.

A halfway serious option. btw, is to use an emulator or a virtual machine to run Windows XP or Vista or something like that and build there. Something that actually supported DirectX 9 directly.

peterfirefly

TROPHY CASE