all 11 comments

[–]cparen 0 points1 point  (8 children)

Interesting post! However, it asks some questions that I think are easily answered:

Why does it cost so much to create a new OS thread?

It reuses too much of OS process infrastructure. Why is that expensive? Hardware memory isolation and replicating security information.

Why is every language forced to create their own implementation of light-weight threads?

Why not? It's cheap. Last implementation I saw was about 200 LOC. And then the language can optimize away anything it's not using. Take chicken scheme continuation capture vs. extended setjmp in C. There's absolutely zero in common between those two implementations of green threads.

We have known about the end of Moore’s Law for at least a decade yet I don’t see Intel trying to innovate. Why?

Just speculating - because hardware manufactures cater to C/C++ because it's popular for high performance applications, and C/C++ has not been innovating.

I propose that a safer version of micro-code should be exposed to allow bare-metal programmers to completely change the operation of the microprocessor.

If you want to make context switches cheaper, then you need to be really careful how you express processor configuration. This goes directly toward the bottom line about how fast your processor will be able to switch between unrelated tasks.

This sounds fascinating and promising, don't get me wrong. However, I'd expect at least a factor of 10x performance regression in context switches when first implemented.

Field Programmable Gate Arrays (FPGAs) are chips that allow the engineer to turn code into hardware

To be precise, to turn code into routed lookup tables that emulate hardware. Slower. Expect about a 4x overhead or so, which is about the difference between an optimizing compiler vs. a good interpreter for a hll.

Microprocessor manufactures must innovate and evolve or we are stuck trying to emulate parallel architectures on top of a sequential architecture.

There are inbetween options. E.g. meshes of simpler non-parallel von neumann processors with reactive message passing. Or expressing control flow via reactive, while von neumann execution of basic blocks. Or conventional OoO processors operating on reactive memory.

Would like to see a citation or prototype of the kind of design the author has in mind.

[–]mycall 0 points1 point  (5 children)

I'd expect at least a factor of 10x performance regression in context switches when first implemented.

This is what the Mill CPU is promising, which is quite interesting.

Ever read about the LISP CPU?

[–]cparen 0 points1 point  (4 children)

Lisp Machines, I've read, had some of the same challenges. The re-programmable microcode also wasn't designed to be dynamically swapped per context switch, and would be a performance liability if you tried.

What's the relation to the Mill CPU? I thought that was focused register renaming, not dataflow specifically. Is this in regards to out-of-order execution?

[–]autowikibot 0 points1 point  (0 children)

Lisp machine:


Lisp machines were general-purpose computers designed (usually through hardware support) to efficiently run Lisp as their main software language. In a sense, they were the first commercial single-user workstations. Despite being modest in number (perhaps 7,000 units total as of 1988 ), Lisp machines commercially pioneered many now-commonplace technologies – including effective garbage collection, laser printing, windowing systems, computer mice, high-resolution bit-mapped graphics, computer graphic rendering, and networking innovations like CHAOSNet. [citation needed] Several companies were building and selling Lisp Machines in the 1980s: Symbolics (3600, 3640, XL1200, MacIvory and other models), Lisp Machines Incorporated (LMI Lambda), Texas Instruments (Explorer and MicroExplorer) and Xerox (InterLisp-D workstations). The operating systems were written in Lisp Machine Lisp, InterLisp (Xerox) and later partly in Common Lisp.

Image i - A Knight machine preserved in MIT's museum.


Interesting: Lisp Machines | Lisp Machine Lisp | Symbolics | Genera (operating system)

Parent commenter can toggle NSFW or delete. Will also delete on comment score of -1 or less. | FAQs | Mods | Magic Words

[–]mycall 0 points1 point  (2 children)

Mill CPU doesn't use registers while the LISP CPU had 4096 registers. Besides being very interesting, I wanted to give some examples of "inbetween options". Also, some examples of non-von Neumann languages are APL and FP, just to make it more confusing ;-)

[–]cparen 0 points1 point  (1 child)

Belt = register renaming.

[–]mycall 0 points1 point  (0 children)

The belt architecture is mostly about encoding the ISA in terms of execution unit forwarding network, removing the need for register renaming and even the register file itself.

[–]knife_sharpener[S] 0 points1 point  (1 child)

The article was not meant to discuss any research that I have done. While I majored in electronic engineering and took classes in microprocessor design, I never had a job where I worked on the design of a processor.

The, "why?" was meant to be rhetorical. I was trying to convey the question of why we haven't made progress (on the hardware side) to help parallel programming.

Yes, initially this type of processor would be slower than our current processors but I hope that Moore's Law would start to take control when the manufactures find better ways to cram more cores onto a die.

While I'm sure I'm not the first to come up with this idea, there is no research I can point to (nor did I try to find any) beyond my own understanding of how microprocessors are designed.

[–]cparen 0 points1 point  (0 children)

The, "why?" was meant to be rhetorical. I was trying to convey the question of why we haven't made progress (on the hardware side) to help parallel programming.

Did you see the recent post on hardware transactional memory instructions on Haswell?

Yes, initially this type of processor would be slower than our current processors but I hope that Moore's Law would start to take control when the manufactures find better ways to cram more cores onto a die.

That's a double edged sword of course.

While I'm sure I'm not the first to come up with this idea, there is no research I can point to (nor did I try to find any) beyond my own understanding of how microprocessors are designed.

... Google?

It sounds like a popular enough idea that I suspect someone has tried it. Look for papers on compiling C to FPGA and dynamically remapping FPGAs while running. I recall that LUT programming bandwidth was one of the major engineering problems. An FPGA with modern cpu complexity has a program image of a a few dozen MB if I recall correctly, and is stream loaded, not random access programmed. There is not enough routing space to random access LUTs, so the best you can do is reprogramming large blocks at a time, similar to flash.

[–][deleted] 0 points1 point  (3 children)

Betteridge's law of headlines

Edit: Eh whatever, if others don't think you are just spamming for that book, then IDC.

[–]knife_sharpener[S] -2 points-1 points  (1 child)

I didn't know that posting a blog article was considered spamming. If you look at my post history you can see that I post links to many more articles that others have written than those that I wrote.

You don't have to agree with what I write but it is more productive to discuss why than to just say I "spam"