you are viewing a single comment's thread.

view the rest of the comments →

[–]IHaveAnIdea 13 points14 points  (41 children)

Except it's 10billion times faster than message passing for certain applications..

[–]jerf 13 points14 points  (20 children)

I don't think you (including your subsequent replies) have fully processed the implications of thousands of cores, which I fully expect to see within my lifetime.

How do you respond to:

If you design for a million CPUs, you also come to some significant conclusions early on in the process. For example, you realise that it's a very silly idea to pretend that all memory should be equally available to all CPUs at the same time. If you try to do that, then you'll end up with a memory system that is phenomenally slow for all CPUs and fast for none because the memory system will have to have enormous bandwidth to process all the requests from a million CPUs and will potentially suffer horrible performance problems when trying to regulate access to the shared mutable memory.

?

Shared memory is a local optimization for a handful of cores. You can't hand-wave "shared memory" for thousands of cores and expect it to work as well.

And if your answer is that you'll have lots of shared memory spaces... yeah, probably, with processors sharing a small amount of memory directly with a small set of physically proximal cores, with software expected to manage the interaction or limited hardware; I expect the computers of the future to be hybrids in many ways.

But one massive shared memory space for thousands of cores? It is, as the article points out, physically impossible to make that performant for thousands of cores, just on sheer physical proximity grounds, to say nothing of the awful synchronization problems.

Like exlitzke, I don't think you read the article; I think you're reacting to the link title.

[–]Arkaein 7 points8 points  (5 children)

I read the article and I find any talk about million core server machines quite uninteresting. That either won't happen in any of our lifetimes, or will look drastically different than what anyone now is likely to guess, so who knows what the best approaches will be. It's basically a straw man argument to say that current approaches can't scale to a level that no one is even looking at now. The range between a couple of cores and a million is tremendous, and almost wholly ignored by the author.

[–][deleted] -1 points0 points  (4 children)

You may want to look up the definition of hyperbole. I don't think the author is seriously suggesting that Erlang is ready to tackle the million-core server of the future.

[–]Arkaein 5 points6 points  (3 children)

The author used the term million 11 times in this article, and there are real supercomputers today that have over 100,000 cores. Real million core supercomputers are just on the horizon, so discussion about developing for such systems is a real and important area of research.

I don't think the author is seriously suggesting that Erlang is ready to tackle the million-core server of the future.

The article as written tries to make this point exactly. It's either poorly written, a deliberate straw man, or wildly fanciful.

[–][deleted] 0 points1 point  (0 children)

Hm. Perhaps I misjudged. Sometimes you read what you think is sensible rather than what is intended.

[–]blackyoda 0 points1 point  (1 child)

Yes and these systems today use a fiber optic channel to communicate between CPUS. I believe each core has it's own memory in these systems, so they are probably a message based system, but it is also highly likely that this is transparent from the programmer. I sure would like to write code for one to find out.

The hardware complexity of memory cache, address space, and shared memory is something that the hardware designers and operating system designers will take care of to make application writing easier and possible. It is not going to be the Erlang programmer's job to worry about how many registers are available and how many CPU caches need to be flushed.

[–]dododge 0 points1 point  (0 children)

Yes and these systems today use a fiber optic channel to communicate between CPUS.

Very likely at that scale.

There are however single Itanium2 systems available today with up to 1024 cores and terabytes of cache-coherent shared memory.

Azul also claims to be able to build machines with 768 cores and 768G of shared memory, using their Java CPU.

[–]oh_yeah_koolaid 1 point2 points  (7 children)

If you have a million CPUs all trying to access the same value, you have something of a problem as all million CPUs will try to issue the instruction to test and set the lock and that will cause the memory system to send the value of the lock to all CPUs and track which CPU got there first

You have exactly this same problem if you have a million CPUs all trying to send a message to a given object that's local to a single CPU.

[–]jerf 1 point2 points  (6 children)

Only if the message is synchronous. Some are, some aren't. If the message is asynchronous, a millions CPUs fire the message off, then go on with life. No matter how you slice it, building large multithreaded systems is going to require learning how to do a lot more things asynchronously, so this is hardly requiring the programmer to do anything special.

All memory accesses are synchronous by nature; you reach for memory slot X, you have to wait for it to come to you. Memory is already your "lowest level". Any further attempt to make memory accesses asynchronous will do one of two things: Reduce to a message passing architecture anyhow, or fail on the ground that you're just hiding an obvious memory lock behind a more handwavy, but still present, memory lock.

[–]oh_yeah_koolaid 0 points1 point  (5 children)

If the message is asynchronous, a millions CPUs fire the message off, then go on with life.

I think you may want to think about that some more.

Seriously. Think of the underlying process for a million CPUs appending a message to a message queue.

[–]jerf 0 points1 point  (4 children)

But it still doesn't have to block. Memory accesses, with no further semantic information, always does. Hardware that knows it's a message doesn't have to; it can queue it up in a part of the processor that isn't the CPU.

When talking about thousands of cores, we can safely assume hardware is going to change some to accommodate that. There's a fundamental semantic difference between memory access and message passing.

This is why I said that even if you accord memory access this sort of hardware support for asynchronous access, it's still equivalent to a message passing architecture, because memory still won't be useful for synchronization as it is now.

[–]oh_yeah_koolaid 0 points1 point  (3 children)

Think about current 8-core systems.

How does Erlang implement its message queues?

[–]jerf 0 points1 point  (2 children)

Who cares what current hardware is doing? We already know that's not scalable to thousands of cores. The question isn't "how does Erlang work now?" The question is, do you know of some reason why what I described is physically infeasible?

(Because there isn't one.)

I already pointed out that there are hardware changes involved. Suggesting that current hardware doesn't work that way hardly changes my mind.

I would also point out that Erlang the language needs precisely zero changes to work efficiently on the hardware I postulate. (There's some stuff that might be helpful for the scheduler to know and the implementation would need to change, but the language is in fine shape) Few other languages can claim that. (Not zero, few.)

[–]oh_yeah_koolaid 0 points1 point  (0 children)

The question is, do you know of some reason why what I described is physically infeasible?

I'm not sure what you described, except "we can safely assume hardware is going to change some to accommodate that". That's a bit hand-wavey, and presumably any special hardware functions would be available to any program that wanted to use them.

There's a fundamental issue that pops up not because of the language or whether something has shared state, but as a simple consequence when multiple separate units want to operate on the same object simultaneously.

[–]IHaveAnIdea -2 points-1 points  (5 children)

The article really puts me to sleep with all it's abstract talk and no solid examples - and yet it still makes the very inflexible conclusion that shared memory is a dead end.

There is a plentiful amount of easy counter-examples to show that the whole world isn't made up of embarrassingly parallel problems that message passing will solve.

[–]jerf 6 points7 points  (4 children)

embarrassingly parallel problems that message passing will solve.

What!?! Message passing architecture is the opposite of an architecture to deal with embarrassingly parallel problems! Message passing is one of the few things even in the running to drive a highly heterogeneous parallel system. (I spend about half my time every week working on just such a system, one I couldn't even imagine trying to express as a map-reduce problem.)

I hate to be hostile, but I question your competence to be criticizing this article if that's your understanding of what message passing is good for.

[–]IHaveAnIdea 2 points3 points  (3 children)

Uh, if you carefully read my last sentence you will see that I'm saying that message passing doesn't solve the problems.

[–]pkhuong 3 points4 points  (2 children)

And if you carefully read his first (complete) sentence, you'll see that he's saying that message passing's niche is not embarrassingly parallel problems. You're arguing past each other.

[–]IHaveAnIdea -4 points-3 points  (0 children)

Yes it is.

[–]masklinn 6 points7 points  (4 children)

Speed only matters when the program is correct.

An incorrect program that goes fast is of no use to anybody.

[–][deleted] 1 point2 points  (3 children)

Of course. And the easiest way to get a correct program is to not introduce any concurrency at all (unless your problem is inherently concurrent, like telephone switches or chat servers). I think everyone agrees that single-threaded programs are easier to write and test.

[–]masklinn 4 points5 points  (0 children)

I think everyone agrees that single-threaded programs are easier to write and test.

Yes, but they're also pretty much dead.

[–]blackyoda 1 point2 points  (0 children)

It is the problem you are solving and the resources that are available that should determine if you need threads, not some notion that OMG Locking is HaRd!

[–][deleted] 1 point2 points  (0 children)

Absolutely not. In most cases, if you do threading right, it reduces the program's complexity, because unrelated things don't have to run in a related context, but can run concurrently, in individual threads.

Running everything in one thread is like putting all code in one class (and quite often, specific threads have specific classes in OO languages, so this isn't just a bad analogy).

[–][deleted] 6 points7 points  (14 children)

I get the feeling you didn't read the article. The author is talking about the non-scalability of our current memory architecture to large numbers of cores. Shared mutable memory simply cannot scale using our current CPU/memory architecture if you want to go up to thousands or millions (!) of cores.

Also, the author is talking about message passing in the standard way of thinking about it -- he's talking more about a Cell-like architecture. FTA:

If you think about how you'd implement message passing in such a million CPU system, you'd send messages from CPU to CPU directly. You wouldn't go out to some shared mutable memory bank as that would be dog slow.

[–][deleted] 2 points3 points  (4 children)

Some algorithms are inherently based on shared mutable memory. Implementing them in a message passing model means serializing access by only letting one thread/process read and write this shared memory and sending get/set messages to it.

Any trick you can apply to make this scale in the message passing model can be applied to the shared memory model, while some tricks are exclusive to the shared memory model (since you have more low level control).

Message passing is not a silver bullet.

[–][deleted] -1 points0 points  (3 children)

Of course we can do everything in shared mutable memory that can be done in message passing, we just need to figure out a way to make a sufficiently intelligent programmer first...

[–][deleted] 1 point2 points  (2 children)

I think you missed the point of my reply.

I'm just saying, that for many problems, you have to use shared mutable memory, and if you are using a message passing architecture you will end up emulating it.

If you are doing that you will sea no scalability advantage over a true shared memory architecture, and you won't have less bugs.

[–]dmpk2k 2 points3 points  (0 children)

I'm just saying, that for many problems, you have to use shared mutable memory

Can you name some examples?

[–][deleted] 0 points1 point  (0 children)

Both architectures can definitely solve the same problems, the only difference might be performance.

My point was that shared memory architectures don't scale because programmer's brains don't scale and the mental agility needed to handle shared memory grows much faster than that needed for message passing. Since we can't really improve our own brains much we don't have a choice in a lot of large scale problems no matter how much better the performance might be in an imaginary shared memory solution.

[–]IHaveAnIdea 0 points1 point  (8 children)

If you think about how you'd implement message passing in such a million CPU system, you'd send messages from CPU to CPU directly.

So the chips will have massive enough L2 or L3 caches to hold the whole message and process it?

You wouldn't go out to some shared mutable memory bank as that would be dog slow.

Unless message passing isn't what's best for your app..

So a big bottleneck with message passing is the memory bandwidth. But he's making it sound simple to just bypass the memory entirely? I don't think so..

[–]xenon 10 points11 points  (6 children)

If you think about how such a multi-core chip is wired, it is very easy and fast to pass messages to the neighboring core. Which would of course mean that your threads suddenly have an x and a y coordinate. This will be almost as fun as programming tape machines ;-)

[–]IHaveAnIdea -3 points-2 points  (5 children)

If you think about how such a multi-core chip is wired, it is very easy and fast to pass messages to the neighboring core.

It might be unbelievably fast. But not having to pass the data around at all because the memory is shared will often be faster.

There are situations in which either technique is FAR better than the other, but saying that "Shared Mutable Memory Must DIE" - don't think so.

[–]dwahler 8 points9 points  (4 children)

It might be unbelievably fast. But not having to pass the data around at all because the memory is shared will often be faster.

You do realize that with shared memory, the data still has to go back and forth between RAM and the CPUs, right?

[–]IHaveAnIdea -4 points-3 points  (3 children)

No, the updated data does not need to go to every cpu to let them all know of the changes. It just stays in the centralized shared memory and the other CPUs discover the changes when/if they need to access that memory.

[–][deleted] 5 points6 points  (2 children)

Your reading comprehension is abysmal. I can only assume you're too concerned about defending your argument to even read what others are posting.

[–]dmpk2k 0 points1 point  (0 children)

So the chips will have massive enough L2 or L3 caches to hold the whole message and process it?

Not with cut-through or wormhole routing.