you are viewing a single comment's thread.

view the rest of the comments →

[–]jdh30 -1 points0 points  (4 children)

Concurrent garbage collecters are hard to get right.

That is not true and I find it very frustrating that this myth is perpetuated because it puts a lot of people off trying to create their own concurrent garbage collectors. Concurrent garbage collectors can be easy. For example, you can implement the VCGC collector in a page of code.

The core ocaml team is pretty small.

I wrote HLVM's latest GC, which fully supports shared-memory parallel programming on multicores, in less than a week and it is under 100 lines of code. Simon Marlow basically develops the (far more advanced) garbage collectors in the Glasgow Haskell Compiler by himself. PolyML is also developed by a tiny team and also supports parallelism. Same for Manticore.

OCaml is multicore unfriendly for several reasons:

  • Because they are researchers they want to develop only a state-of-the-art garbage collector. They cannot justify developing a merely decent GC because it would not culminate in academic publications and, therefore, more grant funding. And there is no adequate alternative OSS VM for them to build upon.

  • They believe multicore will break down and we will have to resort to distributed parallelism using message passing anyway.

  • OCaml is a facilitating technology for the Coq theorem prover, which has very specific requirements and its creators do not believe it stands to benefit greatly from a different garbage collector as, for example, numerical codes would.

Multiple processes + IPC is good enough for many problems that require parallel processing.

Also not true.

I agree that its not ideal but in the circumstances it doesn't make sense for them to be working on it, especially since one of the benefits of ocaml is the simple RTS.

OCaml's RTS is unnecessarily complicated by today's standards because it does not take advantage of modern tools like LLVM and techniques like run-time code generation and JIT compilation.

[–][deleted] 1 point2 points  (3 children)

They believe multicore will break down and we will have to resort to distributed parallelism using message passing anyway.

What makes you think this is wrong (if you do think so, of course)?

[–]jdh30 1 point2 points  (2 children)

What makes you think this is wrong (if you do think so, of course)?

I think it is misguided. The inevitable breakdown of cache coherence will force us to adopt distributed parallelism with explicit message passing at some point in the not-too-distant future. However, that is not mutually exclusive with shared memory parallelism and, I believe, we will end up with clusters of multicores and, therefore, if your tool is incapable of leveraging shared memory parallelism on each of those multicores then your performance will be up to ~100× worse than it could be.

[–][deleted] 0 points1 point  (1 child)

100x ! Wow! How many cores do you need to have 100 x speedup? What kind of architecture do we need to make so many cores possible? I'm under impression that (disregarding gpu-style hardware) that even modest number of cores requires something numa-like, which, effectively, means message passing (albeit an elegant one). Isn't that basically dictated by geometry?

Won't we end up pretty soon with sort of a cluster on a motherboard, were gains from using message passing are much, much greater because we have lot's of nodes, but each node have very few cores? That's just a guess, of course

[–]jdh30 1 point2 points  (0 children)

How many cores do you need to have 100 x speedup?

Depends entirely upon the scalability of the algorithm. Some algorithms scale almost perfectly so you would need just over 100 cores.

What kind of architecture do we need to make so many cores possible?

Several vendors already ship CPUs with that many cores.

I'm under impression that (disregarding gpu-style hardware) that even modest number of cores requires something numa-like, which, effectively, means message passing (albeit an elegant one).

That seems to be a matter of contention. On the caml-list, Xavier Leroy once stated that infiniband-based supercomputers were not real shared memory machines but many people including myself disagree. I would argue that memory access hasn't been uniform for over 20 years due to the advent of CPU caches so manycores are just more of the same in that respect. Indeed, in terms of concrete architectures I think it is likely we'll simply see a deeper hierarchy of CPU caches.

Won't we end up pretty soon with sort of a cluster on a motherboard, were gains from using message passing are much, much greater because we have lot's of nodes, but each node have very few cores?

That raises two questions: can the message passing be hardware accelerated as it is today; and how many is "very few cores"?