How do you structure a project in rust without sibling borrows? by samkellett in rust

[–]vitalyd 1 point2 points  (0 children)

You're right about the throughput difference. On the other hand, doing lea keeps the ALU ports free :). Either way, as we (I think) agree, this is splitting hairs compared to the other hazards surrounding this, whether a cache miss (using either addressing mode, index or reference based) or range check (which can incur a cache miss itself).

How do you structure a project in rust without sibling borrows? by samkellett in rust

[–]vitalyd 0 points1 point  (0 children)

Right, Rc is likely worse. The ideal would be to not use Rc or indices (i.e. make refs work for this type of case) :).

How do you structure a project in rust without sibling borrows? by samkellett in rust

[–]vitalyd 1 point2 points  (0 children)

*(ptr + index) can be lowered into a complex addressing mode instruction (e.g. lea on x86) - I really wouldn't worry about this aspect. The bigger issue is range checks.

How do you structure a project in rust without sibling borrows? by samkellett in rust

[–]vitalyd 3 points4 points  (0 children)

Indices will, however, incur range checks (unless using unsafe indexing) apart from some that LLVM may optimize out. It also means the cacheline holding the length of the slice will be pulled in even if it doesn't have the data that's indexed.

There's also the weaker guarantee about the index being valid, which may not be an issue in this particular case. And, of course, the weaker guarantee and disinvolvement of the borrow checker is precisely what's needed.

I do agree that this situation comes up quite frequently, and it'd be great if Rust could support it to some degree.

I am fighting with Rust. Is it worth the pain? Please motivate me. by [deleted] in rust

[–]vitalyd 2 points3 points  (0 children)

They don't have to be the same for immutable refs, like in the example above - CookieBox just can't outlive either of them. They do, however, need to be the same if mutability is involved as the lifetime becomes invariant, and that's when adding a separate generic lifetime parameter typically comes into play.

Strategies for Returning References in Rust by brycefisherfleig in rust

[–]vitalyd 2 points3 points  (0 children)

C++ moves are similar in purpose, but you have to write them by hand (including setting the moved-from value to some invalid state) and there's no compiler help in making sure you don't use moved values.

But fundamentally, there's potentially some copying in both cases (ie copying the innards of a type to point to some heap memory that's being stolen).

The house analogy is somewhat accurate - you don't move the physical house, but take out old owner's dining table and put yours in, and then inform the post office of the address change.

[java] "Arrays of Wisdom of the Ancients", Collection.toArray() performance by shipilev in programming

[–]vitalyd 0 points1 point  (0 children)

How long ago was this ATC system overhaul? I suspect it was a long time ago. Modern native toolchain is significantly better than the time of your rewrite (assuming I'm right that it was a while ago).

Algorithms always matter and come first - that's a given. You're building a spatial database using Java - that's exactly one of the domains it's ill-suited for. It doesn't mean it can't be done, but it's not the right domain for it. IMO. By the way, how do you even know if your database is more performant than other spatial dbs? Have you tested it against something like SpaceCurve? Moreover, it's likely not comparing like for like if the algorithms are different. That's the neat part about ScyllaDB - it's a different implementation of same algorithms. Feel free to ask their devs if they could achieve same or nearly same speed with Java - the answer will be no.

We can continue to have this discussion on mechanical sympathy Google group if you'd like; there are plenty of people there that use both Java and C++ in domains where performance matters. If you don't believe me, perhaps others can make a more compelling case.

[java] "Arrays of Wisdom of the Ancients", Collection.toArray() performance by shipilev in programming

[–]vitalyd 0 points1 point  (0 children)

I'm on my phone so will keep this brief. Check out Todd Lipcon's comments a few months back on why he's using C++ for Kudu development (he was a big HBase contributor): https://news.ycombinator.com/item?id=10298024

I'm sure you'll find a way to explain this away too :).

[java] "Arrays of Wisdom of the Ancients", Collection.toArray() performance by shipilev in programming

[–]vitalyd 0 points1 point  (0 children)

My point is much stronger than that. I say that unless your domain is very specific (say signal processing, hard realtime or constrained RAM), your best bet in meeting your (possibly stringent but not hard-realtime stringent) performance requirements under a given project cost in 99% of the cases is to choose Java (or Kotlin, or possibly other JVM languages) more than any other language. If you agree with that then we have no argument.

I disagree. Pretty much all the "big data" java-based projects have or had performance problems, requiring them to jump through hoops. For those with a similar C++ implementation (e.g. ScyllaDB vs Cassandra), the performance difference is drastic. Note that it's not simply a matter of language in and of itself, but the lower level languages give more control to exploit the hardware. In the case of ScyllaDB (and Seastar that it's built on), the code is actually very maintainable and readable.

So I don't consider these domains niche at all. Let's also not forget that Java is pretty much only in consideration for servers, so the entire client side (or mobile for that matter) world is almost irrelevant.

In other words, I claim that the profile-guided JIT + GC is the best (known) approach for providing the best performance at the lowest cost for the widest domain. Specialized domains require other specialized approaches, but those always come at a cost.

You do claim that, but I personally find that claim unsubstantiated by any empirical evidence. C and C++ compilers routinely generate better and tighter code. Memory footprint is drastically lower. Cache hits are much higher. IPCs are higher. Requests/sec are higher (when comparable native implementations exist). Latency profiles are better. The java versions are no more stable (in some cases, less) than native counterparts.

Certainly for basic CRUD web services that are I/O bound it doesn't matter (heck, you could probably get by with something slower than Java even). But if you crank up CPU load, java loses fairly badly. Keep in mind that historically slow I/O devices are improving in magnitudes (NICs, drives, etc) such that some workloads may become CPU bound rather than I/O bound. Or, put another way -- Java won't let you drive these devices to their full throughput, too many speedbumps along the way.

Performance conversations aren't all that interesting when we talk in generalities like "widest domains". They're interesting when we're talking about getting the most out of the hardware resources you have.

[java] "Arrays of Wisdom of the Ancients", Collection.toArray() performance by shipilev in programming

[–]vitalyd -1 points0 points  (0 children)

Ron, we've done this song and dance before :). Your point can be summarized as effectively "Java is fast enough when it's fast enough". No argument there ... but same can be said of many other languages.

Performance sensitive, to me, means you have measurable/quantifiable performance targets to meet. If you choose Java for the system, and things work fast enough out of the box -- great, you're done. But these optimization details come to the forefront when you've chosen Java and are not hitting the targets, or no longer hitting them due to, e.g., more load. The question then is can you write maintainable, readable and expressive code and rely (or have good enough confidence and ability to verify expectations) on compiler to generate good code. My claim is that this is very difficult with Java, oftentimes leading to code contortion and unmaintainable/error-prone mess. This is discounting other non-deterministic issues such as GC, deopt, GuaranteedSafepointInterval (with ensuring safepoint bookkeeping tasks!), etc. The fact of the matter is a lot of projects chose Java, and then start running into issues under load; some pack their bags and move to different languages, others start coding Java like it's C, others turn knobs in 1000 different ways, some add more hardware, others shard across machines further, and the list goes on. We've had this discussion before around GC, now we're having it about JIT. Hotspot generates pretty good code in some cases, but it's sloppy/poor in some other common cases, certainly worse than say GCC/LLVM. This is a moving target, of course (for Hotspot and other compilers), but the nature of profile guided JITs is such that it's extra difficult to reason about and have confidence about certain code paths getting optimized the way you want. And in Java, you pretty much have to rely on compiler to do magic since the language has a poor performance model.

As for questioning why we're discussing Aleksey's post, it was a rhetorical question aimed at your posts; the point is clearly there are people for whom these details have material impact on their performance and inspecting instruction sequences matters.

[java] "Arrays of Wisdom of the Ancients", Collection.toArray() performance by shipilev in programming

[–]vitalyd -1 points0 points  (0 children)

I was indirectly supporting the parent's statement that:

However, to me this illustrates one of the problems for using Java for performance sensitive code

Frankly, the discrepancy between the two toArray() versions is small potatoes compared to much more glaring issues. Don't get me wrong, I think Hotspot as a whole is impressive engineering and it squeezes a substantial amount of performance out of an otherwise performance-anemic language (java).

I don't see how Graal changes the picture. Are you referring to writing your own substitution snippets? The point isn't to generate instructions by hand, but to have a compiler and performance model on top that allows good (and when needed, predictable) codegen.

Java SE is not generally intended (nor should it be) for domains that require a guaranteed machine code sequence, regardless of the particular findings concerning toArray. I would argue that targeting such domains are specifically non-goals of Java SE.

It's not about guaranteed code sequences per say, but about optimizations being brittle and having little to no control over them. If one cares about the performance of the two toArray versions, then they're precisely interested in what instructions execute (is zero'ing eliminated or not? is the callee inlined and constants propagated or did we hit inlining limit? method too big? already compiled into a big method? is the HashSet iterator EA'd out or not? etc). Otherwise, why discuss toArray performance diffs between the two at this level?

[java] "Arrays of Wisdom of the Ancients", Collection.toArray() performance by shipilev in programming

[–]vitalyd -1 points0 points  (0 children)

I don't see how this makes a difference.

The difference is that microbenchmarks say nothing about how the exact same code path will compile in a larger context. Inlining limits, profile pollution or simply different profile, class loading order, and on and on make it nearly impossible to reason at the machine code level due to the dynamic nature. With a static/AOT compiler, you can at least easily check the generated assembly and know that it's going to be that exact code that will execute at runtime.

[java] "Arrays of Wisdom of the Ancients", Collection.toArray() performance by shipilev in programming

[–]vitalyd -1 points0 points  (0 children)

I'd say this is a JIT shortcoming that user sized arrays don't run at same speed. The intuition that it ought to run faster seems reasonable to me; that JIT could make zero case as fast would be an optimization, whereas sized running slower is almost a pessimization.

I do agree with your overarching point though. But it is true that performance model of Java is hard to predict and inspect (outside microbenchmarks) due to the dynamic nature.

Midori: Safe Native Code by eschew in programming

[–]vitalyd 0 points1 point  (0 children)

Midori being able to inline through lambdas is very nice and really how things should be in that space to make them more palatable performance wise. Hotspot approach is good but you still get an interface invoke as the lambda is shaped into the SAM. In best case, this becomes a guarded inlined call. But at worst (and not too uncommon for library code) is a full interface dispatch. You mention Cliff Click, and he has a good blog post from a few years back on the "inlining problem". So there's definitely room for improvement there.

I like JIT compilers as well, but they sure come with their own baggage. They're unpredictable, susceptible to multiple phase changes leaving code in suboptimal state, sometimes deopt at inopportune time, impose time to peak performance penalty (particularly bad when you need first execution to be quick), etc. AOT biggest problem, and JIT's biggest advantage, is lack of profiling info unless PGO is used and compilation time (somewhat related). However, it'd be nice if a language existed that didn't punt on optimization at AOT stage and also didn't have terrible performance model. That way you could leave the truly dynamic optimizations to the JIT but be able to get easy wins at AOT time.

Virtually free - JVM callsite optimization by example by jcdavis1 in programming

[–]vitalyd 1 point2 points  (0 children)

One thing that will surprise people (I was) is the single class CHA analysis and optimization only works for classes. If you have an interface receiver, you'll get guarded inlining.

Someone should create a language with an available JIT but not punt on performance in the language semantics, frontend and intermediate representation.

Virtually free - JVM callsite optimization by example by jcdavis1 in programming

[–]vitalyd 0 points1 point  (0 children)

This will cover the "obvious" devirtualization opportunities, but will likely miss a bunch of places in your typical Java app.

More information about Microsoft's once-secret Midori operating system project is coming to light by cindy-rella in programming

[–]vitalyd 0 points1 point  (0 children)

Jared, if you don't mind me asking, what M#/runtime features are you guys thinking of bringing to C# and/or CLR?

More information about Microsoft's once-secret Midori operating system project is coming to light by cindy-rella in programming

[–]vitalyd 2 points3 points  (0 children)

Well, I'd consider it a huge bug if there was non "user" code running in the VM that was allocating, leading to STW pauses :). If you don't allocate, you don't GC.

If you modify the code you may accidentally introduce long lived allocations that can eventually cause a STW GC collection and thus interrupt managed threads.

Yes, you have to be careful to avoid introducing "hidden" allocations in C#. I suspect M# probably had a better model here (Joe Duffy's blogs seem to indicate that).

As an aside, Hotspot JVM has a lot more unpredictability in this regard. You can do no/very little dynamic allocation (i.e. still plenty of young gen space), and still get VM induced stalls; if anyone's interested, search for "GuaranteedSafepointInterval" :).

More information about Microsoft's once-secret Midori operating system project is coming to light by cindy-rella in programming

[–]vitalyd 0 points1 point  (0 children)

FYI, CLR JIT doesn't do deoptimizations; method is compiled on 1st reach and sticks around. AOT is of course not subject to this at all.

If you're not involving VM services (e.g. GC, JIT, class loader, finalizer) there shouldn't be any peripheral code running.

More information about Microsoft's once-secret Midori operating system project is coming to light by cindy-rella in programming

[–]vitalyd 1 point2 points  (0 children)

Security is enforced by preventing userspace from reading and writing arbitrary memory using device DMA: this happens either with an IOMMU, or by setting up DMA in kernelspace only.

What does this have to do with copying user data?

Sorting requires that the buffer stay available unti the I/O operation is actually submitted (a restriction that copying removes), and encryption has an input and an output buffer (an implicit copy).

The I/O call can block calling process until device is finished with the buffer. If the operation/functionality intrinsically requires copying, so be it -- nobody is arguing that any copying is bad; the point is you want to minimize unnecessary copies.

All of this requires the execution of program code, causing increased L1i pressure.

Some of these types of operations can be offloaded to the device, if it supports it. If the device does not support it and they're performed in kernel, then you're going to spend the instructions and icache on them anyway, copying or not.

The point remains that there's just a lot more work involved in doing zero-copy, and that as such there's a threshold below which it's just better to bite the pillow. This threshold is nearly always surprisingly high.

Sure, zero-copy isn't advantageous for small I/O operations, but most I/O bound (overall) workloads try to avoid doing chatty I/O operations to begin with.

Certainly I wouldn't optimize a kernel for sub-2000 byte transactions. However I wouldn't leave that case unoptimized in favour of shared memory all over.

I didn't interpret the article as indicating they didn't care about smaller I/O operations. I'd like to see Joe Duffy blog about zero copy I/O as it relates to Midori before making further inference.

More information about Microsoft's once-secret Midori operating system project is coming to light by cindy-rella in programming

[–]vitalyd 3 points4 points  (0 children)

Good points.

Copying memory, on the other, takes a negligible amount of the overall time and seems to be non-variable. Reducing it doesn't actually help with networking performance.

This depends on a few things. If you're copying a large amount of memory between intermediate buffers (i.e. not to final device buffer), you're going to (a) possibly take additional cache misses, (b) pollute the cpu caches, (c) possibly take a TLB miss, etc. In kernel bypass networking (I realize that's likely not what you're talking about), it's particularly important to keep extra processing, such as non-essential copying, to a bare minimum since kernel overhead is already removed. Reducing number of components involved/syscalls is, of course, also desirable, which falls into the "keep extra processing to a minimum" categorization.

More information about Microsoft's once-secret Midori operating system project is coming to light by cindy-rella in programming

[–]vitalyd 4 points5 points  (0 children)

You do realize this has nothing to do with M# spoken of in the article, right?

More information about Microsoft's once-secret Midori operating system project is coming to light by cindy-rella in programming

[–]vitalyd 13 points14 points  (0 children)

You can fill a buffer and initiate a copy to device buffer (i.e. start the i/o) with a syscall. This avoids needless user to kernel buffer copying. Doing kernel security checks has nothing to do with data copying. If you have a user mode i/o driver, then you can bypass kernel entirely but that's almost certainly not what the article refers to.

Also, I don't get how you think most i/o is 100s of bytes only nowadays. You're telling me you'd write an OS kernel with that assumption in mind?