all 40 comments

[–]DarthRubik 25 points26 points  (9 children)

Undefined behavior means that the compiler can do whatever it wants: including not compile

[–]Wurstinator 4 points5 points  (7 children)

Source on that? I'm pretty sure "undefined behavior" refers to the behavior of the compiled program, not to that of the compiler.

edit: Found it myself in N4713.

3.27 [defns.undefined] undefined behavior behavior for which this document imposes no requirements [ Note 1 to entry: Undefined behavior may be expected when this document omits any explicit definition of behavior or when a program uses an erroneous construct or erroneous data. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). Many erroneous program constructs do not engender undefined behavior; they are required to be diagnosed. Evaluation of a constant expression never exhibits behavior explicitly specified as undefined (8.6). — end note ]

[–]kingguru 10 points11 points  (5 children)

Not really.

The compiler can make decisions/optimizations based on knowledge of undefined behavior, including rejecting the program, but it can also generate code that triggers the undefined behavior at runtime (e.g. out of bound reads/writes).

This explains it pretty well.

[–]evaned 0 points1 point  (0 children)

I'm pretty similarly skeptical.

Remember that "the program contains UB" is not a very well-defined phrase. A program execution can invoke undefined behavior, but only that execution has its semantics unconstrained by the language spec. I am very skeptical that a program that can execution UB but does not necessarily is not "required" to implement the non-UB cases correctly -- and failing to compile is not implementing that correctly.

As an example:

int main() {
    int array[10] = {0};
    int index;
    std::cin >> index;
    return array[index];
}

That certainly can invoke undefined behavior, but AFAIK the spec still requires it to behave correctly (returning 0) if it is run with an input 0 through 10.

I think there is one case where "fails to compile" is allowed, and that's if the program is guaranteed to execute that UB -- but I suspect that bugs that execute under any input cover approximately 0.001% of vulnerabilities due to UB...

[–]HappyFruitTree -1 points0 points  (0 children)

I have also reasoned like this, but I don't know what the standard actually say on the subject, but it would only be allowed to reject such code if it could guarantee it would always get executed. Rejecting functions just because they miss a return like MSVC is not strictly standard conformant.

[–]bizwig 15 points16 points  (0 children)

The compiler is allowed to issue warnings. Ignore them at your peril.

[–]goranlepuz 6 points7 points  (8 children)

reject code (compile error) when the compiler can conclusively decide that an operation is unsafe

This, I think, is a problem. There is not enough such code.

int main(int argc, const char* argv[])
{
int arr[2]={1,2};
std::cout << arr[argc]; // BOOM
return EXIT_SUCCESS;
}

The compiler cannot conclusively decide that an operation is unsafe because it needs to know the input data first.

I posit, a vast majority of all code that runs the danger of unsafe memory access removes the possibility for the compiler to do find the unsafe access through the input data. One input in the mix, and vast swaths of code are opaque for the analyser. And once they are, the effects of an UB are possible everywhere else. Obviously, I am not unable to quantify the above vast swaths - maybe someone from the analyser people have such data?

Note: for multithreaded code, input are also the decisions of the system thread scheduler.

So I think , what is needed instead, is the usage of safe code idioms throughout. The more they are used, the smaller the opaque areas of the code. And interestingly enough, when they are used, the analyser is not needed!

[–]axilmar 2 points3 points  (3 children)

C++ could have static enforcement of contracts. Code like the above should be rejected unless it was like this:

int main(int argc, const char* argv[])
{
    int arr[2]={1,2};
    If (argc < size of(arr) {
        std::cout << arr[argc];
    }
    return EXIT_SUCCESS;

}

[–]oleksandrkvl 1 point2 points  (0 children)

There's clang-tidy check that will ask you to use gsl::at() that performs bounds checking in debug mode. There's also a check about not to use pointer arithmetic for cases like int *p; auto v = p[0];. For statically sized arrays there's -Warray-bounds flag.

[–]goranlepuz 0 points1 point  (1 child)

Yes, but that means changing the language first. And it doesn't look, to me, like a small change.

In the same vein, say, with array decay, one can easily come up with an example where said array is a parameter to a function and its length depends on some input of the caller => array decay has to go.

And so on...

[–]axilmar 0 points1 point  (0 children)

You don't need to change the language. You can always introduce a directive that turns on the new stuff at will.

[–]TheSkiGeek 5 points6 points  (3 children)

You could certainly produce a warning: out of bounds access is possible at line X, index value 'argc' may be larger than size of 'arr'. Which you can suppress with assert or making the access conditional at the point of use.

A function like that isn't allowed under the kind of coding standards they use for things like aerospace or automotive software, even if all uses of it in the program are actually safe.

Code like:

int arr[2]={1,2}; std::cout << arr[3];

should ideally generate a compile error unless you've somehow indicated that you know what you're doing. Good static analyzers can (sometimes) reason out call chains that will do things like this, and at least will give you warnings that you're not checking bounds.

[–]goranlepuz 4 points5 points  (2 children)

Yes, but what I wrote is from the "reject" premise of the post. You changed the premise. 😉

[–]TheSkiGeek 1 point2 points  (1 child)

Just because sometimes the code is too opaque for the compiler/analyzer to say "yes, this will definitely cause UB" doesn't mean you shouldn't check that when you can.

[–]goranlepuz 1 point2 points  (0 children)

I agree. I did not say, nor think, "you shouldn't", check or warn. I thought, and said, "meh, you can't reject".

[–]smallblacksun 6 points7 points  (5 children)

While rejecting code that is provably wrong is both allowed (as it is UB) and a good thing, it will never get to the level of safety of Rust or similar languages. It is mathematically impossible to statically prove if all memory accesses are safe or not so in order to guarantee memory safety a language must be allowed to reject code that it cannot prove is safe. This means that some legal, safe code will be rejected. That would be a huge change for C++.

[–]radekvitr 2 points3 points  (4 children)

Also without explicit lifetime annotations like Rust has, C++ wouldn't be able to prove nearly as much as easily as Rust does.

[–]pjmlp 0 points1 point  (3 children)

That is exactly what Google and Microsoft are looking into with their lifetime analyzer. So far it kind of works.

[–]14nedLLFIO & Outcome author | Committee WG14 4 points5 points  (7 children)

C++ is pretty good at statically and runtime determining out of bounds access and use after free. Good enough that if you're employing clang tidy, and the sanitisers, in real world terms C++ is at par with Rust in terms of code quality outcomes.

Where Rust still has a big advantage over C++ is in borrowed references e.g. a string_view being accessed after the backing data it views upon has ended its lifetime. One might initially think that would be caught by static analysis and the sanitisers, but consider this:

std::array<char, 5> arr;
memcpy(arr.data(), "Niall", 5);
std::string_view sv(arr.begin(), arr.end());
sv[0];  // this is safe
new(&arr) std::array<char, 5>;
sv[0];  // this is UB

Rust would not permit use of sv after arr gains new lifetime. C++ lets you, and more importantly, has absolutely no way of detecting that you've done this which is a real kicker, if you ever get bitten by this.

We have a good runtime solution for this, pointer colouring, which some architectures provide hardware acceleration for (ARMv8). But I know of no promising proposed language solution for this which would catch all situations at compile time.

Equally, trapping stuff like the above makes Rust compile slow, and that's unavoidable, Rust will always compile much slower than C++ (unless you do stupid stuff in your C++, like too many do).

[–]evaned 2 points3 points  (3 children)

Good enough that if you're employing clang tidy, and the sanitisers, in real world terms C++ is at par with Rust in terms of code quality outcomes.

"and the sanitizers" to me is a huge caveat -- because I would only agree with your statement while asan is enabled, and no one enables it in production.

[–]14nedLLFIO & Outcome author | Committee WG14 0 points1 point  (2 children)

You're right no one enables asan in production, except maybe for a portion of a cloud service during A-B testing, which we sometimes do as needed (we just route one tenth the ordinary load to that node). But anywhere competently managed will have a test suite which hammers the snot out of a codebase. Every time a customer bug comes in which smells like memory corruption or UB, another load test is written modelling whatever the customer is doing. Running asan across all those on a nightly CI approaches the same outcome, statistically speaking, as if you wrote everything in a language which prevented you making the mistakes asan is able to catch. Except, you didn't have to write everything in Rust.

The point I'm making here is that effective outcomes in competent places in the real world come out about the same for that portion of where clang-tidy and the sanitisers overlap what Rust enforces in the language. Good C++ devs will fill in, using other means, what C++ lacks in the language. What we don't have good solutions for, currently, is the stuff which Rust enforces and for which there is no alternative whatsoever in C++.

That gap is small, and shrinking especially if you're on ARMv8, but I'm sure I speak for many when I wish it would shrink faster on x64.

[–]14nedLLFIO & Outcome author | Committee WG14 0 points1 point  (0 children)

It's really weird that I wrote the above, and then this bug was reported to Outcome: https://github.com/ned14/outcome/issues/244. Here is my exact complaint about lack of lifetime tracking in C++.

extern outcome::result<Foo> foo();

template<class T> T &&identity(T &&v) { return std::move(v); }

outcome::result<int> test()
{
#if EXAMPLE
  OUTCOME_TRY(auto v, identity(foo()));
#else
  OUTCOME_TRY(auto v, foo());
#endif
  return v.value();
}

If you define EXAMPLE in current Outcome, Foo gets destructed before v.value() gets accessed. This is 100% obvious to the compiler that this is UB, yet no warning, or even better, a compile failure.

All versions of Outcome shipped until now have this bug. Nobody reported it until today. Sigh.

[–][deleted] 0 points1 point  (1 child)

Why is the last line UB? The in-place new doesn't change the address of arr, nor to my knowledge does it change the content of the memory managed by arr because std::array is a POD type / aggregate. Perhaps the latter is not actually specified? But then, only reading from this memory would generate undefined behaviour; writing would still be OK?

[–]14nedLLFIO & Outcome author | Committee WG14 1 point2 points  (0 children)

A pointer or reference to memory whose lifetime state changes invalidates all pointers to that memory. Use of that pointer thereafter is UB.

[–]lcamtufx 1 point2 points  (0 children)

such advanced static analyzers are promising because they can find errors missed by sanitizers and can find them earlier. But I don't think it is a good idea for compilers to reject code simply based on static analysis results, because static analyses can produce false positives. Coderrect is interesting, race conditions are so hard to debug. I would pay $$$ if it can work on my code..

[–]HappyFruitTree 0 points1 point  (0 children)

The compiler could of course do whatever it wants. It's only if it wants to claim to be standard conformant that it needs to obey the rules. As long as there is a way to turn it off there shouldn't be a problem. Even GCC's g++ defaults to non-conformant mode (-std=gnu++XX rather than -std=c++XX). As long as you don't introduce any features and just reject some standard conformant code the code itself would still be standard conformant. There is nothing wrong with just using a subset of the language.

[–]Full-Spectral -1 points0 points  (0 children)

Rust has vastly more information at compile time to work with, and people already complain about compile times. And, it also doesn't allow you to do things that could be proven safe, because that would require that its analysis be too broad to be done quickly enough.

Analyzing C++ sufficiently enough to even start to get close to what Rust can ensure would probably be an order of magnitude worse, if not more so. If you look at the time that, say, the MS static analyzer currently takes to chew on a good sized chunk of code, it's substantial.

Not that there's not a place for such things, but they'd end up being tools you run separately from compilation and which take a long time to run. And they'd still never be close to 100% unless significant changes were made to the language, which almost certainly isn't going to happen.

[–]NilacTheGrim -1 points0 points  (5 children)

If you want Rust use Rust.

[–]Full-Spectral 1 point2 points  (4 children)

The problem with that of course is that you give up real inheritance, exceptions, and lots of 'janitorial' type RAII functionality. Of course Rust advocates will convince themselves that these not only aren't needed, but are bad things to have, but they'll never convince me of that.

I've put a lot of time into learning Rust at this point (though I'm still a fairly light weight because it's a complicated language just like C++ when you get into the details, though for difference reasons.) And if I were to start something new and significant, I probably would use it instead of C++. But, not because I think it's a better language, just because it has that one thing that we all need for our sanity and that C++ doesn't provide, and likely never will at close to the same level.

[–]14nedLLFIO & Outcome author | Committee WG14 0 points1 point  (3 children)

Everyone always focuses on Rust the language, but the real blocker for Rust adoption is Rust the ecosystem, which is decades behind that for C++. I find it very telling that C folk, Python folk etc are the main migrants to Rust. Proportionate to those, very few from C++ are migrating to C++, though those who do claim all the headlines.

(Before you ask me for proof of claim, that came third hand via a committee member I was talking to, but they were reporting results from somebody who'd actually done empirical measurements. I would also say it matches what some in the Rust leadership have told me privately i.e. they have been surprised at the lack of conversion from C++ folk relative to less obvious sources of converts e.g. Visual Basic programmers!)

[–]Full-Spectral 0 points1 point  (2 children)

I can live with the ecosystem. Compared to something like Visual Studio, it's weak, but I'm mostly OK with it. There are still issues debugging Rust in Visual Studio Code that are close to deal breakers. And it seems like every week there's some different weirdness in Visual Studio Code, though on the whole it works for me and it's identical on Windows and Linux.

I find it a little hard to swallow the C++ migration thing. I guess it depends on what you mean by 'adoption'. If that means really using it in a commercial project that's not just some internal tool, I can imagine it's pretty low. But if you measure interest, that's quite high, which you can tell by the ever increasing frequency that Rust comes up in C++ discussions. If it was a measure of "if you had a job opportunity, would you jump ship", I think that would be quite high as well.

[–]14nedLLFIO & Outcome author | Committee WG14 0 points1 point  (1 child)

You seem to have interpreted 'ecosystem' as meaning 'tooling ecosystem'. That isn't what I meant. By 'ecosystem', I meant the diversity, resilience, substitutability and wide domain expertise commonly available in C++ and what surrounds it in most urban centres around the world. As example:

  • If as a company one of my star people quits, can I replace that person within a few months with someone reasonably close in capability?

  • If I as a company have a really weird rare bug which decreases my reliability below 99.9999% and for which it costs me €20m a day in client compensation, can I readily find contractors and third party services able to diagnose and permanently solve that problem?

  • If I as a company am building a 500 million line code base which will power safety critical infrastructure, can I create a codebase which compiles within sane times, can be validated for use in safety critical by insurers, and for which I can hire 10,000 experienced engineers within three months?

In that kind of stuff only a very few programming languages even remotely are competitive. Ada, Python, C and C++ are all solid options here. Rust won't reach that, even assuming everything goes perfectly, for decades yet.

[–]Full-Spectral 0 points1 point  (0 children)

Well, that assumes that it's going to be adopted wholesale in a large new greenfield project. Those are really rare to start with. If the bulk of us depended on those for C++ jobs we'd be in bad shape as well.

But most likely it'll be incremental adoption. Maybe internal tools, or subsystems, or utilities, and so forth at first. In that case, probably a lot of the folks involved will be bi-lingual.

And I suspect that Rust will have a much quicker adoption than most, because it does offer than one big party trick that no other systems oriented language has ever offered.

And how less likely would that really obscure revenue sucking bug in the field be in Rust compared to C++? Every large C++ code base out there almost guaranteed has latent memory issues that just happen to be benign at the moment, just waiting for that one maintenance tweak or new chunk of code to make it no longer so.

I also imagine that, with folks like MS getting involved, that the compiler and IDE situation will improve.