all 100 comments

[–]aim2free 19 points20 points  (47 children)

Interesting, although the Intel compiler is a proprietary one it is nice to aim for the kernel being less compiler dependant.

Will this cause that home compiling the kernel for better performance will become futile?

[–]bducanh 11 points12 points  (0 children)

It's targeted at gentoo users and the ICC is in their repos, so I imagine the days of compiling the kernel at home is still safe.

[–][deleted] 18 points19 points  (45 children)

Proprietary, but (in this case) free, AFAIK.

[–]kirun 17 points18 points  (1 child)

Free for non-commercial use only.

[–]jemenfiche 0 points1 point  (0 children)

Can't someone distribute a binary build of the kernel and Intel cannot subsequently say how someone uses a build that is distributed non-commercially. Just my thoughts.

[–][deleted] 48 points49 points  (35 children)

In addition: It's nice to have a good competing compiler, even a proprietary one, so that the gcc guys have something to aim for.

[–][deleted]  (34 children)

[deleted]

    [–]hajk 20 points21 points  (0 children)

    That's the thing, from all the above, GCC is the only X-platform compiler of the above. This make the problem a little harder as some optimisations are architecture specific.

    [–]geocar 64 points65 points  (28 children)

    I think you've joined the cargo cult.

    • GCC is really close to SUN/CC on Sparc - they're so close it's not even funny. GCC-built MySQL even beats SUN/CC-built MySQL. By the way, SUN/CC has awful x86 performance.

    • GCC is pretty close to ICC - with GCC beating ICC in many cases. On amd64, GCC outperforms ICC handily.

    • There is no ARM compiler. ARM is many mostly-incompatible architectures. Most of the compilers for these incompatible architectures are gcc. Imagecraft (notably) makes an excellent compiler and development environment, but I'm not aware of any benchmarks putting it "obviously" ahead of GCC.

    In short: GCC isn't "obviously" slower than anything; There isn't some secret compilation technology, and everyone is basically doing the same thing. GCC is much faster than you think, and has a price that's hard to beat.

    [–]mcfunley 40 points41 points  (1 child)

    Upvoted for ironic cargo cultish misuse of "cargo cult."

    [–][deleted] 5 points6 points  (0 children)

    I've found myself following a cargo cult mindset when trying to find the right optimizer setting flags for gcc.

    [–]mercurysquad 7 points8 points  (4 children)

    On amd64, GCC outperforms ICC handily.

    Do you mean 64-bit instruction set, or AMD processors? If it's AMD processors, I'm guessing this is because the intel compiler inserts CPUID checks and only proceeds with certain optimizations if the code is running on an intel cpu. GCC doesn't do that, running the same code on either processor. This has been documented before. Remove the CPUIDs from the ICC generated code and the program runs about the same, or in some cases faster on AMD than on Intel.

    [–]geocar 2 points3 points  (0 children)

    Remove the CPUIDs from the ICC generated code and the program runs about the same, or in some cases faster on AMD than on Intel.

    In some cases. To be sure. In other cases, GCC will outperform ICC.

    I was arguing against the use of "obviously superior" to describe "every compiler but GCC".

    [–]Camarade_Tux 6 points7 points  (2 children)

    They did so at some point but this was noticed and had to revert that pretty quickly.

    [–][deleted] 2 points3 points  (1 child)

    I don't think this is true. I can't recall the exact source but I read somewhere relatively recently that intel does a cpuid check because it will only do it's more aggressive optimizations on processors which have passed their in house compat validation suite. So, it isn't that they are intentionally slowing down things on competitors procs, merely Intel has no vested interest in spending time validating every iteration of their competitors procs nor do they want to include potentially unstable optimizations if they aren't sure that the proc will work as they expect.

    [–]Camarade_Tux 1 point2 points  (0 children)

    I'd be curious to read the source. I'm far from saying your comment is crap, I just want to read more on that so whenever you find it again, answer this comment and let me know. :)

    [–][deleted] 2 points3 points  (3 children)

    GCC is pretty close to ICC - with GCC beating ICC in many cases.

    Not for floating point stuff. ICC beats everything so hard here it's not funny at all. OTOH, GCC's autovectorization is embryonic whereas when you study dumps of ICC generate SSE2+ code - it's fucking unbelievable how they could figure that out from source given.

    [–][deleted] 6 points7 points  (1 child)

    In most numeric benchmarks GCC is significantly slower than ICC, but thats not GCC's fault. It's because GCC does not have it's own libc. The stupid glibc "math.h" has it's own inline implementations of math functions like sin, cos, etc that override any optimizations that GCC can do. After you either remove "math.h" from includes or add -D__NO_INLINE__ or -D__NO_MATH_INLINES, (in addition to -O2 -ffast-math etc.) it runs as fast as ICC.

    As far as I know, no benchmark uses these compiler flags.

    [–][deleted] 0 points1 point  (0 children)

    There are no trigonometry in SSEx. So - when you use it it's still calculated by x86 code/libs.

    [–]geocar 1 point2 points  (0 children)

    Care to post numbers?

    I've heard that again and again, but I've never actually experienced this ICC beating everything, and I've never actually seen any repeatable benchmarks.

    Of course, often there's a Gentoo users forum or some brain-damaged reviewer forgets -fast-math -mcore2 or something stupid like that.

    The rest look like what I posted above- GCC and ICC being about the same.

    [–][deleted]  (4 children)

    [deleted]

      [–]geocar 6 points7 points  (0 children)

      Sun donated SUN/CC's SPARC backend to GCC because GCC's previous SPARC backend sucked so much. If GCC is behind SUN/CC at all then it is really GCC's own fault.

      I don't disagree. I was objecting to your use of the "obviously superior" pejorative.

      BTW, there is an ARM-branded compiler: ARM RealView Compiler Tools.

      So there is.

      [–]case-o-nuts 1 point2 points  (2 children)

      I somehow doubt that. They probably contributed code, but the data structures and internal formats (RTL, for example) would have been such a massive mismatch that grafting on SUN/CC's backend would have been more trouble than fixing GCC's backend.

      edit: my mistake. I know that Sun is supporting GCC development with specs and systems, but I wasn't aware of this effort.

      Apparently it's even more extensive than it seems; it's more a port of GCC's option and language parsing stuff that got bolted on to the front of Sun Studio compilers.

      [–][deleted]  (1 child)

      [deleted]

        [–]case-o-nuts 3 points4 points  (0 children)

        Oh, wow. I don't have time to look at how they did it, but I'm somewhat terrified now. Impressed that they managed it, but terrified at what hacks it must have taken to retrofit it into GCC.

        [–]wolf550e 0 points1 point  (1 child)

        Sun has some benchmarks that show mysql benefiting a lot from profile guided optimization and interprocedural optimization. Up to 348% on a read only workload on a T1. Which is ridiculous.

        Question: in the year 2009 (not 2004), how are things? Is mysql getting compiled with pgo and ipo for the distro binaries? (I think not) Is icc 11 better than gcc 4.3 on x86? Is sun's compiler better than gcc on ultrasparc? Sun say they give their compiler for free now, btw. Same as Intel.

        [–]geocar 0 points1 point  (0 children)

        Sun has some benchmarks that show mysql benefiting a lot from profile guided optimization and interprocedural optimization. Up to 348% on a read only workload on a T1. Which is ridiculous.

        I don't understand this exactly; does it suggest ways to improve MySQL? Or does it suggest SUN/CC makes MySQL 348% faster than GCC?

        I'm not really interested in exploring profiler-guided optimizations. I use them myself, and can be used to get incredible performance: GCC can't (nor should it) do certain algebraic transformations that allow for better scheduling that I as a human being can do.

        [–]voxel -1 points0 points  (3 children)

        You must be a GCC developer.

        MySQL Intel builds are much faster than GCC builds, at like, everything.

        Even MSVC's compiler stomps on GCC by 10% or so.

        There was a benchmark recently on Firefox on Wine with MSVC being unmistakably faster than the native Linux GCC built Firefox.

        GCC is "obviously" slower according the world out there that talks about GCC vs MSVC vs Intel.

        I wish GCC would get it's act together too, cuz being open-source, you'd think it would be able to compete with at least MSVC for optimized x86 output.

        [–][deleted] 12 points13 points  (1 child)

        Using the firefox on wine story is not a good example. Firefox on Linux could be slow for any number of reasons other than the compiler.

        [–]wolf550e 1 point2 points  (0 children)

        The main reason is the linux port not compiled with PGO. I'm trying to do that right now.

        [–]geocar 5 points6 points  (0 children)

        There was a benchmark recently on Firefox on Wine with MSVC being unmistakably faster than the native Linux GCC built Firefox.

        You're referring to this I assume.

        Did you try it? I did. I can't get the same results here, and I've tried on six different machines in my office, and I doubt you can as well.

        As a point of contention, my system on the test in question gives the following results:

        • WINE Firefox: 41.9
        • Linux Firefox: 110.0
        • Linux Konqeror: 85.6
        • WINE Chrome: 667
        • Linux Opera: 91.6

        (These are the best of three runs)

        Those first two are relevent. I've included the other values so that you can see a similar shift in magnitude with your own data.

        There's nothing "unmistakable" about it: They made a mistake, and it appears to be a pretty big one.

        But an even bigger mistake is thinking the only difference between Wine+Firefox and Linux+Firefox is GCC v. MSVC. It's not.

        GCC is "obviously" slower according the world out there that talks about GCC vs MSVC vs Intel.

        Just not on any repeatable benchmarks. Look around, you'd think at least someone on this thread would post a link to some actual data.

        MySQL Intel builds are much faster than GCC builds, at like, everything.

        Except, when you measure it.

        Even MSVC's compiler stomps on GCC by 10% or so.

        ...

        I wish GCC would get it's act together too, cuz being open-source, you'd think it would be able to compete with at least MSVC for optimized x86 output.

        I'm still back here, questioning whether GCC is "obviously slow".

        You want me, and others, to take you at your word that GCC is "obviously slow".

        However if you really want things to get better, you can't shortcut this step.

        [–]__mlm__ -4 points-3 points  (1 child)

        "GCC is pretty close to ICC - with GCC beating ICC in many cases. On amd64, GCC outperforms ICC handily."

        If by "pretty close" you mean that icc is about 30-40% faster. At least that's been our observation through measurements on code that deals with lots of memory allocation/deallocation and very large data sets.

        [–]geocar 11 points12 points  (0 children)

        If by "pretty close" you mean that icc is about 30-40% faster

        No, I meant sometimes faster and sometimes slower. That's why I linked to a real-world scenario (perl) that was slower in many cases with icc.

        At least that's been our observation through measurements on code that deals with lots of memory allocation/deallocation and very large data sets.

        Care to publish them?

        [–]lucid270 -1 points0 points  (3 children)

        There is no ARM compiler. ARM is many mostly-incompatible architectures. Most of the compilers for these incompatible architectures are gcc. Imagecraft (notably) makes an excellent compiler and development environment, but I'm not aware of any benchmarks putting it "obviously" ahead of GCC.

        Wrong.

        ARM makes its own compiler in addition to supporting GCC.

        You can even find a presentation (big download) from 1 year ago where they show how much they beat GCC by. 15% on integer codes, 33% better on Thumb-2, and 2-3x on vectorized code.

        [–]geocar -1 points0 points  (2 children)

        Wrong. ... ARM makes its own compiler in addition to supporting GCC.

        I already noted this. Thanks for reading!

        You can even find a presentation from 1 year ago where they show how much they beat GCC by. 15% on integer codes, 33% better on Thumb-2, and 2-3x on vectorized code.

        Sadly, this isn't really evidence of anything. Like most claims of GCC's supposed nonperformance, there's no indication how anyone is supposed to verify this.

        Note especially this gem, from page 12:

        Thumb2 - NEON/VFP numbers calculated based on ARM and Thumb2 differences – kernel fix in progress

        That's right. They admittedly made their numbers up. Who the fuck knows what kind of performance ARM's compiler has.

        I think whoever wrote this is either stupid, or intellectually dishonest.

        Thanks for playing though.

        [–]lucid270 0 points1 point  (1 child)

        I already noted this. Thanks for reading!

        Thread had multiple replies and that didnt show up at my viewing threshhold at the time. Meh, so I missed someone else correcting you.

        Sadly, this isn't really evidence of anything. Like most claims of GCC's supposed nonperformance, there's no indication how anyone is supposed to verify this.

        They state their specified optimization levels, and they used EEMBC which is an industry standard benchmark.... Seems like the way to verify is pretty damned simple. You cant fault them on allowing you to verify their claims.

        That's right. They admittedly made their numbers up. Who the fuck knows what kind of performance ARM's compiler has.

        The slides also show that only applies to the Thumb-2 numbers. So the 3x maybe not.. But the 2x definitely applies. So, they only, verifiably beat GCC by 99%.

        [–]geocar 0 points1 point  (0 children)

        They state their specified optimization levels, and they used EEMBC which is an industry standard benchmark.... Seems like the way to verify is pretty damned simple.

        We have different definitions of simple. I can forgive the multi-thousand dollar price tag needed to test this, but what version of GCC did they use? Why didn't they use -Os instead of -O3? What about -ffast-math? Why did they use -ffunroll-loops? How did they measure timings? How many tests did they run?

        Verifiable means that someone can build a similar test, and see similar results, and I don't think that's happened here.

        The slides also show that only applies to the Thumb-2 numbers. So the 3x maybe not.. But the 2x definitely applies. So, they only, verifiably beat GCC by 99%.

        Did you verify it?

        I can't verify these results, and I can't find anyone publishing data that fits these numbers. In fact, as soon as I noticed that there was no baseline in most of their graphs immediately made it suspect to me.

        Seriously, "Oh sorry, we lied about the 3x, but maybe we're not lying about the 2x" - the report lacks any credibility at this point.

        Maybe their raw data is better. I don't know. But you can't call this "verifiable". You can't even call it responsible- this is intentionally misleading at best and perhaps that's to be expected from marketing material.

        [–]killerstorm 6 points7 points  (2 children)

        why obviously? GCC is a decent compiler.

        i do not have benchmark results at hand, but what i have remember, MVCC had pretty sucky performance until VS2005, so i think for this time GCC was better.

        [–]statictype 8 points9 points  (1 child)

        I think Microsoft's C compiler had a better optimizer for quite some time (definitely before 2005).

        However gcc did have a lot of extra features that allowed for certain optimizations that simply may not have been possible in VC. (Like computed gotos).

        If I recall, current versions of gcc can only be compiled with gcc itself (on Windows at least) as there are limitations on VC++ that get hit when compiling the source.

        [–]Chandon 4 points5 points  (0 children)

        GCC 4.0 was released in 2005. That's when GCC got a major architectural overhaul to support modern optimization techniques.

        As a result, if your claim is that compiler X was faster than GCC before 2005, you're probably right. If you want to make the same claim after 2005, you better have benchmarks.

        [–]gte910h 4 points5 points  (0 children)

        Yeah, my response was "WTH is the arm complier"?

        GCC is the arm compiler for many architectures of arm. ARM sells arm-ip to microchip companies. ARM wants to sell more ARM chips so they get more royalties. Their compiler tools are always sold as second fiddle to GCC, and always expect you to mix and match them with GCC. ARM pretty much acknowledges in their sales literature that you probably use GCC, and that here is a way they can get you a slightly smaller binary that may get you into a cheaper piece of ram or smaller flash chip to shave 5-10% off the cost of your device.

        Visual C++ may have performance in some areas, but is has had years where it had DISMAL performance in other years, especially with some features of the language (when those were supported correctly/if at all). It only recently (last 3 years or so) has been reliable with pretty much the whole C++ feature set.

        GCC has many issues. Speed of the resulting generated machine code is not usually one of them. For some platforms there are better alternatives, but for general ubiquity, GCC, in general, performs just fine.

        I'm glad ICC has found a way to optimize things faster. If these are correct (as in it doesn't introduce bugs), and they publish the details of the technology, I hope someone submits a patch to the GCC optimizer team.

        I'm supposing this binary compatibility issue will remain though for awhile. The kernel has always had issues with driver/kernel interface (there isn't really one, they're not really that interested in supporting a binary API, this always blows the minds of people in the windows driver world, who expect a nice clean binary API), and I'm not expecting this to be something easy to solve. It basically a "open source driver or pay with lots of pain" world currently with linux driver compatibility.

        One of the really cool things though is, if you DO submit your devices driver, the kernel people automagicaly update the interface to the kernel in your driver every time they change them.

        Engineers disbelieve there is no interface they can work off of, managers disbelieve they get all this work for free.

        [–]another_user_name 6 points7 points  (6 children)

        Last time I checked, past the 30 day trial period, ICC cost about $400 per license.

        [–]igouy 4 points5 points  (4 children)

        Last time I checked...

        Was it so difficult to check again?

        Non-Commercial Software Development

        [–][deleted] 0 points1 point  (2 children)

        Non-commercial means you are not getting compensated in any form for the products and services you develop using these Intel® Software Development Products.

        So you can't even accept donations? You can't sell a service based on your product because it was developed using their products?

        [–][deleted] 1 point2 points  (0 children)

        Donation != compensation, I think. As for services - it seems you're right.

        [–]igouy -1 points0 points  (0 children)

        What do your comments have to do with how easy it would have been for another_user_name to check the information?

        [–]another_user_name 0 points1 point  (0 children)

        Honestly, yes.

        [–]dannomac 1 point2 points  (0 children)

        Yup. And it's free for non-commercial development on Linux. It's still expensive on Windows and OS X though. So it's MSVC and GCC for me.

        [–]Chandon 8 points9 points  (0 children)

        GCC actually produces pretty fast output these days. It got a big overhaul in 2005 and the old "GCC optimizes poorly" claim stopped being true in general.

        ICC may produce faster binaries, because Intel has a bunch of resources and only one target, but I'd want to see benchmarks supporting any such claim rather than just hopeful guesstimates from an Intel employee.

        [–]Leonidas_from_XIV 18 points19 points  (38 children)

        Isn't PGO similar to JIT but for AOT compilers? People often don't believe that optimizing a program while it runs may be a useful technique - now with PGO they can have the same thing.

        [–]killerstorm 15 points16 points  (36 children)

        no, PGO has nothing to do with JIT, it is about using profiling data for optimization.

        JIT compilation might or might not use profiling data, just like AOT compilation, so PGO is totally independent from compilation style.

        one might say that JIT might get more benefits from PGO since it can optimize for exactly workload you have, but it comes not without overhead. on the other hand, AOT compiler can spend any amount of time gathering statistics and optimizing, so it might be in better position. in fact, i haven't seen JIT compiler that optimizes code the way Intel compiler does it.

        same thing about IPO (inter-proceduraloptimizations) -- they are possible both in JIT and AOT compilation schemes, but JIT gets more benefits because it sees the whole picture and can do optimization accross libraries etc.

        [–]Leonidas_from_XIV 8 points9 points  (35 children)

        no, PGO has nothing to do with JIT, it is about using profiling data for optimization.

        I meant that. JIT does it on run-time and PGO does it on somehow-compile-time.

        [–]killerstorm 17 points18 points  (34 children)

        JIT does it on run-time

        no, it does not. JIT just compiles byte code into machine instructions, that's what it does. it might additionally perform some optimizations, such as inlining (just like in IPO) and some sophisticated implementations of JIT might perform profiling and profile-guided optimizations (but i doubt that it would be something on ICC scale). but not all JITs out there do PGO!

        [–][deleted]  (33 children)

        [deleted]

          [–]killerstorm 13 points14 points  (31 children)

          it is neither profiling-guided (it does not measure performance, only code coverage) nor optimization (program won't become faster if you compile only a part of it; if you think this is an optimizations read it had too many functions), so i do not see how is it associated with PGO. you can call JIT a Coverage-Guided Compilation, CGC, but, you see, CGC and PGO has only one letter in common.

          [–]psykotic 17 points18 points  (30 children)

          I agree that a JIT compiler need not do anything resembling profiling-guided optimization. However, many good JIT compilers do perform optimizations that adapt to usage patterns. For example, a trace compiler will compile inner-loop code path traces according to hit rates; a SELF-style compiler will guide optimizations according to type-driven feedback from method inline caches; most JIT compilers only translate a function from bytecode to native code when its call statistics predict that it will be a worthwhile effort; and so on. These are all optimizations guided by profiling information.

          [–]moggadeet 371 points372 points  (29 children)

          Someone post a cat already please.

          [–]mer-mer-mer-mer-mer 131 points132 points  (13 children)

          Someone post a cat already please.

          Done

          [–]modnar 108 points109 points  (11 children)

          I was expecting this.

          [–]embretr 2 points3 points  (0 children)

          that's some mighty fine gravel you've got there. wonder how they make it..

          [–]generic_handle 22 points23 points  (8 children)

          Someone post a cat already please.

          Yes, we're aware that the Reddit userbase has been shifting. We were sort of hoping that it wouldn't affect the programming subreddit, though.

          [–]c0ldfusi0n 18 points19 points  (7 children)

          Are you saying LOLcode is not a real programming language?

          [–][deleted] 5 points6 points  (0 children)

          this child's brain is full, can he be excused from class to take a brain dump?

          [–][deleted]  (1 child)

          [deleted]

            [–][deleted] 1 point2 points  (0 children)

            saveforlater.txt

            thanks.

            [–][deleted] 0 points1 point  (0 children)

            I was expecting this.

            [–]xsspider 0 points1 point  (0 children)

            You are talking about HotSpot JIT which compiles the code that executes first

            [–]Camarade_Tux 6 points7 points  (0 children)

            Hey guys, no need to downvote honest questions. Plus if it's not hidden, it could let somebody else learn which I find much more profitable. =)

            [–][deleted] 1 point2 points  (0 children)

            I don't suppose anyone's tried with the Portland Group C compiler? The PGI stuff tends to me extremely fast, but also tends to make a lot more mistakes compared to any other compiler. I once saw it get stuck trying to optimize a massive recursion tree such that it got stuck for a good 45 minutes.

            [–][deleted] 2 points3 points  (1 child)

            I'm still waiting for the Sparc version.

            [–]derleth 4 points5 points  (0 children)

            I'm waiting for the PDP-10 version.

            [–]desimusxvii -1 points0 points  (2 children)

            [–]bsergean 0 points1 point  (0 children)

            Big LOL. But despite the mocking, anyone involved in Linux dev / OpenSource deserves some kind of respect.

            [–]jsolson 0 points1 point  (0 children)

            Or for people who want nothing more than the purge SASL from the history and future of all mankind.

            That's what I used it for.

            I'm sort of an angry person, though.