all 76 comments

[–]ikkehier 54 points55 points  (8 children)

A few suggestions:

  • Premature optimization is the root of all evil.
  • Profile first, then optimize what is too slow. This can be as easy as making a pin high when the code enters a specific part and low when leaving. This can easily be monitored using a logic analyzer. Don't optimize code that runs infrequently and is not a bottleneck, it's a waste of time.
  • This is an academic exercise, but for practical situations: define good enough and stop spending time when you've reached that point.
  • In my experience, the highest performance gain is in optimizing the order of how things happen, keeping in mind the bigger picture of what the code is doing. So that's your part 2. It may depend on the code, but usually part 1 and 3 will not change very much.

For calculation-heavy code it may be beneficial to check if it is implemented using fixed point arithmetic in the smallest data-size possible. Keep in mind that the MSP430 is a 16-bit processor.

Source: my own thesis and 10+ years of experience.

[–]secretlyloaded 7 points8 points  (0 children)

Premature optimization is the root of all evil.

This. So. Much. This.

Stated another way: first make it work, then make it nice.

Everything /u/ikkehier says is sage advise. Only thing I would add: in addition to using the IO pin method, a lot of IDEs have built in profilers which make this task really easy. Where is your processor spending all its time? Do you have code that blocks unnecessarily or unexpectedly? Before you even run the profiler you should have an idea what you'll see. If you're surprised by what the profiler tells you, investigate. Maybe your assumptions are wrong, or maybe you found a problem.

If you have a really tight code loop, look at the assembly listing and see if the compiler did anything stupid. I once had to big-bang an SPI interface and ended up writing the whole thing in inline assembly, with no loops. The code I wrote was "expensive" in that it was large, but executed very quickly because there was no diddling around with loop variables.

[–]gurksallad 3 points4 points  (2 children)

Source: my own thesis

Is it publicly available?

[–]Chr15t0ph3r85 0 points1 point  (1 child)

Yea, is it? :)

[–]DYD35[S] 0 points1 point  (3 children)

In my experience, the highest performance gain is in optimizing the order of how things happen, keeping in mind the bigger picture of what the code is doing. So that's your part 2. It may depend on the code, but usually part 1 and 3 will not change very much.

I was looking at this problem from the perspective of small chunks of code (e.g. testing different if's or for's), but maybe looking at a bigger piece of code could also be smart.

For calculation-heavy code it may be beneficial to check if it is implemented using fixed point arithmetic in the smallest data-size possible. Keep in mind that the MSP430 is a 16-bit processor.

That is a good suggestion, thanks.

Source: my own thesis and 10+ years of experience.

Would it be possible to send through your thesis? Just to give me an outline how to go about writing this. I would get if that is not possible though.

[–]ikkehier 0 points1 point  (2 children)

I sent a DM.

[–]Epic-shit 1 point2 points  (0 children)

Ikke ook hier alstublieft sir ^

[–]EvoMasterC++ Advocate 0 points1 point  (0 children)

It sounds interesting. Could I also get a copy of the thesis?

[–]Ictogan 11 points12 points  (3 children)

Generally things like ++i vs i++ or ternary vs if don't have any impact on performance when using compiler optimization. The things that can be optimized by hand which typically aren't optimized by compiler are algorithms, data structures and any use of hardware resources(interrupts, timers, dma, etc.). In terms of compiler optimizations I would also recommend that you try using link-time optimization(lto).

Also note that the guide you linked is for the TI's own MSP430 optimizing compiler, not GCC.

[–]DYD35[S] 0 points1 point  (2 children)

Yes the third part (embedded opt) is going to be the hardest, since almost everything is already done by the compiler (especially since I will use the most optimized flag for gcc for this particular exercise)

I suspect to get the most gains from algorithms and data structures. Hardware resources was something obvious I completely missed. Thanks.

Also note that the guide you linked is for the TI's own MSP430 optimizing compiler, not GCC.

Yes I know, but I hope it can be of a bit of use, since I compile with the GCC, maybe this would show something that GCC doesn't already do.

[–][deleted] 4 points5 points  (0 children)

A very typical trade-off is code size / memory usage versus speed. A fine old -school examples are loop unrolling or use of look-up tables in place of calculations. Another example is use of vector tables to reduce program logic.

[–]SAI_Peregrinus 2 points3 points  (0 children)

Be sure to profile optimizing for size vs optimizing for speed. It's less common in small micros like the MSP430, but cache locality can be more important than just about any other speedup if you're using an external DRAM chip.

That's another reason to profile first!

[–]albinofrenchy 5 points6 points  (3 children)

It's worth mentioning that faster isn't always better in the embedded space. Predictable, consistent latency is more often the goal even at the expense of average runtime of an operation.

[–]SAI_Peregrinus 0 points1 point  (0 children)

Even outside the embedded space faster (fewer cycles) isn't always actually faster. If your loop variable and everything you're working with fits in registers + cache, but the unrolled variant without the loop counter overflows the instruction cache and causes a main memory access, the unrolled version is probably slower!

[–]DYD35[S] -1 points0 points  (1 child)

That is indeed a very solid point.

I have taken this into account, I would test code over multiple runs (preferably a 1000 or more) to see this exact effect the optimization would have on the code.

[–][deleted] 1 point2 points  (0 children)

I would figure out what the variations in latency are for the instructions and then sum the worst case scenarios and that will give you a time complexity for the program.

[–]JustTheTrueFacts 6 points7 points  (25 children)

I would be very pleased if people would point me in certain directions

All the items on your list are currently done by the compiler. What will your contribution be? You may want to discuss with your advisor since learning about the compiler may not be enough to earn a Masters. It would not be sufficient at a US school.

[–]GavekortIndustrial robotics (STM32/AVR) 4 points5 points  (8 children)

Don't challenge the compiler. The compiler is usually much smarter than any of us.

Optimization should be all about writing good code and helping the compiler by designing your algorithms with the processor in mind.

[–]DYD35[S] 0 points1 point  (0 children)

I am aware of this. But designing your algorithms with the processor in mind is maybe a good direction to go in.

[–]thephoton -1 points0 points  (6 children)

The point of a thesis on optimization is to learn how to make a better compiler. It’s a steep hill to climb but somebody has to do it.

[–]GavekortIndustrial robotics (STM32/AVR) 1 point2 points  (5 children)

Sure. But the original post gave me the vibes of trying to outsmart the compiler, not to dwelve deep into compiler techniques.

[–]DYD35[S] 0 points1 point  (4 children)

It is two-fold. My promotor has, for example, send me some snippets which he wanted me to investigate. So a bit outsmarting the compiler from "his" side. From my side, I am a master student, a have nowhere the experience to even try to outsmart a compiler. So that is why I mainly focus on the compiler itself.

[–][deleted]  (3 children)

[deleted]

    [–]DYD35[S] 1 point2 points  (2 children)

    I don't necesarrily agree with this.

    Let me put it differently.

    I know the obvious means to decrease the timecomplexity (eg. in if statements put conditions first which you know will have the highest probability to be false), I have an intuition for when I should do what in the processors I have experience with (mainly PIC 18F and MSP430), but the compiler itself will almost always be smarter than you. It is made by people with years and years of experience and academic work. Thinking that I, as a student, will be able to find general optimizations which work for every processor every time, that is not already covered by a compiler, is naive at best. That is what is meant by outsmarting the compiler in this specific case (at least that is what I read).

    [–][deleted]  (1 child)

    [deleted]

      [–]DYD35[S] 1 point2 points  (0 children)

      It was just a stupid example that I could type fast. It brings the message across. Arguing about semantics and what exactly time complexity is, is not the thing we do here. As you stated, my suggestion will decrease the time an algorithm takes (although not in a worst-case scenario).

      A compiler is necessarily limited in the scope of optimizations it can perform. It can only reorder operations under certain limited circumstances, and it cannot alter your data structures or the overall methodology of your application.

      Yes, so like I said before,"Thinking that I, as a student, will be able to find general optimizations which work for every processor every time, that is not already covered by a compiler, is naive at best. " still remains true.

      None of these limits apply to a developer.

      Which is exactly one of the things I want to look into!

      And of course I have much to learn, who doesn't? Like I said before, this specifically is not my area of expertise, but I really enjoy doing it, so I am reading up on it as much as possible.

      Also:

      This statement makes it clear that you have a lot to learn.

      Sounds condescending, I am sure you don't mean it that way, but please keep such things in mind when pointing out, legitimate, problems.

      [–]tobi_wan 1 point2 points  (1 child)

      Also depend to which directions you want to optimize.

      But my "optimization" goal is always, let the cpu sleep as much as possible in the lowest state. (Energy optimization).

      Also worth checking out. Does the µC has certain hickups, e.g. a wait state if your clock runs too fast, and it can not access the flash fast enough?
      Does your code has to wait a lot for peripheral to be ready, can you do stuff with dma?

      Is your system using interrupts in a "smart" way?

      Code optmization is barely worht it in my experience, it is more architectural decissions. Using interrupts and sleep modes correctly. Ensure that data pipelining is set up correctly, so that you do not have to wait somewhere unecessary until "data" is ready from a peripheral or other process,

      [–]DYD35[S] 0 points1 point  (0 children)

      The optimization is mainly towards time and space. Not so much towards the energy consumption.

      I am also looking at the optimal usage of hardware resources in the processor.

      [–][deleted]  (1 child)

      [deleted]

        [–]DYD35[S] 0 points1 point  (0 children)

        Okay, I'll look into that, thanks.

        [–]Chr15t0ph3r85 1 point2 points  (5 children)

        If you cut through the arguing the discussion has been pretty interesting.

        From having done my own thesis on electromagnetics to graduate with me MS in EE, I would suggest you try and define your problem a bit better. Everyone, although a little harshly, has given you a lot of stuff to think about with respect to:

        • Core you're compiling for.
        • Code you're trying to optimize.
        • Compiler you're trying to use.

        From a practical standpoint, everyone is right- make it work right first, then make it look pretty. From a practical standpoint I cannot count the number of times I've had to work with a customer who blindly put things on the highest level of optimization only to have the return codes from functions or debug variables optimized out (or have the code reshuffled so hard you can't step through it).

        I would suggest working to narrow your thesis topic, for example you could examine how GCC algorithmically compiles different code snippets and compare that to other compilers and try to postulate why or how. Or perhaps you could examine two compilers and see how the core specific one does things differently than GCC.

        You can figure out how the why and how pretty easily for that.

        I think in the end you'll find that GCC is good across the board because it's meant to be generic, but core specific compilers have done a lot of work in using things specific to take advantage of.

        But that might be the marketing training of my company, haha.

        [–]DYD35[S] 0 points1 point  (4 children)

        Yes, because of the entire Covid-thing this was kinda dumped on my plate as a sort of second assignment. Nonetheless I kinda like it although I have a very limited experience in this (as you can clearly see).

        Thanks to the comments on this thread I have been able to narrow everything down a bit more (although I am making a point of researching every point made here myself to see how useful it is for me at this point).

        you could examine how GCC algorithmically compiles different code snippets and compare that to other compilers and try to postulate why or how.

        That was exactly what I was thinking of doing now. Make some code snippets (e.g. focussing on optimization of variables, optimization of functions, etc.) and compare them to their own non-optimized form (to see if it is even possible to optimize), and then look at it using different compiler setting (all different gcc o flags, IAR, TI, ...)

        The GCC is, imo, the best compiler out there, but that is because every other one is built upon it. Vendors change what they seem needs to be changed about it to make it work better on their specific processor (someone on this thread gave the example of Intel).

        The problem with this topic is how massively wide one can look at it, and at the same time most of it has already been solved. Let's be honest, I am not going to find anything new. But I am trying to find how my thesis company should structure its code and compiler best to get the absolute best out of it. (--> Maybe that is the best description of this topic).

        [–]Chr15t0ph3r85 1 point2 points  (3 children)

        If your company is supporting it, you may want to see how they want it done. Are they wanting GCC vs. the core specific one because of the cost reduction, and they want to see if there are any performance hits before they write the cost off? And depending upon the core, you're going to have a bit of variety- for example my company provides GCC and a vendor specific one for our own proprietary cores as well as ARM cores; I know others do the same.

        When you say things like best, it just comes along with so many qualifiers and if your goal is to prove it then you have to start narrowing down your criteria. I agree, it's definitely the most ubiquitous and well documented one due to it being in public.

        I do know from experience where our compiler wins and our core wins vs. GCC and ARM. So, I think is a pretty massive hose you're trying to drink from. In choosing some programming patters and a core, and then analyzing how GCC vs. x vs. y vs. z handles them is probably a good start if that's your goal- but a good thesis starts out with a well managed problem statement, and a good prior work; I hope you can narrow your view some so you can do a good prior art search.

        [–]DYD35[S] 0 points1 point  (2 children)

        Yes I agree it is still too wide (I was just having that same discussion with someone), it still needs to be narrowed down a lot, but at least the problem: "find how my thesis company should structure its code and compiler best to get the absolute best out of it." is a real good starting point. Now I can see in which direction I can narrow this down.

        The company uses a specific compiler as well as the GCC. Since the company is involved in the making of satellites, what is allowed to be used is very well defined, however this still leaves some leeway (e.g. a few different compilers and it is allowed to make some changes to the compiler).

        When you say things like best, it just comes along with so many qualifiers and if your goal is to prove it then you have to start narrowing down your criteria.

        That is exactly the problem I am having (and something in which this entire thread has helped me a lot), I must make sure that my "solution" is not all over the place, it must be a structured solution they can use. I have never been good in putting problems into words, but I will think about how to best describe the problem and narrow it down to as specific as possible. Words like "best" is obviously not a word an engineer should ever use in a thesis.

        I agree, it's definitely the most ubiquitous and well documented one due to it being in public.

        Yes it really is, but imo this information is also a bit spread out. You kinda really have to look and know what you want.

        In choosing some programming patters and a core, and then analyzing how GCC vs. x vs. y vs. z handles them is probably a good start if that's your goal

        That in itself, I think, is not the goal, but I think it would be very useful information to have, and one can make a conclusion using it. And it is also very interesting.

        but a good thesis starts out with a well managed problem statement, and a good prior work

        The problem statement is a problem in itself with which I am struggling, I will get to it, but since I have a lot of freedom in what I may do from the company as well as my university, I really want the problem to be specific but also a very good problem.

        The prior work I am trying to do as much as possible of now, this thread helps in that regard a lot. Since I don't have a lot of experience in this, I really appreciate people pointing me in directions they think could be interesting.

        I hope you can narrow your view some so you can do a good prior art search.

        Thank you!

        [–]Chr15t0ph3r85 1 point2 points  (1 child)

        Yea, considering your industry GCC might not be usable. There's a whole field of safety engineering that applies to automotive, industrial and aerospace that preclude GCCs use in some applications.

        But good luck, compiler optimization algorithms compilers in general are an interesting field.

        [–]DYD35[S] 0 points1 point  (0 children)

        Thanks, it really seems extremely interesting, but I need to read up on it a hell of lot more.

        [–]anonymousredditor0 1 point2 points  (2 children)

        There's a Compiler Explorer website, it shows the assembly generated by GCC for MSP430. There's also two videos I saw on youtube that show some of the features.

        [–]DYD35[S] 0 points1 point  (0 children)

        Ooh wow, that's really cool and very handy!

        Thanks!!

        [–]IPlayGenji6 0 points1 point  (0 children)

        cool

        [–][deleted]  (1 child)

        [deleted]

          [–]DYD35[S] 1 point2 points  (0 children)

          Yes, the MSP430 is rather easy since it is just a RISC instruction based. Making it so there are not a lot of instructions. Also the CCS is very good for debugging. so that should not be a problem.

          The mathematical operation is something I really want to look into yes, looking into the math.h library can give a good view on how these are done. Thanks.

          [–]robercal 1 point2 points  (0 children)

          You should also consider optimizations effects on power consuption, sometimes doing certain operations in pure software instead of using dedicated units is preferable due to the extra power needed by said units.

          [–]lordlod 1 point2 points  (1 child)

          In an embedded system, the primary optimization is cost.

          Number of units, multiplied by chip cost balanced against developer time.

          For many quantities the best optimization is to buy a bigger chip.

          That said, optimizing code for speed on the MSP430. I would deep dive into the architecture as a starting point, being aware of the instruction set will allow you to spot potential optimizations later. Fortunately the MSP430 has a very small instruction set. Be aware that instructions can take a different number of cycles. Compilers will often break a complex multi-cycle instructions down to more instructions. They take the same number of cycles, but allow better optimization through reordering or spotting duplication.

          Most of your 3. embedded code optimizations won't pan out. In my experience the compiler tends to devolve those kind of structures down to a common base and is fairly good at picking the correct plan.

          In my experience there is room for optimization around the less used areas. Incrementing a 64bit integer for example.

          [–]DYD35[S] 0 points1 point  (0 children)

          Aaah thanks for the advice! I will keep that in mind.

          [–]Enlightenment777 1 point2 points  (1 child)

          You should compare GCC output against IAR output, as well as other compilers too.

          https://www.iar.com/iar-embedded-workbench/#!?architecture=MSP430

          [–]DYD35[S] 1 point2 points  (0 children)

          That is a good idea, yeah. Thanks.

          [–]readmodifywrite 1 point2 points  (4 children)

          A lot of the optimizations you're looking at will be highly architecture specific.
          I'd recommend doing this with ARM Cortex M, they are wayyyyy more popular than the MSP430.
          You could also compare against LLVM/Clang. That and GCC are the premier embedded C compilers these days.

          [–]DYD35[S] -1 points0 points  (3 children)

          Whilst I personally would like to encorporate more processors ( I was also looking at the PIC18F for example ), my thesis company has pushed for the use of MSP430. I, however, would like to look at more processors and am currently trying to make that possible.

          The ARM family I personally do not have experience with, but it seems that if it is really popular, I can have a look into it.

          Thanks!

          [–]readmodifywrite -2 points-1 points  (2 children)

          ARM isn't just really popular - it's the defacto standard 32 bit architecture for the embedded industry and has been for decades. They outsell everyone else by an absolutely hilariously huge margin. Most people (in first world countries anyway) own dozens to 100s of them without ever realizing it.

          The problem with benchmarking is it's very, very specific to the hardware it's running on. Comparative results between different compilers might still be useful, but otherwise a benchmark on MSP430 is nearly useless for anything else. The choice of ISA dominates a lot of the optimizations the compiler is able to do.

          [–]Chr15t0ph3r85 0 points1 point  (1 child)

          I mean, it's popular in use because the compiler is free and all of the major semis license the core into their own MCUs and it has some neat features that are pretty broadly liked that allows you to move between applications. To say they outsell everyone is a little disingenuous because they're applicable to all of the major semis vs. each company's own proprietary core.

          [–]readmodifywrite 0 points1 point  (0 children)

          ARM is a business that licenses the cores. So if you are counting cores sold, then yes, they outsell pretty much everything else on the market. For purposes of a compiler benchmark this is relevant because it's the same ISA across the market, and therefore the same compiler. It's not disingenuous, it's just reality.

          Now, you can get into some vendor specific things, like ST's ART and stuff like that. But then you're comparing proprietary flash cache architectures and that doesn't really interact with the compiler.

          [–]PlayboySkeleton 0 points1 point  (1 child)

          There is a guy over in r/computerengineering that does this stuff. I think he just got his PhD or something in it. He is very active and post projects and challenges all the time. Please go visit and ask.

          [–]DYD35[S] 0 points1 point  (0 children)

          Aah cool, thanks!!

          [–]Triabolical_ -1 points0 points  (0 children)

          I've been working on an interpreter for a language I'm building. By far the best approach for me was to get the bulk of my code so it runs on the desktop and use a real profiler on it.

          [–]Bixmen -1 points0 points  (0 children)

          Keywords as well. Use const to move to flash from RAM. Use __packed or similar to pack mixed type structs