Search This Blog

Saturday, August 10, 2019

Microchip XC32 Compiler Optimization


It is often quipped that the humorist Mark Twain once said,

    “There are lies, darn lies and statistics”

If Mr. Twain was a programmer he might have said something to this effect,

    “There are lies, darn lies and benchmarks”

Benchmarking has a long history of being bent into something that produces: “a good number, that will have nothing to do with your actual situation”. This is especially true now that we have embedded processors with caches and all sorts of other performance enhancing and limiting factors.

In the 1980’s there was one benchmark test that really took hold. It was called the “Sieve of Eratosthenes” and it was a numerical / prime number test that was used quite a bit to see what compilers produced the 'fastest' code [1].

This benchmark really just looked at one aspect of performance: Calculating prime numbers. It didn’t test graphics, IO speed or a host of other factors that are also important in a real world applications. So it was very narrow minded in its scope.

It did gain such prominence that the compiler writers of the era actually wrote optimizers that would detect this code sequence and apply special techniques to get the fastest possible performance for this specific benchmark test. There was nothing against wrong in doing this – It is kind of like producing cars with a fast 0-60 MPH acceleration factor – they can detect the ‘pedal to the metal’ and change the shift points to get the fastest 0-60 time, which is great for bragging rights, but this is a rarely used sequence in real life driving. This is much the same, in my opinion with most modern microprocessor benchmarks.

Whats the point then?

This ‘Optimization’ project started for two reasons,

1) I was working on a battery powered image processing application and I wanted to see what the effects of the various XC32 optimization levels were to see if I could save a significant amount of power by switching to higher optimization levels and hence lower the CPU clock speed thus saving power.

2) To just see generally, what the effect of the various Pro XC32 optimization levels do compared to the free version. It has been noted online that some people feel like they are being ‘cheated’ from ‘significant’ performance improvements with the free version of the XC32 compiler. So, we’ll take a look at that.

Note: You may know that Microchip provides a free compiler for its PIC32 processor series called XC32. It is currently based on GCC 4.8 and the free version provides -O0 and -O1 optimization levels. The ‘paid’ version includes support and the other GCC optimization levels: -O2, -Os and -O3

The XC32 optimization levels are not exactly the same as the standard GCC levels but they roughly follow. The XC32 2.0x manual states,


Some Notes

I used the XC32 2.05 and 2.10 Version for these tests (These versions perform the same in my tests, the versions differences only add some new devices and fix some corner case defects as can be seen in the release notes).

When Debugging your program logic is it useful to set the optimization level to -O0 as this produces pretty much 1:1 code with your C code, this makes following along the logic and looking at variables easy. Also inlining of functions is disabled so functions appear as you wrote them.

-O1 is the standard optimization level for the free version of XC32 and even this level inlines functions and aggressively removes variables that can be kept on the stack or in a register making your code run much faster and be smaller, but also making debugging very hard to follow. This is the optimization that you most likely want to use for your ‘Release’ code, after all: Who doesn’t want faster / smaller code?

Many Standard libraries like the MIPS PIC32 DSP libraries are written in hand optimized MIPS assembly language and hence bypass the C compiler completely, so you get the fastest possible performance even with the free version of the compiler.
 

 -O2, -Os and -O3 Optimizations are only available with the paid version of XC32, which is quite inexpensive at < $30.00 per month and a must for professional developers if only for the included, expedited support that the license also includes.

[2020 Update] XC32 Version 2.40 and above now includes -01 and -02 optimizations in the free version. (See the release notes).

My hardware test setup for all these tests is a PIC32MZ2048EFH processor running at 200 MHz clock speed.

On to the Benchmarking

One of the standard benchmarks used with advanced 32 bit processors is the ‘CoreMark’ [2]. This is a multifaceted benchmark that try's to simulate many different aspects of an actual application, yet in the end produce a single performance number. When I looked at this I found that the code was quite small and didn’t move a lot of variables around so in execution it probably can spend most if not all of its time in any modern 32 bit processors cache.

Another application that I used in my benchmarking was a custom image processing application that I wrote. It is not big in the sense of requiring a large program memory footprint, but it does use upwards of 57,000 bytes of data as it process a 32 x 24 image up to 320 x 240 for display and along the way: scales, normalizes, applies automatic gain control, maps the resulting image to a 255 step color map, etc. So there is quit a bit of data manipulation along the way and all the data cannot fit in the PIC32MZ’s 16k data cache at once.

The last application I used was a relatively large application provided by Microchip as a demo program that represents a complete application with extensive graphics, etc.  This application was compiled just to see the relative code sizes that the XC32 compiler produced is, as I thought the CoreMark and my application were really too small for a realistic analysis of program size.

CoreMark Program Insights

The CoreMark site [2] provides results and provides the compilers “Command Line Parameters” that were used to compile the program. This information was very interesting as you will soon see.

First, let’s take a look at the CoreMark execution speed versus various XC32 compiler optimization levels. The CoreMark program when compiled at -O1 is only 32k bites long and uses only 344 bytes of data memory, so it is quite small and probably runs completely in the PIC32MZ processor cache, so speed of execution is all we can really look for here.

  
Figure 1 – This is the execution speed for the CoreMark with various optimization levels. All results were normalized to the -O1 level as this is the highest optimization for the free version of the XC32 Compiler. See the text for a discussion of each optimization.

I included the -O0 optimization level in Figure 1 just as a comparison to see what a huge difference even the free -O1 optimization level makes on code performance. The difference between -O0 and -O1 in the CoreMark (and nearly every other application I have ever compiled) is nearly 1.5:1, no other optimization makes that big of a jump. In fact all the other optimizations and ‘tweaks’ only produce marginal gains on the -O1 optimization. As noted the optimization level -O0 is really only useful in debugging code, no one would every build a final application with this optimization level unless they really just don’t like their customers.

-Os is ‘optimize for size’ and as expected it results in a nearly 10% performance hit here. The CoreMark program is really very small in memory footprint, so this option would be a waste of time in any small application like this one, it is included only as a reference.

-O1 is the default optimization for the XC32 compiler free version. All the other results are normalized to this result (100%).

-O2 and -O3 produce only marginal gains above the -O1 level. -O2 is 13% faster, -O3 is 17% faster.

-O2++ is the optimization that Microchip used for the benchmark results posted on the CoreMark site and as may be expected includes some undocumented command line options and some specific tweaks to get the performance up as much as possible. Again, there is nothing that says anyone can’t optimize specifically for the benchmark, only that others should be able duplicate the results, which I was able to do. Here is the command line options for -O2++ as I found them,

-g -O2 -G4096 -funroll-all-loops -funroll-loops -fgcse-sm -fgcse-las -fgcse -finline -finline-functions -finline-limit=550 -fsel-sched-pipelining -fselective-scheduling -mtune=34kc -falign-jumps=128 -mjals

The really interesting option here is this one: “-mtune=34kc” I could not find this option documented anywhere and I really did not have time to search through megabytes of source code to try to find it. But “34Kc” is a designation that MIPS uses to describe the core that the PIC32MZ is based on so it is some sort of optimization for this specific core.

Bottom line is that these optimizations produce the fastest result – 30% faster than -O1 alone, but it is only some 12% faster than the base -O3 optimization, so it is only marginally better than -O3 alone.

-O3++ is where I applied these same -O2++ command line options to the base -O3 optimizations just for fun. This produced a marginally faster result than -O3 alone but it was still slower than -O2++.

Interesting to see the results, but again, while CoreMark is more comprehensive than the 1980's 'Shieve' benchmark, the CoreMark probably does not apply to your application at all, it’s small and uses an incredibly small amount of data memory.

Image Processing Program Insights

This was an actual application that I wrote that takes raw data from a 32 x 24 pixel image sensor, converts the raw data into useful values, up interpolates the pixels to 320 x 240, and them scales and limits the values for display. The data formats were int32, int16, int8 and float data types. A large amount of data was processed, some 57,000 bytes for each image.

Reading the sensor and writing to the display are fixed and running as fast as they can as set by the limitations of the hardware interfaces, so nothing can be done about that. The experiment was to see if the central processing algorithms could be speed up enough that would allow me to slow down the CPU clock enough to save on battery power. Running some benchmarks was a first step in that determination.

    while(1) {
    GetSensorData();        // Fixed by how fast the sensor can be read.
    ConvertDataArray();
    BiLinInterpolate();
    GetImageAdjustments();  // User / GUI Interaction – get settings.
    ApplyExpansion();
    ApplyContrastCurve();
    ApplyBrightnessAdjustment();
    DisplayImage();        // Fixed by how fast the display can be written to.
    }


Figure 2 – The central image processing ‘loop’ consists of routines like this. At each step the data arrays were processed in deterministic loops. Some of the processing is redundant because multiple loops were used as each step consisted of one specific data operation. These loops could be combined if need be. But, without some profiling first the effort may have been in vain (See the conclusion), guessing almost never pays off in optimizing.


Figure 3 – The simplest optimization is to use the compilers built in ‘smarts’ to make the code faster. Here my simple but data intensive image processing program was optimized using various compiler settings and the speed of execution was measured. The optimization level -O1 was normalized to 100%.

Figure 3 is pretty straight forward. I timed only the image procession portion of the code, excluding the hardware IO as that is fixed by hardware constraints. Optimization level -O0 would never be used for released code, it is included here only as a comparison to show how aggressive the compiler gets even with the free -O1 optimization. Interestingly, option -O2 produced a slower result than option -O1 in this example, there is probably some data inefficiency going on with this option. As expected however option -O3 produced the fastest ‘standard’ result, but really only marginally faster than -O1 at around 10%.

XC32 also has some extra ‘switches’ that can be tweaked from the GUI. I set all these for the: “-O1++, -O2++ and -O3++ level tests. These switches are shown in figure 4.


 
Figure 4 – XC32 allows you to set these switches to turn more aggressive compiler options. As can be seen in figure 3 for the '-Ox++' results. It is faster, but not by a lot, typically less than 5%.

The bottom line here is: The XC32 free versions -O1 optimization was only slightly slower than the paid versions -O3++ maximum optimizations by less than 15%, OK but nothing really to get too excited about.

Large Application Insights

This large application is based on Microchips “Real Time FFT” application example program [3]. This is a fairly large application and it uses a lot of data, has a LCD and an extensive user interface, along with ADC, DAC drivers and FFT calculations. I don’t have the hardware to run this code, so I looked only at generated code size and data size. The Data Size held steady at 200 kBytes independent of the optimization level which is expected for a large graphics oriented program like this. The compiled program size was 279 kBytes when compiled with the optimization level of -O1.


Figure 5 – The Microchip application example: “Real Time FFT” was compiled at various optimization levels and the resulting program SIZE is shown plotted here. This is a rather big application at some 279 kBytes when compiled at -O1.  As can be expected when optimizing for absolute maximum speed (-O3++ see figure 4) the program gets much larger but at a huge cost in size.

As can be seen in Figure 5, The -Os optimization gave only a marginal size decrease of around 7% over the default -O1 optimization. -O3++ however grew very large, probably mostly due to the application of figure 5’s “Unroll Loops” switch. This switch forces the unrolling of all loops, even non-deterministic ones.

This result it to be expected, as any compilers ‘Money Spec’ is execution speed, not program size. Which for the majority of real world applications is the proper trade off. As my image processing application shows, the performance gains from -O1 to -O3 would be expected to be minimal and the trade off in program size might be excessive, especially if you are running out of space and can’t go to a larger memory device for some reason.

In a very large application code size may be a real issue because you want to save money by using a smaller memory chip. The -Os option probably isn’t the silver bullet you will be looking for as the < 10% code size savings are nearly insignificant. Microchip has an application note [4] that provides some interesting information on how -Os works and some more optimization tricks for code size. The tricks when applied result in less than a 2% improvement over the -Os optimization alone, so while this note is interesting reading, it provides no further improvement.

Conclusion

The only good benchmark is the one of your actual running application code. That being said, one item that can be gleaned from the experiments here is that the free XC32 compiler running at its maximum optimization level of -O1 produces code that is within 15-20% or so of the maximum optimizations possible with the XC32 paid compiler. The other item that can be gleaned here, is that by spending hours and hours you might be able to coax 20% or perhaps even 30% better performance by hand tweaking the compiler optimizations for the particular task at hand, but you can probably do the same or better right in your code by improving your base code algorithms.

It is always good to always remember the first and second rule of code optimizations however,

     #1 - Write plain and understandable code first.
     #2 - Don’t hand optimize anything until you have profiled the code and proven that the
             optimization will help.

Guessing at what to optimize is almost is never right and ends up wasting a lot of time for a very marginal performance increase and a highly likely increase in your codes maintenance costs.

I have found that there are all sorts of interesting articles out there on optimization with GCC and I have also found that unless they show actual results they are mostly someones 'Opinion' on how things work, or how things worked 15 years ago but don’t work that way today. For instance, I have found articles that state emphatically that 'Switch' statements produce far faster code than 'If ' statements, and I have found articles that say emphatically just the opposite. Naturally nether of these articles show any actual results, so I have learned to beware and check for myself. Which takes me back to the #1 rule of optimizing, don’t do it until you have proven that you need to do it. And yes, I fight this temptation myself all the time!

And finally, yes, finally… If you don’t have the XC32 paid version, don’t fret too much – in most applications the paid version will only provide a marginal gain in performance and size over the default and free XC32 -O1 optimization provided free. But, as always: "Your mileage may vary".

Appendix A –  General Optimization Settings – Best Practices with XC32

Best for Step by step Debugging

Set Optimization level to -O0 and set all other defaults to ‘reset’ condition. This will compile the code almost exactly as you have written it which will allow for easy line by line program trace during debugging. This is very helpful in tracing and verifying the programs logic.

   

Best settings for Speed or Size

Use these settings for maximum ‘easy’ speed in the XC32 free compiler. Meaning these settings will get you to around 80 to 90% of maximum speed possible. Any improvement over this will need to be accompanied by a very careful review of all the compiler options and their effect, actual time profiles of the code and possibly specific module optimizations.

Project Properties → xc32-gcc→ General



Project Properties → xc32-gcc→ Optimization
 
 

To Optimize a specific source file Individually

Right clock on the source file and select ‘Properties’
 
  



Then select: “Overwrite Build Options” and from there you can set the optimization level of the specific file separate from the main application settings.

  

Then select the xc32-gcc → Optimizations as above and have at it as detailed above.

Appendix B – Lists of the optimizations as applied by XC32 Version 2.05/2.10

Optimizations vs. -Ox level, All other settings at default levels.

Note: Use “-S -fverbose-asm” to list every silently applied option (including optimization ones) in assembler output.

# -g -O0 -ffunction-sections -ftoplevel-reorder -fverbose-asm
 # options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 # -fbranch-count-reg -fcommon -fdebug-types-section
 # -fdelete-null-pointer-checks -fearly-inlining
 # -feliminate-unused-debug-types -ffunction-cse -ffunction-sections
 # -fgcse-lm -fgnu-runtime -fident -finline-atomics -fira-hoist-pressure
 # -fira-share-save-slots -fira-share-spill-slots -fivopts
 # -fkeep-static-consts -fleading-underscore -fmath-errno
 # -fmerge-debug-strings -fmove-loop-invariants -fpcc-struct-return
 # -fpeephole -fprefetch-loop-arrays -fsched-critical-path-heuristic
 # -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock
 # -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec
 # -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column
 # -fsigned-zeros -fsplit-ivs-in-unroller -fstrict-volatile-bitfields
 # -fsync-libcalls -ftoplevel-reorder -ftrapping-math -ftree-coalesce-vars
 # -ftree-cselim -ftree-forwprop -ftree-loop-if-convert -ftree-loop-im
 # -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 # -ftree-phiprop -ftree-pta -ftree-reassoc -ftree-scev-cprop
 # -ftree-slp-vectorize -ftree-vect-loop-version -funit-at-a-time
 # -fverbose-asm -fzero-initialized-in-bss -mbranch-likely
 # -mcheck-zero-division -mdivide-traps -mdouble-float -mdsp -mdspr2 -mel
 # -membedded-data -mexplicit-relocs -mextern-sdata -mfp64 -mfused-madd
 # -mgp32 -mgpopt -mhard-float -mimadd -mlocal-sdata -mlong32 -mno-mdmx
 # -mno-mips16 -mno-mips3d -mshared -msplit-addresses


# -g -O1 -ffunction-sections -ftoplevel-reorder -fverbose-asm
 # options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 # -fbranch-count-reg -fcombine-stack-adjustments -fcommon -fcompare-elim
 # -fcprop-registers -fdebug-types-section -fdefer-pop -fdelayed-branch
 # -fdelete-null-pointer-checks -fearly-inlining
 # -feliminate-unused-debug-types -fforward-propagate -ffunction-cse
 # -ffunction-sections -fgcse-lm -fgnu-runtime -fguess-branch-probability
 # -fident -fif-conversion -fif-conversion2 -finline -finline-atomics
 # -finline-functions-called-once -fipa-profile -fipa-pure-const
 # -fipa-reference -fira-hoist-pressure -fira-share-save-slots
 # -fira-share-spill-slots -fivopts -fkeep-static-consts
 # -fleading-underscore -fmath-errno -fmerge-constants
 # -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
 # -fpcc-struct-return -fpeephole -fprefetch-loop-arrays
 # -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 # -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 # -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 # -fsched-stalled-insns-dep -fshow-column -fshrink-wrap -fsigned-zeros
 # -fsplit-ivs-in-unroller -fsplit-wide-types -fstrict-volatile-bitfields
 # -fsync-libcalls -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp
 # -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop
 # -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts
 # -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert
 # -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize
 # -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc
 # -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize -ftree-slsr
 # -ftree-sra -ftree-ter -ftree-vect-loop-version -funit-at-a-time
 # -fvar-tracking -fvar-tracking-assignments -fverbose-asm
 # -fzero-initialized-in-bss -mbranch-likely -mcheck-zero-division
 # -mdivide-traps -mdouble-float -mdsp -mdspr2 -mel -membedded-data
 # -mexplicit-relocs -mextern-sdata -mfp64 -mfused-madd -mgp32 -mgpopt
 # -mhard-float -mimadd -mlocal-sdata -mlong32 -mno-mdmx -mno-mips16
 # -mno-mips3d -mshared -msplit-addresses


# -g -Os -ffunction-sections -ftoplevel-reorder -fverbose-asm
 # options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 # -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon
 # -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps
 # -fdebug-types-section -fdefer-pop -fdelayed-branch
 # -fdelete-null-pointer-checks -fdevirtualize -fearly-inlining
 # -feliminate-unused-debug-types -fexpensive-optimizations
 # -fforward-propagate -ffunction-cse -ffunction-sections -fgcse -fgcse-lm
 # -fgnu-runtime -fguess-branch-probability -fhoist-adjacent-loads -fident
 # -fif-conversion -fif-conversion2 -findirect-inlining -finline
 # -finline-atomics -finline-functions -finline-functions-called-once
 # -finline-small-functions -fipa-cp -fipa-profile -fipa-pure-const
 # -fipa-reference -fipa-sra -fira-hoist-pressure -fira-share-save-slots
 # -fira-share-spill-slots -fivopts -fkeep-static-consts
 # -fleading-underscore -fmath-errno -fmerge-constants
 # -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
 # -foptimize-register-move -foptimize-sibling-calls -fpartial-inlining
 # -fpcc-struct-return -fpeephole -fpeephole2 -fprefetch-loop-arrays
 # -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop
 # -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 # -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 # -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 # -fsched-stalled-insns-dep -fschedule-insns2 -fshow-column -fshrink-wrap
 # -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types
 # -fstrict-aliasing -fstrict-overflow -fstrict-volatile-bitfields
 # -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math
 # -ftree-bit-ccp -ftree-builtin-call-dce -ftree-ccp -ftree-ch
 # -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim
 # -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre
 # -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon
 # -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pre
 # -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink
 # -ftree-slp-vectorize -ftree-slsr -ftree-sra -ftree-switch-conversion
 # -ftree-tail-merge -ftree-ter -ftree-vect-loop-version -ftree-vrp
 # -funit-at-a-time -fuse-caller-save -fvar-tracking
 # -fvar-tracking-assignments -fverbose-asm -fzero-initialized-in-bss
 # -mbranch-likely -mcheck-zero-division -mdivide-traps -mdouble-float
 # -mdsp -mdspr2 -mel -membedded-data -mexplicit-relocs -mextern-sdata
 # -mfp64 -mfused-madd -mgp32 -mgpopt -mhard-float -mimadd -mlocal-sdata
 # -mlong32 -mmemcpy -mno-mdmx -mno-mips16 -mno-mips3d -mshared
 # -msplit-addresses


# -g -O2 -ffunction-sections -ftoplevel-reorder -fverbose-asm
 # options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 # -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon
 # -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps
 # -fdebug-types-section -fdefer-pop -fdelayed-branch
 # -fdelete-null-pointer-checks -fdevirtualize -fearly-inlining
 # -feliminate-unused-debug-types -fexpensive-optimizations
 # -fforward-propagate -ffunction-cse -ffunction-sections -fgcse -fgcse-lm
 # -fgnu-runtime -fguess-branch-probability -fhoist-adjacent-loads -fident
 # -fif-conversion -fif-conversion2 -findirect-inlining -finline
 # -finline-atomics -finline-functions-called-once -finline-small-functions
 # -fipa-cp -fipa-profile -fipa-pure-const -fipa-reference -fipa-sra
 # -fira-hoist-pressure -fira-share-save-slots -fira-share-spill-slots
 # -fivopts -fkeep-static-consts -fleading-underscore -fmath-errno
 # -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants
 # -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls
 # -foptimize-strlen -fpartial-inlining -fpcc-struct-return -fpeephole
 # -fpeephole2 -fprefetch-loop-arrays -fregmove -freorder-blocks
 # -freorder-functions -frerun-cse-after-loop
 # -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 # -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 # -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 # -fsched-stalled-insns-dep -fschedule-insns -fschedule-insns2
 # -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller
 # -fsplit-wide-types -fstrict-aliasing -fstrict-overflow
 # -fstrict-volatile-bitfields -fsync-libcalls -fthread-jumps
 # -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp
 # -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars
 # -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce
 # -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre
 # -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon
 # -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pre
 # -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink
 # -ftree-slp-vectorize -ftree-slsr -ftree-sra -ftree-switch-conversion
 # -ftree-tail-merge -ftree-ter -ftree-vect-loop-version -ftree-vrp
 # -funit-at-a-time -fuse-caller-save -fvar-tracking
 # -fvar-tracking-assignments -fverbose-asm -fzero-initialized-in-bss
 # -mbranch-likely -mcheck-zero-division -mdivide-traps -mdouble-float
 # -mdsp -mdspr2 -mel -membedded-data -mexplicit-relocs -mextern-sdata
 # -mfp64 -mfused-madd -mgp32 -mgpopt -mhard-float -mimadd -mlocal-sdata
 # -mlong32 -mno-mdmx -mno-mips16 -mno-mips3d -mshared -msplit-addresses


# -g -O3 -ffunction-sections -ftoplevel-reorder -fverbose-asm
 # options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 # -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon
 # -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps
 # -fdebug-types-section -fdefer-pop -fdelayed-branch
 # -fdelete-null-pointer-checks -fdevirtualize -fearly-inlining
 # -feliminate-unused-debug-types -fexpensive-optimizations
 # -fforward-propagate -ffunction-cse -ffunction-sections -fgcse
 # -fgcse-after-reload -fgcse-lm -fgnu-runtime -fguess-branch-probability
 # -fhoist-adjacent-loads -fident -fif-conversion -fif-conversion2
 # -findirect-inlining -finline -finline-atomics -finline-functions
 # -finline-functions-called-once -finline-small-functions -fipa-cp
 # -fipa-cp-clone -fipa-profile -fipa-pure-const -fipa-reference -fipa-sra
 # -fira-hoist-pressure -fira-share-save-slots -fira-share-spill-slots
 # -fivopts -fkeep-static-consts -fleading-underscore -fmath-errno
 # -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants
 # -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls
 # -foptimize-strlen -fpartial-inlining -fpcc-struct-return -fpeephole
 # -fpeephole2 -fpredictive-commoning -fprefetch-loop-arrays -fregmove
 # -freorder-blocks -freorder-functions -frerun-cse-after-loop
 # -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 # -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 # -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 # -fsched-stalled-insns-dep -fschedule-insns -fschedule-insns2
 # -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller
 # -fsplit-wide-types -fstrict-aliasing -fstrict-overflow
 # -fstrict-volatile-bitfields -fsync-libcalls -fthread-jumps
 # -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp
 # -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars
 # -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce
 # -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre
 # -ftree-loop-distribute-patterns -ftree-loop-if-convert -ftree-loop-im
 # -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 # -ftree-partial-pre -ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc
 # -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize -ftree-slsr
 # -ftree-sra -ftree-switch-conversion -ftree-tail-merge -ftree-ter
 # -ftree-vect-loop-version -ftree-vectorize -ftree-vrp -funit-at-a-time
 # -funswitch-loops -fuse-caller-save -fvar-tracking
 # -fvar-tracking-assignments -fvect-cost-model -fverbose-asm
 # -fzero-initialized-in-bss -mbranch-likely -mcheck-zero-division
 # -mdivide-traps -mdouble-float -mdsp -mdspr2 -mel -membedded-data
 # -mexplicit-relocs -mextern-sdata -mfp64 -mfused-madd -mgp32 -mgpopt
 # -mhard-float -mimadd -mlocal-sdata -mlong32 -mno-mdmx -mno-mips16
 # -mno-mips3d -mshared -msplit-addresses

References

[1] Some of the original BYTE magazine articles dealing with the Sieve of Eratosthenes

A High-Level Language Benchmark by Jim Gilbreath
BYTE Sep 1981, p.180
https://archive.org/details/byte-magazine-1981-09/page/n181

Eratosthenes Revisited: Once More through the Sieve by Jim Gilbreath and Gary Gilbreath
BYTE Jan 83 p.283
https://archive.org/details/byte-magazine-1983-01/page/n291

Benchmarking UNIX systems by David Hinnant
BYTE Aug 1984 p.132
https://archive.org/details/byte-magazine-1984-08/page/n137

[2] Coremark Benchmark
www.eembc.org/coremark/

[3] Microchip Technology, example application. Located in the Harmony install directory at,
    .../apps/audio/real_time_fft

[4] Microchip Technology, “How to get the least out of your PIC32 C compiler”,
https://www.microchip.com/mymicrochip/filehandler.aspx?ddocname=en557154


Article By: Steve Hageman / www.AnalogHome.com

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

Note: This Blog does not use cookies (other than the edible ones).