Monday, June 1, 2020

Is a ‘Heroic’ Low Voltage Noise Amplifier Always Desirable?

(Random internet screen grab)

I recently ran across a “Ultra Low Noise Moving Magnet Phono Preamp” that was built with four paralleled LT1028 low noise OPAMP’s, and immediately thought: “That will make the noise worse” simply because I have built amplifiers like this before and I knew that to achieve a low total system noise that the source resistance would have to be extremely low, well below 1 ohm, which a Magnetic Phono Cartridge is not *.

Note: We are going to do a 'rough cut' analysis here, looking at the main 1st order effects only. For instance we will ignore for the time being the resistance in the inverting input to the OPAMP(s), they add noise but are low enough to pale in comparison to the Phone Cartridge resistance. We will ignore for the time being that the Phono Cartridge impedance is not the same at DC and 1 kHz. Finally we will not include the effects of the  RIAA filter that Phono Preamps always have. For a complete detailed analysis see the section: "Extra Bonus 2" below.

Parallel OPAMP’s For Lower Noise

You have probably seen this circuit before, it is even shown in the LT1028 Data Sheet [1]. It works on the principle: If you parallel uncorrelated noise sources, say: Four LT1028 OPAMPs, the combined RMS voltage noise will be reduced by the Sqrt(N), where N is the number of devices paralleled. This has been used successfully with OPAMP and Voltage References in ultra low noise circuits for decades. Figure 1 shows the basic configuration.

 

Figure 1 – Simplified schematic of the basic proposed “Ultra Low Noise Phono Preamp” using four LT1028 OPAMP’s in a parallel combination to get the lowest possible input voltage noise. But is this really the lowest noise in the actual application circuit?

What most people forget is that in the case of an OPAMP, the input current noise increases by the same Sqrt(N). We will see how this fits together a bit later on.

Source Resistance*

In the case of a moving Magnet Phono Cartridge the source resistance is the cartridge itself and these, while being variable between manufacturers, do have some common characteristics [2].

The stylus is attached to a magnet that moves in the record grooves. This magnet then moves next to a stationary coil and this interaction of moving magnetic field produces a voltage in the magnetic coil, this voltage when amplified then produces the sound that we hear. Figure 2 show a first order model for a typical Moving Magnet Phono Cartridge and specified load of 250pF || 47 kOhms for a flat frequency response.

 

Figure 2 – Simplified model of a Moving Magnet Phono Cartridge including the connecting Coax Cable and the Specified Load: 250pF || 47 kOhms. 100pF of the load capacitance is usually included in the Preamp.

These coils are wound with many turns of very fine magnet wire as the inductance needs to be large to make enough voltage when playing a record to be useable. This long length of fine magnet wire has a resistance which is quite large, in fact it can be upwards of 1500 Ohms as Table 1 shows.


   

Table 1 – Two representative Shure Moving Magnet Phono Cartridge types are presented. These seem to be representative of the typical spread in values of all commercial Phono Cartridges.

As can be seen there is a quite large DC resistance in this type of design. If we use the lower number of 630 Ohms from Table 1 for the rest of our examples, we can see that this resistance has a noise voltage all its own which can be found by the familiar resistor thermal noise equation of,

Vnoise_rms = Sqrt(4 * Kb * T * R) Resistor noise in a 1 Hz Bandwidth [3].

At room temperature this equation simplifies to,

nVnoise_rms = 0.13 * Sqrt(R) Resistor noise at 27 Deg C (equation 1)

The units here are: “Nanovolts per Root Hertz”, where the Nanovolts are: “RMS” (Root Mean Squared). We will write these units as this for the remainder of this article: nV/rt-Hz

The “Root Hertz” Simply reminds us that the values are normalized to a 1 Hz bandwidth, that makes the subsequent math easy. If we want to know the total integrated noise in say a 20 kHz bandwidth, then we would just multiply the value by Sqrt(20,000) or 141, then we would get a cancellation of the rt-Hz term and just be left with the Nanovolts RMS value.

Applying Equation 1, to the Phono Cartridge DC resistances and you can see that the source itself, at low frequencies produces a thermal noise voltage of,

0.13 * Sqrt(630) = 3.2 nV/rt-Hz
0.13 * Sqrt(1550) = 5.1 nV/rt-Hz

We clearly just found out that just the cartridge itself has a thermal noise of 3.2 to 5.1 nV/rt-Hz for these two, but representative cartridges.

Looking at the data sheet we can also see that the LT1028 has a typical noise voltage of just 0.85 nV/rt-Hz at midband or 1 kHz.

Disconnect 1 -

Armed with only the knowledge of the equivalent circuit of the Phono Cartridge and with only one multiplication (equation 1), we can see that the circuit noise floor is going to be set by the DC resistance of the Phono Cartridge and not the LT1028 amplifier.

This is why the “First Audio OPAMP”, the NE5534A produced by Phillips / Signetics in the mid 1970’s was so popular, it was designed to have an input noise equivalent to what the application circuit demanded, yes the NE5534 had an input voltage noise of typically 3.5 nV/rt-Hz. Look at that, it was designed that way for a purpose, as it matches the typical noise of a Magnetic Phono Cartridge.

The first disconnect is: Going to ‘Heroic’ lengths to lower the input amplifiers noise in this case is not going to improve the entire systems noise performance because the noise floor is set by the sensor itself and a single LT1028 is already 3 times lower than probably the best Phono Cartridge.

Disconnect 2 -

One might ask: “Well what does it matter if we parallel 4 preamps? It doesn’t make the noise worse does it?”, let’s see...

Remember when we discussed what paralleling OAPMP’s really does? It reduces the voltage noise by Sqrt(N) BUT it increases the current noise by the same Sqrt(N). In a normal OPAMP circuit, at low source resistances the voltage noise will dominate and at high source resistances the current noise will dominate. In between these extremes there is an interaction with the source resistance itself.

An easy calculation to make is to divide the voltage noise by the current noise at a given frequency to come up with an equivalent noise resistance for the OPAMP, this is sometimes called Ropt and it will be the point where a source resistance of the same value will be equal to the total voltage and current noise of the OPAMP, producing a combined value that is 3 dB higher (or 1.41 times) than each separately.

For the LT1028 the data sheet the noise values at 1 kHz are,

Vn = 0.85 nV/rt-Hz
In = 1.0 pA/rt-Hz

Hence Ropt at 1 kHz is found to be,

Ropt = 0.85e-9 / 1.0e-12 = 850 Ohms

We can say that for the LT1028 at 1 kHz

A source resistance of << 850 Ohms and you will be limited by the voltage noise of the OPAMP
A source resistance of >> 850 Ohms and you will be limited by the current noise of the OPAMP

If we parallel 4 x LT1028’s we get the following Ropt,

Ropt_4x = (0.85e-9 / Sqrt(4)) / (1.0e-12 * Sqrt(4)) = 213 Ohms

Ropt has been lowered by N times the value of a single amplifier, in this example N is equal to 4, and the new Ropt is: 850/4 = 213 Ohms.

In this particular case of four paralleled LT1028’s you can see that the current noise will always be dominate since the lowest Phono Cartridge resistance that I found was 630 Ohms.

Ropt is a quick and useful calculation to see where your OPAMP selection stands in relation to the source resistance.

The disconnect #2 is: “Just using more amplifiers will not lead to an improvement in total system voltage noise if Ropt is above the sensor resistance.”

A Closer Look

A closer examination of the system and all its various noise sources shows why,

Note: When adding voltage noise terms together, we use the Root Sum Square method (RSS) or,
Result = Sqrt(Val1*Val1 + Val2*Val2)

 

Table 2 – The voltage noises add in RSS fashion, the current noise is multiplied by the source resistance to get the equivalent voltage noise effect of the two. In the 1x preamp case: 1pA/rt-Hz * 630 = 0.63 nV/rt-Hz. The situation is even worst if we use the 1500 Ohm Phono Cartridge in the bottom example and the 4x configuration. Then the total system noise would be a whopping: 8.2 nV/rt-Hz.


As can be seen in Table 2, while the voltage noise of 4 X LT1028’s does indeed drop the voltage noise of the Preamp by 50%, but the additional current noise increases the of noise developed across the sensors resistance by 4 times and finally the total system noise is actually 23% higher with 4 paralleled amplifiers and this is for the low source resistance Phono Cartridge, the result gets even worse for the 1.5k Ohm version of the Phono Cartridge.

We could have predicted this by taking a quick look at Ropt at the design stage. Since the Ropt of a single LT1028 is some 850 Ohms, and this is smack in the middle of the range of our expected Phono Cartridges, we know that paralleling more of these amplifiers will not help in reducing the total voltage noise in the circuit as the current noise will take over and really a single LT1028 is going to be pretty optimum as it is.

 

Figure 3 – Total system input integrated noise, referred to the input, over a 10 Hz to 20 kHz bandwith as simulated for a 1x and 4x LT1028 Preamp when measuring the 630 Ohm Phono Cartridge and specified load of Figure 1. As can be seen the 1 x LT1028 produces lower total integrated system noise (upper plot). Adding four paralleled LT1028’s in this example made the total integrated system noise worse (lower plot). This simulation includes the frequency effects of the Phono Cartridge source resistance.


Total integrated noise at 20 kHz – 1 x OPAMP’s - Upper Trace = 3.6 uV RMS
Total integrated noise at 20 kHz – 4 x OPAMP’s - Lower Trace = 4.2 uV RMS

Bottom Line – Four Amplifiers actually has 17% worst total system noise in a 20 kHz bandwidth, even for the lowest resistance Phono Cartridge. The situation is even worse if the 1550 Ohm cartridge is considered.

Side Note #1: These bipolar based low noise OPAMP’s, almost always have a higher 1/f frequency for the current noise than the voltage noise 1/f frequency. For the LT1028 the 1/f corner on the voltage noise is approximately 3.5 Hz, well below the start of the audio band. While the 1/f Corner for the current noise is around 850 Hz. This will mean that the low frequency noise will be increasing in the audio band if the OPAMP is operating where the current noise times source resistance is the dominant noise source of the circuit.


Side Note #2: On Bipolar OPAMP’s with input bias current compensation, a large portion of the bias current noise can be due to the compensation current, this compensation current is always generated by one transistor inside the OPAMP and then split to the two OPAMP inputs. Some of the input current noise is therefore correlated between the two inputs [4]. This means that at higher source resistances, the total system noise may be less with circuits that use balanced source resistances. The amount of correlation is never listed explicitly on the data sheet and only Linear Technology regularly puts this information in their performance curves on their data sheets. For all other OPAMP’s in the world, you will just have to measure this for yourself. For the LT1028 the amount of current noise correlation has proven to be around 25%.

Conclusion -

Don’t forget that total system noise depends on not only the Voltage Noise Source Resistance, the OPAMP itself, and also the contribution of the OPAMP’s Current Noise times the Source Resistance.

Make sure that you know what the Source Resistance of the thing you are measuring before picking and trying to optimize the OPAMP Preamp.

Use the very simple to calculate value: “Ropt” to see where your perspective OPAMP fits in relation to the system source resistance. For most typical circuits where you are trying to minimize total voltage noise, you want the Ropt to be 2 to 10 times higher than the system source resistance. If possible.

For this example Phono Preamp with it’s 630 to 1500 Ohm source resistance, a single LT1028 is not the best choice with a Ropt of 850 Ohms, and you will pay a premium price for it’s ultra low noise performance. The more reasonably priced NE5534A is also a perfectly reasonable and lower cost choice with it’s Ropt of 8700 Ohms @ 1 kHz.

If you pick an OPAMP who’s total voltage and current noise contribution (Ropt) is exactly equal to the source resistance, the combined voltage noise will be 3 dB higher (1.414 higher or 40% in linear terms) than the source resistance alone.

Bonus Curve -

I presented a singe value for Ropt, the one we read off the data sheet from the given values at 1 kHz. However, Ropt is not a single value, it varies with frequency just as the Voltage Noise and Current Noise do.

A complete Ropt curve versus frequency for the LT1028 is shown in Figure A. This curve is derived from the typical voltage noise and current noise figures from the LT1028 data sheet. A curve like this is useful to have for all your low noise OPAMP’s as it is a handy reminder of how Ropt changes with frequency for a particular OPAMP, because many circuits do not operate at single frequencies, some may operate at extremely low frequencies or at high frequencies exclusively. With a Ropt plot versus Frequency you can make even better informed choices on your amplifiers interaction with your source resistance.

 

Figure A – Ropt is not a single number, but actually varies with frequency as the Voltage and Current Noise of the OPAMP vary. Her we can see that the midband value of 825 Ohms drops as low as 200 Ohms at 10 Hz to a high of 925 Ohms at 20 kHz. It is good to keep this in mind when designing for circuits with either very low or very high bandwidth centers, because not every circuit operates at 1 kHz which is where the typical voltage an current noise vlaues are given. I make these curves with the help of a template that I have made for the LibreOffice Spreadsheet program [5].

Footnote:
* The term I use here as ‘Source Resistance’ means the real part of the source impedance, and it is a number that usually varies with frequency.

Extra Bonus 2 - 

Art Kay of Texas Instruments wrote an really excellent book on all things OPAMP noise, it even details how to simulate the noise with Spice. Art has kindly posted a zip file with all the articles at the link below,
https://e2e.ti.com/support/amplifiers/f/14/t/435427?App-Notes-for-Op-Amp-Noise-by-Art-Kay


References:

[1] LT1028 Data Sheet www.analog.com

[2] Reference Shure Phono Cartridge Data Sheets:
M97xE User Guide, 2008
M91G Data Sheet, 1970


[4] Solomon, James - “The monolithic op amp: A tutorial study” IEEE Journal of Solid-State Circuits, vol. 9, pp. 314 – 332, December 1974(Also available as an application note at ti.com)

[5] Open Source Office Software – www.LibreOffice.org


Article By: Steve Hageman / www.AnalogHome.com

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

Note: This Blog does not use cookies (other than the edible ones).



Thursday, March 19, 2020

How accurate is that GPS anyway?



There is always a lot of uncertainty when dealing with a new GPS system, that goes well beyond the immediate needs of powering up, interfacing to, and communicating with the hardware, such as:

     * How much ‘Positional Noise” is there in the readings?
     * How accurate is the Altitude reading?
     * Can I correlate the HDOP (Horizontal dilution of precision) value to positional inaccuracy?
     * Can I correlate the Number of Satellites seen to positional inaccuracy?

Well the Internet has many ‘opinions’ but dew real answers or data… A good overview of the GPS signal accuracy as transmitted can be found in reference [1], but this does not easily relate to what actual user experience, on the receiving end will be. I will present my basic measurements here to see if we can answer the above questions.

I recently started a small GPS Tracking project and the GPS selected was the MediaTek MTK3339 [2], because this all I one GPS is a fully integrated GPS with a built in Antenna, has good ‘Data Sheet’ Performance and Adafruit (as always) provides some really useful breakout boards which made early prototyping and testing easy [3].

After cobbling together some prototype software to read, store and display positions and calculate distances, data was acquired for several long periods to answer the above questions.

Procedure:

I sat the GPS prototype in a location inside my LAB, which is a wood frame building that is pretty transparent to GPS Signals – The GPS never sees less than 5 Satellites at any time and recorded around 9 hours of data over several days. I plotted this data on a Map using the GPS Visualizer [4] to see where the positional ‘Tracks’ were as shown in Figure 1.

 

Figure 1 – Using the Web Based Plotting Page "GPS Visualizer" I plotted 9 Hours of positional data with the GPS absolutely stationary. As you can see there is some positional ‘noise’ or ‘meandering’ in the readings. At the very bottom of this figure is a distance scale.

While the true and accurate position on the Earth down to a foot is difficult to ascertain unless you are a surveyor, I averaged all the data in Figure 1 and came up with an average position. On subsequent data runs I compared this data to the previous average positions and the results are almost exactly the same, that is the data overlays itself to within 0.5 Feet. This suggests that if there are any systematic measurement offsets or errors, these errors are static day to day. The data certainly looks to be within a foot to the detail that I can see on Google Maps (assuming that Google Maps is accurate to a foot).

For my intended application a 50 foot accuracy is more than enough. The real worry is since my tracking application is also recording total distance traveled, I need to know what the GPS positional noise is so that I can have a limit on how much movement is required before I log it. This prevents me from recording or integrating the noise into a false distance.

With the foregoing in mind lets look at 9 hours of data recorded with the GPS receiver stationary in my Lab. The ‘Delta Distance’ is the instantaneous distance calculated for any reading to the average value of all 9 hours of data acquired. If we assume that the long term mean is correct, which it looks to be, then this would be a measure of the instantaneous noise. The GPS receiver is operated at a 1 Sample per second rate and the data recorded is on every fifteenth sample.



Figure 2a – Using the average of all the data as a reference point, the delta distance of each reading is plotted. The peak distance from the average can be quite large as can be seen. The Red Line is a one hour moving average which smooths the peak error by almost 3:1.




Figure 2b – Same setup as Figure 1a, but with the data captured on a different day.

More statistics of figure 1a and 1b. The Peak error was: 54.1 Feet for figure 1a and 68.1 feet for figure 1b, RMS error of all points was: 16.3 and 16.5 Feet respectively. This 16 Feet RMS error compares well with what reference [1] states as a typical expected user accuracy.

Now the question becomes: Can I use some other measure to know about any one measurement to get an idea about the probable error in that particular measurement? All GPS units that I have seen also supply a HDOP calculation [5] and you can also get the number of Satellites that the GPS is currently basing the calculations on.



Figure 3 – A Scatter Plot to see if there is any correlation between delta distance and the number of Satellites used in the calculations as recorded by the GPS. As can be see, there is no correlation here at all.



Figure 4 - A Scatter Plot to see if there is any correlation between delta distance and the HDOP as recorded by the GPS. As can be see, there is no correlation here at all either.

 Conclusion:

Errors happen in every measuring instrument or system. With this GPS you can be reasonably certain that your maximum positional error will be less than 70 feet as was verified by me over several 9 hour stints of data recording. This testing for long periods of time and on several different days gave plenty of time for the satellites to be in all sorts of positions both good and bad for measurements.

As for using more data supplied by the GPS to get an ‘instant’ sense about how much any one particular reading is in error, at least for this particular receiver, neither the number of Satellites or the HDOP correlates to any instantaneous positional error that I can see. Bottom line is: You can’t be sure of the errors until you acquire enough data (At least an hours worth) in any stationary spot to tell what the errors likely are [6].

By way of actual data this hopefully gives some people better data with which to make better informed decisions on what kind of ‘Typical’ data they can expect from a low cost GPS receiver in actual use.

Bonus Data:

No analysis would be complete without showing the histogram of the data. After a plot of the data by itself, the histogram provides a view into the nature of the data distribution that is not always clear from looking at the data itself. With that in mind Figure 5 plots the histogram of Figure 2A's data.

Figure 5 - A histogram of Figure 2A's data shows a familiar kind of statistical distribution that resembles a Poisson distribution. At the very least we can see that the smoothed data (Red Line) is some sort of distribution and not a random distribution.

As can bee seen above there is a nice 'classic' distribution that resembles the Weibull distribution. Other statistical data for the data in Figure 2A is (All units in Feet),




Extra Bonus Data:

It is well known that the Altitude data is even more inaccurate with GPS data, this has to do with the geometry of the satellites and the calculations involved [1]. From Topographic maps I believe my Labs true elevation to be approximately 141 Feet (After accounting for the added height of the workbench, where the GPS Antenna was, from the ground elevation in my Lab). The GPS data for 9 hours worth of data is shown in Figure 6.



Figure 6 – Altitude data for 9 hours worth of GPS Data. The test location was at approximately 141 Feet (from Topographic Maps). The average of this data was 136.4 Feet, suggesting a 5 foot offset in long term data, however this could be within the uncertainty of the topographic data that I used. The peak to peak deviation of the data over 9 hours was: 84.63 Feet. Just interesting to see some real, actual data.

References:

[1] https://www.gps.gov/systems/gps/performance/accuracy/#how-accurate

[2] MediaTek Labs model: MTK3339

[3] Adafruit Ultimate GPS breakout board. PRODUCT ID: 746 . 

[4] GPS Visualizer web page for plotting GPS Data.

[5] HDOP discussion: https://en.wikipedia.org/wiki/Dilution_of_precision_(navigation)

[6] There are offline programs that can look at the satellite geometry that you are seeing at and that will give you a better idea of the maximum error you might see at that very moment, but this is beyond what the simple consumer GPS Modules typically provide. These programs are used by GPS surveyors to time or at least try to time their measurements to concede with the maximum positional accuracy.


Article By: Steve Hageman / www.AnalogHome.com

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

Note: This Blog does not use cookies (other than the edible ones).


Wednesday, January 8, 2020

The ground isn’t flat, so our PCB designs shouldn't be either…

  

In the 'bad old days' we used to spend a lot of time measuring and eyeballing PCB designs to see if everything would fit as intended. Naturally this led to many errors and iterations the first time anything new was tried.

Since Altium led the way with native 3D design capability in their PCB Design Software some 10 years ago, it has been an indispensable in how modern PCB’s are designed.

 

Figure 1 – The classic PCB view – It is great for routing traces, but it’s all flat, you really can’t tell if your footprints or parts placements are going to collide or not until the first hardware gets built and by then its too late.

No more endless hours making detailed measurements and transferring from one tool to another, guessing if things will fit together. Now, just press the ‘3’ key in Altium to look at the parts and PCB in 3D mode and see instantly how it all fits, then press ‘2’ to get back to the flat PCB view to route traces. This is especially true with connectors, which have historically been a real source of confusion and errors for decades. Actually starting with a 3D model of the connector and placing it on a PCB, then placing the mating connector on this will catch most of the reversed connector errors before the first hardware gets made wrong.



Figure 2 – A 3D view of a small GPS Logger prototype. Now that’s instant parts placement verification. Press the ‘3’ key in Altium to get an instant 3D view of your PCB design.




Figure 3 – The prototype GPS logger – it all fit together perfectly thanks to Altium’s native 3D capability.

About the project: A client needed a quick GPS Logger prototype to gain valuable system experience with and to start writing code for. Using a few off the shelf modules from Adafruit (Their Ultimate GPS Breakout and Feather Wing Display breakout), plus a small custom processor and memory board, a quick prototype was put together in a week that allowed the system to be tested and firmware to be developed all before the final design was finalized. This rapid prototype, instant feedback, really speeds up the completion of projects and leads to less errors and problems down the road.

Just say no to PCB tools that don’t give you instant / native 3D capability!

Article By: Steve Hageman / www.AnalogHome.com

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

Note: This Blog does not use cookies (other than the edible ones).

Monday, September 30, 2019

Make Your Code More Assertive!




Most if not all 32 bit processors have a software instruction to fire a ‘Breakpoint’, the MIPS core used by Microchip PIC32 processors is no different.

This can be used to make a “Debug Only Breakpoint” that is useful as a standard C ‘assert()’ alternative.

Assert is a method of adding check code into your program that can be used to check assumptions about the state of variables or program status to flag problems or errors.

Using assertions can dramatically reduce programming errors, especially the errors that occur when libraries are being used [1][2].

Asserts in a PC environment are pretty easy to use as there is normally console window available or disk file to log any asserted problems to.

In a small embedded system neither of these things is typically available. However, when developing and testing code a programmer / debugger is normally attached to the system and this can be used as the window into the systems operation.

What is needed is an assert macro that fires a software breakpoint and halts the program if the system is in a debugging state and ideally for the macro to generate no code if the system is built with a production state. When using MPLAB-X and XC-32, Microchips preferred development IDE and compiler, this is easy to do.

When building a Debug image, MPLAB-X via XC-32 defines a name: “__DEBUG” (Two leading underscores). This defined name can be used to build the assert macro two ways. 1) When debugging the macro is built and when not debugging the macro generates no code, as shown below.




A simple Macro that mimics the standard C assert() call for any PIC32 based project. The macro generates a check that calls the MIPS software breakpoint if the project was built with ‘__DEBUG’ defined. This name is automatically defined by MPALB-X when a debugging mage is built.

You can put code similar to the above into a C header file.Our new macro then works just like any standard C assert() macro,

   - assert_dbg(exp) where exp is non-zero does nothing (a non-assertion).

   - assert_dbg(exp) where exp evaluates to zero, stops the program on a software
     breakpoint if debugging mode is active in MPALB-X.

I chose to name the macro: “assert_dbg()” so that the association to a standard assert would be easy to remember (it works the same way) and I added the ‘_dbg’ suffix to show that something is slightly different here as a reminder to the programmer that this is like: “A standard assert, but slightly different”.

Naturally you can use any name you like.

Bonus Macro

While reading an article by Niall Murphy on the Barr Group website [3], I ran across a comment by ‘ronkinoz’ that shared another clever / useful debugging macro - I will call it ‘assert_compile()’ here.

What assert_compile() does is to check things that can be checked at compile time, if the assert fails then the compiler halts on an error.

The macro works by trying to define a typedef array with a size of one char if the assert passes, if the assert fails the array will be sized as ‘-1’ which GCC will halt on as a compile error.

This can be very useful in checking the size of known instances against known system constraints for instance.

One thing that often causes problems is making a EEPROM structure, like a Cal Constant table larger than the designed allocated space. This often happens when a project is revised and revised to add new features or as a product is developed and it is discovered that the calibration routines need to change.

You can catch things like getting a table too large and possibly overwriting other EEPROM allocations. The size of fixed objects is known to the compiler at compile time.

My first use is to peg my code to a compiler version. That way in a year or so if I forget to read my notes and try to compile the code under a different version, I will get a rude warning to take a careful look before continuing.

The macro is presented here,



References:

[1] https://www.microsoft.com/en-us/research/publication/assessing-the-relationship-between-software-assertions-and-code-qualityan-empirical-investigation/

[2] McConnell, Steve, “Code Complete, A Practical Handbook of Software Construction”, 2nd edition, page 189: “Assertions”. Microsoft Press, 2004.

[3]  https://barrgroup.com/comment/7#comment-7



Article By: Steve Hageman / www.AnalogHome.com

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

Note: This Blog does not use cookies (other than the edible ones).

Saturday, August 10, 2019

Microchip XC32 Compiler Optimization


It is often quipped that the humorist Mark Twain once said,

    “There are lies, darn lies and statistics”

If Mr. Twain was a programmer he might have said something to this effect,

    “There are lies, darn lies and benchmarks”

Benchmarking has a long history of being bent into something that produces: “a good number, that will have nothing to do with your actual situation”. This is especially true now that we have embedded processors with caches and all sorts of other performance enhancing and limiting factors.

In the 1980’s there was one benchmark test that really took hold. It was called the “Sieve of Eratosthenes” and it was a numerical / prime number test that was used quite a bit to see what compilers produced the 'fastest' code [1].

This benchmark really just looked at one aspect of performance: Calculating prime numbers. It didn’t test graphics, IO speed or a host of other factors that are also important in a real world applications. So it was very narrow minded in its scope.

It did gain such prominence that the compiler writers of the era actually wrote optimizers that would detect this code sequence and apply special techniques to get the fastest possible performance for this specific benchmark test. There was nothing against wrong in doing this – It is kind of like producing cars with a fast 0-60 MPH acceleration factor – they can detect the ‘pedal to the metal’ and change the shift points to get the fastest 0-60 time, which is great for bragging rights, but this is a rarely used sequence in real life driving. This is much the same, in my opinion with most modern microprocessor benchmarks.

Whats the point then?

This ‘Optimization’ project started for two reasons,

1) I was working on a battery powered image processing application and I wanted to see what the effects of the various XC32 optimization levels were to see if I could save a significant amount of power by switching to higher optimization levels and hence lower the CPU clock speed thus saving power.

2) To just see generally, what the effect of the various Pro XC32 optimization levels do compared to the free version. It has been noted online that some people feel like they are being ‘cheated’ from ‘significant’ performance improvements with the free version of the XC32 compiler. So, we’ll take a look at that.

Note: You may know that Microchip provides a free compiler for its PIC32 processor series called XC32. It is currently based on GCC 4.8 and the free version provides -O0 and -O1 optimization levels. The ‘paid’ version includes support and the other GCC optimization levels: -O2, -Os and -O3

The XC32 optimization levels are not exactly the same as the standard GCC levels but they roughly follow. The XC32 2.0x manual states,


Some Notes

I used the XC32 2.05 and 2.10 Version for these tests (These versions perform the same in my tests, the versions differences only add some new devices and fix some corner case defects as can be seen in the release notes).

When Debugging your program logic is it useful to set the optimization level to -O0 as this produces pretty much 1:1 code with your C code, this makes following along the logic and looking at variables easy. Also inlining of functions is disabled so functions appear as you wrote them.

-O1 is the standard optimization level for the free version of XC32 and even this level inlines functions and aggressively removes variables that can be kept on the stack or in a register making your code run much faster and be smaller, but also making debugging very hard to follow. This is the optimization that you most likely want to use for your ‘Release’ code, after all: Who doesn’t want faster / smaller code?

Many Standard libraries like the MIPS PIC32 DSP libraries are written in hand optimized MIPS assembly language and hence bypass the C compiler completely, so you get the fastest possible performance even with the free version of the compiler.

-O2, -Os and -O3 Optimizations are only available with the paid version of XC32, which is quite inexpensive at < $30.00 per month and a must for professional developers if only for the included, expedited support that the license also includes.

My hardware test setup for all these tests is a PIC32MZ2048EFH processor running at 200 MHz clock speed.

On to the Benchmarking

One of the standard benchmarks used with advanced 32 bit processors is the ‘CoreMark’ [2]. This is a multifaceted benchmark that try's to simulate many different aspects of an actual application, yet in the end produce a single performance number. When I looked at this I found that the code was quite small and didn’t move a lot of variables around so in execution it probably can spend most if not all of its time in any modern 32 bit processors cache.

Another application that I used in my benchmarking was a custom image processing application that I wrote. It is not big in the sense of requiring a large program memory footprint, but it does use upwards of 57,000 bytes of data as it process a 32 x 24 image up to 320 x 240 for display and along the way: scales, normalizes, applies automatic gain control, maps the resulting image to a 255 step color map, etc. So there is quit a bit of data manipulation along the way and all the data cannot fit in the PIC32MZ’s 16k data cache at once.

The last application I used was a relatively large application provided by Microchip as a demo program that represents a complete application with extensive graphics, etc.  This application was compiled just to see the relative code sizes that the XC32 compiler produced is, as I thought the CoreMark and my application were really too small for a realistic analysis of program size.

CoreMark Program Insights

The CoreMark site [2] provides results and provides the compilers “Command Line Parameters” that were used to compile the program. This information was very interesting as you will soon see.

First, let’s take a look at the CoreMark execution speed versus various XC32 compiler optimization levels. The CoreMark program when compiled at -O1 is only 32k bites long and uses only 344 bytes of data memory, so it is quite small and probably runs completely in the PIC32MZ processor cache, so speed of execution is all we can really look for here.

  
Figure 1 – This is the execution speed for the CoreMark with various optimization levels. All results were normalized to the -O1 level as this is the highest optimization for the free version of the XC32 Compiler. See the text for a discussion of each optimization.

I included the -O0 optimization level in Figure 1 just as a comparison to see what a huge difference even the free -O1 optimization level makes on code performance. The difference between -O0 and -O1 in the CoreMark (and nearly every other application I have ever compiled) is nearly 1.5:1, no other optimization makes that big of a jump. In fact all the other optimizations and ‘tweaks’ only produce marginal gains on the -O1 optimization. As noted the optimization level -O0 is really only useful in debugging code, no one would every build a final application with this optimization level unless they really just don’t like their customers.

-Os is ‘optimize for size’ and as expected it results in a nearly 10% performance hit here. The CoreMark program is really very small in memory footprint, so this option would be a waste of time in any small application like this one, it is included only as a reference.

-O1 is the default optimization for the XC32 compiler free version. All the other results are normalized to this result (100%).

-O2 and -O3 produce only marginal gains above the -O1 level. -O2 is 13% faster, -O3 is 17% faster.

-O2++ is the optimization that Microchip used for the benchmark results posted on the CoreMark site and as may be expected includes some undocumented command line options and some specific tweaks to get the performance up as much as possible. Again, there is nothing that says anyone can’t optimize specifically for the benchmark, only that others should be able duplicate the results, which I was able to do. Here is the command line options for -O2++ as I found them,

-g -O2 -G4096 -funroll-all-loops -funroll-loops -fgcse-sm -fgcse-las -fgcse -finline -finline-functions -finline-limit=550 -fsel-sched-pipelining -fselective-scheduling -mtune=34kc -falign-jumps=128 -mjals

The really interesting option here is this one: “-mtune=34kc” I could not find this option documented anywhere and I really did not have time to search through megabytes of source code to try to find it. But “34Kc” is a designation that MIPS uses to describe the core that the PIC32MZ is based on so it is some sort of optimization for this specific core.

Bottom line is that these optimizations produce the fastest result – 30% faster than -O1 alone, but it is only some 12% faster than the base -O3 optimization, so it is only marginally better than -O3 alone.

-O3++ is where I applied these same -O2++ command line options to the base -O3 optimizations just for fun. This produced a marginally faster result than -O3 alone but it was still slower than -O2++.

Interesting to see the results, but again, while CoreMark is more comprehensive than the 1980's 'Shieve' benchmark, the CoreMark probably does not apply to your application at all, it’s small and uses an incredibly small amount of data memory.

Image Processing Program Insights

This was an actual application that I wrote that takes raw data from a 32 x 24 pixel image sensor, converts the raw data into useful values, up interpolates the pixels to 320 x 240, and them scales and limits the values for display. The data formats were int32, int16, int8 and float data types. A large amount of data was processed, some 57,000 bytes for each image.

Reading the sensor and writing to the display are fixed and running as fast as they can as set by the limitations of the hardware interfaces, so nothing can be done about that. The experiment was to see if the central processing algorithms could be speed up enough that would allow me to slow down the CPU clock enough to save on battery power. Running some benchmarks was a first step in that determination.

    while(1) {
    GetSensorData();        // Fixed by how fast the sensor can be read.
    ConvertDataArray();
    BiLinInterpolate();
    GetImageAdjustments();  // User / GUI Interaction – get settings.
    ApplyExpansion();
    ApplyContrastCurve();
    ApplyBrightnessAdjustment();
    DisplayImage();        // Fixed by how fast the display can be written to.
    }


Figure 2 – The central image processing ‘loop’ consists of routines like this. At each step the data arrays were processed in deterministic loops. Some of the processing is redundant because multiple loops were used as each step consisted of one specific data operation. These loops could be combined if need be. But, without some profiling first the effort may have been in vain (See the conclusion), guessing almost never pays off in optimizing.


Figure 3 – The simplest optimization is to use the compilers built in ‘smarts’ to make the code faster. Here my simple but data intensive image processing program was optimized using various compiler settings and the speed of execution was measured. The optimization level -O1 was normalized to 100%.

Figure 3 is pretty straight forward. I timed only the image procession portion of the code, excluding the hardware IO as that is fixed by hardware constraints. Optimization level -O0 would never be used for released code, it is included here only as a comparison to show how aggressive the compiler gets even with the free -O1 optimization. Interestingly, option -O2 produced a slower result than option -O1 in this example, there is probably some data inefficiency going on with this option. As expected however option -O3 produced the fastest ‘standard’ result, but really only marginally faster than -O1 at around 10%.

XC32 also has some extra ‘switches’ that can be tweaked from the GUI. I set all these for the: “-O1++, -O2++ and -O3++ level tests. These switches are shown in figure 4.


 
Figure 4 – XC32 allows you to set these switches to turn more aggressive compiler options. As can be seen in figure 3 for the '-Ox++' results. It is faster, but not by a lot, typically less than 5%.

The bottom line here is: The XC32 free versions -O1 optimization was only slightly slower than the paid versions -O3++ maximum optimizations by less than 15%, OK but nothing really to get too excited about.

Large Application Insights

This large application is based on Microchips “Real Time FFT” application example program [3]. This is a fairly large application and it uses a lot of data, has a LCD and an extensive user interface, along with ADC, DAC drivers and FFT calculations. I don’t have the hardware to run this code, so I looked only at generated code size and data size. The Data Size held steady at 200 kBytes independent of the optimization level which is expected for a large graphics oriented program like this. The compiled program size was 279 kBytes when compiled with the optimization level of -O1.


Figure 5 – The Microchip application example: “Real Time FFT” was compiled at various optimization levels and the resulting program SIZE is shown plotted here. This is a rather big application at some 279 kBytes when compiled at -O1.  As can be expected when optimizing for absolute maximum speed (-O3++ see figure 4) the program gets much larger but at a huge cost in size.

As can be seen in Figure 5, The -Os optimization gave only a marginal size decrease of around 7% over the default -O1 optimization. -O3++ however grew very large, probably mostly due to the application of figure 5’s “Unroll Loops” switch. This switch forces the unrolling of all loops, even non-deterministic ones.

This result it to be expected, as any compilers ‘Money Spec’ is execution speed, not program size. Which for the majority of real world applications is the proper trade off. As my image processing application shows, the performance gains from -O1 to -O3 would be expected to be minimal and the trade off in program size might be excessive, especially if you are running out of space and can’t go to a larger memory device for some reason.

In a very large application code size may be a real issue because you want to save money by using a smaller memory chip. The -Os option probably isn’t the silver bullet you will be looking for as the < 10% code size savings are nearly insignificant. Microchip has an application note [4] that provides some interesting information on how -Os works and some more optimization tricks for code size. The tricks when applied result in less than a 2% improvement over the -Os optimization alone, so while this note is interesting reading, it provides no further improvement.

Conclusion

The only good benchmark is the one of your actual running application code. That being said, one item that can be gleaned from the experiments here is that the free XC32 compiler running at its maximum optimization level of -O1 produces code that is within 15-20% or so of the maximum optimizations possible with the XC32 paid compiler. The other item that can be gleaned here, is that by spending hours and hours you might be able to coax 20% or perhaps even 30% better performance by hand tweaking the compiler optimizations for the particular task at hand, but you can probably do the same or better right in your code by improving your base code algorithms.

It is always good to always remember the first and second rule of code optimizations however,

     #1 - Write plain and understandable code first.
     #2 - Don’t hand optimize anything until you have profiled the code and proven that the
             optimization will help.

Guessing at what to optimize is almost is never right and ends up wasting a lot of time for a very marginal performance increase and a highly likely increase in your codes maintenance costs.

I have found that there are all sorts of interesting articles out there on optimization with GCC and I have also found that unless they show actual results they are mostly someones 'Opinion' on how things work, or how things worked 15 years ago but don’t work that way today. For instance, I have found articles that state emphatically that 'Switch' statements produce far faster code than 'If ' statements, and I have found articles that say emphatically just the opposite. Naturally nether of these articles show any actual results, so I have learned to beware and check for myself. Which takes me back to the #1 rule of optimizing, don’t do it until you have proven that you need to do it. And yes, I fight this temptation myself all the time!

And finally, yes, finally… If you don’t have the XC32 paid version, don’t fret too much – in most applications the paid version will only provide a marginal gain in performance and size over the default and free XC32 -O1 optimization provided free. But, as always: "Your mileage may vary".

Appendix A –  General Optimization Settings – Best Practices with XC32

Best for Step by step Debugging

Set Optimization level to -O0 and set all other defaults to ‘reset’ condition. This will compile the code almost exactly as you have written it which will allow for easy line by line program trace during debugging. This is very helpful in tracing and verifying the programs logic.

   

Best settings for Speed or Size

Use these settings for maximum ‘easy’ speed in the XC32 free compiler. Meaning these settings will get you to around 80 to 90% of maximum speed possible. Any improvement over this will need to be accompanied by a very careful review of all the compiler options and their effect, actual time profiles of the code and possibly specific module optimizations.

Project Properties → xc32-gcc→ General



Project Properties → xc32-gcc→ Optimization
 
 

To Optimize a specific source file Individually

Right clock on the source file and select ‘Properties’
 
  



Then select: “Overwrite Build Options” and from there you can set the optimization level of the specific file separate from the main application settings.

  

Then select the xc32-gcc → Optimizations as above and have at it as detailed above.

Appendix B – Lists of the optimizations as applied by XC32 Version 2.05/2.10

Optimizations vs. -Ox level, All other settings at default levels.

Note: Use “-S -fverbose-asm” to list every silently applied option (including optimization ones) in assembler output.

# -g -O0 -ffunction-sections -ftoplevel-reorder -fverbose-asm
 # options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 # -fbranch-count-reg -fcommon -fdebug-types-section
 # -fdelete-null-pointer-checks -fearly-inlining
 # -feliminate-unused-debug-types -ffunction-cse -ffunction-sections
 # -fgcse-lm -fgnu-runtime -fident -finline-atomics -fira-hoist-pressure
 # -fira-share-save-slots -fira-share-spill-slots -fivopts
 # -fkeep-static-consts -fleading-underscore -fmath-errno
 # -fmerge-debug-strings -fmove-loop-invariants -fpcc-struct-return
 # -fpeephole -fprefetch-loop-arrays -fsched-critical-path-heuristic
 # -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock
 # -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec
 # -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column
 # -fsigned-zeros -fsplit-ivs-in-unroller -fstrict-volatile-bitfields
 # -fsync-libcalls -ftoplevel-reorder -ftrapping-math -ftree-coalesce-vars
 # -ftree-cselim -ftree-forwprop -ftree-loop-if-convert -ftree-loop-im
 # -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 # -ftree-phiprop -ftree-pta -ftree-reassoc -ftree-scev-cprop
 # -ftree-slp-vectorize -ftree-vect-loop-version -funit-at-a-time
 # -fverbose-asm -fzero-initialized-in-bss -mbranch-likely
 # -mcheck-zero-division -mdivide-traps -mdouble-float -mdsp -mdspr2 -mel
 # -membedded-data -mexplicit-relocs -mextern-sdata -mfp64 -mfused-madd
 # -mgp32 -mgpopt -mhard-float -mimadd -mlocal-sdata -mlong32 -mno-mdmx
 # -mno-mips16 -mno-mips3d -mshared -msplit-addresses


# -g -O1 -ffunction-sections -ftoplevel-reorder -fverbose-asm
 # options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 # -fbranch-count-reg -fcombine-stack-adjustments -fcommon -fcompare-elim
 # -fcprop-registers -fdebug-types-section -fdefer-pop -fdelayed-branch
 # -fdelete-null-pointer-checks -fearly-inlining
 # -feliminate-unused-debug-types -fforward-propagate -ffunction-cse
 # -ffunction-sections -fgcse-lm -fgnu-runtime -fguess-branch-probability
 # -fident -fif-conversion -fif-conversion2 -finline -finline-atomics
 # -finline-functions-called-once -fipa-profile -fipa-pure-const
 # -fipa-reference -fira-hoist-pressure -fira-share-save-slots
 # -fira-share-spill-slots -fivopts -fkeep-static-consts
 # -fleading-underscore -fmath-errno -fmerge-constants
 # -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
 # -fpcc-struct-return -fpeephole -fprefetch-loop-arrays
 # -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 # -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 # -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 # -fsched-stalled-insns-dep -fshow-column -fshrink-wrap -fsigned-zeros
 # -fsplit-ivs-in-unroller -fsplit-wide-types -fstrict-volatile-bitfields
 # -fsync-libcalls -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp
 # -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop
 # -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts
 # -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert
 # -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize
 # -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc
 # -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize -ftree-slsr
 # -ftree-sra -ftree-ter -ftree-vect-loop-version -funit-at-a-time
 # -fvar-tracking -fvar-tracking-assignments -fverbose-asm
 # -fzero-initialized-in-bss -mbranch-likely -mcheck-zero-division
 # -mdivide-traps -mdouble-float -mdsp -mdspr2 -mel -membedded-data
 # -mexplicit-relocs -mextern-sdata -mfp64 -mfused-madd -mgp32 -mgpopt
 # -mhard-float -mimadd -mlocal-sdata -mlong32 -mno-mdmx -mno-mips16
 # -mno-mips3d -mshared -msplit-addresses


# -g -Os -ffunction-sections -ftoplevel-reorder -fverbose-asm
 # options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 # -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon
 # -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps
 # -fdebug-types-section -fdefer-pop -fdelayed-branch
 # -fdelete-null-pointer-checks -fdevirtualize -fearly-inlining
 # -feliminate-unused-debug-types -fexpensive-optimizations
 # -fforward-propagate -ffunction-cse -ffunction-sections -fgcse -fgcse-lm
 # -fgnu-runtime -fguess-branch-probability -fhoist-adjacent-loads -fident
 # -fif-conversion -fif-conversion2 -findirect-inlining -finline
 # -finline-atomics -finline-functions -finline-functions-called-once
 # -finline-small-functions -fipa-cp -fipa-profile -fipa-pure-const
 # -fipa-reference -fipa-sra -fira-hoist-pressure -fira-share-save-slots
 # -fira-share-spill-slots -fivopts -fkeep-static-consts
 # -fleading-underscore -fmath-errno -fmerge-constants
 # -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
 # -foptimize-register-move -foptimize-sibling-calls -fpartial-inlining
 # -fpcc-struct-return -fpeephole -fpeephole2 -fprefetch-loop-arrays
 # -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop
 # -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 # -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 # -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 # -fsched-stalled-insns-dep -fschedule-insns2 -fshow-column -fshrink-wrap
 # -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types
 # -fstrict-aliasing -fstrict-overflow -fstrict-volatile-bitfields
 # -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math
 # -ftree-bit-ccp -ftree-builtin-call-dce -ftree-ccp -ftree-ch
 # -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim
 # -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre
 # -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon
 # -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pre
 # -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink
 # -ftree-slp-vectorize -ftree-slsr -ftree-sra -ftree-switch-conversion
 # -ftree-tail-merge -ftree-ter -ftree-vect-loop-version -ftree-vrp
 # -funit-at-a-time -fuse-caller-save -fvar-tracking
 # -fvar-tracking-assignments -fverbose-asm -fzero-initialized-in-bss
 # -mbranch-likely -mcheck-zero-division -mdivide-traps -mdouble-float
 # -mdsp -mdspr2 -mel -membedded-data -mexplicit-relocs -mextern-sdata
 # -mfp64 -mfused-madd -mgp32 -mgpopt -mhard-float -mimadd -mlocal-sdata
 # -mlong32 -mmemcpy -mno-mdmx -mno-mips16 -mno-mips3d -mshared
 # -msplit-addresses


# -g -O2 -ffunction-sections -ftoplevel-reorder -fverbose-asm
 # options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 # -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon
 # -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps
 # -fdebug-types-section -fdefer-pop -fdelayed-branch
 # -fdelete-null-pointer-checks -fdevirtualize -fearly-inlining
 # -feliminate-unused-debug-types -fexpensive-optimizations
 # -fforward-propagate -ffunction-cse -ffunction-sections -fgcse -fgcse-lm
 # -fgnu-runtime -fguess-branch-probability -fhoist-adjacent-loads -fident
 # -fif-conversion -fif-conversion2 -findirect-inlining -finline
 # -finline-atomics -finline-functions-called-once -finline-small-functions
 # -fipa-cp -fipa-profile -fipa-pure-const -fipa-reference -fipa-sra
 # -fira-hoist-pressure -fira-share-save-slots -fira-share-spill-slots
 # -fivopts -fkeep-static-consts -fleading-underscore -fmath-errno
 # -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants
 # -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls
 # -foptimize-strlen -fpartial-inlining -fpcc-struct-return -fpeephole
 # -fpeephole2 -fprefetch-loop-arrays -fregmove -freorder-blocks
 # -freorder-functions -frerun-cse-after-loop
 # -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 # -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 # -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 # -fsched-stalled-insns-dep -fschedule-insns -fschedule-insns2
 # -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller
 # -fsplit-wide-types -fstrict-aliasing -fstrict-overflow
 # -fstrict-volatile-bitfields -fsync-libcalls -fthread-jumps
 # -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp
 # -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars
 # -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce
 # -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre
 # -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon
 # -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pre
 # -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink
 # -ftree-slp-vectorize -ftree-slsr -ftree-sra -ftree-switch-conversion
 # -ftree-tail-merge -ftree-ter -ftree-vect-loop-version -ftree-vrp
 # -funit-at-a-time -fuse-caller-save -fvar-tracking
 # -fvar-tracking-assignments -fverbose-asm -fzero-initialized-in-bss
 # -mbranch-likely -mcheck-zero-division -mdivide-traps -mdouble-float
 # -mdsp -mdspr2 -mel -membedded-data -mexplicit-relocs -mextern-sdata
 # -mfp64 -mfused-madd -mgp32 -mgpopt -mhard-float -mimadd -mlocal-sdata
 # -mlong32 -mno-mdmx -mno-mips16 -mno-mips3d -mshared -msplit-addresses


# -g -O3 -ffunction-sections -ftoplevel-reorder -fverbose-asm
 # options enabled:  -faggressive-loop-optimizations -fauto-inc-dec
 # -fbranch-count-reg -fcaller-saves -fcombine-stack-adjustments -fcommon
 # -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps
 # -fdebug-types-section -fdefer-pop -fdelayed-branch
 # -fdelete-null-pointer-checks -fdevirtualize -fearly-inlining
 # -feliminate-unused-debug-types -fexpensive-optimizations
 # -fforward-propagate -ffunction-cse -ffunction-sections -fgcse
 # -fgcse-after-reload -fgcse-lm -fgnu-runtime -fguess-branch-probability
 # -fhoist-adjacent-loads -fident -fif-conversion -fif-conversion2
 # -findirect-inlining -finline -finline-atomics -finline-functions
 # -finline-functions-called-once -finline-small-functions -fipa-cp
 # -fipa-cp-clone -fipa-profile -fipa-pure-const -fipa-reference -fipa-sra
 # -fira-hoist-pressure -fira-share-save-slots -fira-share-spill-slots
 # -fivopts -fkeep-static-consts -fleading-underscore -fmath-errno
 # -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants
 # -fomit-frame-pointer -foptimize-register-move -foptimize-sibling-calls
 # -foptimize-strlen -fpartial-inlining -fpcc-struct-return -fpeephole
 # -fpeephole2 -fpredictive-commoning -fprefetch-loop-arrays -fregmove
 # -freorder-blocks -freorder-functions -frerun-cse-after-loop
 # -fsched-critical-path-heuristic -fsched-dep-count-heuristic
 # -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic
 # -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic
 # -fsched-stalled-insns-dep -fschedule-insns -fschedule-insns2
 # -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller
 # -fsplit-wide-types -fstrict-aliasing -fstrict-overflow
 # -fstrict-volatile-bitfields -fsync-libcalls -fthread-jumps
 # -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp
 # -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars
 # -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce
 # -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre
 # -ftree-loop-distribute-patterns -ftree-loop-if-convert -ftree-loop-im
 # -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 # -ftree-partial-pre -ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc
 # -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize -ftree-slsr
 # -ftree-sra -ftree-switch-conversion -ftree-tail-merge -ftree-ter
 # -ftree-vect-loop-version -ftree-vectorize -ftree-vrp -funit-at-a-time
 # -funswitch-loops -fuse-caller-save -fvar-tracking
 # -fvar-tracking-assignments -fvect-cost-model -fverbose-asm
 # -fzero-initialized-in-bss -mbranch-likely -mcheck-zero-division
 # -mdivide-traps -mdouble-float -mdsp -mdspr2 -mel -membedded-data
 # -mexplicit-relocs -mextern-sdata -mfp64 -mfused-madd -mgp32 -mgpopt
 # -mhard-float -mimadd -mlocal-sdata -mlong32 -mno-mdmx -mno-mips16
 # -mno-mips3d -mshared -msplit-addresses

References

[1] Some of the original BYTE magazine articles dealing with the Sieve of Eratosthenes

A High-Level Language Benchmark by Jim Gilbreath
BYTE Sep 1981, p.180
https://archive.org/details/byte-magazine-1981-09/page/n181

Eratosthenes Revisited: Once More through the Sieve by Jim Gilbreath and Gary Gilbreath
BYTE Jan 83 p.283
https://archive.org/details/byte-magazine-1983-01/page/n291

Benchmarking UNIX systems by David Hinnant
BYTE Aug 1984 p.132
https://archive.org/details/byte-magazine-1984-08/page/n137

[2] Coremark Benchmark
www.eembc.org/coremark/

[3] Microchip Technology, example application. Located in the Harmony install directory at,
    .../apps/audio/real_time_fft

[4] Microchip Technology, “How to get the least out of your PIC32 C compiler”,
https://www.microchip.com/mymicrochip/filehandler.aspx?ddocname=en557154


Article By: Steve Hageman / www.AnalogHome.com

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

Note: This Blog does not use cookies (other than the edible ones).