Sunday, July 23, 2017

When Microprocessors are a commodity – How do you choose?

Slightly over 20 years ago I had the need for a microprocessor to use in a project. I needed to keep the budget low. The 8 bit microprocessors of the time cost $15.00 in single quantities, that wasn’t an issue, but the tool chain cost was. I didn’t want to spend the thousands on a traditional tool chain that many of the processors of the day used. I bought a hobby programmer for $69.00. The required UV EEPROM Eraser cost me $29.00, the C compiler was $99.00. The tool chain was the most expensive part and that really drove the Microprocessor vendor choice. Which in this case was a Microchip PIC16C71 with all of 1k program memory and 36 bytes of RAM, running at a blazing 1 MIPS! The final project worked fine.

Thanks to Mr. Moore and his ‘Law’: Now we have 32 bit processors with built in floating point processors that have at least 512MB of flash and 256k of RAM all running with a 200 MHz system clock and it costs less than $10.00 in single quantities.

The tool chains are free and based on the GCC compiler. Similarly the basic programmer / debuggers are less than $20.00.

I have experience with both Microchip PIC32MZ and ST Micro STM32F7 products. So when I had a new project come by, how would I choose the ‘best’ device?

One processor is MIPS core based and the other is an ARM design. For 99.9% of the code I needed to write, the C compiler hides the underlying processor details so I don’t need to know anything about the underlying differences between a MIPS and ARM core. GCC has optimizations for both types and does a fine job of making efficient code.

The client may care about what processor to use, but if they say this: What they really care about is the tool chain, and I agree. You simply can’t run a successful operation if every project requires a different tool chain. So if you are a “ARM” processor house you are really is a: ST Micro, NXP, Atmel, et. al. tool chain house that just happens to target ARM Core processors from some manufacturer.

The choice as to what to pick comes down to slight features or preferences in tool chains or processor features. Everything else is pretty equal. Believe it or not, 32 bit microprocessors are a commodity.

My Current Project

The project at hand was a dual channel data acquisition and computation instrument that needed to drive two fast 16 bit ADC’s, buffer the samples in a large amount of on board RAM, then process the results with FFT’s and drive an Analog Output with some computed results. The design also needed a USB connection to a PC for setup and monitoring.

Normally this is done with at least three chips: An FPGA for ADC and Memory interfacing and a 32 Bit Processor acting as the DSP number cruncher and communication processor. For this project I needed to keep the chip count to one, the 32 Bit Processor. So speed and memory was a primary consideration.

Speed was the first consideration: At least 200 MHz system clock was required, just so I would not fail on the DSP computation part. I had enough bench-marking experience with some previous projects that I knew that a 32 Bit processor running at least 200 MHz would give me the desired number crunching performance.

The next constraint was that the ADC’s chosen were parallel output devices. I needed to be able to get a full 16 bits read in to the processor in a single chunk and ping-pong between the ADC’s to get both read in at a 2 MSPS rate. This is where demo boards came in. Both processors had (in a 64 pin package) at least one fully pinned out 16 bit port and writing some test bit-banging code I verified that both would also be able to manipulate the ADC’s and get the data to RAM fast enough. Both processors passed this constraint.

ADC interfacing: The ADC’s had a 3.3V IO voltage levels and both processors also have 3.3V IO pin voltage levels. Both processors also use a single voltage for the core and IO pins, this makes the overall design simpler to only have to supply one voltage to the processor. So no advantage to either processor.

Next was RAM – I wanted the biggest amount of RAM possible, just to be safe. The initial design was to run 2 x 8k FFT’s with another 2 x 4k buffers for a continuous averaging the results. Both chips had many times this RAM, but you can never have too much RAM can you?

The PIC32MZ had a slight advantage here as that part has a 512k or RAM versus the STM32F7 parts 256k.

Program Memory: I always buy the biggest memory part available for prototyping – after all you want to get the design going fast, not save a few bucks and end up sending days trying to figure out how to make the program fit. The PIC32MZ also has a slight advantage here as it has an unbelievable 2048k or Flash program memory! The STM32F7 topped out at 512k – Although to be honest I would never end up using all that Flash from either part for this application.

Note: If you are building a Web interface and will be serving up Web pages, all that RAM and ROM starts to look pretty small, pretty fast!

Speaking of FFT’s – Life is too short to be writing your own FFT and DSP routines and both chips passed this test by supplying a very complete and functional DSP library at no cost!

Speaking of Math – Both processors have Floating Point Units (FPU’s). The PIC32 however can do both single and double precision floating point, whereas the STM32F is a single precision unit only. The thought was in the code that I would do integer FFT’s and averaging, then convert the single FFT bin of interest to floating point to do the control math then convert this to back to integer for output to the DAC. Using floating point when the processor has a FPU has almost no speed penalty and just makes the code easier to understand (which my clients like). This gives the nod to the PIC32 for never leaving me out in the cold if I needed more precision than the single precision STM32F FPU provides.

Processor Package Size: The ideal was to use a 64 pin LQFP – Check on that as both families have that package available.

Tool chain: Both are usable, free, GCC based tool chains with very low cost programmer / debuggers. The STM32F7 tool chain is a little more common as it provides a set of tools to initialize and configure the peripherals through a HAL (Hardware Abstraction Layer) Library. The PIC32MZ uses what Microchip calls their Harmony configuration program. Harmony abstracts all the various PIC32 chips to a single very high level HAL

Documentation wise: The PIC32 XC-32 Compiler and Standard Library documentation is very good and customized for the PIC32 GCC extensions. Likewise the online help in the IDE, while not being perfect or complete is more than just auto-generated listings of function calls, so it is quite useful.

With the STM32 you are left with the standard GCC documentation that can be found on the web. The STM32 HAL library documentation is basically just auto generated listing of the functions and their parameters with no other useful information.

Nod to the PIC32 for better compiler documentation.

Both programming IDE’s are easy to use and have the same amount of annoying little issues (nothing works perfectly, does it?), so that’s a dead heat. At least neither ever crashed on my computers. They just have the annoying bugs like the dreaded “red squiggle underline” under perfectly fine code and not syntax highlighting correctly all the time.

Processor Interrupts – The firmware design was such that the sampling clock would drive the ADC’s convert pin directly and an external processor interrupt pin to initiate the processors data collection function. Another external interrupt would be needed for an external trigger circuit. Both processors have fast interrupts available on all IO pins, so no advantage to either here.

Communication – The plan was to use a trusty FTDI USB to Serial converter to get the USB communication to the Control PC. Hence I wanted a USART that could run at 3 Million Baud (The maximum rate for a FT232R chip). Both processors have multiple USART’s and can support the 3 MBAUD rate. A tie here. The PIC32MZ has a slight toolchin advantage as the stdio functions like printf() are already wired up to USART 2 and don’t require any further setup. On the STM32F parts there is some user code required to route the standard out to the proper USART. Not a big deal, but it has to be done.

Sampling Clock – The design was to use the 96 MHz system reference oscillator divided down to 24 MHz as the processor system clock input, this would be further divided to get an adjustable ADC sampling rate clock. The divider needed to be completely in hardware so that it would have no extra uncertainty jitter (An interrupt based timer/divider would have too much jitter and would not work).

The PIC32MZ has four independent Reference Clock Dividers that can be clocked from a variety of sources and can generate a divide ratio of 2-32768. This fit perfectly for my needs. The STM32F7 does not have such a divider. This is the biggest difference between the parts. It did take two hours to figure out how to program this on the fly as the core has to be unlocked, to change the divide ratio, etc. This was well covered n the documentation I just had to find it.

Analog Output – The result of all this sampling, FFT’s and math was a single number that could be output to a DAC. 12 bits was the minimum precision and 16 bits would be preferred. The initial thought was to use an external DAC on a SPI bus for this. As the update rate only had to be at 100 Hz, this would be easy to implement. The STM32 may have an advantage here – as it has two 12 bit DAC’s built-in whereas the PIC32 only has a low performance, 5 bit voltage divider built in. The worry here is that the internal DAC would be corrupted with noise from the processor core, but at 100 Hz output update rate I could have easily filtered off any noise. Slight advantage to the STM32 here.

Core Features – The MIPS core has an independent counter on the system clock. This counter runs at one half the system clock rate (100 MHz in this case) and can be easily used to make very precise delays and timing measurements.

The STM32F7 has a DWT Cycle counter on the core clock and a SysTick counter, but it is not straightforward to start and use and there are typos in both the ARM and STM32 documentation. Also the STM32 configuration program does not configure the DWT counter for you it must be done with bare metal code. That’s a negative for the STM32 toolchain, as the PIC32 Harmony configurator exposes every part of it’s chip for configuration.

Help – Both the PIC32MZ and STM32F7 processors have active and helpful forums on the interwebs, so no advantage to either part here.

Resume – Having ARM Core experience on your resume looks better than MIPS core experience, so points to the STM32F7 here. ARM cores are just more popular, go figure...

And the Winner is,

The choice was simple, even though I slightly prefer the STM32 tool chain over the PIC32 and if all other things were equal that would have been the deciding factor, the fact that the PIC32 has one single little hardware peripheral: “The Reference Clock Divider” ultimately drove the decision.  The use of an external DAC for the PIC32 implementation was not as big a deal as the reference clock, which would have required much more extensive circuitry than a simple DAC to implement externally. The use of an external DAC also alleviated all fear of having processor noise on the output.

The PIC32’s more ROM and especially RAM memory was simply icing on the cake.

Appendix – Why use an external USB chip when your processor implements USB?

A FTDI FT232RL costs around $4.50 in single units, so why put one in front of a microprocessor that has a built in USB interface?

Several reasons, actually.

#1 – Driver stability: Using a microprocessors built in USB interface usually means using the Windows CDC class driver. The CDC driver up until Windows 10 has been reported to have many quirky issues. The FTDI driver on the other hand is bullet proof. Unless cost is of paramount important, I always do my customers (and myself) a favor and use the most bullet proof solution possible. I just have never had any issue with the FTDI drivers on any PC. They work very well.

#2 – Program development: When developing programs frequent processor resets are the norm. Everytime you program new code the processor restarts. A processor restart resets the processors built in USB peripheral also, this breaks any connection that you have with the PC at the time, forcing you to restart the PC control program also.

When you use a FTDI chip, resetting the processor does nothing to the USB to PC connection, it does not reset and any PC programs keep running as if nothing happened. You might get a few garbage characters, but you won’t have to restart any PC program. This saves tremendous time and frustration during code development.

Even if your final design is going to use the processors built in USB, at least design your board to use a FTDI dongle for development purposes, then switch to the built in USB peripheral when the design is more stable.

#3 – Speed: USB is USB. The speed of the transfer is dependent on: How many bits are to be sent, the latency times and buffer size of the USB driver. The built in USB peripheral can’t be any faster than the FT232 chip. To maximize speed for any given application these two parameters need to be modifiable [1]. The FTDI driver exposes these two parameters in their excellent DLL so that they can be changed on the fly. It is a control panel / registry hack to accomplish this with the Windows CDC driver and I’m not sure that the values can change on the fly without disconnecting and reconnecting the device.

#4 – Ease of programming: With the FT232 chip the interface is through the processors USART and can directly use functions like printf(). If you use the built in USB peripheral you will be on your own to build packets and stuff them down the USB pipe optimally. This takes time to benchmark and analyze, the speed gains will also be marginal with un-optimized driver settings.

Do yourself and your customer a favor and give them the reliability of a FT232 USB connection, you won’t be sorry and in the long run everyone will save money.

[1] For a very interesting overview of USB latency and buffer size and how it effects total transfer speed see: “AN232B-04 Data Throughput, Latency and Handshaking” published by FTDI Inc.

Copyright notice: MPIS, ARM, PIC32, STM32, Microchip, ST Micro, NXP, Atmel and FTDI logos and names are copyrighted by their respective owners. 

Article By: Steve Hageman     

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.  

This Blog does not use cookies (other than the edible ones). 

Tuesday, March 14, 2017

Auto Generate a 'C' Function Prototype Header File

There is a lot of discussion on the proper way to use header files in Embedded C projects. I don't want to get into that discussion, rather I want to present a tool that is useful for my usage. If you don't agree with my use model, that's OK, you don't have to use this tool that way, it is very adaptable to any situation! [1]

I use header files for three things,

  1. One big one that has all the #includes in it so that every module has the proper references to things like: <stdio.h> and <math.h>.
  2. One that has the global program data in it.
  3. One that has the various public function prototypes (signatures) in it.
The first two are relatively easy to make and maintain. They also settle down quickly in a project as the tasks they maintain are defined early on and while there may be refactoring, they rarely change much.

#3 however is constantly changing and even in a small project it may grow to many, many function prototypes and refactoring causes constant updating.

It is tedious to have to change the data in two places when refactoring or when adding functionality through public functions in an embedded C program.

Auto Generation to the Rescue

I knew that someone somewhere must have written a utility that reads a directory, looks through all the '.c' files and auto-generates a header with all the public function prototypes in it.

Sure enough I found some C code written in 1993, by a Mr. Richard Hipp on the InterWebs that does just that [2].

It's a small program, all in one file and only several pages long. I brought it into Pelles C [3] to see if I could compile it under Win 7 and almost unbelievably it compiled with only a few simple changes. For code written in 1993 that compiles on Windows that seems almost unbelievable to me!

What the program is supposed to do is read every '.c' (and or '.h') file in a directory, extract the public functions and write them as function prototypes into individual '.h' files or one big '.h' file that can be added to the project.

Functions are marked as local (i.e. not for exporting into the '.h' file) by either using the keyword: 'static' or by the use of a 'LOCAL' define (see the program documentation [2]).

The program is wonderfully written, easy to follow and worked straight up out of the box, with one exception.

To read all the '.c' files at once to make a single output file the program depends on the UNIX ability to do wildcard expansion on the command line. MSDOS does not have that capability, so I had to wrap the makeheaders.exe in a MSDOS Batch file to make it work the way I wanted to use it.

MakeHeaders is so complete, it has the ability to pipe in a file list of names on the command line and then make a single big '.h' file. Mr. Hipp thought of everything. The batch file that I used is below.

REM Make one big Header File
dir /B *.c > mkhdr_input.txt
makeheaders -h -f mkhdr_input.txt >AutoGeneratedPrototypes.h
del mkhdr_input.txt


Listing 1 – A MSDOS batch file to operate the MakeHeaders program work the way I wanted it to. Upon running it makes a temporary file with all the names of the '.c' files in the current directory. Then it feeds this list into the makeheaders.exe. The program then parses all the files picking out the public function prototypes to write into one big '.h' file.

The batch file of Listing 1 first makes a file called: “mkhdr_input.txt” that contains just the file names. The “dir /B” switch is for the bare format which will list just the file names.

The makeheaders.exe is then fed with the mkhdr_input.txt file as input. The '-h' switch causes makeheaders to make one big '.h' file as output and then it redirects the output to the standard output which I redirect to the file AutoGeneratedPrototypes.h with the “>” redirect command.

The '-f' switch tells makeheaers that it will get it's input from the file specified, in this case: “mkhdr_input.txt”

Finally I just cleanup by deleting the: “mkhdr_input.txt” file.

There are many other options and ways to make the MakeHeaders program operate so be sure to check out the documentation [2]. For example the same list input can be used to make an individual '.h' file that corresponds to every '.c' file in the directory. This format is preferred by many and in very large programs may be preferable.

I just add the AutoGeneratedHeaders.h in my master 'include .h' file and the rest is automatic.

Perfect Function Prototypes Every Time

Now, anytime I refactor or add public functions to any source file I can just double click on the MakeHeaders.bat file and a new AutoGeneratedPrototypes.h file is made all ready for compiling.

Extra Bonus

If you want to use MakeHeaders to create a single, individual '.h' file for every '.c' file just use the batch file below.

REM Make A Seperate Header File for Each *.C file
for %%G IN (*.c) DO MakeHeaders %%G


[1] To find out more about the various ways to use header files, do a Google Search like,

Then pick a strategy that makes sense for you.

[2] Make Headers program. As of March 2017, the source code and documentation can be found at,

[3] Pelles C – A very good freeware C Compiler for Windows,

Article By: Steve Hageman     

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.  

This Blog does not use cookies (other than the edible ones). 

Monday, March 6, 2017

Simple Circuits Add to Versatility of the AD9834 Direct Digital Synthesizer IC

The little AD9834 Direct Digital Synthesizer (DDS) made by Analog Devices is a powerhouse of a design that is found in all sorts of products from Radios to Power Supplies. It is the “NE555” of DDS chips in it's popularity [1].

The AD9834 consists of a 28 bit programmable DDS Core and a 10 Bit Current Output DAC. The Nominal DAC output current is 0 to 3 mA. This current can conveniently be output directly into a 200 Ohm load to generate a 0 to 600 mV output voltage (See figure 1).

Figure 1 – Standard output circuit configuration for AD9834 with a FSADJ resistor of 6.81k provides a fixed 0 to 600 mV output.

Note: The following circuits are simplified and do not show power supply connections or proper bypassing. Please refer to the parts specific data sheets for complete usage information.

While 0 to 600 mV may be useful in many applications it is not particular useful if the output voltage needs to be user adjustable or bipolar. This is especially true when the DDS is used like a function generator where the end user needs an adjustable amplitude.

Adjustable DAC Full Scale Current

The first approach to adding an adjustable output to the AD934 is to attack the DAC current setting resistor. This resistor is nominally 6.81k for a DAC current of 0 to 3mA. The voltage at pin 1 (FSADJ) is nominally 1.15 Volts and this generates a current in the 6.81k resistor of 0.1689 mA (1.15/6810). This current gets scaled by 18 times internally in the AD9834 to get to the final 3mA DAC Full Scale Current.

The internal design of this circuit lends itself to controlling the DAC full Scale current over a reasonable range and this can be a useful and inexpensive way to get an analog or digital adjustment on the DDS output voltage.

As shown in Figure 2, a simple precision OPAMP from Linear Technology [2] used in a scaling circuit has been added to the DDS to control the DDS output voltage over a 4:1 range. Maximum DDS output voltage for this circuit is achieved for a 0 Volt control input.

A potentiometer or any 0 to 5V DAC output can be use as the 0 to 5V input to allow complete digital control of the DDS output voltage.

Figure 2 – A simple OPAMP circuit added to the AD9834 can give the AD9834 a 4:1 output Adjustment Range for a 0 to 5 Volt input signal. The Input signal could be a Potentiometer or from a 0 to 5V DAC.

The limiting factor in the maximum achievable adjustment range of the circuit in Figure 2 is the AC performance of the DDS DAC. While the output can be adjusted down from its maximum value the feedthrough glitches from the DAC switches will remain the same and the linearity of the DAC will suffer at lower output levels. Note also, that any excess noise on the 0 to 5V control voltage will additionally cause AM Modulation on the DDS output so add filtering as may be required by your application.

I have used this circuit in Figure 2 for a 4:1 adjustability with decent results. Your mileage may vary, so be sure to check the AC parameters that are important in your specific application.

Multiplying DAC on the DDS Output

For the ultimate in digital adjustability a Multiplying DAC  (MDAC) can be used at the output of the DDS to get 2^14 (16384:1) or better than 80 dB of adjustment range.

The AD5453 family from Analog Devices is a very high bandwidth Multiplying DAC [3] that comes in 8, 10, 12 and 14 bit resolutions. It takes in a AC reference voltage and outputs a Current that is scaled by a Digital Control Word.

Most MDAC's have a very low -3 dB bandwidth on the order of 20 kHz, the High Speed AD5453 when used with a suitable OPAMP output has a -3 dB bandwidth of 10 MHz or better. The maximum attenuation (or how low you can control the output) is flat to 300 kHz at 14 bits, rising to 1MHz at 12 bits, 3MHz at 10 bits finally rising to 10 MHz at 8 bits.

Figure 3 – Combining a high speed AD5453 MDAC and a LT1087 Dual OPAMP allows very complete control of the AD9834 DDS output.

Note: The 10 uF capacitor sets the low frequency roll off. With the 10 uF value shown the low frequency, -3 dB point is below 2 Hz. (The input resistance of the AD5453 VREF Pin is 7k Ohms minimum).

Note: The 1.5pF capacitors should be adjusted in the final circuit for maximum output flatness over frequency.

The circuit of Figure 3 provides a +/- 5 Volt output with low distortion to 1 MHz and provides 80 dB plus of output voltage adjustment range. Additionally the output can be offset from -5V to +5V with the addition of an offset adjustment control via a low cost DAC or POT (Offset Adjust Input).

DDS Output Control At Even Higher Frequencies

If you need to operate the AD9834 at even higher frequencies, closer to the maximum specified fundamental output of 37.5 MHz, or even operating in “Super Nyquist” mode [4] you should look at a 50 Ohm CMOS RF Attenuator like those manufactured by Peregrine Semiconductor [5]. The PE43711 has a frequency range down to 9 kHz and 31 dB of control with 0.25 dB steps all the way to 6 GHz. At higher frequencies you will probably be designing around 50 ohm circuit impedance's anyway so this should not be much of an issue. Multiple PE43711's can be connected in series to get more attenuation in 31 dB chunks.

Note: At lower RF frequencies, less than about 50 MHz, CMOS, SiGe and Silicon based IC's are preferred to GaAs IC's. This is because the GaAs IC's typically have worse harmonic distortion (Especially very bad 2nd harmonic distortion) at low RF frequencies.


[1] The NE555 timer is arguably the most popular linear IC of all time.

[2] Linear Technology LT1677 Precision and LT1087 High Speed OPAMPS are manufactured by Linear Technology Inc. Now a part of Analog Devices.

[3] Analog Devices Application Note: “Multiplying DACs Excel at Handling AC Signals”.

[4] Super Nyquist Mode, See: Analog Devices Application Note AN-939

[5] Peregrine Semiconductor, Inc

Article By: Steve Hageman 

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

This Blog does not use cookies (other than the edible ones). 

Thursday, January 12, 2017

Numeric optimizations in C# for a faster DFT

Kirk: "More speed Scotty !!!"

Like Capt. Kirk, we engineers it seems, are never satisfied and always want more speed!

How to get more speed out of DFT calculations in C#

Currently the open source project: "DSPLib" [1] calculates Discrete Fourier Transforms (DFT) in double precision math, so the question can be asked: Could using some other data type produce faster DFT calculations? And by fast I mean fast enough to be worthwhile spending the time doing – like 50% faster as 10% faster would likely not be worth the effort.

After reading many articles and blogs on how the .NET framework, the CLR (Common Language Runtime) and Intel Processors handle math, one will find seemingly convincing evidence and very simple benchmarks that Double and Float math calculation takes about the same time on a modern processor. There is also some circumstantial evidence that Floating Point Math is as fast as integer math because the Floating Point Unit is so fast in modern CPU's. Likewise Int64's should theoretically be about the same speed as Int32's because people say that Intel processors calculate everything as Int64's and truncate for shorter data types anyhow.

But... How would all this apply to an actual DFT algorithm? A DFT moves around a fair amount of data in arrays. Can a different numeric format speed up the DFT calculations that I specifically want to make?

I didn't know, but I remembered my Mark Twain. He popularized the saying:  

     "There are three kinds of lies: lies, darned lies, and statistics."

Mr. Twain was a humorist, but if he was a programmer he might have changed the saying to,
     "There are three kinds of lies: lies, darned lies, and benchmarks." 

This applies aptly to what I was faced with – one can find many very simple “Benchmarks” that people have written to show that one numeric type is faster than another or that the numeric format makes no difference, the trouble is these don't store intermediate results or access arrays over and over the way a DFT does. Are they really applicable? I set out to find out…

My DFT Benchmark

I roughed out a basic DFT (All the same loops, arrays and multiples), I used look up tables instead of calculating the Sin/Cos terms in the loop as that is the way I would use them anyway and I did not do any further optimization or try to help or hinder the C# compiler from optimizing anything. I also did not invoke the task parallel extensions, which have been shown to immediately provide a 3x improvement in performance even with a dual core processor [1][2].

// Note: This is not a working DFT - For simplified timings only.
// It does have the same data structures and data movements as a real DFT.
public Int64[] DftInt64(Int64[] data)
    int n = data.Length;
    int m = n;
    Int64[] real = new Int64[n];
    Int64[] imag = new Int64[n];
    Int64[] result = new Int64[m];
    for (int w = 0; w < m; w++)
       for (int t = 0; t < n; t++)
          real[w] += data[t] * Int32Sin[t];
          imag[w] -= data[t] * Int32Cos[t];
       result[w] = real[w] + imag[w];
    return result;

Figure 1: The basic code that I used to benchmark my DFT. I wrote one of these for every numeric data type by changing the variables to the type tested. I initialized the input data and the Sin / Cos arrays (Int32Sin[] and Int32Cos[] above) with real Sin and Cos data once at the start of the program. I then called the DFT's over and over again until the elapsed time settled out, which I believe is an indication that the entire program was running in the processors cache or at least a stable portion of memory. This procedure was repeated for each DFT Length.

I wrote DFT routines to calculate in: Double, Float, Int64, Int32 and Int16 data types and compiled release code in VS2015 for x86 and x64 architectures. The results of these tests are presented in figure 2 below.

Figure 2A – The raw results of my real world benchmark for DFT's. All the times are in Milliseconds and were recorded with .NET's StopWatch() functionality. Two program builds were recorded. One for the x86 and one for the x64 code compilation types in Visual Studio 2015.

Figure 2B – Raw timing numbers are a little hard to compare and comprehend. To improve on this, the view above normalizes each row to the Double data type time for that size DFT, which makes it easier to compare any speed gains in the results. For instance: In the X64 Release version, for a DFT size of 2000, the Int16 data type took 0.28:1 the time of the Double data type (10mS / 36mS = 0.28).

Int16 Compared to Int32 
The Int16 data type timing is comparable to that of the Int32 for both the x86 and x64 compiled program. An interesting note is that the Int16, compiled as a x86 program actually starts to take longer than the Int32 as the DFT size gets really big. This is probably due to the compiled code having to continually cast and manipulate the Int16 values to keep them properly at 16 bits long.

The bottom line is: Int16's are no faster and in some cases longer than Int32's so there is really no point in considering them further I this discussion of speeding up a DFT calculation.

Using an Int16 would also severely limit the dynamic range to less than 96 dB. Many digitizers have better dynamic range than this now. I would consider this a big limiting factor.

Int32 Compared to Double 
The Int32 is much faster than a Double for small DFT's at all compilations (x86 or x64). However as the DFT size gets really big the speed difference disappears. This is especially true for the x86 Compilation where a large DFT is actually slower than the equivalent Double DFT. With x64 compilation, the Int32 is still faster even at large DFT's but it too shows this non-linear behavior as the DFT size increases. This non-linear behavior is probably due to the data arrays not fitting in the processors fastest cache as the arrays get larger.

Interesting Point #1: Benchmark timings can be data array size dependent and in this particular case are non-linear.

Int32 compared to Int64 
For the x86 compilation the Int64 is definitely a non-starter. In all cases tested the Int64 is slower than the equivalent Double calculation. This makes sense as all the address and register calculations would need to be stitched together 32 bit registers. In the x86 case the Int64 is actually slower than the Double data type for all DFT sizes tested!

With the x64 compilation the Int32 and Int64 data types are comparable across all DFT sizes.

It probably does not make any sense to favor the Int64 over an Int32 even for the increased dynamic range. A Int32 calculation can yield a 192 dB dynamic range. This is way more than most applications can use or will ever require.

Interesting Point #2: Somewhat unsurprisingly, when using the x86 Compilation, the use of Int64's is very time consuming, especially when compared to the Int32. More surprisingly is that the Int64 actually takes longer than the equivalent Double.

Float compared to Double 
In both x86 and x64 compilation the Float is quicker than the Double data type. Twice as fast at smaller DFT sizes, the speed gets comparable at very large DFT sizes. That is not the common result that you can glean from the internet. Most peoples benchmarks and discussions suggest the Double and Float data type will have the same execution time.

Interesting Point #3: Despite what the Internet says, The Float data type is faster than Double especially for small DFT sizes in this benchmark.

It is clear that for the fastest DFT calculations the x64 compilation should be used no matter what. This would not seem to be a hardship for anyone as 64 bit Windows 7 (and Win 10?) pretty much rules the world right now and everything from here on out will be at least 64 bits.

Even though the Float data type is usually faster than Double, the clear winner is to use the x64 compilation and the Int32 data type.

At the smaller DFT sizes the Int32 is nearly 3 times faster and even at really large DFT sizes the Int32 is still 25% faster than the Double data type.

Using an Int32 is not a resolution hardship in most real world cases. Most digitizers output their data as an integer data type anyway and a Int32 bit data type can handle any realistic 8 to 24 bit digitizer in use today.

Since the Int32 data type can provide a dynamic range of 192 dB. That seems to be plenty for a while at any rate. As a comparison the Float data type provides about 150 dB of dynamic range.

Next Steps 
Now I have some real and verified benchmarks that can lead me to the best bet on improving the real time performance of a .NET based DFT.

The next obvious steps to optimize this further for the 'Improved' DSPLib library is,

1) Apply Task Parallel techniques to speed the Int32 DFT up for nearly no extra work as the DSPLib library already does [1].

2) Having separate Real[] and Imaginary[] intermediate arrays probably prevents the C# array bounds checker from being the most effective. Flattening these into one array with double the single dimension size will probably yield a good speed increase. This however needs to be verified (Again: Benchmark). References 3 and 4 provide some information on how to proceed here.

At least I have separated the simplistic Internet benchmarks from the real facts as it applies to my specific DFT calculation needs.

The takeaway from all this is: To really know how to optimize even a relatively simple calculation, some time needs to be spent benchmarking actual code and hardware that reflects the actual calculations and data movement required.


[1] DSPLib: An open source FFT / DFT Fourier Transform Library for .NET 4
DSPLib also applies modern multi-threading techniques to get the highest speed possible for any processor as explained in the article above. 

[2] The computer that I ran these benchmarks on is a lower end dual core i7-5500U processor with 8 GB of RAM running 64 Bit, Windows 7 Pro. Higher end processors generally have more on board Cache and will generally give faster results especially at larger DFT sizes.

[3] Dave Detlefs, “Array Bounds Check Elimination in the CLR”

[4] C# Flatten Array

NOTE: The C# Compiler is always being worked on for optimization improvements. C# V7 is just around the corner in January 2017 and it has some substantial changes, these changes may also improve or change the current optimization schemes. If in doubt - always check what the current compiler does.

Article By: Steve Hageman 
We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

This Blog does not use cookies (other than the edible ones). 

Sunday, June 19, 2016

DSPLib – An Open Source FFT Library for .NET 4

There are many Open Source and Commercial implementations around the web and in textbooks for computing Fourier Transforms. Unfortunately most are flawed in a number of ways,
  1. They produce an un-calibrated result that changes depending on the number of points transformed. 
  2. They include no built in methods to scale for Windowing of the input data.
  3. They always have no proper way to measure noise accurately.
  4.  They don't size the returned spectrum to have just the real part of the spectrum. 
  5. They implement their own Complex number type. Ignoring .NET 4's built in Complex data type.
  6. They aren't complete. You have to add a bunch of helper routines every time.
  7.  They have restrictive Open Source Licenses.

All of these things take hours of tweaking to get a usable FFT or DFT running from even the best of the currently available libraries.

I decided to solve this problem for myself and my clients by making a pretty complete Fourier Transform Library that implements, 
  1. Properly Scaled Fast Fourier Transforms.
  2. Properly Scaled Discrete Fourier Transforms. 
  3. Properly Scaled Data Windowing.
  4. Proper functions to scale for noise and signals in the correct manor. 
  5. Signal Generation for Testing.
  6. Useful Array Math routines.

This work is the culmination of about 5 years worth of work using, revising and tweaking other libraries and implementations, both open source and commercial before I wrote my own.

DSPLib is the first Fourier Transform Library that can take any time domain signal input, like from an ADC, apply one of the 27 built in window types and produce a correctly scaled Spectrum Output for either signal or noise analysis with no code tweaking required at all.

The library is released under the very non restrictive MIT License and is essentially royalty free for any use, even commercial.

The complete write up is at – take a look and enjoy never needing to spend hours tweaking Fourier Transform code again.

Open Source .NET FFT Fast Fourier Transform Library Code
Open Source C#  FFT Fast Fourier Transform Library Code
Open Source .NET  DFT Discrete Fourier Transform Library Code
Open Source C#  DFT Discrete Fourier Transform Library Code
Open Source C# FFT Library,   Open Source .NET FFT Library
Open Source C# DSP Library Code
Open Source .NET DSP Library Code

Article By: Steve Hageman 

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project. 

This Blog does not use cookies (other than the edible ones). 

Monday, April 25, 2016

FFT's meet 200MHz, PIC32MZ Microprocessors


Lately I have been working on a series of Analytical Instruments that are roughly “Hybridized” Lock In Amplifiers[1]. These designs are Hybrids because the purely Analog Lock In Amplifier block diagram has been augmented to incorporate Digital Signal Processing functions by digitizing the waveforms then applying DSP Techniques (including FFT's) for noise reduction, reference signal phase comparisons and maintaining the quadrature phase lock of the synchronous detector function, which is also a digital processing block. In addition, sophisticated noise filtering can be accomplished all in software at very respectable update rates.

This is all made possible and simple because of the latest crop of single chip microprocessors that are capable of running at 200 MHz core frequencies and have specialized DSP functions like: DMA and DSP Processing instructions that allow them to beat even most specialized DSP Processors at their own game. The latest Microchip PIC32MZ processors even include free and quite capable DSP libraries so it really is one stop shopping now. All this power for less than the price of a good lunch, as these chips cost less than $10 each.

PIC32MZ FFT Performance

The MIPS core based PIC32MZ processors running at 200 MHz have the potential to do a really fast and big FFT. To find out just how well they do, I fired up my PIC32MZ EF / Connectivity Evaluation board from Microchip Technology (DM320007) to answer a few questions.

The first question is always: “How fast is the FFT?” There is, after all, no point in asking any other questions if a 1024 point FFT takes all day.

If the answer to the first question was reasonable, the next question should be: “How big of an FFT can I do?”, followed by: “What type of dynamic range can I expect?” and finally, perhaps more towards the implementation side is the question: “How is the FFT scaled, so I can get a calibrated response?”.

For the FFT implementation I used the DSP library included with version 1.07.01 (March 2016) of Microchips Harmony Software suite. This library works on the well known Q15 or Q31 bit fixed point schemes. If you are unfamiliar with Q15 and Q31, just think of them like signed 16 and 32 bit integers [2].

The Harmony libraries are easy to use and well documented. Only two functions need to be called to accomplish a FFT. An initialization routine must be called first (and only once) to load up the twiddle factors for the specific length of FFT. Then the FFT itself can be called to actually perform the FFT. To get more information on the libraries, open the Harmony Help and search for “DSP”.

The libraries contain both a Q15 and a Q31 bit version, so I bench-marked both. About the only guidance that Microchip gives in the documentation is that the Q15 version will probably be faster due to optimization, and my testing proved that to be true.

I used my PC and a quick C# program I wrote that uses a USB based Serial connection to load the test waveform down to the PIC's RAM and then to read the results back to the PC for final processing and display. The test waveform was a sine wave that was properly scaled to either 16 or 32 bits. Since the sine wave was calculated with .NET doubles the result is better accuracy than the final cast. This means that the sine wave distortion was not the limiting factor in dynamic range in any test case.

For timing I used the MIPS Core Counter. This counter runs at ½ the core frequency or in this case 100 MHz. The Core timer was read at the start of the routine and after the routine finished, then the counts are subtracted to get the counts between calling the routine and its return. This count is converted to the equivalent time in microseconds for the timings.

No interrupts were enabled in the program and the PIC program was running in a big loop, where the FFT functions were called in a blocking fashion. I allocated the data arrays at compile time and didn't use anything like malloc() when the program was running. I also used Microchips free version XC-32 compiler with the Harmony Framework and no optimizations turned on [3].

The Evaluation board is fitted with the largest PIC32MZ available, a PIC32MZ2048EFH144, this device has a whopping 2MB of Program Flash and 512kB of SRAM on board. This large amount of SRAM allows for some rather large FFT's to be run.

The whole program even with the command interpreter I built took less than 1% of the program flash memory.

Now - On with the show

The first test is naturally to find the FFT speed versus size of the FFT for both the Q15 and Q31 formats.

Table 1 – The FFT speed in micro-seconds was measured for each FFT size for both the Q15 and Q31 formats. The Compiler could only allocate enough memory for a 16k Q15 FFT and 8k Q31 FFT.

The Harmony documentation states that the FFT Initialization function is written in C code and as such, is sower. This is OK, as the initialization function only needs to be called once or if the FFT size changes.

Table 2 – The Q15 and Q31 Initialization function does take longer than the FFT itself, but this function only needs to be called once or if the FFT size changes. The table values are in micro-seconds.

Figure 1 – The Q15 and Q31 FFT times are plotted for a graphical comparison.

Determining the FFT Speed for various FFT sizes actually answered two questions, eventually as I increased the FFT size I ran into a point where the program execution crashed when I tried to run the FFT. I was able to coax a 16k Q15 and an 8k Q31 FFT out of the processor before it would not run anymore.

If your application needs to apply a window to the time data, that is another vector array that would need to be maintained in RAM. Vector averaging and display buffering also require vector arrays and these can quickly eat up all available RAM so that even these FFT sizes may not be achievable in a full featured application as RAM gets gobbled up very quickly as the FFT sizes increase.

Dynamic Range – Or how many bits is that?

Dynamic range in the FFT processing is an important factor I determining whether a Q15 or Q31 format FFT is needed.

For instance a 12 bit ADC can have around 72 dB of dynamic range with no averaging or processing gain applied and a 24 Bit ADC can have a basic 144 dB of dynamic range.

It wouldn't do you very much good to attempt to use a 24 Bit ADC with a 12 Bit dynamic range signal processing chain. You'd just be in the numerical noise with no way of averaging your way out of it.

To determine the FFT's dynamic range I generated a perfect sine wave with my PC and then scaled it to either Q15 or Q31 full scale format. Then I sent this waveform to the PIC and had the PIC do a FFT on the data. I returned the complex FFT result to the PC where I converted it to magnitude dB format for display.

Below is the results for a large FFT with a full scale signal for both the Q15 and Q31 format FFT's.

Figure 2 – The Q15 FFT has about a 74 dB full scale to numeric noise dynamic range.

Figure 3 – The Q31 format has a more 'Spurious' noise look to it but the full scale range is just around 140 dB. Good enough for 16, 18 and probably 24 bits depending on the applications exact needs. Note that the 'spurious looking' noise floor is real and not some internal limiting problem. I backed off the input signal amplitude to make sure that some calculation was not saturating and it had no effect on the noise spurs amplitude.

Anybody have a ruler?

As I mentioned at the start – only if an affirmative answer is given to the speed, size and dynamic range questions should you start worrying about the “How is the FFT Scaled?” question.

Actually the Harmony FFT implementation is quite easy to work with. Most FFT implementations that you will find scale the amplitude with the size of the FFT or 'N'. This means that you have to apply a scale factor proportional to 1/N to the result to get a constant amplitude regardless of the FFT size. Not here, the proportional to 1/N scaling has already been applied, meaning that you will get the same amplitude output for any size FFT. Points scored for whoever wrote this FFT!

The amplitude that you get back depends on the format that you used in the first place. If you use a Q15 format FFT, for a full scale peak to peak sine wave input, you will get out a peak magnitude of (2^15)/2 or 16,384. Similarly the output of a Q31 format FFT will be (2^31)/2 = 1,073,741,824.

The ½ effect is due to the fact that the actual output of a FFT is a mirror of positive and negative frequencies with ½ the total power in each spectrum. Since we are only normally concerned with the positive frequency side, we observe that the power is effectively reduced by ½.

You can easily apply a constant multiplier to get any proper scale factor that you need to the FFT result. Remember that the input signal, if it is a sine wave will probably be measured as a peak to peak value, whereas the FFT Magnitude display shows the RMS value of the input signal at each specific frequency bin.

Also remember that any windowing applied to the time series or zero padding will effect the amplitude of the resulting FFT output also and this must be accounted for by proper scaling. You can check out some previous articles for more about how to do that [4] [5].


It is simply unbelievable what we can do now with a single chip micro-controller when they cost less than lunch and run at 200 MHz. It wasn't too long ago that I had a 33 MHz, 386 Desktop PC and this little 32 Bit PIC can do a FFT faster than that PC could!

Article References

[1] Lock-In-Amplifiers, Wikipedia

[2] Q / Fixed Point Notation formats, Wikipedia

[3] Most of the DSP library that really needs speed is provided in object file format or coded in assembly already. Using the 'Pro' level compiler will probably not increase the speeds reported here.
[4] Hageman, Steve, EDN Online, June 19, 2012, “Understanding DFT & FFT Iplementations”

[5] Hageman, Steve, EDN Online, August 6, 2015, “Real Spectrum Analysis with Octave & MATLAB”

Article By: Steve Hageman
We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project. 
This Blog does not use cookies (other than the edible ones).