Wednesday, August 9, 2017

FTDI / FT232R USB to Serial Bridge Throughput Optimization

Using a USB / UART Bridge IC like the FTDI FT232R is kid of a funny thing, what with "Latency Timers",  “Packets”, "Buffers" and what not, you soon find that it is not like a traditional RS232 port on a PC. Meaning: At higher BAUD rates you will probably notice that it is slower than your old PC with a dedicated Serial port.

Problem is you can't get a PC with a serial port anymore so we are kind of forced to use the USB equivalent.

There are ways to speed up the USB / UART bridge however, so read on (for the remainder of the article I will call the USB / UART bridge by the simple name: Bridge).

USB Transfer Background

A very popular, reliable and low cost Bridge chip is the FTDI FT232R. These FTDI chips have various settings: Baud rate, Packet Size, Latency Timer, Data Buffer and Flow Control pins and these all conspire together to meter data flowing across the USB link [1].

Baud Rate – This is the rate at which the FT232R UART will communicate to the attached downstream serial port. Normally on the downstream side this will connect to a processors UART port. The maximum BAUD rate is dependent on the application. If the end user is going to use some prewritten application, like a terminal program then there will be some constraints that you won’t be able to exceed. 115.2k is at least the minimum BAUD rate that every modern application would be expected to support. Most recent PC applications will support several BAUD rates above this, usually up to 921.6k.

The FTDI driver also has the ability to ‘alias’ higher rates to lower numbers to fool PC Applications into supporting faster BAUD rates. See reference 2 for more information on how to do this.

If you are going to write your own application and plan on using the FTDI DLL interface, instead of the Virtual Com Port (VCP) then you can use any BAUD rate that the processor and FT232R will mutually support. The maximum BAUD rate for the FT232R chip is 3M BAUD, which is easily supported on all modern 32 bit processors.

Note that there is no physical UART on the PC side, so Baud Rate means nothing on the PC side. This parameter is passed to the FTDI chip and that is how it sets it’s downstream BAUD rate to the attached processor, which does have a physical UART.

Packet Size – In these kinds of USB transfers the basic data packet size is 64 bytes. FTDI reserves 2 byes for it’s own use, so the user gets 62 bytes of data to use in every USB packet transfer. The packet size is fixed by the way USB works and can’t be changed.

The Data Buffer – The data buffer is set to 4k bytes by default (in the driver .INF file) and its part in metering data is this: The driver requests a certain amount of data from the device. The driver will continue to receive data until the buffer fills or the latency timer times out, then a USB transfer to the upstream program in the PC will take place. Valid buffer sizes are from 64 to 65536 in steps of 64 bytes. In the FTDI DLL Driver [3] the buffer size can be set by the command,

 FT_SetUSBParameters (FT_HANDLE ftHandle, DWORD dwInTransferSize, DWORD

Note: Only the InTransferSize can be set, the OutTransferSize parameter is only a placeholder and with the FT232 does nothing [3].

Note: The data buffer is physically on the upstream PC Side and is held in the USB Host Controller Driver.

The Latency Timer – This timer is set to 16 milliseconds (mSec) by default. If there is data in the Bridge when the latency timer times out then this data is sent. So at worst (with default settings) there is a 16 mSec delay in getting small amounts of data across the link. Valid Latency values are from 2-255 mSec. In the FTDI DLL Driver [3] the latency can be set by the command,

    FT_SetLatencyTimer (FT_HANDLE ftHandle, UCHAR ucTimer)

Note: The Latency Timer is physically on the upstream PC Side and is implemented in the USB Host Controller Driver.

Flow Control Pins – For the FT232 chip, the control pins have special meaning and functions. If the downstream side changes the state on of one of the flow control pins then the buffer, empty or not is sent in the next possible moment. This can be used to advantage by the downstream processor to signal the end of a transmission and to get the data sent ASAP to the PC. The PC can also control the downstream flow control lines. For instance the DTR line can be connected to the DSR line at the FT232R chip. Then the PC can change the state of the DTR line, which will cause the DSR line to change state and the transfer will immediately be initiated. All of the downstream to upstream Flow Control pins operate at the same priority, so there is no advantage to using one over the other if you are using them to just initiate a data transfer quickly.

Note: If you are using the Virtual Com Port (VCP) interface the Baud Rate, Latency and Buffer size can be controlled by: Editing the USB Driver “.INF” file, changing the appropriate registry keys or by using the Device Manager. There is no simple programming interface that I am aware of. Yet another reason to use the FTDI DLL for interfacing instead of the VCP, especially for applications that use custom programming on the PC side.

Understanding The Problem

Maximum throughput occurs when the maximum number of packets are sent in the minimum time. As a designer you have control over the: Baud Rate, Latency Timer, Buffer Size and the Flow Control pins.

So it is obvious that some combination of these four parameters will result in the fastest possible data transfer rate. The question then becomes: What are the optimum settings?

First we should study how my particular downstream processor and the PC communicate.

My application for this example is the control of an external digitizer. The external digitizer instrument has a 32 bit processor that is connected to the FT232R chip through a UART in the processor. My command structure is always a “Command / Response” type of communication.

Case 1: To start a digitizing data capture I can send a trigger command from the PC like: “TRIG:IMM” the command string is terminated with a “Linefeed” character. This command is decoded in the instrument processor to signify that a TRIGger / IMMediate command should take place and the downstream processor starts the data acquisition process.

To keep the PC and instrument in sync, the digitizer then sends back an acknowledgment when the command has finished. The acknowledgment chosen is the same ‘Linefeed’ character.

This way the PC and the downstream processor can always stay in sync and they both know when the communication has finished because they both wait for the ‘Linefeed’ character before proceeding on. I typically don’t use any other handshaking (or flow control).

In my simplest case (as above) the command from the PC might be 1 to 20 characters long and the response is always just a single character (the ‘Linefeed’).

Case 2: Is when the digitizer has captured all the data and the data is sent back to the PC for analysis. Again a simple command is sent from the PC to the Instrument like: “TRAC:A?”, meaning: Get data TRACe for channel A. This is followed by the ‘Linefeed’ terminator and then the fun starts. There might be a lot of data captured in my instrument that has to be transferred back to the PC. The standard capture is 1024 bytes of 16 bit ADC data. These ASCII values of data are separated by commas so a pretty worst case transfer might be something like,

    “65535,” repeated 1024 times and terminated with a ‘Linefeed’

This is 6 x 1024 + 1 characters or 6145 characters total. With a setting the of 3 M BAUD the processor can pump this data out to the FT232 chip in a little over 20 mSec. This was confirmed with a scope, the processor can easily pass this amount of data in this time without interruption or gaps.

The minimum case would be if the ADC Data was all zeros. In this case the transfer would be,

    “0,” repeated 1024 times and terminated with a ‘Linefeed’

This is 2 x 1024 + 1 characters or 2049 characters total.

It can be seen that even with a fixed length of data points to send back to the PC, if leading zeros are suppressed then the data could be anywhere from 2049 to 6145 characters total. Any optimization would have to take this into account.

Optimizing The Parameters

For Case 1: Where the command size is something around 10 characters and the return is simply the ‘Linefeed’ character, the buffer will never fill and the only way to minimize the transfer time is to set the latency to 2 mSec or use one of the Flow Control Lines to force a transfer.

For Case 2: Where the upstream data is large the proper choice of Buffer and Latency is not so clear.

Naturally as an engineer I read the FTDI optimization application note [1] and took it’s suggestions for setting the latency and buffer size and tested this by measuring and averaging 100 transfers. To my surprise the ‘Improved’ settings gave about the same average transfer speed as the default settings.

So then I started hacking settings by changing the parameters 20% either way and looking at the results – still nothing conclusive and I wasn’t able to converge on a higher transfer rate. I started to wonder: If the maximum transfer rate might be a sharp function of some combination of the latency and transfer rate, how would I determine this? I would likely miss it by hacking a few values at random and this wasn’t getting me anywhere anyway.

I turned to my old Friend: “Monty Carlo” as in the “Monty Carlo Method”. This is a trusty way of randomly picking a lot of values, applying them and then and then seeing what the result is. Monty Carlo analysis is useful when you don’t have a clear understanding of the underlying functions that are controlling something. You will be less likely to miss some narrow response if you randomly pick enough values than if you use an orderly method to step the values.

I wrapped a loop around my benchmarking routine and set it out to capture 5000 random Latency and Buffer Size parameter variations. I also set the test program to run as a EXE and not in the development environment, to remove that as a source of variation, and I didn’t use the test PC for anything else during the run.

Just looking at the raw data, the fastest to the slowest transfer time was: 0.0306 to 0.348 Seconds or 11X speed difference. The Default data rate with a 16 mSec Latency Timer and 4k Buffer was: 0.054 Seconds.  Changing the default to the fastest setting could result in a 54/30 or 1.8X speed increase. That’s worthwhile pursuing.

Looking at the raw data some more, the fastest 21 transfer times all had buffer sizes of 64 bytes. There is a conclusion right there without even having to plot the data!

Being curious, I did plot the entire 5000 points of data and the results are shown in Figure 1. There are some outliers which can probably be explained by the Windows Operating System going off and doing something else during a data transfer, but the large majority of points fall in a small band of values.


Figure 1 – A random selection of points was made for the Buffer Size and Latency, at each point an average of 10, 6145 byte transfers was made and recorded (Vertical Axis). A few features can be seen: A ‘rift’ is visible along the Buffer Size axis. Generally the minimum transfer time is with small Buffer and Latency values (lower front corner of the plot).

The ‘rift’ is an interesting feature of Figure 1. Figure 2 is a zoomed in look at that feature with the plot rotated so that the Transfer Time variation is flattened.

Figure 2 – A zoomed and rotated in look of Figure 1. Now the effect of transfer speed (Vertical Axis) on buffer size can be clearly seen. It does in fact have a minimum at around 6145 bytes and sub-multiples of that. However a minimum can also be seen at the smallest buffer sizes. Note: Since the 3D curve was rotated down to get all the slope out of the curve, the transfer time (Vertical Axis) is not valid for values anymore, it only is a relative measure: Lower on the graph is a faster overall transfer time.

Figure 2 shows the 3D plot flattened on the Buffer Size Axis – A few clear trends are present, the overall transfer time is minimized as the transfer buffer is reduced, reaching a minimum at the right end of the scale or, 64 bytes. Also, there is a minimum at around 6145 bytes and sub-multiples of 6145 bytes. This is predicted by the FTDI application note [1].

Figure 3 – A zoomed in view of small buffer sizes and latency numbers with a 6145 character transfer. Here the minimum can be clearly seen. Setting the Buffer size to 64 bytes and the latency to less than 10 results in the minimum transfer time over the other cases by nearly 10%. The rightmost curve in the plot above tells the story.
Figure 3 shows a zoomed in portion of Figure 1, for small buffer sizes and small latencies. Here it can be seen that the lowest transfer time is when the Latency is set to the minimum. The lower total transfer time group is when the buffer is 64 Bytes (rightmost curve in figure 3).

To analyze these results fully, I sorted the data and found the 24 fastest transfers. These all had a 64 byte buffer, as figure 3 predicted. Then I plotted the transfer time versus latency as shown in figure 4.

Figure 4 – The 24 fastest transfers of 6145 characters all had a 64 byte buffer. Plotting this set of data versus Latency showed that there is a very small local minimum here, but the difference in transfer time from 2 to 7 mSec Latency setting is less than 1 part in 30.

Figure 4 showed that there is indeed a local minimum in transfer speed, but the difference is so small that there really isn’t any appreciable difference for any Latency Value from 2 to 10 mSec.

Figure 5 – As a verification, I also did a summary plot of 2049 character transfers to make sure that the optimization worked for the smallest typical data set too. This plot follows the same trend of figure 4.As before any Latency Value from 2 to 10 mSec results in a very low transfer time.


It’s is really easy, for this example to minimize the USB transfer time for the two use cases in my project just by setting the buffer size to 64 and the Latency to 2 mSec.

This is far simpler than what reference 1 would lead you to believe, but it has been proven by actual measurements. Using these settings also eliminates the need to use the control lines to force a transfer as this won’t shave any time off the transfer time when you have the latency set to its minimum anyway.

As for PC performance: Even on my low end i7 core based notebook, running Windows 7 / x64,  I don’t notice any operating system sluggishness or excessive processor loading using these settings. So there don’t seem to be any downsides to this.

If any sluggishness is noticed, the Latency can be set as high as 10 mSec (5X Higher) with no appreciable reduction in the large data transfer and only an 8 mSec penalty in response time for the single character case (Chase 1), which may not be noticed or that important in the overall scheme of things.

If a higher Latency option was selected it might be wise to wire up one of the FT232R’s flow control lines and to have the downstream processor toggle this at the end of every command to maximize the speed of the single character transfer case.

As a final note: The FTDI Latency and Buffer size settings can be changed at any time after a FTDI USB device is opened for communication, and they take effect immediately. The elapsed time to set both parameters is less than 1 mSec so there is not much time penalty in actively managing the settings as a program runs.

This exercise lowered the transfer time in my application for 6145 characters from 0.052 to 0.031 Seconds, a factor of 8X for Case 1, a small 1 character upstream transfer, to a 1.6X speed improvement for Case 2, 6145 character upstream data transfers. I can now achieve an overall 1.9 Million Bits per Second transfer rate without changing the hardware at all. That’s time well spent tweaking a few software parameters.

Caveat Emptor

It should be noted that the USB bus is a cooperative and possibly multi device system that has a limited overall bandwidth. My applications requirement always specify that my devices are the only devices on the bus consuming any bandwidth for maximum speed. This may not always be the case. For instance there may be a wireless Mouse attached or a Disk Drive, etc.

Even though you may think that your ‘widget’ is the only device in the world, you can never be sure what your customers may try to inter-operate with. It is wise therefore to test and to make sure that your application and its settings can withstand other things going on in the PC at the same time. I personally have written a USB disk drive file mover application that I run on the PC while doing stress testing of my applications. This application consumes a large amount of USB bandwidth by copying large files back and forth across the USB interface to a USB disk drive in the background while I run my application in the foreground looking for transfer issues.


[1] “AN232B-04 Data Throughput, Latency and Handshaking”, Published by: Future Technology Devices International Limited,

[2] “AN120 Aliasing VCP Baud Rates”, Published by: Future Technology Devices International Limited,

[3] “D2XX Programmers Guide”, Published by: Future Technology Devices International Limited,

Article By: Steve Hageman     
We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.  
This Blog does not use cookies (other than the edible ones). 

Sunday, July 23, 2017

When Microprocessors are a commodity – How do you choose?

Slightly over 20 years ago I had the need for a microprocessor to use in a project. I needed to keep the budget low. The 8 bit microprocessors of the time cost $15.00 in single quantities, that wasn’t an issue, but the tool chain cost was. I didn’t want to spend the thousands on a traditional tool chain that many of the processors of the day used. I bought a hobby programmer for $69.00. The required UV EEPROM Eraser cost me $29.00, the C compiler was $99.00. The tool chain was the most expensive part and that really drove the Microprocessor vendor choice. Which in this case was a Microchip PIC16C71 with all of 1k program memory and 36 bytes of RAM, running at a blazing 1 MIPS! The final project worked fine.

Thanks to Mr. Moore and his ‘Law’: Now we have 32 bit processors with built in floating point processors that have at least 512MB of flash and 256k of RAM all running with a 200 MHz system clock and it costs less than $10.00 in single quantities.

The tool chains are free and based on the GCC compiler. Similarly the basic programmer / debuggers are less than $20.00.

I have experience with both Microchip PIC32MZ and ST Micro STM32F7 products. So when I had a new project come by, how would I choose the ‘best’ device?

One processor is MIPS core based and the other is an ARM design. For 99.9% of the code I needed to write, the C compiler hides the underlying processor details so I don’t need to know anything about the underlying differences between a MIPS and ARM core. GCC has optimizations for both types and does a fine job of making efficient code.

The client may care about what processor to use, but if they say this: What they really care about is the tool chain, and I agree. You simply can’t run a successful operation if every project requires a different tool chain. So if you are a “ARM” processor house you are really is a: ST Micro, NXP, Atmel, et. al. tool chain house that just happens to target ARM Core processors from some manufacturer.

The choice as to what to pick comes down to slight features or preferences in tool chains or processor features. Everything else is pretty equal. Believe it or not, 32 bit microprocessors are a commodity.

My Current Project

The project at hand was a dual channel data acquisition and computation instrument that needed to drive two fast 16 bit ADC’s, buffer the samples in a large amount of on board RAM, then process the results with FFT’s and drive an Analog Output with some computed results. The design also needed a USB connection to a PC for setup and monitoring.

Normally this is done with at least three chips: An FPGA for ADC and Memory interfacing and a 32 Bit Processor acting as the DSP number cruncher and communication processor. For this project I needed to keep the chip count to one, the 32 Bit Processor. So speed and memory was a primary consideration.

Speed was the first consideration: At least 200 MHz system clock was required, just so I would not fail on the DSP computation part. I had enough bench-marking experience with some previous projects that I knew that a 32 Bit processor running at least 200 MHz would give me the desired number crunching performance.

The next constraint was that the ADC’s chosen were parallel output devices. I needed to be able to get a full 16 bits read in to the processor in a single chunk and ping-pong between the ADC’s to get both read in at a 2 MSPS rate. This is where demo boards came in. Both processors had (in a 64 pin package) at least one fully pinned out 16 bit port and writing some test bit-banging code I verified that both would also be able to manipulate the ADC’s and get the data to RAM fast enough. Both processors passed this constraint.

ADC interfacing: The ADC’s had a 3.3V IO voltage levels and both processors also have 3.3V IO pin voltage levels. Both processors also use a single voltage for the core and IO pins, this makes the overall design simpler to only have to supply one voltage to the processor. So no advantage to either processor.

Next was RAM – I wanted the biggest amount of RAM possible, just to be safe. The initial design was to run 2 x 8k FFT’s with another 2 x 4k buffers for a continuous averaging the results. Both chips had many times this RAM, but you can never have too much RAM can you?

The PIC32MZ had a slight advantage here as that part has a 512k or RAM versus the STM32F7 parts 256k.

Program Memory: I always buy the biggest memory part available for prototyping – after all you want to get the design going fast, not save a few bucks and end up sending days trying to figure out how to make the program fit. The PIC32MZ also has a slight advantage here as it has an unbelievable 2048k or Flash program memory! The STM32F7 topped out at 512k – Although to be honest I would never end up using all that Flash from either part for this application.

Note: If you are building a Web interface and will be serving up Web pages, all that RAM and ROM starts to look pretty small, pretty fast!

Speaking of FFT’s – Life is too short to be writing your own FFT and DSP routines and both chips passed this test by supplying a very complete and functional DSP library at no cost!

Speaking of Math – Both processors have Floating Point Units (FPU’s). The PIC32 however can do both single and double precision floating point, whereas the STM32F is a single precision unit only. The thought was in the code that I would do integer FFT’s and averaging, then convert the single FFT bin of interest to floating point to do the control math then convert this to back to integer for output to the DAC. Using floating point when the processor has a FPU has almost no speed penalty and just makes the code easier to understand (which my clients like). This gives the nod to the PIC32 for never leaving me out in the cold if I needed more precision than the single precision STM32F FPU provides.

Processor Package Size: The ideal was to use a 64 pin LQFP – Check on that as both families have that package available.

Tool chain: Both are usable, free, GCC based tool chains with very low cost programmer / debuggers. The STM32F7 tool chain is a little more common as it provides a set of tools to initialize and configure the peripherals through a HAL (Hardware Abstraction Layer) Library. The PIC32MZ uses what Microchip calls their Harmony configuration program. Harmony abstracts all the various PIC32 chips to a single very high level HAL

Documentation wise: The PIC32 XC-32 Compiler and Standard Library documentation is very good and customized for the PIC32 GCC extensions. Likewise the online help in the IDE, while not being perfect or complete is more than just auto-generated listings of function calls, so it is quite useful.

With the STM32 you are left with the standard GCC documentation that can be found on the web. The STM32 HAL library documentation is basically just auto generated listing of the functions and their parameters with no other useful information.

Nod to the PIC32 for better compiler documentation.

Both programming IDE’s are easy to use and have the same amount of annoying little issues (nothing works perfectly, does it?), so that’s a dead heat. At least neither ever crashed on my computers. They just have the annoying bugs like the dreaded “red squiggle underline” under perfectly fine code and not syntax highlighting correctly all the time.

Processor Interrupts – The firmware design was such that the sampling clock would drive the ADC’s convert pin directly and an external processor interrupt pin to initiate the processors data collection function. Another external interrupt would be needed for an external trigger circuit. Both processors have fast interrupts available on all IO pins, so no advantage to either here.

Communication – The plan was to use a trusty FTDI USB to Serial converter to get the USB communication to the Control PC. Hence I wanted a USART that could run at 3 Million Baud (The maximum rate for a FT232R chip). Both processors have multiple USART’s and can support the 3 MBAUD rate. A tie here. The PIC32MZ has a slight toolchin advantage as the stdio functions like printf() are already wired up to USART 2 and don’t require any further setup. On the STM32F parts there is some user code required to route the standard out to the proper USART. Not a big deal, but it has to be done.

Sampling Clock – The design was to use the 96 MHz system reference oscillator divided down to 24 MHz as the processor system clock input, this would be further divided to get an adjustable ADC sampling rate clock. The divider needed to be completely in hardware so that it would have no extra uncertainty jitter (An interrupt based timer/divider would have too much jitter and would not work).

The PIC32MZ has four independent Reference Clock Dividers that can be clocked from a variety of sources and can generate a divide ratio of 2-32768. This fit perfectly for my needs. The STM32F7 does not have such a divider. This is the biggest difference between the parts. It did take two hours to figure out how to program this on the fly as the core has to be unlocked, to change the divide ratio, etc. This was well covered n the documentation I just had to find it.

Analog Output – The result of all this sampling, FFT’s and math was a single number that could be output to a DAC. 12 bits was the minimum precision and 16 bits would be preferred. The initial thought was to use an external DAC on a SPI bus for this. As the update rate only had to be at 100 Hz, this would be easy to implement. The STM32 may have an advantage here – as it has two 12 bit DAC’s built-in whereas the PIC32 only has a low performance, 5 bit voltage divider built in. The worry here is that the internal DAC would be corrupted with noise from the processor core, but at 100 Hz output update rate I could have easily filtered off any noise. Slight advantage to the STM32 here.

Core Features – The MIPS core has an independent counter on the system clock. This counter runs at one half the system clock rate (100 MHz in this case) and can be easily used to make very precise delays and timing measurements.

The STM32F7 has a DWT Cycle counter on the core clock and a SysTick counter, but it is not straightforward to start and use and there are typos in both the ARM and STM32 documentation. Also the STM32 configuration program does not configure the DWT counter for you it must be done with bare metal code. That’s a negative for the STM32 toolchain, as the PIC32 Harmony configurator exposes every part of it’s chip for configuration.

Help – Both the PIC32MZ and STM32F7 processors have active and helpful forums on the interwebs, so no advantage to either part here.

Resume – Having ARM Core experience on your resume looks better than MIPS core experience, so points to the STM32F7 here. ARM cores are just more popular, go figure...

And the Winner is,

The choice was simple, even though I slightly prefer the STM32 tool chain over the PIC32 and if all other things were equal that would have been the deciding factor, the fact that the PIC32 has one single little hardware peripheral: “The Reference Clock Divider” ultimately drove the decision.  The use of an external DAC for the PIC32 implementation was not as big a deal as the reference clock, which would have required much more extensive circuitry than a simple DAC to implement externally. The use of an external DAC also alleviated all fear of having processor noise on the output.

The PIC32’s more ROM and especially RAM memory was simply icing on the cake.

Appendix – Why use an external USB chip when your processor implements USB?

A FTDI FT232RL costs around $4.50 in single units, so why put one in front of a microprocessor that has a built in USB interface?

Several reasons, actually.

#1 – Driver stability: Using a microprocessors built in USB interface usually means using the Windows CDC class driver. The CDC driver up until Windows 10 has been reported to have many quirky issues. The FTDI driver on the other hand is bullet proof. Unless cost is of paramount important, I always do my customers (and myself) a favor and use the most bullet proof solution possible. I just have never had any issue with the FTDI drivers on any PC. They work very well.

#2 – Program development: When developing programs frequent processor resets are the norm. Everytime you program new code the processor restarts. A processor restart resets the processors built in USB peripheral also, this breaks any connection that you have with the PC at the time, forcing you to restart the PC control program also.

When you use a FTDI chip, resetting the processor does nothing to the USB to PC connection, it does not reset and any PC programs keep running as if nothing happened. You might get a few garbage characters, but you won’t have to restart any PC program. This saves tremendous time and frustration during code development.

Even if your final design is going to use the processors built in USB, at least design your board to use a FTDI dongle for development purposes, then switch to the built in USB peripheral when the design is more stable.

#3 – Speed: USB is USB. The speed of the transfer is dependent on: How many bits are to be sent, the latency times and buffer size of the USB driver. The built in USB peripheral can’t be any faster than the FT232 chip. To maximize speed for any given application these two parameters need to be modifiable [1]. The FTDI driver exposes these two parameters in their excellent DLL so that they can be changed on the fly. It is a control panel / registry hack to accomplish this with the Windows CDC driver and I’m not sure that the values can change on the fly without disconnecting and reconnecting the device.

#4 – Ease of programming: With the FT232 chip the interface is through the processors USART and can directly use functions like printf(). If you use the built in USB peripheral you will be on your own to build packets and stuff them down the USB pipe optimally. This takes time to benchmark and analyze, the speed gains will also be marginal with un-optimized driver settings.

Do yourself and your customer a favor and give them the reliability of a FT232 USB connection, you won’t be sorry and in the long run everyone will save money.

[1] For a very interesting overview of USB latency and buffer size and how it effects total transfer speed see: “AN232B-04 Data Throughput, Latency and Handshaking” published by FTDI Inc.

Copyright notice: MPIS, ARM, PIC32, STM32, Microchip, ST Micro, NXP, Atmel and FTDI logos and names are copyrighted by their respective owners. 

Article By: Steve Hageman     

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.  

This Blog does not use cookies (other than the edible ones). 

Tuesday, March 14, 2017

Auto Generate a 'C' Function Prototype Header File

There is a lot of discussion on the proper way to use header files in Embedded C projects. I don't want to get into that discussion, rather I want to present a tool that is useful for my usage. If you don't agree with my use model, that's OK, you don't have to use this tool that way, it is very adaptable to any situation! [1]

I use header files for three things,

  1. One big one that has all the #includes in it so that every module has the proper references to things like: <stdio.h> and <math.h>.
  2. One that has the global program data in it.
  3. One that has the various public function prototypes (signatures) in it.
The first two are relatively easy to make and maintain. They also settle down quickly in a project as the tasks they maintain are defined early on and while there may be refactoring, they rarely change much.

#3 however is constantly changing and even in a small project it may grow to many, many function prototypes and refactoring causes constant updating.

It is tedious to have to change the data in two places when refactoring or when adding functionality through public functions in an embedded C program.

Auto Generation to the Rescue

I knew that someone somewhere must have written a utility that reads a directory, looks through all the '.c' files and auto-generates a header with all the public function prototypes in it.

Sure enough I found some C code written in 1993, by a Mr. Richard Hipp on the InterWebs that does just that [2].

It's a small program, all in one file and only several pages long. I brought it into Pelles C [3] to see if I could compile it under Win 7 and almost unbelievably it compiled with only a few simple changes. For code written in 1993 that compiles on Windows that seems almost unbelievable to me!

What the program is supposed to do is read every '.c' (and or '.h') file in a directory, extract the public functions and write them as function prototypes into individual '.h' files or one big '.h' file that can be added to the project.

Functions are marked as local (i.e. not for exporting into the '.h' file) by either using the keyword: 'static' or by the use of a 'LOCAL' define (see the program documentation [2]).

The program is wonderfully written, easy to follow and worked straight up out of the box, with one exception.

To read all the '.c' files at once to make a single output file the program depends on the UNIX ability to do wildcard expansion on the command line. MSDOS does not have that capability, so I had to wrap the makeheaders.exe in a MSDOS Batch file to make it work the way I wanted to use it.

MakeHeaders is so complete, it has the ability to pipe in a file list of names on the command line and then make a single big '.h' file. Mr. Hipp thought of everything. The batch file that I used is below.

REM Make one big Header File
dir /B *.c > mkhdr_input.txt
makeheaders -h -f mkhdr_input.txt >AutoGeneratedPrototypes.h
del mkhdr_input.txt


Listing 1 – A MSDOS batch file to operate the MakeHeaders program work the way I wanted it to. Upon running it makes a temporary file with all the names of the '.c' files in the current directory. Then it feeds this list into the makeheaders.exe. The program then parses all the files picking out the public function prototypes to write into one big '.h' file.

The batch file of Listing 1 first makes a file called: “mkhdr_input.txt” that contains just the file names. The “dir /B” switch is for the bare format which will list just the file names.

The makeheaders.exe is then fed with the mkhdr_input.txt file as input. The '-h' switch causes makeheaders to make one big '.h' file as output and then it redirects the output to the standard output which I redirect to the file AutoGeneratedPrototypes.h with the “>” redirect command.

The '-f' switch tells makeheaers that it will get it's input from the file specified, in this case: “mkhdr_input.txt”

Finally I just cleanup by deleting the: “mkhdr_input.txt” file.

There are many other options and ways to make the MakeHeaders program operate so be sure to check out the documentation [2]. For example the same list input can be used to make an individual '.h' file that corresponds to every '.c' file in the directory. This format is preferred by many and in very large programs may be preferable.

I just add the AutoGeneratedHeaders.h in my master 'include .h' file and the rest is automatic.

Perfect Function Prototypes Every Time

Now, anytime I refactor or add public functions to any source file I can just double click on the MakeHeaders.bat file and a new AutoGeneratedPrototypes.h file is made all ready for compiling.

Extra Bonus

If you want to use MakeHeaders to create a single, individual '.h' file for every '.c' file just use the batch file below.

REM Make A Seperate Header File for Each *.C file
for %%G IN (*.c) DO MakeHeaders %%G


[1] To find out more about the various ways to use header files, do a Google Search like,

Then pick a strategy that makes sense for you.

[2] Make Headers program. As of March 2017, the source code and documentation can be found at,

[3] Pelles C – A very good freeware C Compiler for Windows,

Article By: Steve Hageman     

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.  

This Blog does not use cookies (other than the edible ones). 

Monday, March 6, 2017

Simple Circuits Add to Versatility of the AD9834 Direct Digital Synthesizer IC

The little AD9834 Direct Digital Synthesizer (DDS) made by Analog Devices is a powerhouse of a design that is found in all sorts of products from Radios to Power Supplies. It is the “NE555” of DDS chips in it's popularity [1].

The AD9834 consists of a 28 bit programmable DDS Core and a 10 Bit Current Output DAC. The Nominal DAC output current is 0 to 3 mA. This current can conveniently be output directly into a 200 Ohm load to generate a 0 to 600 mV output voltage (See figure 1).

Figure 1 – Standard output circuit configuration for AD9834 with a FSADJ resistor of 6.81k provides a fixed 0 to 600 mV output.

Note: The following circuits are simplified and do not show power supply connections or proper bypassing. Please refer to the parts specific data sheets for complete usage information.

While 0 to 600 mV may be useful in many applications it is not particular useful if the output voltage needs to be user adjustable or bipolar. This is especially true when the DDS is used like a function generator where the end user needs an adjustable amplitude.

Adjustable DAC Full Scale Current

The first approach to adding an adjustable output to the AD934 is to attack the DAC current setting resistor. This resistor is nominally 6.81k for a DAC current of 0 to 3mA. The voltage at pin 1 (FSADJ) is nominally 1.15 Volts and this generates a current in the 6.81k resistor of 0.1689 mA (1.15/6810). This current gets scaled by 18 times internally in the AD9834 to get to the final 3mA DAC Full Scale Current.

The internal design of this circuit lends itself to controlling the DAC full Scale current over a reasonable range and this can be a useful and inexpensive way to get an analog or digital adjustment on the DDS output voltage.

As shown in Figure 2, a simple precision OPAMP from Linear Technology [2] used in a scaling circuit has been added to the DDS to control the DDS output voltage over a 4:1 range. Maximum DDS output voltage for this circuit is achieved for a 0 Volt control input.

A potentiometer or any 0 to 5V DAC output can be use as the 0 to 5V input to allow complete digital control of the DDS output voltage.

Figure 2 – A simple OPAMP circuit added to the AD9834 can give the AD9834 a 4:1 output Adjustment Range for a 0 to 5 Volt input signal. The Input signal could be a Potentiometer or from a 0 to 5V DAC.

The limiting factor in the maximum achievable adjustment range of the circuit in Figure 2 is the AC performance of the DDS DAC. While the output can be adjusted down from its maximum value the feedthrough glitches from the DAC switches will remain the same and the linearity of the DAC will suffer at lower output levels. Note also, that any excess noise on the 0 to 5V control voltage will additionally cause AM Modulation on the DDS output so add filtering as may be required by your application.

I have used this circuit in Figure 2 for a 4:1 adjustability with decent results. Your mileage may vary, so be sure to check the AC parameters that are important in your specific application.

Multiplying DAC on the DDS Output

For the ultimate in digital adjustability a Multiplying DAC  (MDAC) can be used at the output of the DDS to get 2^14 (16384:1) or better than 80 dB of adjustment range.

The AD5453 family from Analog Devices is a very high bandwidth Multiplying DAC [3] that comes in 8, 10, 12 and 14 bit resolutions. It takes in a AC reference voltage and outputs a Current that is scaled by a Digital Control Word.

Most MDAC's have a very low -3 dB bandwidth on the order of 20 kHz, the High Speed AD5453 when used with a suitable OPAMP output has a -3 dB bandwidth of 10 MHz or better. The maximum attenuation (or how low you can control the output) is flat to 300 kHz at 14 bits, rising to 1MHz at 12 bits, 3MHz at 10 bits finally rising to 10 MHz at 8 bits.

Figure 3 – Combining a high speed AD5453 MDAC and a LT1087 Dual OPAMP allows very complete control of the AD9834 DDS output.

Note: The 10 uF capacitor sets the low frequency roll off. With the 10 uF value shown the low frequency, -3 dB point is below 2 Hz. (The input resistance of the AD5453 VREF Pin is 7k Ohms minimum).

Note: The 1.5pF capacitors should be adjusted in the final circuit for maximum output flatness over frequency.

The circuit of Figure 3 provides a +/- 5 Volt output with low distortion to 1 MHz and provides 80 dB plus of output voltage adjustment range. Additionally the output can be offset from -5V to +5V with the addition of an offset adjustment control via a low cost DAC or POT (Offset Adjust Input).

DDS Output Control At Even Higher Frequencies

If you need to operate the AD9834 at even higher frequencies, closer to the maximum specified fundamental output of 37.5 MHz, or even operating in “Super Nyquist” mode [4] you should look at a 50 Ohm CMOS RF Attenuator like those manufactured by Peregrine Semiconductor [5]. The PE43711 has a frequency range down to 9 kHz and 31 dB of control with 0.25 dB steps all the way to 6 GHz. At higher frequencies you will probably be designing around 50 ohm circuit impedance's anyway so this should not be much of an issue. Multiple PE43711's can be connected in series to get more attenuation in 31 dB chunks.

Note: At lower RF frequencies, less than about 50 MHz, CMOS, SiGe and Silicon based IC's are preferred to GaAs IC's. This is because the GaAs IC's typically have worse harmonic distortion (Especially very bad 2nd harmonic distortion) at low RF frequencies.


[1] The NE555 timer is arguably the most popular linear IC of all time.

[2] Linear Technology LT1677 Precision and LT1087 High Speed OPAMPS are manufactured by Linear Technology Inc. Now a part of Analog Devices.

[3] Analog Devices Application Note: “Multiplying DACs Excel at Handling AC Signals”.

[4] Super Nyquist Mode, See: Analog Devices Application Note AN-939

[5] Peregrine Semiconductor, Inc

Article By: Steve Hageman 

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

This Blog does not use cookies (other than the edible ones). 

Thursday, January 12, 2017

Numeric optimizations in C# for a faster DFT

Kirk: "More speed Scotty !!!"

Like Capt. Kirk, we engineers it seems, are never satisfied and always want more speed!

How to get more speed out of DFT calculations in C#

Currently the open source project: "DSPLib" [1] calculates Discrete Fourier Transforms (DFT) in double precision math, so the question can be asked: Could using some other data type produce faster DFT calculations? And by fast I mean fast enough to be worthwhile spending the time doing – like 50% faster as 10% faster would likely not be worth the effort.

After reading many articles and blogs on how the .NET framework, the CLR (Common Language Runtime) and Intel Processors handle math, one will find seemingly convincing evidence and very simple benchmarks that Double and Float math calculation takes about the same time on a modern processor. There is also some circumstantial evidence that Floating Point Math is as fast as integer math because the Floating Point Unit is so fast in modern CPU's. Likewise Int64's should theoretically be about the same speed as Int32's because people say that Intel processors calculate everything as Int64's and truncate for shorter data types anyhow.

But... How would all this apply to an actual DFT algorithm? A DFT moves around a fair amount of data in arrays. Can a different numeric format speed up the DFT calculations that I specifically want to make?

I didn't know, but I remembered my Mark Twain. He popularized the saying:  

     "There are three kinds of lies: lies, darned lies, and statistics."

Mr. Twain was a humorist, but if he was a programmer he might have changed the saying to,
     "There are three kinds of lies: lies, darned lies, and benchmarks." 

This applies aptly to what I was faced with – one can find many very simple “Benchmarks” that people have written to show that one numeric type is faster than another or that the numeric format makes no difference, the trouble is these don't store intermediate results or access arrays over and over the way a DFT does. Are they really applicable? I set out to find out…

My DFT Benchmark

I roughed out a basic DFT (All the same loops, arrays and multiples), I used look up tables instead of calculating the Sin/Cos terms in the loop as that is the way I would use them anyway and I did not do any further optimization or try to help or hinder the C# compiler from optimizing anything. I also did not invoke the task parallel extensions, which have been shown to immediately provide a 3x improvement in performance even with a dual core processor [1][2].

// Note: This is not a working DFT - For simplified timings only.
// It does have the same data structures and data movements as a real DFT.
public Int64[] DftInt64(Int64[] data)
    int n = data.Length;
    int m = n;
    Int64[] real = new Int64[n];
    Int64[] imag = new Int64[n];
    Int64[] result = new Int64[m];
    for (int w = 0; w < m; w++)
       for (int t = 0; t < n; t++)
          real[w] += data[t] * Int32Sin[t];
          imag[w] -= data[t] * Int32Cos[t];
       result[w] = real[w] + imag[w];
    return result;

Figure 1: The basic code that I used to benchmark my DFT. I wrote one of these for every numeric data type by changing the variables to the type tested. I initialized the input data and the Sin / Cos arrays (Int32Sin[] and Int32Cos[] above) with real Sin and Cos data once at the start of the program. I then called the DFT's over and over again until the elapsed time settled out, which I believe is an indication that the entire program was running in the processors cache or at least a stable portion of memory. This procedure was repeated for each DFT Length.

I wrote DFT routines to calculate in: Double, Float, Int64, Int32 and Int16 data types and compiled release code in VS2015 for x86 and x64 architectures. The results of these tests are presented in figure 2 below.

Figure 2A – The raw results of my real world benchmark for DFT's. All the times are in Milliseconds and were recorded with .NET's StopWatch() functionality. Two program builds were recorded. One for the x86 and one for the x64 code compilation types in Visual Studio 2015.

Figure 2B – Raw timing numbers are a little hard to compare and comprehend. To improve on this, the view above normalizes each row to the Double data type time for that size DFT, which makes it easier to compare any speed gains in the results. For instance: In the X64 Release version, for a DFT size of 2000, the Int16 data type took 0.28:1 the time of the Double data type (10mS / 36mS = 0.28).

Int16 Compared to Int32 
The Int16 data type timing is comparable to that of the Int32 for both the x86 and x64 compiled program. An interesting note is that the Int16, compiled as a x86 program actually starts to take longer than the Int32 as the DFT size gets really big. This is probably due to the compiled code having to continually cast and manipulate the Int16 values to keep them properly at 16 bits long.

The bottom line is: Int16's are no faster and in some cases longer than Int32's so there is really no point in considering them further I this discussion of speeding up a DFT calculation.

Using an Int16 would also severely limit the dynamic range to less than 96 dB. Many digitizers have better dynamic range than this now. I would consider this a big limiting factor.

Int32 Compared to Double 
The Int32 is much faster than a Double for small DFT's at all compilations (x86 or x64). However as the DFT size gets really big the speed difference disappears. This is especially true for the x86 Compilation where a large DFT is actually slower than the equivalent Double DFT. With x64 compilation, the Int32 is still faster even at large DFT's but it too shows this non-linear behavior as the DFT size increases. This non-linear behavior is probably due to the data arrays not fitting in the processors fastest cache as the arrays get larger.

Interesting Point #1: Benchmark timings can be data array size dependent and in this particular case are non-linear.

Int32 compared to Int64 
For the x86 compilation the Int64 is definitely a non-starter. In all cases tested the Int64 is slower than the equivalent Double calculation. This makes sense as all the address and register calculations would need to be stitched together 32 bit registers. In the x86 case the Int64 is actually slower than the Double data type for all DFT sizes tested!

With the x64 compilation the Int32 and Int64 data types are comparable across all DFT sizes.

It probably does not make any sense to favor the Int64 over an Int32 even for the increased dynamic range. A Int32 calculation can yield a 192 dB dynamic range. This is way more than most applications can use or will ever require.

Interesting Point #2: Somewhat unsurprisingly, when using the x86 Compilation, the use of Int64's is very time consuming, especially when compared to the Int32. More surprisingly is that the Int64 actually takes longer than the equivalent Double.

Float compared to Double 
In both x86 and x64 compilation the Float is quicker than the Double data type. Twice as fast at smaller DFT sizes, the speed gets comparable at very large DFT sizes. That is not the common result that you can glean from the internet. Most peoples benchmarks and discussions suggest the Double and Float data type will have the same execution time.

Interesting Point #3: Despite what the Internet says, The Float data type is faster than Double especially for small DFT sizes in this benchmark.

It is clear that for the fastest DFT calculations the x64 compilation should be used no matter what. This would not seem to be a hardship for anyone as 64 bit Windows 7 (and Win 10?) pretty much rules the world right now and everything from here on out will be at least 64 bits.

Even though the Float data type is usually faster than Double, the clear winner is to use the x64 compilation and the Int32 data type.

At the smaller DFT sizes the Int32 is nearly 3 times faster and even at really large DFT sizes the Int32 is still 25% faster than the Double data type.

Using an Int32 is not a resolution hardship in most real world cases. Most digitizers output their data as an integer data type anyway and a Int32 bit data type can handle any realistic 8 to 24 bit digitizer in use today.

Since the Int32 data type can provide a dynamic range of 192 dB. That seems to be plenty for a while at any rate. As a comparison the Float data type provides about 150 dB of dynamic range.

Next Steps 
Now I have some real and verified benchmarks that can lead me to the best bet on improving the real time performance of a .NET based DFT.

The next obvious steps to optimize this further for the 'Improved' DSPLib library is,

1) Apply Task Parallel techniques to speed the Int32 DFT up for nearly no extra work as the DSPLib library already does [1].

2) Having separate Real[] and Imaginary[] intermediate arrays probably prevents the C# array bounds checker from being the most effective. Flattening these into one array with double the single dimension size will probably yield a good speed increase. This however needs to be verified (Again: Benchmark). References 3 and 4 provide some information on how to proceed here.

At least I have separated the simplistic Internet benchmarks from the real facts as it applies to my specific DFT calculation needs.

The takeaway from all this is: To really know how to optimize even a relatively simple calculation, some time needs to be spent benchmarking actual code and hardware that reflects the actual calculations and data movement required.


[1] DSPLib: An open source FFT / DFT Fourier Transform Library for .NET 4
DSPLib also applies modern multi-threading techniques to get the highest speed possible for any processor as explained in the article above. 

[2] The computer that I ran these benchmarks on is a lower end dual core i7-5500U processor with 8 GB of RAM running 64 Bit, Windows 7 Pro. Higher end processors generally have more on board Cache and will generally give faster results especially at larger DFT sizes.

[3] Dave Detlefs, “Array Bounds Check Elimination in the CLR”

[4] C# Flatten Array

NOTE: The C# Compiler is always being worked on for optimization improvements. C# V7 is just around the corner in January 2017 and it has some substantial changes, these changes may also improve or change the current optimization schemes. If in doubt - always check what the current compiler does.

Article By: Steve Hageman 
We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

This Blog does not use cookies (other than the edible ones). 

Sunday, June 19, 2016

DSPLib – An Open Source FFT Library for .NET 4

There are many Open Source and Commercial implementations around the web and in textbooks for computing Fourier Transforms. Unfortunately most are flawed in a number of ways,
  1. They produce an un-calibrated result that changes depending on the number of points transformed. 
  2. They include no built in methods to scale for Windowing of the input data.
  3. They always have no proper way to measure noise accurately.
  4.  They don't size the returned spectrum to have just the real part of the spectrum. 
  5. They implement their own Complex number type. Ignoring .NET 4's built in Complex data type.
  6. They aren't complete. You have to add a bunch of helper routines every time.
  7.  They have restrictive Open Source Licenses.

All of these things take hours of tweaking to get a usable FFT or DFT running from even the best of the currently available libraries.

I decided to solve this problem for myself and my clients by making a pretty complete Fourier Transform Library that implements, 
  1. Properly Scaled Fast Fourier Transforms.
  2. Properly Scaled Discrete Fourier Transforms. 
  3. Properly Scaled Data Windowing.
  4. Proper functions to scale for noise and signals in the correct manor. 
  5. Signal Generation for Testing.
  6. Useful Array Math routines.

This work is the culmination of about 5 years worth of work using, revising and tweaking other libraries and implementations, both open source and commercial before I wrote my own.

DSPLib is the first Fourier Transform Library that can take any time domain signal input, like from an ADC, apply one of the 27 built in window types and produce a correctly scaled Spectrum Output for either signal or noise analysis with no code tweaking required at all.

The library is released under the very non restrictive MIT License and is essentially royalty free for any use, even commercial.

The complete write up is at – take a look and enjoy never needing to spend hours tweaking Fourier Transform code again.

Open Source .NET FFT Fast Fourier Transform Library Code
Open Source C#  FFT Fast Fourier Transform Library Code
Open Source .NET  DFT Discrete Fourier Transform Library Code
Open Source C#  DFT Discrete Fourier Transform Library Code
Open Source C# FFT Library,   Open Source .NET FFT Library
Open Source C# DSP Library Code
Open Source .NET DSP Library Code

Article By: Steve Hageman 

We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project. 

This Blog does not use cookies (other than the edible ones).