Thursday, January 12, 2017

Numeric optimizations in C# for a faster DFT

Kirk: "More speed Scotty !!!"

Like Capt. Kirk, we engineers it seems, are never satisfied and always want more speed!

How to get more speed out of DFT calculations in C#

Currently the open source project: "DSPLib" [1] calculates Discrete Fourier Transforms (DFT) in double precision math, so the question can be asked: Could using some other data type produce faster DFT calculations? And by fast I mean fast enough to be worthwhile spending the time doing – like 50% faster as 10% faster would likely not be worth the effort.

After reading many articles and blogs on how the .NET framework, the CLR (Common Language Runtime) and Intel Processors handle math, one will find seemingly convincing evidence and very simple benchmarks that Double and Float math calculation takes about the same time on a modern processor. There is also some circumstantial evidence that Floating Point Math is as fast as integer math because the Floating Point Unit is so fast in modern CPU's. Likewise Int64's should theoretically be about the same speed as Int32's because people say that Intel processors calculate everything as Int64's and truncate for shorter data types anyhow.

But... How would all this apply to an actual DFT algorithm? A DFT moves around a fair amount of data in arrays. Can a different numeric format speed up the DFT calculations that I specifically want to make?

I didn't know, but I remembered my Mark Twain. He popularized the saying:  

     "There are three kinds of lies: lies, darned lies, and statistics."

Mr. Twain was a humorist, but if he was a programmer he might have changed the saying to,
     "There are three kinds of lies: lies, darned lies, and benchmarks." 

This applies aptly to what I was faced with – one can find many very simple “Benchmarks” that people have written to show that one numeric type is faster than another or that the numeric format makes no difference, the trouble is these don't store intermediate results or access arrays over and over the way a DFT does. Are they really applicable? I set out to find out…


My DFT Benchmark

I roughed out a basic DFT (All the same loops, arrays and multiples), I used look up tables instead of calculating the Sin/Cos terms in the loop as that is the way I would use them anyway and I did not do any further optimization or try to help or hinder the C# compiler from optimizing anything. I also did not invoke the task parallel extensions, which have been shown to immediately provide a 3x improvement in performance even with a dual core processor [1][2].


// Note: This is not a working DFT - For simplified timings only.
// It does have the same data structures and data movements as a real DFT.
public Int64[] DftInt64(Int64[] data)
{
    int n = data.Length;
    int m = n;
    Int64[] real = new Int64[n];
    Int64[] imag = new Int64[n];
    Int64[] result = new Int64[m];
    for (int w = 0; w < m; w++)
    {
       for (int t = 0; t < n; t++)
       {
          real[w] += data[t] * Int32Sin[t];
          imag[w] -= data[t] * Int32Cos[t];
       }
       result[w] = real[w] + imag[w];
    }
    return result;
}

Figure 1: The basic code that I used to benchmark my DFT. I wrote one of these for every numeric data type by changing the variables to the type tested. I initialized the input data and the Sin / Cos arrays (Int32Sin[] and Int32Cos[] above) with real Sin and Cos data once at the start of the program. I then called the DFT's over and over again until the elapsed time settled out, which I believe is an indication that the entire program was running in the processors cache or at least a stable portion of memory. This procedure was repeated for each DFT Length.

I wrote DFT routines to calculate in: Double, Float, Int64, Int32 and Int16 data types and compiled release code in VS2015 for x86 and x64 architectures. The results of these tests are presented in figure 2 below.

Figure 2A – The raw results of my real world benchmark for DFT's. All the times are in Milliseconds and were recorded with .NET's StopWatch() functionality. Two program builds were recorded. One for the x86 and one for the x64 code compilation types in Visual Studio 2015.

Figure 2B – Raw timing numbers are a little hard to compare and comprehend. To improve on this, the view above normalizes each row to the Double data type time for that size DFT, which makes it easier to compare any speed gains in the results. For instance: In the X64 Release version, for a DFT size of 2000, the Int16 data type took 0.28:1 the time of the Double data type (10mS / 36mS = 0.28).


Int16 Compared to Int32 
The Int16 data type timing is comparable to that of the Int32 for both the x86 and x64 compiled program. An interesting note is that the Int16, compiled as a x86 program actually starts to take longer than the Int32 as the DFT size gets really big. This is probably due to the compiled code having to continually cast and manipulate the Int16 values to keep them properly at 16 bits long.

The bottom line is: Int16's are no faster and in some cases longer than Int32's so there is really no point in considering them further I this discussion of speeding up a DFT calculation.

Using an Int16 would also severely limit the dynamic range to less than 96 dB. Many digitizers have better dynamic range than this now. I would consider this a big limiting factor.

Int32 Compared to Double 
The Int32 is much faster than a Double for small DFT's at all compilations (x86 or x64). However as the DFT size gets really big the speed difference disappears. This is especially true for the x86 Compilation where a large DFT is actually slower than the equivalent Double DFT. With x64 compilation, the Int32 is still faster even at large DFT's but it too shows this non-linear behavior as the DFT size increases. This non-linear behavior is probably due to the data arrays not fitting in the processors fastest cache as the arrays get larger.


Interesting Point #1: Benchmark timings can be data array size dependent and in this particular case are non-linear.


Int32 compared to Int64 
For the x86 compilation the Int64 is definitely a non-starter. In all cases tested the Int64 is slower than the equivalent Double calculation. This makes sense as all the address and register calculations would need to be stitched together 32 bit registers. In the x86 case the Int64 is actually slower than the Double data type for all DFT sizes tested!

With the x64 compilation the Int32 and Int64 data types are comparable across all DFT sizes.

It probably does not make any sense to favor the Int64 over an Int32 even for the increased dynamic range. A Int32 calculation can yield a 192 dB dynamic range. This is way more than most applications can use or will ever require.

Interesting Point #2: Somewhat unsurprisingly, when using the x86 Compilation, the use of Int64's is very time consuming, especially when compared to the Int32. More surprisingly is that the Int64 actually takes longer than the equivalent Double.



Float compared to Double 
In both x86 and x64 compilation the Float is quicker than the Double data type. Twice as fast at smaller DFT sizes, the speed gets comparable at very large DFT sizes. That is not the common result that you can glean from the internet. Most peoples benchmarks and discussions suggest the Double and Float data type will have the same execution time.


Interesting Point #3: Despite what the Internet says, The Float data type is faster than Double especially for small DFT sizes in this benchmark.



Conclusion 
It is clear that for the fastest DFT calculations the x64 compilation should be used no matter what. This would not seem to be a hardship for anyone as 64 bit Windows 7 (and Win 10?) pretty much rules the world right now and everything from here on out will be at least 64 bits.

Even though the Float data type is usually faster than Double, the clear winner is to use the x64 compilation and the Int32 data type.

At the smaller DFT sizes the Int32 is nearly 3 times faster and even at really large DFT sizes the Int32 is still 25% faster than the Double data type.

Using an Int32 is not a resolution hardship in most real world cases. Most digitizers output their data as an integer data type anyway and a Int32 bit data type can handle any realistic 8 to 24 bit digitizer in use today.

Since the Int32 data type can provide a dynamic range of 192 dB. That seems to be plenty for a while at any rate. As a comparison the Float data type provides about 150 dB of dynamic range.


Next Steps 
Now I have some real and verified benchmarks that can lead me to the best bet on improving the real time performance of a .NET based DFT.

The next obvious steps to optimize this further for the 'Improved' DSPLib library is,


1) Apply Task Parallel techniques to speed the Int32 DFT up for nearly no extra work as the DSPLib library already does [1].

2) Having separate Real[] and Imaginary[] intermediate arrays probably prevents the C# array bounds checker from being the most effective. Flattening these into one array with double the single dimension size will probably yield a good speed increase. This however needs to be verified (Again: Benchmark). References 3 and 4 provide some information on how to proceed here.


At least I have separated the simplistic Internet benchmarks from the real facts as it applies to my specific DFT calculation needs.

The takeaway from all this is: To really know how to optimize even a relatively simple calculation, some time needs to be spent benchmarking actual code and hardware that reflects the actual calculations and data movement required.



References:

[1] DSPLib: An open source FFT / DFT Fourier Transform Library for .NET 4

https://www.codeproject.com/articles/1107480/dsplib-fft-dft-fourier-transform-library-for-net
DSPLib also applies modern multi-threading techniques to get the highest speed possible for any processor as explained in the article above. 

[2] The computer that I ran these benchmarks on is a lower end dual core i7-5500U processor with 8 GB of RAM running 64 Bit, Windows 7 Pro. Higher end processors generally have more on board Cache and will generally give faster results especially at larger DFT sizes.

[3] Dave Detlefs, “Array Bounds Check Elimination in the CLR”


[4] C# Flatten Array

https://www.dotnetperls.com/flatten-array

NOTE: The C# Compiler is always being worked on for optimization improvements. C# V7 is just around the corner in January 2017 and it has some substantial changes, these changes may also improve or change the current optimization schemes. If in doubt - always check what the current compiler does.

Article By: Steve Hageman www.AnalogHome.com 
We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.

This Blog does not use cookies (other than the edible ones). 

No comments:

Post a Comment