Like Capt. Kirk, we
engineers it seems, are never satisfied and always want more speed!
How
to get more speed out of DFT calculations in C#
Currently
the open source project: "DSPLib" [1] calculates Discrete Fourier Transforms (DFT) in double precision math, so the question can be
asked: Could using some other data type produce faster DFT calculations? And by fast I mean fast enough
to be worthwhile spending the time doing – like 50% faster as 10% faster would likely not be worth the effort.
After
reading many articles and blogs on how the .NET framework, the CLR
(Common Language Runtime) and Intel Processors handle math, one will find seemingly convincing
evidence and very simple benchmarks that Double and Float math
calculation takes about the same time on a modern processor. There is
also some circumstantial evidence that Floating Point Math is as fast
as integer math because the Floating Point Unit is so fast in modern
CPU's. Likewise Int64's should theoretically be about the same speed
as Int32's because people say that Intel processors calculate
everything as Int64's and truncate for shorter data types anyhow.
But...
How would all this apply to an actual DFT algorithm? A DFT moves
around a fair amount of data in arrays. Can a different numeric
format speed up the DFT calculations that I specifically want
to make?
I
didn't know, but I remembered my Mark Twain. He popularized the
saying:
"There
are three kinds of lies: lies, darned lies, and statistics."
Mr.
Twain was a humorist, but if he was a programmer he might have
changed the saying to,
"There are three kinds of lies:
lies, darned lies, and benchmarks."
This
applies aptly to what I was faced with – one can find many very
simple “Benchmarks” that people have written to show that one
numeric type is faster than another or that the numeric format makes
no difference, the trouble is these don't store intermediate results
or access arrays over and over the way a DFT does. Are they really
applicable? I set out to find out…
My DFT Benchmark
I
roughed out a basic DFT (All the same loops, arrays and multiples), I
used look up tables instead of calculating the Sin/Cos terms in the
loop as that is the way I would use them anyway and I did not do any
further optimization or try to help or hinder the C# compiler from
optimizing anything. I also did not invoke the task parallel
extensions, which have been shown to immediately provide a 3x
improvement in performance even with a dual core processor [1][2].
// Note: This is not a working DFT - For simplified timings only.
// It does have the same data structures and data movements as a real DFT.
public Int64[] DftInt64(Int64[] data)
{
int n = data.Length;
int m = n;
Int64[] real = new Int64[n];
Int64[] imag = new Int64[n];
Int64[] result = new Int64[m];
for (int w = 0; w < m; w++)
{
for (int t = 0; t < n; t++)
{
real[w] += data[t] * Int32Sin[t];
imag[w] -= data[t] * Int32Cos[t];
}
result[w] = real[w] + imag[w];
}
return result;
}
Figure
1: The basic code that I used to benchmark my DFT. I wrote one of
these for every numeric data type by changing the variables to the
type tested. I initialized the input data and the Sin / Cos arrays
(Int32Sin[] and Int32Cos[] above) with real Sin and Cos data once at
the start of the program. I then called the DFT's over and over again
until the elapsed time settled out, which I believe is an indication
that the entire program was running in the processors cache or at
least a stable portion of memory. This procedure was repeated for
each DFT Length.
I
wrote DFT routines to calculate in: Double, Float, Int64, Int32 and
Int16 data types and compiled release
code in VS2015 for x86 and x64 architectures. The results of these
tests are presented in figure 2 below.
Figure
2A – The raw results of my real world benchmark for DFT's. All the
times are in Milliseconds and were recorded with .NET's StopWatch()
functionality. Two program builds were recorded. One for the x86 and
one for the x64 code compilation types in Visual Studio 2015.
Figure
2B – Raw timing numbers are a little hard to compare and
comprehend. To improve on this, the view above normalizes each row to
the Double data type time for that size DFT, which makes it easier to
compare any speed gains in the results. For instance: In the X64
Release version, for a DFT size of 2000, the Int16 data type took
0.28:1 the time of the Double data type (10mS / 36mS = 0.28).
Int16 Compared to Int32
The Int16 data type timing is comparable to that of the Int32 for both the x86 and x64 compiled program. An interesting note is that the Int16, compiled as a x86 program actually starts to take longer than the Int32 as the DFT size gets really big. This is probably due to the compiled code having to continually cast and manipulate the Int16 values to keep them properly at 16 bits long.
The
bottom line is: Int16's are no faster and in some cases longer than
Int32's so there is really no point in considering them further I
this discussion of speeding up a DFT calculation.
Using
an Int16 would also severely limit the dynamic range to less than 96
dB. Many digitizers have better dynamic range than this now. I would
consider this a big limiting factor.
Int32
Compared to Double
The Int32 is much faster than a Double for small DFT's at all compilations (x86 or x64). However as the DFT size gets really big the speed difference disappears. This is especially true for the x86 Compilation where a large DFT is actually slower than the equivalent Double DFT. With x64 compilation, the Int32 is still faster even at large DFT's but it too shows this non-linear behavior as the DFT size increases. This non-linear behavior is probably due to the data arrays not fitting in the processors fastest cache as the arrays get larger.
The Int32 is much faster than a Double for small DFT's at all compilations (x86 or x64). However as the DFT size gets really big the speed difference disappears. This is especially true for the x86 Compilation where a large DFT is actually slower than the equivalent Double DFT. With x64 compilation, the Int32 is still faster even at large DFT's but it too shows this non-linear behavior as the DFT size increases. This non-linear behavior is probably due to the data arrays not fitting in the processors fastest cache as the arrays get larger.
Interesting
Point #1: Benchmark timings can be data array size dependent and in
this particular case are non-linear.
Int32 compared to Int64
For the x86 compilation the Int64 is definitely a non-starter. In all cases tested the Int64 is slower than the equivalent Double calculation. This makes sense as all the address and register calculations would need to be stitched together 32 bit registers. In the x86 case the Int64 is actually slower than the Double data type for all DFT sizes tested!
With the x64 compilation the Int32
and Int64 data types are comparable across all DFT sizes.
It probably does not make any sense
to favor the Int64 over an Int32 even for the increased dynamic
range. A Int32 calculation can yield a 192 dB dynamic range. This is
way more than most applications can use or will ever require.
Interesting
Point #2: Somewhat unsurprisingly, when using the x86 Compilation,
the use of Int64's is very time consuming, especially when compared
to the Int32. More surprisingly is that the Int64 actually takes
longer than the equivalent Double.
Float compared to Double
In both x86 and x64 compilation the Float is quicker than the Double data type. Twice as fast at smaller DFT sizes, the speed gets comparable at very large DFT sizes. That is not the common result that you can glean from the internet. Most peoples benchmarks and discussions suggest the Double and Float data type will have the same execution time.
Interesting
Point #3: Despite what the Internet says, The Float data type is
faster than Double especially for small DFT sizes in this benchmark.
Conclusion
It is clear that for the fastest DFT calculations the x64 compilation should be used no matter what. This would not seem to be a hardship for anyone as 64 bit Windows 7 (and Win 10?) pretty much rules the world right now and everything from here on out will be at least 64 bits.
Even
though the Float data type is usually faster than Double, the clear
winner is to use the x64 compilation and the Int32 data type.
At
the smaller DFT sizes the Int32 is nearly 3 times faster and even at
really large DFT sizes the Int32 is still 25% faster than the Double
data type.
Using
an Int32 is not a resolution hardship in most real world cases. Most
digitizers output their data as an integer data type anyway and a
Int32 bit data type can handle any realistic 8 to 24 bit digitizer in
use today.
Since
the Int32 data type can provide a dynamic range of 192 dB. That seems
to be plenty for a while at any rate. As a comparison the Float data
type provides about 150 dB of dynamic range.
Next Steps
Now I have some real and verified benchmarks that can lead me to the best bet on improving the real time performance of a .NET based DFT.
The
next obvious steps to optimize this further for the 'Improved' DSPLib
library is,
1)
Apply Task Parallel techniques to speed the Int32 DFT up for nearly
no extra work as the DSPLib library already does [1].
2)
Having separate Real[] and Imaginary[] intermediate arrays probably
prevents the C# array bounds checker from being the most effective.
Flattening these into one array with double the single dimension size
will probably yield a good speed increase. This however needs to be
verified (Again: Benchmark). References 3 and 4 provide some
information on how to proceed here.
At
least I have separated the simplistic Internet benchmarks from the
real facts as it applies to my specific DFT calculation needs.
The
takeaway from all this is: To really know how to optimize even a
relatively simple calculation, some time needs to be spent
benchmarking actual code and hardware that reflects the actual
calculations and data movement required.
References:
https://www.codeproject.com/articles/1107480/dsplib-fft-dft-fourier-transform-library-for-net
DSPLib also applies modern multi-threading techniques to get the highest speed possible for any processor as explained in the article above.
DSPLib also applies modern multi-threading techniques to get the highest speed possible for any processor as explained in the article above.
[2]
The computer that I ran these benchmarks on is a lower end dual core
i7-5500U processor with 8 GB of RAM running 64 Bit, Windows 7 Pro.
Higher end processors generally have more on board Cache and will
generally give faster results especially at larger DFT sizes.
[3]
Dave
Detlefs, “Array Bounds Check
Elimination in the CLR”
[4] C#
Flatten Array
https://www.dotnetperls.com/flatten-array
NOTE: The C# Compiler is always being worked on for optimization improvements. C# V7 is just around the corner in January 2017 and it has some substantial changes, these changes may also improve or change the current optimization schemes. If in doubt - always check what the current compiler does.
NOTE: The C# Compiler is always being worked on for optimization improvements. C# V7 is just around the corner in January 2017 and it has some substantial changes, these changes may also improve or change the current optimization schemes. If in doubt - always check what the current compiler does.
Article
By:
Steve Hageman www.AnalogHome.com
We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.
This Blog does not use cookies (other than the edible ones).
We design custom: Analog, RF and Embedded systems for a wide variety of industrial and commercial clients. Please feel free to contact us if we can help on your next project.
This Blog does not use cookies (other than the edible ones).
No comments:
Post a Comment