Sample Integrated Fourier Transform (SIFT): An

Sample Integrated Fourier Transform (SIFT): An
approach to low latency, low power high
performance FT for real time signal processing
The computation provides the same number of coefficients
as the number of samples. For all cases where N is even
and the samples are real-valued:
Trang K. Ta1, Walter Pelton2, Pochang Hsu, Nipa Yossakda3
Microelectronics, CA, USA – email: tta@fmi.fujitsu.com
2Agnitio Technologies, CA, USA – email: Agnitio@home.com
3Northwestern Polytechnic University, CA, USA
At fi = 0 or N/2, then
cos (2*π*fi*xj/N) = 1, and
sin (2*π*fi*xj/N) = 0
1Fujitsu
The Fast Fourier Transform (FFT) is widely used in DSP
applications because of its computationally efficient
algorithm. The significant advantage of the FFT over a
direct implementation of the Discrete Fourier Transform
(DFT) is in terms of the number of required Multiply and
ACcumulate (MAC) cycles. This is clearly shown in Table
1. However, the FFT is a batch process and requires more
than N*Log(N) cycles [1, 2] to do an N-point transform.
Any implementation of FFT hardware must provide a
reversal sorting mechanism, which is complicated [3],
regardless of size of the number of points, and the batch
process cannot begin until the data collection is complete.
This paper introduces a synergistic algorithm plus
architecture implementation of the Fourier Coefficients
based on DFT, the SIFT paradigm. The SIFT makes use of
a transactional process that does not require sample storage
or register addressing hardware in its implementation to
compute the Fourier Transform while attaining very low
latency, low power, high performance. A 64-point discrete
IC SIFT was designed and its micrograph shown in figure
14. An ASIC core of the design was completed and 100%
synthesized using 0.18µm process. The simulated results of
the core are presented in the following sections of this
paper. A 1024-point complex FT, based on 32-point SIFT
cells, is able to complete a transform in 3.2 µsec,
performing 20GMACs/s while dissipating 500mW and
requiring an area of 16mm2 when fabricated in a 0.18 um
process.
An overview of DFT and FFT
The FFT based on decimation and the FFTW based on
divide and conquer, remain the most efficient algorithms
for converting information from time domain to frequency
domain in Von Neumann, Harvard, RISC and DSP
processors. This important analytical tool has been used in
many fields as acoustics, communications, signal and image
processing. An overview of DFT, FFT, and SIFT with
implementation results are presented in the paper.
The Equations (1) and (2) are used to compute the
coefficients of the DFT. This is the digital approximation to
the Fourier Series that is actually used for most analog
situations. The function S(xj) is a finite set of evenly
spaced samples. These samples are assembled in a memory
to be processed.
The number of samples in a group, N, is commonly a power
of two, i.e. 8, 16, 64, … 1024, etc.
Two coefficients are calculated for the frequency whose
period equals the time of N samples (f0). This is called the
Base Frequency. The same format repeats for 2*f0 … to
N/2*f0. In the case of fi = N/2*f0 and fi = 0*f0 the Bi are
zero by identity.
The Discrete Fourier Transform procedure is to multiply
each sample by the sine and by the cosine of the value of
the independent variable times the rate and sum over all of
the samples. This requires N MACs for each of N
coefficients or N squared MACs per DFT.
Examination of the process identifies that many of the
values are the same due to periodicity of the trigonometric
functions. The samples to be multiplied by equal values
can be multiplied once and the value may be used in the
locations that have equal values. This is the basis of the
butterfly which reduces the number of multiplies from Nsquared to N*Log(N) in the FFT.
Table 1. # MAC cycles vs. # Points in Transform
N (# of points)
8
16
32
64
128
256
512
1024
DFT: N * N
64
256
1,024
4,096
16,384
65,536
262,144
1,048,576
FFT: N*Log(N)
24
64
160
384
896
2048
4608
10240
FFT Butterfly: Bit reversal sorting algorithm and
samples dependency
Currently the process to multiply-add in a given system
may require the same time as add. The butterfly requires a
complex structure [1, 2, 3] for storing and addressing the
samples and the values of the functions of the independent
variables. The complexity exists because the FFT butterfly
operates mainly by decomposing an N-point time domain
signal into N single point time domain signals using bit
reversal sorting algorithm. Then, the N-frequency spectrum
will be combined in the exact reverse order that the time
domain decomposition took place. As a result, it requires
hardware to accommodate the sorting algorithm and more
than N*Log(N) cycles to complete an N-point transform [1,
2]. The sorting hardware already exists in general purposecomputing architectures. The graph in figure 2 shows the
number of cycles required for SIFT and FFT butterfly in
hardware implementation. With 32-point transform, to
complete 32*Log (32) or 160 calculations, the FFT
butterfly needs 733 cycles for single precision, FFT(s)
curve, and 2097 cycles for double precision [1], FFT (d)
curve, versus 1024 cycles for SIFT. When the number of
points get smaller than 16, SIFT requires smallest number
of cycles to do a transform.
The sorting and switching hardware present in a generalpurpose computer adds to the cycle time and power
requirements. As will be shown in the 1024-point structure,
parallelism comes easily with SIFT and the number of
cycles may be reduced dramatically.
Note that the multiplies in the FFT butterfly require a
specific sample sequence that is very different than the
order of arrival. Calculation of the value of each coefficient
requires input from all of the samples. Figure 1 shows a
load instruction always required before an execution
instruction in a FFT machine, meanwhile SIFT allows a
continuous execution. In FFT design all of the samples
must be present to begin the process [3]. This batch
requirement derives from the FFT bit reversal-sorting
algorithm. The time from the arrival of the last sample until
the availability of the coefficients is referred to as the
latency (hidden time.)
Sample Integration Fourier Transform paradigm
Expansion of the equation (1) yields a set of N coefficients
A1, A2,…, An. A similar expansion can be done for the
equation (2)
A1= S1 cos(2πf1*1/N) + S2 cos(2πf1*2/N) +… + SN cos(2πf1*N/N)
A2= S1 cos(2πf2*1/N) + S2 cos(2πf2*2/N) +… + SN cos(2πf2*N/N)
Ai= -----
------
-------
AN= S1cos(2πfN*1/N) + S2cos(2πfN*2/N) +… + SNcos(2πfN*N/N)
By noticing that the first terms of all coefficients A’s are
contributed by the first sample, then it is possible to
compute and store the first component of all coefficients
upon the arrival of the first sample. If this can be done
before the arrival of the second sample, it follows that each
coefficient can be updated before the arrival of the third
sample. Extending this procedure to the Nth sample, figure
5, we can complete and output the N coefficients by the
time of the arrival of the first sample of the next set.
Therefore the name Sample Integrated Fourier Transform is
introduced. The SIFT paradigm results in several
advantages for hardware implementation.
The first advantage of this procedure is shown in figure 2. If
the computational resource that is available is just capable
of completing these steps as samples arrive, it will be
finished at the end of the Nth sample time. If the same
resource is dedicated to the computation beginning after the
last sample arrives; it will finish at the last sample time of
the next sample period. The new paradigm has less latency
than FFT does because it is a transactional process and the
FFT algorithm is a batch process.
Secondly, when a sample arrives it will be used to update
the N coefficients and then that sample is no longer needed.
There is no need to store or address samples except when
each is the current sample. Because the samples are simply
consumed on the fly, there is no need for the elaborate
special purpose hardware to store, address and process
transforms. The situation is illustrated in figure 11. This
presents an algorithm with less processing requirement.
Thirdly, each coefficient may be updated on the arrival of
the next sample by subtracting the oldest contribution and
adding the contribution from the current sample. This
permits a sequence of complete transforms that are updated
at each sample time without additional computational
overhead. The situation is illustrated in Figure 6. The
beginning has moved over. The convolution with the
aperture is different, but the wave represented is the same.
The information content is latent in this case. There is no
loss of information compared to the classical Fourier
Transform.
Hardware Implementation and Results
Figure 3 shows the block diagram of the 64-point SIFT.
The chip is divided into five units namely the Control unit,
Aspect Generator, Multiplier, Adder/Substractor and RAM.
The memory is a 64-entry 18-bit RAM [7]. The control unit
generates all the control signals for the chip. A 13-bit
binary counter in the control unit monitors the timings of
the other blocks. The LSB of the counter is used as a timing
signal in the design. The counter bits [1:6] are used to
sequentially update the sixty-four coefficients according to
the current sample and bits [12:7] assign the sample number
and the appropriate aspect function. The maximum
significant value enables the output of the coefficients as
they respectively complete their final updates. The Aspect
Generator, figure 9, consists of a six bit counter to generate
sequences of values of cosine and sine. The sample number
from the main counter is used as increment value for the six
bit counter. To generate the first sequence, the counter
increments at the rate of one count. For the 2nd, 3rd, 4th,…,
Nth sequences the counter increments at the rate of 2, 3,
4,…, N. The output of the aspect generator is thus called
“sequence” and is 6-bit wide. The two MSB bits, bit 5 and
bit 4, are used to decode the quadrant of the aspect value.
The Multiplier unit multiplies the aspect function times the
absolute value of the sample then divides the product by
128. The division consists of a seven-bit shift, which is
achieved by wiring offset. This is done for normalization.
Its effect is to make the hardware implementation simpler.
This method allows the use of a pure integer multiplier. The
Adder/Subtractor unit consists of a fast adder and a
controlled inverter to perform 2’s compliment conversion
arithmetic.
The simulated waveforms of the forward SIFT for sin(x) is
shown in figure 10. Figure 12, and 13 show the inverse FT
waveforms of Acos(2π*f1*x/N) + Bsin(2π*f2*x/N) at the
frequencies (f1,f2): (16,63) and (31,63) respectively.
Functional results have been verified against MATLAB
simulations and hardware results from a discrete IC SIFT
board design that is shown in figure 14. Using 0.18um
process, the 64-point SIFT runs at 330MHz, has zero
latency, attains 13 µsec execution time, dissipates 8mW.
The 64-point SIFT core has an area of 0.21 mm2 and shown
in figure 7.
Figure 8 shows the top-level diagram of a 1024-complex
point SIFT. This design is achieved with 64 32-point SIFT
cells. Using 0.18µm cmos process, the design is able to
deliver 20 GMACs/s sustained rate, and occupies an area of
16mm2. At maximum throughput of 320 million samples
per second it delivers 320 million coefficients per second.
This produces a full transform each 3.2µsec. At this sample
rate it dissipates 500 milliwatts. At lower sample rates the
power is proportionately less.
The structure operates by splitting the problem into two
phases. The incoming samples are handled as 32
interleaved transforms. The resulting coefficients are then
taken level by level as 32 interleaved transforms in the
other direction as in the case of a two dimensional Fourier
Transform. This introduces a latency of one.
At all sample rates the design is the most energy efficient
and smallest 1024-point Fourier Transform solution that is
currently announced.
References:
[1] Mike Hannah, et al., “Implementation of the Double
Precision Complex FFT for the TMS320C54x DSP,” Texas
Instruments, Application Report, SPRA554B, August 1999.
[2] Guy R. L. Sohie et al., “Implementation of Fast Fourier
Transform on Motorola’s Digital Signal Processors,”
Motorola’s High Performance DSP Technology, APR4/D,
rev. 3, Sections 4, 6, 8.
[3] Steven W. Smith, The Scientist and Engineer’s Guide to
Digital Signal Processing, pp. 228-232, California
Technical Publishing, 1997.
[4] Bevan M. Baas, “A 9.5 mW 330 µsec 1024-point FFT
processor,” CICC, 1998
[5] R. N. Bracewell, Fourier Transform and its
Applications, rev. 2, New York, New York, 1986
[6] Tukey and Cooley, “An Algorithm for the Machine
Calculation of Complex Fourier Series,” Mathematics of
Computation, Vol. 19, pp.297-301, April 1965
[7] Trang K. Ta, et al. “Dual Port SRAM Design: An
Overview,” IBM Circuit TTL Conference, 1991
Figure 1: SIFT pipeline vs. FFT Butterfly pipeline
Figure 2: Number of cycles vs. points for SIFT, single
precision FFT(s), and double precision FFT(d)
Figure 3: SIFT allows continuous execution as samples arrive
Figure 5: Each sample contributes to coefficients upon
arrival
Figure 6: Coefficients can be updated by subtracting the
oldest contribution
Figure 4: Block diagram of the 64-point SIFT
Figure 7: Micrograpgh of the 64-point FT core
Figure 9 : Main potion of the Aspect Generator
Figure 8: Architecture of 1024-point FT
Figure 11: No cache and buffers required for SIFT as opposed
to FFT Butterfly
Figure 10: FT of sin(x)
Figure 12: Inverse FT of Acos(2πf1 x/N) + Bsin(2πf2 x/N) at
frequencies (16,63)
Figure 14: Micrograph of the discrete ICs, 64-point SIFT
design
Figure 13: Inverse FT of Ecos(2πf1 x/N) + Fsin(2πf2 x/N) at
frequencies (31,63)