## Written Examination TSTE87 2017-05-29

- 1. (a) Give two examples of use cases where a dedicated implementation may be advantageous compared to using general purpose DSP processors.
  - vantageous compared to using general purpose DSP processors. (2 p)
    (b) Define latency and execution time. (2 p)
  - (c) Name exactly two features that are typically found in general purpose DSP processors, but not in CPUs/microcontrollers. (2 p)
  - (d) Explain how a minimum signed-digit representation and the canonic signed-digit representation differs from the general signed-digit representation. (2 p)
  - (e) Name and describe one way to accelerate a carry-propagation adder. (2 p)
- 2. In an FFT, there is a need for bit-reversal of a sequence consisting of 16 samples, arriving two at a time. Hence, in time slot 1, samples 0 and 1 arrives, in time slot 2, samples 2 and 3 and so on. The outputs should be in bit-reversed order, so the first samples to leave the bit-reversal circuits is sample 0 (=0000 bit-reversed) and sample 8 (=0001 bit-reversed). After that, sample 4 (=0010 bit-reversed) and sample 12 (=0011 bit-reversed) should leave the circuit, and so on.
  - (a) Draw a memory life time graph for the bit-reversal circuit. (6 p)
  - (b) Divide the variables between a minimal number of memories. For each memory it is possible to read one variable and write one variable in each cycle. (4 p)
  - (c) For each memory, assign the variables to a minimal number of memory cells. It is possible to read from and write to the same memory position in a cycle. (4 p)
- 3. Consider the digital filter algorithm in Fig. 1.



Figure 1: Digital filter for use in Problems 3.

- (a) Determine the minimal sample period,  $T_{\min}$ . (2 p)
- (b) Determine the critical path,  $T_{cp}$ . (2 p)
- (c) Apply pipelining and retiming to obtain an algorithm with  $T_{cp} = T_{\min}$ . (4 p)

4. The digital filter algorithm in Fig. 2 shall be implemented on a shared-memory architecture with homogeneous non-preemptive processing elements. There are two types of processing elements: multipliers and adders. The latency of a multiplier is two clock cycles and the latency of an adder is one clock cycle. Both types of processing elements have an execution time of two clock cycles.



Figure 2: Digital filter for use in Problems 4 and 5.

- (a) Determine the minimal sample period,  $T_{\min}$ . (2 p)
- (b) Determine the critical path,  $T_{cp}$ . (2 p)
- (c) Determine the precedence graph. (6 p)
- (d) Schedule the algorithm for  $T_s = T_{\min}$  and a reasonably efficient resource utilization. Make sure that you indicate all the connections in the schedule. (8 p)
- 5. The filter in Fig. 2 is now to be implemented using distributed arithmetic. The filter coefficients are,  $a = -\frac{5}{8}$ ,  $b = -\frac{5}{4}$ , and  $c = \frac{3}{4}$ .
  - (a) Perform retiming to reduce the number of distributed arithmetic units (two are enough). (2 p)
  - (b) Determine the state-space representation of the retimed algorithm. (4 p)
  - (c) Draw the resulting architecture using building blocks such as shift-registers, shift-accumulators, and ROMs. (4 p)
  - (d) Determine the ROM contents in a named binary representation. (4 p)
- 6. Consider the implementation of a spectrometer for radio astronomy. Such a system consists of a long length FFT and for each FFT computation, the each individual FFT output (a bin/tap) is squared and accumulated to the sum of squared taps from the previous FFT. This means that for an N-point FFT, there will be N accumulated sums of squares that must be stored. The spectrometer should be implemented in an FPGA with 2016 multipliers and 38304 kb of memory. The amount of memory required to compute an N-point FFT is  $\approx N$  complex samples. Assume a data word length of 18 bits (18 bits real and 18 bits imaginary), both for the FFT computation and the sum of squares, that a complex multiplication requires four multipliers, a complex square two multipliers, and a clock frequency of 500 MHz for the FPGA.
  - (a) Determine the maximum FFT length (power of two) assuming a data rate of 4 GSa/s. (4 p)
  - (b) Depending on if the number of multipliers or amount of memory was limiting in the previous question: where is the break-even point in terms of data rate for the same FFT length? That is, the data rate where both the memory and the multipliers are fully used (within reasonable limits). (2 p)