## Exercise 1

The two signal flow graphs in Figure 1 realizes the same function. The delay for a multiplication and an addition is $4 t_{d}$ and $2 t_{d}$, respectively. The input signal $x(n)$ must be held at constant value $x(j)$ during a certain time, before the new value $x(j+1)$ is allowed to occur at the input. The theoretical minimal time (neglecting propagation delay through delay elements) between the values of $x(j)$ and $x(j+1)$ is

$$
T_{\text {min }}=\max _{i}\left\{t_{i} / N_{i}\right\}
$$

where $T_{i}$ is the total delay due to the arithmetic operations, and $N_{i}$ is the number of delay elements in the directed loop $i$.


Figure 1. Two different signal flow graphs of the same function.
a) Determine $T_{\min (a)}$ and $T_{\min (b)}$. Neglect the propagation delay through the delay elements.
b) Which signal flow graph results in the fastest circuit?
c) The lowest possible power supply voltage of the circuit realized from Figure 1a) is 2.0 V for a given demand of performance (throughput). Determine the lowest possible power supply voltage of the circuit realized from Figure 1b) when $V_{t}=0.35 \mathrm{~V}$ and $t_{d} \propto V_{d d} /\left(V_{d d}-V_{t}\right)^{2}$. Both circuits should have their delay elements placed as in Fig. 1.
d) Determine $P_{(b)} / P_{(a)}$.

## Exercise 2

a) Loop unroll the signal flow graph in Figure 2 and use arithmetic transformations to shorten the critical path.


Figure 2. Signal flow graph.
b) What are the advantages of unrolling?

## Exercise 3

In Figure 3, a shift register is shown. The number of registers is 64 and the clock frequency is 1.00 GHz . The D flip-flops are all positive edge triggered. The set up time $t_{s u}$ and the hold time $t_{h}$ for the D flip-flops are 0.17 ns and 0.05 ns , respectively. The new data is propagated to the output 0.14 ns after the rising clock edge (i.e. $t_{c q}=0.14 \mathrm{~ns}$ ). $V_{d d}$ $=2.5 \mathrm{~V} . V_{t}=0.30 \mathrm{~V}$. The propagation delay in the chosen process technology is in proportion to $V_{d d} /\left(V_{d d}-V_{t}\right)^{1.5}$.


Figure 3. Shift register.
a) Determine the lowest possible power supply voltage for the original circuit. How large is the relative power consumption?
b) Introduce interleaving in the shift register ( $2 \times 32$ ). You may use one multiplexer and two clock signals. The throughput should be the same as in the original circuit. Assume that the data stream $d=\left[d_{0}, d_{1}, d_{2}, d_{3}, \ldots\right]$ can be divided into two data streams $d_{a}=\left[d_{0}, d_{2}, d_{4}, \ldots\right]$ and $d_{b}=\left[d_{1}, d_{3}, d_{5}, \ldots\right]$ (How?). Sketch the circuit and the waveforms. Determine the relative power consumption using the same supply voltage as in a) compared with the original voltage. Neglect the propagation delay and the power consumption of the multiplexer.
c) Determine the lowest possible power supply voltage of the interleaved circuit. How large is the relative power consumption compared with the original circuit?

## Exercise 4

A four-input multiplexer realized by three two-input multiplexers is shown in Figure 4. The inputs are periodically selected in the order $n_{0}, n_{1}, n_{2}, n_{3}$. A new input is selected in each clock cycle. The propagation delay of one multiplexer is $t_{d}$.


Figure 4. Four-input multiplexer.
a) Determine the select signals $s_{0}, s_{1}, s_{2}$ and calculate the propagation delay.
b) If the input signal $n_{j}(j=0,1,2,3)$ always is valid a certain time $t_{n}>t_{d}$ before it is selected, a faster circuit can be designed with the same hardware. Sketch the new circuit and the select signals $s_{0}, s_{1}$, and $s_{2}$. Determine the propagation delay from the time when the select signals become valid.
c) How can the result in b) be used to save power?

## Exercise 5

a) In Figure 5 a multiplier is shown. The input signal $x$ is multiplied with 0.375 . How many inputs of the adder is connected to the sign bit $x_{0}$ ?


Figure 5. Multiplier realized by two hard wire shifts and one adder.
b) Show that the result shown in Figure 6 always is true.

Figure 6. Interesting result of two's complement addition.
c) How is it possible to save power with the use of the result in b)?

## Exercise 6

An $N$-times- $N$ matrix $A$ should be calculated. Each element in $A$ is calculated as shown in the equation below where $c_{1}(x, i)$ and $c_{2}(y, j)$ are constants and $B(i, j)$ are variables.

$$
A(x, y)=\sum_{i=0}^{N-1} \sum_{j=0}^{N-1} c_{1}(x, i) c_{2}(y, j) B(i, j)
$$

a) How many multiplications are required to calculate $A$ ?
b) How many multiplications are required if $c_{1}(x, i) \cdot c_{2}(y, j)$ are precalculated to a new constant $C(x, y, i, j)$ ?
c) Rewrite the equations so that the number of multiplications is reduced.
d) Determine the number of memory access in b) and c).

## Exercise 7

The algorithm shown in Figure 7 accumulates an even number of samples $N$ by adding the samples in a sequence $y(n)=x(n)+y(n-1) ; n=0,1, \ldots, N-1$. The register is implicitly reset at $n=0$. The algorithm should be optimized for low energy consumption using algorithm transformations and supply voltage scaling. Assume a constant capacitive load of adders and registers in implementations, and that the adder and the register hardware operate at a common supply voltage in an implementation. In a direct implementation of the algorithm in Figure 7 the supply voltage is $V_{d d 0}=1.8 \mathrm{~V}$ after minimization with respect to the throughput requirement.


Figure 7. Signal-flow graph of an accumulate algorithm.
a) How large energy savings can be obtained with direct pipelining or interleaving of the algorithm?
b) Transform the algorithm using loop unrolling such that two samples are processed concurrently. Hence the new algorithm should perform the operations $y(2 n)=x(2 n)+y(2 n-1)$ and $y(2 n+1)=x(2 n+1)+y(2 n) ; n=0,1, \ldots, N / 2-1$. Draw the signal-flow graph of the new algorithm.
c) Use the associative and distributed arithmetic laws to reduce the latency of the critical loop of the algorithm in b). The final latency should equal the sum of one two-input adder and one register delay. Draw the signal-flow graph of the new algorithm.
d) Pipeline the algorithm in c) with as few registers as possible so that the throughput becomes limited by the critical loop.
e) Assume the same supply voltage $V_{d d 0}$ is used in direct implementations of the original algorithm in Figure 7 and the resulting algorithm of d). Estimate the relative energy consumption of the two implementations.
f) Minimize the energy consumption of resulting algorithm of d) by scaling the supply voltage. Assume the propagation delay $t_{p}$ is proportional to $V_{d d} /\left(V_{d d}-V_{t}\right)^{1.5}$ where $V_{t}=0.45 \mathrm{~V}$. Estimate the relative energy savings after supply voltage scaling compared with a direct implementation of the original algorithm in Figure 7.

## Solution 1

a)

$$
\begin{aligned}
& T_{\min (a)}=\max \left\{\frac{T_{0}}{N_{0}}, \frac{T_{1}}{N_{1}}, \frac{T_{2}}{N_{2}}\right\}=\max \left\{\frac{2 T_{a d d}+T_{m u l t}}{1}, \frac{3 T_{a d d}+T_{m u l t}}{2}, \frac{3 T_{a d d}+T_{m u l t}}{3}\right\} \\
& =2 T_{\text {add }}+T_{\text {mult }}=8 t_{d} \\
& T_{\min (b)}=\max \left\{\frac{T_{0}}{N_{0}}, \frac{T_{1}}{N_{1}}, \frac{T_{2}}{N_{2}}\right\}=\max \left\{\frac{T_{a d d}+T_{m u l t}}{1}, \frac{2 T_{a d d}+T_{m u l t}}{2}, \frac{3 T_{a d d}+T_{m u l t}}{3}\right\} \\
& =T_{\text {add }}+T_{m u l t}=6 t_{d}
\end{aligned}
$$

b) The signal flow graph in Figure 1 b)
c) $T_{\min (b)}=\frac{4}{3} T_{\min (a)} \Rightarrow T_{\operatorname{clock}(b)}=\frac{4}{3} T_{\operatorname{clock}(a)}$

$$
\Rightarrow \frac{V_{d d(b)}}{\left(V_{d d(b)}-V_{t}\right)^{2}}=\frac{4}{3} \frac{V_{d d(a)}}{\left(V_{d d(a)}-V_{t}\right)^{2}}
$$

where $V_{t}=0.35 \mathrm{~V}$ and $V_{d d(a)}=2.0 \mathrm{~V}$.

Solving the equation results in:
$V_{d d(b)}=k \pm \sqrt{k^{2}-V_{t}^{2}}$ where $k=V_{t}+\frac{3\left(V_{d d(a)}-V_{t}\right)^{2}}{8 V_{d d(a)}} \approx 0.8605$.
Hence $V_{d d(b)} \approx 1.647 \approx 1.65 \mathrm{~V}$
d) $V_{d d(b)}{ }^{2} / V_{d d(a)}{ }^{2} \approx 0.68$

## Solution 2

a) $y(2 n)=x(2 n)+a y(2 n-1)+b y(2 n-2)$ (I)

$$
y(2 n-1)=x(2 n-1)+a y(2 n-2)+b y(2 n-3)(\text { II })
$$

Use (II) in (I)

$$
\begin{aligned}
& y(2 n)=x(2 n)+a x(2 n-1)+a^{2} y(2 n-2)+a b y(2 n-3)+b y(2 n-2) \\
& y(2 n)=x(2 n)+a x(2 n-1)+\left(a^{2}+b\right) y(2 n-2)+a b y(2 n-3) \text { (III) }
\end{aligned}
$$

Use (II) and (III) and sketch the circuit.
b) The modified circuit process two data in each clock cycle. The critical path of the original circuit is (after the move of one delay element) one multiplication and one addition plus the delay of the register. The maximal throughput is $1 /\left(t_{\text {add }}+t_{\text {mult }}+t_{\text {reg }}\right)$.

If pipelining is used (where?) then the maximal througput is $2 /\left(2 t_{\text {add }}+t_{\text {mult }}+t_{\text {reg }}\right)$. By using loop unrolling and algebraic transformations the time margin can be increased which may be used for a higher throughput or for a lower power supply voltage which may result in power savings.

## Solution 3

a) $T_{c l k} \geq t_{s u}+t_{c q}=0.31 \mathrm{~ns}, T_{c l k}=1 \mathrm{~ns}$

$$
1=0.31\left(\frac{V_{D D(\text { new })}}{\left(V_{D D(\text { new })}-V_{T H}\right)^{1.5}} \cdot \frac{\left(V_{D D(\text { old })}-V_{T H}\right)^{1.5}}{V_{D D(\text { old })}}\right) \Rightarrow V_{D D(\text { new })} \approx 0.75 \mathrm{~V}
$$

Power consumption: $\sim 9 \%$ of the original circuit
b) Assuming that the divided data streams has the same transition activity as the original and neglecting the energy consumed by the multiplexer:
Half clock frequency $\Rightarrow$ power consumption: $4.5 \%$ of the original circuit.
c) $V_{D D(\text { new })} \approx 0.53 \mathrm{~V} \Rightarrow$ Power consumption: $2.2 \%$ of the original circuit. Note that we have not considered any overhead of using two clocks and the power consumption of the multiplexer.

## Solution 4

a) The table below shows how the select signals can be chosen ("-" = don't care). For example, when the input signal $s_{0}$ change value from 0 to $1, n_{1}$ must propagate through two multiplexers before the output of the circuit is valid. Hence, the propagation time for the circuit is $2 t_{d}$.

| $s_{0}$ | $s_{1}$ | $s_{2}$ | $f$ |
| :---: | :---: | :---: | :---: |
| 0 | - | 0 | $n_{0}$ |
| 1 | - | 0 | $n_{1}$ |
| - | 0 | 1 | $n_{2}$ |
| - | 1 | 1 | $n_{3}$ |

b) The figure below shows the four input multiplexer where two input signals have changed places. If the input signal $n_{j}(j=0,1,2,3)$ always is valid a certain time $t_{n}>$ $t_{d}$ before it is selected, the signal can propagate through the first multiplexer before it is selected. For example, consider the scenario where $n_{0}$ is selected after $n_{3}$. When $n_{3}$ is selected $s_{0}=0$, which means that $n_{0}$ is valid at the output of multiplexer $M_{0}$ before $n_{0}$ is selected. The "new" circuit results in a propagation delay of one multiplexer $t_{d}$. The extra time margin can be used for power supply scaling.


| $s_{0}$ | $s_{1}$ | $s_{2}$ | $f$ |
| :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | $n_{0}$ |
| 1 | 0 | 1 | $n_{1}$ |
| 1 | 1 | 0 | $n_{2}$ |
| 0 | 1 | 1 | $n_{3}$ |

## Solution 5

a) The sign bit $x_{0}$ is interconnected to 7 inputs of the adder while $x_{j}(j>0)$ is connected to two inputs. This makes the sign bit the most capacitively loaded bit.
b) Numbers:

$$
x=-a_{0} 2^{19}+\sum_{i=15}^{18} a_{0} 2^{i}+\sum_{j=0}^{14} a_{15-j} 2^{j}, y=-2^{19}+2^{18}+2^{17}+2^{16}+\left(1-a_{0}\right) 2^{15}+\sum_{j=0}^{14} a_{15-j} 2^{j}, z=2^{15}
$$

Show that sum is same:

$$
\begin{aligned}
y+z-x & =-2^{19}+2^{18}+2^{17}+2^{16}+2 \cdot 2^{15}-2^{15} a_{0}+a_{0} 2^{19}-\sum_{i=15}^{18} a_{0} 2^{i}= \\
& =0+a_{0}\left(2^{19}-2^{18}-2^{17}-2^{16}-2 \cdot 2^{15}\right)=0+a_{0} \cdot 0=0 \Rightarrow x=y+z
\end{aligned}
$$

c) Do not sign extend, but invert the sign bit and add ones in the positions given by b). The ones should not consume dynamic power since they are not switching, and the logic nets can be simplified for the constant case.

## Solution 6

a) $2 N^{4}$
b) $N^{4}$
c) With the following manipulation of the equation, the number of multiplications will be reduced:

$$
\begin{aligned}
& A(x, y)=\sum_{i=0}^{N-1} \sum_{j=0}^{N-1} c_{1}(x, i) c_{2}(y, j) B(i, j)=\sum_{i=0}^{N-1} c_{1}(x, i) \sum_{j=0}^{N-1} c_{2}(y, j) B(i, j) \Rightarrow \\
& A(x, y)=\sum_{i=0}^{N-1} c_{1}(x, i) Z(i, y), \quad Z(i, y)=\sum_{j=0}^{N-1} c_{2}(y, j) B(i, j)
\end{aligned}
$$

Calculating $Z$ takes $N^{3}$ multiplications, $N$ multiplications for one summation, and $N^{2}$ elements in $Z$ gives totally $N^{3}$ multiplications. $A$ also takes $N^{3}$ multiplications which gives a total number of $2 N^{3}$ multiplications. The number of multiplications is reduced from $N^{4}$ to $2 N^{3}$.
d) Memory access before the manipulation:
$C-N^{4}$ (one memory access for every calculation)
$B-N^{4}$ (one memory access for every calculation)
$A-N^{2}$ (one memory access after one finished calculation)
Total memory access $=N^{2}+2 N^{4}$
Memory access after the manipulation:
$c_{2}-N^{3}, B-N^{3}, Z-N^{2}, c_{1}-N^{3}, Z-N^{3}, A-N^{2}$. Total memory access $=4 N^{3}+2 N^{2}$
For $N>2$ the last equation uses least memory access.


Memory access for both equations

## Solution 7

a) Pipelining and interleaving cannot be applied directly to a recursive algorithm, hence there is no energy to gain from these methods.
b) Signal-flow graph after loop unrolling: $x(2 n) x(2 n+1)$

c) To improve the latency of the critical loop, we need to transform operations out of the loop by reordering the additions. Start by simplifing the critical loop by moving the addition $x(2 n)=x(2 n)+y(2 n)$ out of the loop. It can made by duplicating the left-most addition in b) according to the figure to the right.


Reorder the additions such that the loop is shortened to one adder and one register. This is made in the figure to the right by swapping the order of adding $x(2 n+1)$ and the register output.

This signal-flow graph has a critical loop with only one adder and one register in it and does hence solve our
 problem.
d) The critical path of the algorithm in c) goes through one register and two adders. This path can be reduced to one register and one adder by introducing pipelining in the sequential parts of the algorithm, obtaining the same latency of the critical path as that of the loop. A cut $A-A^{\prime}$ suitable for inserting two register to obtain the pipelining is indicated in the figure to the right.

e) Assume the energy consumption of a register operation to be $E_{\text {reg }}$ and of an addition operation $E_{\text {add }}$. To perform an accumulation operation the algorithm in d) consumes the energy $E_{d}=N / 2\left(3 E_{\text {reg }}+3 E_{\text {add }}\right)=1.5 N\left(E_{\text {reg }}+E_{\text {add }}\right)$ while the original algorithm consumes $E_{0}=N\left(E_{\text {reg }}+E_{\text {add }}\right)$. Hence the algorithm in d) consumes $150 \%$ energy of the original algorithm if the same supply voltage is used.
f) The latency for computing an accumulation operation with an implementation based on the algorithm in d) is $t_{1}=\frac{N}{2} k V_{d d 1}\left(V_{d d 1}-V_{t}\right)^{-1.5}$, where $k$ is the proportionality constant, while an implementation based on the original algorithm has the latency $t_{0}=N k V_{d d 0}\left(V_{d d 0}-V_{t}\right)^{-1.5}$. With the same throughput requirement,
$t_{1}=t_{0} \Rightarrow \frac{V_{d d 1}}{2\left(V_{d d 1}-V_{t}\right)^{1.5}}=\frac{V_{d d 0}}{\left(V_{d d 0}-V_{t}\right)^{1.5}}$
Solving this equation numerically, we obtain $V_{d d 1} \approx 1.04 \mathrm{~V}$.
Assuming the load capacitances of a register and an adder are $C_{\text {reg }}$ and $C_{\text {add }}$, respectively, we then obtain the relative energy consumption

$$
\frac{E_{1}}{E_{0}}=\frac{\left.(N / 2) 3\left(E_{r e g}+E_{a d d}\right)\right|_{V_{d d}=V_{d d 1}}}{\left.N\left(E_{r e g}+E_{a d d}\right)\right|_{V_{d d}=V_{d d 0}}}=\frac{3}{2} \frac{\left(C_{r e g}+C_{a d d}\right) V_{d d 1}^{2}}{\left(C_{r e g}+C_{a d d}\right) V_{d d 0}^{2}}=\frac{3}{2}\left(\frac{V_{d d 1}}{V_{d d 0}}\right)^{2} .
$$

Numerically $E_{1} / E_{0} \approx 0.50$, and hence the energy savings become approx. $50 \%$.

