The precedence graph in Figure 1 describes in what order a different set of operations have to be carried out. The dashed lines denote different time slots in the graph, each one time unit (t.u.) long. The graph contains some freedom, for example the result of the addition on the second row is not needed until the fourth time slot. Since an addition takes 1 t.u. to complete it could be delayed 1 or 2 t.u. without altering the functionality of the circuit.  $V_{dd} = 2.5$  V,  $V_t = 0.35$  V,  $t_{add} = 1$  t.u.,  $t_{mult} = 2$  t.u.,  $P_{add} = P$ ,  $P_{mult} = 4P$ . All operations are implemented in static CMOS logic (2 µm process).



Figure 1. Precedence graph.

- a) Use multiple supply voltages for different operations to minimize the total power consumption. The propagation delay should still be 5 t.u. What voltages should be supplied and to which operations?
- b) How large is the relative power reduction?

Exercise 2

Four number should be added with three two-input adders according to Figure 2.



Figure 2. Adder with four inputs.

- a) How can the power consumption be reduced by using pipelining? What are the effects on the implementation?
- b) How can the power consumption be reduced by interleaving? What are the effects on the implementation?
- c) What will happen to the implementation if both pipelining and interleaving are applied to the adder?

A signal flow graph with a register and CSA-tree is shown in Figure 3. The input to the CSA-tree consists of 16 words where each word has a word length of 32 bits. The propagation delay of the register  $t_{reg}$  is measured from the triggering clock edge to the time the output is stable. The maximal power supply voltage for the used 0.18 µm process is 2.7 V and the propagation delay is in proportion to  $V_{dd}/(V_{dd} - V_t)^{1.7}$  where  $V_t \approx 0.32$  V. The output of the CSA-tree is intended to be connected to a fast adder via registers. Some data for the circuits are listed in Table 1.



|                               | CSA-tree | Register<br>(32 bit) | Multiplexer<br>(2x64 bit) |
|-------------------------------|----------|----------------------|---------------------------|
| Power supply                  | 1.8 V    | 1.8 V                | 1.8 V                     |
| Propagation delay             | 4.90 ns  | 127 ps               | 135 ps                    |
| Setup time                    | -        | 144 ps               | -                         |
| Power consumption (@ 250 MHz) | 29 mW    | 0.90 mW              | 2.8 mW *                  |

Figure 3. CSA-tree with register.

Table 1. Data for the circuits.

\* The power consumption for the multiplexer is estimated when select is a clock signal with a frequency of 125 MHz, yielding a data output rate of 250 MHz.

- a) Sketch the waveforms of the signals in Figure 3. Let the input data be represented by d0, d1, d2, d3, ..., and the output data of the CSA-tree should be represented by out0, out1, out2, ... where outj is the result of input dj (j=1, 2, 3, ...).
- b) Interleave the structure by the use of one register and a 2x64 bit multiplexer. The propagation delay of the multiplexer is  $t_m = 135$  ps. Sketch the signal flow graph and the waveforms of the signals.
- c) Pipeline the CSA-trees in the interleaved architecture. Sketch the signal flow graph and the waveforms of the signals. Assume that the CSA-tree can be split into two equally long data-paths.
- d) Estimate the minimal power consumption for the original circuit, the interleaved circuit, and the circuit that uses both interleaving and pipelining. Assume that the transition activity is not changed when one data stream is divided into two data streams. The throughput should be 250 Msamples/s.

In Figure 4, a digital circuit is shown. Circuits A, circuit B and the registers are originally designed for a power supply voltage of 2.5 V. Some data for a multiplexer and the circuits in Figure 4 are listed in Table 1. The propagation delay in the current 0.18 µm process is proportional to  $V_{dd}/(V_{dd} - V_t)^{1.5}$ , where  $V_t \approx 0.34$  V.



Figure 4. Digital circuit.

|                                  | А                    | В                    | Register<br>(16 bit) | Multiplexer<br>(16 bit) |
|----------------------------------|----------------------|----------------------|----------------------|-------------------------|
| Power supply                     | 2.5 V                | 2.5 V                | 2.5 V                | 2.5 V                   |
| Propagation delay                | 9.0 ns               | 6.0 ns               | 0.25 ns              | 0.20 ns                 |
| Power consumption<br>(@ 100 MHz) | 11 mW                | 21 mW                | 2.3 mW               | 1.5 mW *                |
| Area                             | 0.45 mm <sup>2</sup> | 0.65 mm <sup>2</sup> | $0.005 \text{ mm}^2$ | $0.004 \text{ mm}^2$    |

Table 1: Data for the digital subcircuits.

<sup>\*</sup>The power consumption for the multiplexer is estimated when select is a clock signal with a frequency of 50 MHz, yielding a data output rate of 100 MHz.

#### **Specification for the original digital circuit:**

Available power supply voltages: 0.9 V, 1.3 V, 1.8 V Clock frequency: 75 MHz Max area: 2.0 mm<sup>2</sup>

- a) Introduce pipelining in the circuit in Figure 4 with the use of one extra register. Determine the lowest possible power consumption fulfilling the given specification. Sketch the new circuit.
- b) Redesign the circuit in Figure 4 for lowest possible power consumption fulfilling the specification with the use of the circuits listed in Table 1. Determine the power consumption and sketch the new circuit.

In Figure 5 a digital circuit is shown. The propagation delays of the different circuits are  $t_A = 2$  ns,  $t_B = 4$  ns,  $t_C = 8$  ns,  $t_D = 7$  ns,  $t_{reg} = 0.5$  ns.

- a) Determine the maximum clock frequency of the circuit in Figure 5.
- b) Pipeline the circuit and calculate the new maximum clock frequency.
- c) Modify the circuit so that the maximum clock frequency exceeds 220 MHz.



Figure 5. Digital circuit.

## Exercise 6

The arithmetic function f = a(b + c) + (a + b) should be implemented with three twoinput adders and one two-input multiplier. The signals *a*, *b*, *c*, and *f* are connected to external registers. The circuits are implemented using static CMOS logic and have the propagation times  $t_0$ ,  $3t_0$ , and  $9t_0$  for a register, adder and multiplier, respectively, at the maximum supply voltage 2.5 V.

- a) Estimate the minimum propagation time of the circuit assuming  $V_{DD} = 2.5$  V.
- b) Now assume that multiple supply voltages may be used,  $|V_T| = 0.40$  V for all transistors, and the propagation time scales in proportion to  $V_{DD}/(V_{DD}-|V_T|)^{1.5}$ . Schedule the starting time and assign supply voltages for the operations such that minimum dynamic power dissipation is achieved.
- c) Pipelining should now be introduced in the circuit in a). Show how registers can be added in two of the signal paths of the circuit in a) such that the algorithmic latency increases with one clock cycle. Do not pipeline the operations internally.
- d) Suggest a new multiple supply voltage assignment that minimizes the power dissipation of the circuit in c) without decreasing the maximal throughput. Exclude the registers in the assignment, i.e., keep the maximum supply voltage for the registers.
- e) Consider the case of interleaving of two blocks that compute *f*. Discuss how this configuration would allow the voltage to be scaled compared with the pipelining case in d). How would the costs differ from the pipelining case?

The data path shown in Figure 6 should operate at a frequency of 125 MHz. The critical path is indicated for each component assuming a nominal power supply voltage of 1.5 V. Assume that a block consumes power  $P_0$  at nominal supply voltage,  $V_T$ =0.33 V, and r=2.



Figure 6. Data path.

- a) Perform power supply voltage scaling of the system and compute the relative power saving compared with the data path operating at nominal power supply voltage.
- b) Introduce one level of pipelining in the system using a register block, scale the voltage, and compute the power savings assuming that both parts operate at the same frequency. Neglect the power and delay of the register.
- c) Assume that all blocks consume the same amount of power at the nominal power supply voltage, including the register block introduced. Repeat b) assuming the register has a propagation delay of 1 ns.

a) The propagation time  $t_d$  in this CMOS circuit is proportional to  $V_{dd}/(V_{dd}-V_t)^2$ .

In the case of  $V_{dd} = 2.5$  V this gives  $V_{dd} / (V_{dd} - V_t)^2 \approx 0.54$ .

Since the result of the addition on the second row in the precedence graph is not needed until after 3 t.u. the latency of this operation can be increased with a factor of three. The latency of the multiplication on the third row can be increased two times without destroying the functionality or the total latency of the function. The new precedence graph is shown in Figure 7.

For the addition we have  $1.62 = V_{dd1}/(V_{dd1} - V_t)^2$ , giving  $V_{dd1} \approx 1.22$  V, and for the multiplication  $1.08 = V_{dd2}/(V_{dd2} - V_t)^2$ , resulting in a new supply voltage of  $V_{dd2} \approx 1.54$  V.



Figure 7. Modified precedence graph.

b) The power dissipation using the initial solution is  $P_1 = (4P_{add} + 2P_{mult}) = 12P$ , and using different power supply voltages

$$P_{2} = \frac{(3P_{add} + P_{mult})V_{dd}^{2} + P_{add}V_{dd1}^{2} + P_{mult}V_{dd2}^{2}}{V_{dd}^{2}} \approx 8.76P$$

The realtive power reduction is  $1 - P_2/P_1 = 27\%$ .

#### Solution 2

a) The critical path limits the minimal clock period to  $T_0 = 2t_{add} + t_{reg}$  in the original circuit (assuming registers before and after the circuit). Pipelined circuit with two additional registers:



Figure 8. Pipelined adder.

Minimal clock period in the pipelined circuit:  $T_{pipe} = t_{add} + t_{reg}$ . That is, the minimal clock period is  $t_{add}$  shorter. For this case, the additional time margin can be traded for lower power dissipation if the overhead of the two registers is less than the gain of supply voltage scaling. This is because the dominant dynamic power dissipation is proportional to  $V_{dd}^2$ , while the throughput is approximately proportional to  $V_{dd}$ . Hence, by decreasing  $V_{dd}$  until the requirement on throughput is met, there will be large savings in power due to the square relation. The large savings possible allow us to afford certain overhead in terms of hardware, e.g., pipelining registers, with a net gain in power dissipation as long as the increase in throughput is enough.

b) Interleaved circuit with identical two blocks using four additional registers (assuming there is no need for registers before the circuit) and one multiplexer. Note that the overhead in terms of power of the registers are less than normal registers since these registers should be clocked at half frequency. The minimal clock period in the circuit path is  $T_{half} = 2t_{add} + t_{mux} + t_{reg}$ . However, since the circuit process two four input additions per clock cycles, the maximal throughput equals to  $2/(2t_{add} + t_{mux} + t_{reg})$ . For this case, the extra time margin can be traded for lower power dissipation if the overhead of the registers and the multiplexer is less than the gain obtained with supply voltage scaling. Note that the required area when interleaving is more than doubled in comparison with the original circuit.



Figure 9. Interleaved adder.

c) If the methods in a) and b) are combined, then the minimal clock period is  $t_{add} + t_{reg}$  and the maximal throughput is  $2/(t_{add}+t_{reg})$  with potentially even more power savings compared with a) and b). However, since the overhead is also further increased the risk of having less gain than expected is increased.

a)



b)



8

ya(n) 6x32 a(n)reg reg 2x32 2x32 16x32 тих  $\overline{\phi}$ ► y(n) ¢ yb(n)6x32 b(n V reg reg 2x32 φ  $\overline{\phi}$ x(n)*d*0 d1 d2 dЗ d2 a(n) *d*0 dЗ b(n)d1  $t_{reg} t_p/2$ out2 out0 ya(n)  $t_{reg} t_p/2$ out3 out1 yb(n)t<sub>m</sub> tm out1 out3 out0 out2 y(n)

d) The original circuit will consume  $P_1 = 29 + 16 \cdot 0.9 = 43.4$  mW, when  $V_{dd} = 1.8$  V. Unfortunately this power supply voltage will not give the required throughput. Total propagation delay  $t_{tot1} = 4.9 + 0.127 + 0.144 = 5.171$  ns. The total propagation delay must be decreased to 4 ns (250 MHz). Hence,

$$\frac{V_{DDnew} / (V_{DDnew} - 0.32)^{1.7}}{1.8 / (1.8 - 0.32)^{1.7}} \le \frac{4}{5.171} = 0.774$$
$$V_{DDnew} = 2.32 \text{ V} \Rightarrow P_{new1} = \frac{P_1 V_{DDnew}^2}{V_{DDold}^2} = \frac{43.4 \cdot 2.32^2}{1.8^2} = 72 \text{ mW}$$

Total propagation delay ( $V_{dd} = 1.8$  V) with interleaving:  $t_{tot} = 4.9 + 0.127 + 0.144 + 0.135$  ns = 5.306 ns

c)

The internal clock frequency is 125 MHz corresponding to a clock period time of 8 ns. The circuit will consume  $P_2 = 2 \cdot \frac{29}{2} + 2 \cdot \frac{16 \cdot 0.9}{2} + 2.8 \text{ mW} = 46.2 \text{ mW}$ .

We can increase the propagation delay with a factor  $\frac{8}{5.306} = 1.508$ .

$$\frac{V_{DDnew} / (V_{DDnew} - 0.32)^{1.7}}{1.8 / (1.8 - 0.32)^{1.7}} \le 1.508 \implies V_{DDnew} \approx 1.27 \text{ V}$$

$$P_{new2} = \frac{P_2 V_{DDnew}^2}{V_{DDold}^2} = \frac{46.2 \cdot 1.27^2}{1.8^2} = 23 \text{ mW}$$

Total propagation delay of critical path of the circuit ( $V_{dd} = 1.8$  V) when both interleaving and pipelining is used:

$$t_{tot3} = \frac{4.9}{2} + 0.127 + 0.144 + 0.135 \text{ ns} = 2.856 \text{ ns}$$

The internal clock frequency is 125 MHz (clock period time of 8 ns). The circuit will consume  $P_3 = 2 \cdot \frac{29}{2} + 2 \cdot \frac{16 \cdot 0.9}{2} + 2 \cdot \frac{6 \cdot 0.9}{2} + 2.8 \text{ mW} = 51.6 \text{ mW}$ .

We can increase the propagation delay with a factor  $\frac{8}{2.856} = 2.801$ .

$$\frac{V_{DDnew} / (V_{DDnew} - 0.32)^{1.7}}{1.8 / (1.8 - 0.32)^{1.7}} \le 2.801 \implies V_{DDnew} \approx 0.84 \text{ V}$$

$$P_{new3} = \frac{P_3 V_{DDnew}^2}{V_{DDold}^2} = \frac{51.6 \cdot 0.84^2}{1.8^2} \text{ mW} \approx 11 \text{ mW}$$

#### Solution 4

a) The pipelined circuit:



Clock frequency of 75 MHz  $\Rightarrow$   $T_{clk} \approx 13.33$  ns.

|                                   | А       | В       | Register<br>(16 bit) | Multiplexer<br>(16 bit) |
|-----------------------------------|---------|---------|----------------------|-------------------------|
| <i>t</i> <sub>delay</sub> (2.5 V) | 9.0 ns  | 6.0 ns  | 0.25 ns              | 0.20 ns                 |
| <i>t</i> <sub>delay</sub> (1.8 V) | 11.7 ns | 7.78 ns | 0.32 ns              | 0.26 ns                 |
| <i>t<sub>delay</sub></i> (1.3 V)  | 15.8 ns | 10.5 ns | 0.44 ns              | 0.351 ns                |
| <i>t<sub>delay</sub></i> (0.9 V)  | 24.5 ns | 16.4 ns | 0.68 ns              | 0.55 ns                 |

New propagation delay when the power supply is scaled:

$$t_{delay(new)} = t_{delay(old)} \left\{ \frac{V_{DD(new)}}{\left(V_{DD(new)} - V_{t}\right)^{1.5}} \cdot \frac{\left(V_{DD(old)} - V_{t}\right)^{1.5}}{V_{DD(old)}} \right\}$$

We choose as low power supply as possible.

Circuit A:  $V_{DDA} = 1.8$  V, Circuit B:  $V_{DDB} = 1.3$  V.

Propagation delay for one register and circuit A: (11.7 + 0.68) ns = 12.4 ns < 13.3 ns Propagation delay for one register and circuit B: (10.5 + 0.68) ns = 11.2 ns < 13.3 ns

(Note that the power consumption is specified for  $f_{clk} = 100$  MHz while the clock frequency in the specification is 75 MHz)

$$P_{min} = \frac{1.8^2}{2.5^2} \cdot \frac{75}{100} \cdot 11 + \frac{1.3^2}{2.5^2} \cdot \frac{75}{100} \cdot 21 + 4 \cdot \frac{0.9^2}{2.5^2} \cdot \frac{75}{100} \cdot 2.3 \approx 9.4 \text{ mW}$$

b) If both circuit A and circuit B are interleaved the area of the circuit will be larger than the specification allows. But, it is possible to interleave one of the circuits.

Interleaving of circuit A:



The clock frequency  $f_2$  for the interleaved circuit is 37.5 MHz  $\Rightarrow T_{clk2} \approx 26.6$  ns.

We choose as low power supply as possible. Circuit A:  $V_{DDA} = 0.9$  V, Circuit B:  $V_{DDB} = 1.3$  V

The power consumption due to the registers and the multiplexer:

$$4 \cdot \frac{0.9^2}{2.5^2} \cdot \frac{37.5}{100} \cdot 2.3 + 2 \cdot \frac{0.9^2}{2.5^2} \cdot \frac{75}{100} \cdot 2.3 + 1 \cdot \frac{0.9^2}{2.5^2} \cdot \frac{75}{100} \cdot 1.5 \approx 1.04 \text{ mW}$$

Propagation delay of the interleaved circuit A plus on register and one multiplexer:  $0.68 + 24.5 + 0.55 \approx 25.7$  ns < 26.6 ns

Propagation delay for one register and circuit B: (10.5 + 0.68) ns = 11.2 ns < 13.3 ns

Power consumption due to the circuits  $A_1$ ,  $A_2$  and B:

$$2 \cdot \frac{0.9^2}{2.5^2} \cdot \frac{37.5}{100} \cdot 11 + \frac{1.3^2}{2.5^2} \cdot \frac{75}{100} \cdot 21 \approx 5.33 \text{ mW}$$

Total power consumption:  $1.04 + 5.33 \approx 6.4$  mW

Interleaving of circuit B:



The clock frequency  $f_2$  for the interleaved circuit is 37.5 MHz  $\Rightarrow T_{clk2} = 26.6$  ns. We choose as low power supply as possible. Circuit A:  $V_{DDA} = 1.8$  V, Circuit B:  $V_{DDB} = 0.9$  V.

Propagation delay of the interleaved circuit *B* plus on register and one multiplexer:  $0.68 + 16.4 + 0.55 \approx 17.6$  ns < 26.6 ns

Propagation delay for one register and circuit A: (11.7 + 0.68) ns = 12.4 ns < 13.3 ns

The power consumption due to the register and the multiplexer:

$$2 \cdot \frac{0.9^2}{2.5^2} \cdot \frac{37.5}{100} \cdot 2.3 + 3 \cdot \frac{0.9^2}{2.5^2} \cdot \frac{75}{100} \cdot 2.3 + 1 \cdot \frac{0.9^2}{2.5^2} \cdot \frac{75}{100} \cdot 1.5 \approx 1.04 \text{ mW}$$

Power consumption due to the circuits A,  $B_1$ , and  $B_2$ :

$$\frac{1.8^2}{2.5^2} \cdot \frac{75}{100} \cdot 11 + 2 \cdot \frac{0.9^2}{2.5^2} \cdot \frac{37.5}{100} \cdot 21 \approx 6.32 \text{ mW}$$
  
Total power consumption:  $1.04 + 6.32 \approx 7.4 \text{ mW} > 6.4 \text{ mW}$ 

Hence, we choose to **interleave circuit** *A*.

- a) Initial maximum frequency is 1/(2 ns + 4 ns + 8 ns + 7 ns + 0.5 ns) = 46.5 MHz.
- b) When pipelining, registers are introduced between the logic blocks. One thing that should be noticed is that placing registers between A and B will not help us speed up the circuit. This because the delay for A+B = 6 ns, that is less than the delay for C (8 ns). Maximum clock frequency after pipelining: 1/(8 ns + 0.5 ns) = 117.6 MHz.



#### Pipelined solution

c) C needs to be modified in some way, if the system should use a higher frequency. A solution is to introduce interleaving for the path through C and D (see Figure 10). Maximum clock frequency: 1/(4 ns + 0.5 ns) = 222 MHz. The propagation time of the multiplexer is neglected.



Interleaved and pipelined solution

## Solution 6

a) Computation graph



The minimum propagation time is  $15t_0$  (excluding external registers)

b) All operations must use the maximum supply voltage  $V_{DD} = 2.5$  V, except the operation (a+b) that may be assigned  $V_{DD1} < V_{DD}$ . The propagation time of (a+b) may increase with a factor of 12/3. Minimum  $V_{DD1}$  is then obtained as  $\frac{V_{DD1}}{(V_{DD1} - |V_T|)^{1.5}} = \frac{12}{3} \frac{V_{DD}}{(V_{DD} - |V_T|)^{1.5}} \Rightarrow V_{DD1} \approx 0.79$  V, yielding the time and

supply voltage schedule



c) Pipelined circuit



d) Now the sum *f* can be computed during a whole clock cycle, hence this adder may also be assigned the supply voltage  $V_{DD1} \approx 0.72$  V, besides (a+b).



e) Neglecting overhead, interleaving would allow the original algorithm to be computed in half the time, i.e.,  $15t_0/2 = 7.5t_0$ , which is less than the critical path of  $13t_0$  in d). This would allow for further reduced voltages for the interleaving case compared with pipelining, and hence likely larger power savings. However, interleaving has an overhead of six latches, one 2:1 multiplexer, three adders and one multiplier compared with two registers for pipelining which will require more area for interleaving and reduce the difference in power savings between the cases.

a) Critical path through c-d  $\Rightarrow$   $t_{old} = 4 + 3$  ns = 7 ns @  $V_{dd} = V_{old} = 1.5$  V Relation between old and new times and supply voltages

$$\frac{t_{new}}{t_{old}} = \frac{V_{new}}{\left(V_{new} - V_T\right)^2} \cdot \frac{\left(V_{old} - V_T\right)}{V_{old}}$$

where

 $t_{new} = (125 \text{ MHz})^{-1} = 8 \text{ ns}, V_T = 0.4 \text{ V and } V_{new} \text{ is sought.}$ 

Solve e.g. by inserting numerical values

$$\frac{8^2}{7^2} = \frac{V_{new}}{(V_{new} - 0.33)^2} \cdot \frac{(1.5 - 0.33)^2}{1.5} \Leftrightarrow \frac{8^2 \cdot 1.5}{7^2 \cdot (1.5 - 0.33)^2} (V_{new} - 0.33)^2 - V_{new} = 0 \Longrightarrow$$
  
$$V_{new} \approx 1.38 \text{ V} \text{ (other solution is less than } V_T \text{)}$$

Power saving

$$S = 1 - \frac{P_{new}}{P_{old}} = 1 - \frac{fCV_{new}^2}{fCV_{old}^2} \approx 1 - \frac{1.38^2}{1.5^2} \approx 15\%$$

b) Pipelining according to e.g. cut in figure below



New critical path is through c-reg. Repeating calculations with  $t_{old} = t_{pipe} = 4+0$  ns yields  $V_{pipe} \approx 1.01$  V and  $S_{pipe} \approx 54\%$ .

c) Repeating calculations with  $t_{old} = t_{oh} = 4+1$  ns yields  $V_{oh} \approx 1.13$  V.

Using  $P_0 = fC_0 V_{old}^2 \Rightarrow C_0 = P_0 / (fV_{old}^2)$  yields relative power saving  $S_{oh} = 1 - \frac{5fC_0 V_{oh}^2}{4P_0} = 1 - \frac{5f \frac{P_0}{fV_{old}^2} V_{oh}^2}{4P_0} = 1 - \frac{5V_{oh}^2}{4V_{old}^2} \approx 29\%$