Lecture 4: The lab system and JPEG encoding



TSEA44: Computer hardware – a system on a chip

2022-11-09

### Agenda

- Array/memory hints
- Cache in a system
  - The effect of cache in combination with accelerator
- Introduce JPEG encoding of images
  - DCT transform
  - Data reduction



2022-11-09

3

### **Practical issues**

- Forming groups
  - Require pass on lab0
  - Send email to me (to get shared folder access)
  - Try to form groups of 2
    - Groups of 1 to 3 is ok; 3 is ok, 2 is prefered, 1 is ok
    - See exam webpage of course for list of students
    - Each group have their own directory to store files /courses/TSEA44/labs/labgrpXX



TSEA44: Computer hardware – a system on a chip

2022-11-09

4

### **Practical issues**

Next lecture is an invited lecture from AXIS

2022-11-09

5

## Some tips about arrays/memories

- FPGA memories can be created using
  - Flipflops; asynchronous read, synchronous write
  - Distributed using LUTs; asynchronous read, synchronous write, 16x1 each
  - BRAMs; synchronous read, synchronous write, 512x32, 1024x16 ....
- Memories can be designed
  - Using templates (BRAMs)
  - Inferred (distributed)



TSEA44: Computer hardware – a system on a chip

2022-11-09

6

### Some tips about arrays/memories

- Usually describe memory as arrays
- Two ways to describe arrays in SystemVerilog
  - Packed, e.g., logic [3:0] p;
    - Guaranteed to be continuous



- Typical for samples, values
- Unpacked, e.g., logic u [3:0];
- a[0] a[1] a[2]
- Support other data types
- Typical for multiple unit of same type



```
Unpacked arrays

wire [31:0] bus;

reg [7:0] mem [0:3]; // a 4-byte memory

...

assign bus[31:24] = mem[3];

7
0
mem[0][7:0]
mem[1][7:0]
mem[2][7:0]
mem[3][7:0]

bus[31:0]
```



2022-11-09 10

### **Caches**

- Essential! Required to get good (close to 1) instructions per clock cycle (IPC)
- Expect to fetch 1 instruction each clock cyle
  - Internal (FPGA ROM/RAM) memory have a latency of 3 clockcycles
  - External (SRAM/SDRAM/FLASH) have a latency of 4 clockcycles
- Size: (depending on FPGA) there are up to 120 x 2KB block RAMs
  - => Select 8KB each for IC and DC
- **Type**: direct mapped (or set associative)











2022-11-09

## Cache policy

Cacheline = 4 words = 16 Bytes

### Instruction cache

|      | hit | miss                                 |
|------|-----|--------------------------------------|
| read |     | fill (replace) cacheline from memory |

### Data cache

|       | hit                                       | miss                                 |
|-------|-------------------------------------------|--------------------------------------|
| read  | read from cache                           | fill (replace) cacheline from memory |
| write | write to cache<br>write thru to<br>memory | write to memory only                 |

LU LINKÖPING UNIVERSITY

TSEA44: Computer hardware - a system on a chip

### Or1200 store buffer

- In a write-through data cache is every write equivalent to a cache miss!
- A store (write) buffer is placed between CPU and memory
- · Writes are placed in a queue, so that the data cache is available on the next clock cycle





2022-11-09 18

### Accelerator interfacing

- Accelerator should implement functionality that is timeconsuming to run on the CPU
- Interfacing the accelerator require additional data moves
- Simplest case (for the processor)
  - CPU send data to accelerator
  - CPU gets data from accelerator
    - · Data available immediately, no waiting
    - · Usually difficult to implement, processing takes time



2022-11-09

# Accelerator interfacing, cont.

- More common case: Accelerator require some time to process data
  - CPU send data to accelerator
  - CPU waits for some time (N clock cycles)
    - No useful work performed by processor
  - CPU gets data from accelerator
  - Worse if time required to wait is unknown
    - Busy wait on the bus: Ask accelerator, but not get a respons for many clock cycles => Stalling CPU, locking bus



TSEA44: Computer hardware - a system on a chip

2022-11-09 20

# Accelerator interfacing, cont.

- Common for the accelerator to have large amount of data to receive, process, and return
- Simplest approach: Use CPU to feed accelerator with data

Mem->CPU

Feed data to accelerator, uses CPU

CPU-> Accelerator

...wait

Accelerator->CPU CPU -> Mem

Return data from accelerator, uses CPU

2022-11-09

### Accelerator interfacing, cont.

 Want to reduce load on CPU: let the accelerator do the data moves by itself: DMA! (Direct Memory Access)

CPU setups DMA controller in accelerator (startadress, length)

Mem -> Accelerator Feed data to accelerator, CPU do other things ...processing

Accelerator->Mem Return data from accelerator, CPU do other things

A drawback: Both accelerator and CPU compete for the bus Even worse if a number of accelerators work on data in sequence (Accelerator1 -> Accelerator2 ->...)



TSEA44: Computer hardware – a system on a chip

2022-11-09 22

### Accelerator interfacing, cont.

- Stop communication between accelerators from going over the bus
  - Use special memories interconnecting only accelerators
  - Remove bus use (increase availability for the CPU)
  - The memories are unavailable to the CPU

CPU->Accelerator (setup startadress, length etc.)

Mem->Accelerator1

- ... process in Accelerator1, store result in extra memory
- ... process in Accelerator2, read input from extra memory

Accelerator2->Mem





2022-11-09 24

### JPEG Introduction

- Joint Photographers Expert Group
- Image compression standard defined by JPEG
  - Remove things that we cannot see
  - Decoded image is slightly different from original
    - Lossy compression





2022-11-09 26

Problem definition
 JPEG compression of testbild.raw 512x400 pixels
 JPEG works on 8x8 blocks => 3200 blocks
 Unaccelerated JPEG takes more than 32 000 000 clock

TSEA44: Computer hardware - a system on a chip



TSEA44: Computer hardware – a system on a chip 2022-11-09 Z7

Color Conversion Y = 0.299 R + 0.587 G + 0.144 B  $Cb = -0.1687 R - 0.3313 G + 0.5 B + 2^{P_s - 1}$   $Cr = 0.5 R - 0.4187 G - 0.0813 B + 2^{P_s - 1}$ Y

Cb

Y = luminance
Cb/Cr = chrominance

Cr

Cr

LINKOPING
UNIVERSITY



2022-11-09

### 8-point 1-D DCT/IDCT

$$T(k) = c(k) \sum_{x=0}^{7} v(x) \cos\left(\frac{(2x+1)k\pi}{16}\right), \quad k = 0....7$$

$$v(x) = \sum_{k=0}^{7} c(k) T(k) \cos\left(\frac{(2x+1)k\pi}{16}\right), \quad x = 0....7$$

$$c(0) = \sqrt{\frac{1}{8}}$$

$$c(k) = \frac{1}{2}, \quad k \neq 0$$

$$C(x;k) = \cos\left(\frac{(2x+1)k\pi}{16}\right)$$
coord freq

LU LINKÖPING UNIVERSITY

TSEA44: Computer hardware - a system on a chip

2022-11-09

# 8x8-point 2-D DCT/IDCT

$$T(k,l) = c(k,l) \sum_{x=0}^{7} \sum_{y=0}^{7} v(x,y)C(y;l)C(x;k), \qquad k,l = 0...7$$

$$v(x,y) = \sum_{k=0}^{7} \sum_{l=0}^{7} c(k,l)T(k,l)C(y;l)C(x;k), \qquad x,y = 0...7$$

$$c(0,0) = \frac{1}{8} \qquad k = l = 0$$

$$c(k,l) = \frac{1}{4} \qquad else$$

$$C(x;k) = \cos\left(\frac{(2x+1)k\pi}{16}\right)$$

2022-11-09 3

### Simplifications

1. Separation in x and y

$$T(k,l) = c(k,l) \sum_{x=0}^{7} \left\{ \sum_{y=0}^{7} v(x,y)C(y;l) \right\} C(x;k)$$

$$= c(k,l) \sum_{x=0}^{7} B(x,l)C(x;k)$$



2. 1-D DCT can be simplified for N=8



TSEA44: Computer hardware – a system on a chip

2022-11-09

# Meaning of the transform

$$v(x,y) = \sum_{k=0}^{7} \sum_{l=0}^{7} T(k,l)c(k,l)C(y;l)C(x;k) =$$

$$= \sum_{k=0}^{7} \sum_{l=0}^{7} T(k,l)C(x,y;k,l)$$

C(x, y; 0, 0)

v(x,y)



C(x, y)

C(x, y; 7, 7)





2022-11-09 34

### **Data Reduction**

• Transform, rounded division result  $Y_q = \text{round (}DCT_2(Y)/Q_L)$ 

$$Y = \begin{bmatrix} 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 164 & 164 & 158 & 155 & 161 & 159 & 159 & 160 \\ 160 & 160 & 163 & 158 & 160 & 162 & 159 & 156 \\ 159 & 159 & 155 & 157 & 158 & 159 & 156 & 157 \end{bmatrix}$$

$$\mathcal{DCT}_2(Y) = \begin{bmatrix} 259 & 5 & 3 & 0 & 0 & -1 & -5 & 6 \\ 8 & -1 & 1 & -5 & 2 & 3 & -4 & 3 \\ -5 & 0 & -2 & 2 & -1 & 0 & 2 & -2 \\ 2 & 1 & 2 & 1 & -1 & -1 & 0 & 1 \\ -1 & -1 & 0 & -1 & 2 & 1 & -1 & -1 \\ 1 & 0 & -2 & 0 & -2 & 1 & 2 & 1 \\ -2 & 0 & 3 & 2 & 2 & -2 & -1 & -1 \\ 1 & 0 & -2 & -2 & -1 & 2 & 1 & 1 \end{bmatrix}$$

$$Q_L = \begin{bmatrix} 16 & 11 & 10 & 16 & 24 & 40 & 51 & 61 \\ 12 & 12 & 14 & 19 & 26 & 58 & 60 & 55 \\ 14 & 13 & 16 & 24 & 40 & 57 & 69 & 56 \\ 14 & 17 & 22 & 29 & 51 & 87 & 80 & 62 \\ 18 & 22 & 37 & 56 & 68 & 109 & 103 & 77 \\ 24 & 35 & 55 & 64 & 81 & 104 & 113 & 92 \\ 49 & 64 & 78 & 87 & 103 & 121 & 120 & 101 \\ 72 & 92 & 95 & 98 & 112 & 100 & 103 & 99 \end{bmatrix}$$





2022-11-09

### RLE = run length encoding, Example:

Raw data: 0 0 0 1 1 1 1 0 0 0 0 1 0

Encoded as: 3:0, 4:1, 4:0, 1:1, 1:0

Alternative (if only zeroes are plentiful):

Raw data: 50007009000000

Encoded as: 5:3, 7:2, 9:DONE

Special code

LINKÖPING UNIVERSITY

TSEA44: Computer hardware – a system on a chip

2022-11-09 38

# Zigzag Pattern

- Increase possibility of zeros at the end of the sequence
  - Small energy in highest frequencies



2022-11-09 3

### Magnitude encoding (DC only)

| Encoded value | DC Value       | Range        |
|---------------|----------------|--------------|
| 0             | 0              |              |
| 1             | [-1]           | [1]          |
| 2             | [-3, -2]       | [2, 3]       |
| 3             | [-7, -4]       | [4,7]        |
| 4             | [-15, -8]      | [8, 15]      |
| 5             | [-31, -16]     | [16, 31]     |
| 6             | [-63, -32]     | [32, 63]     |
| 7             | [-127, -64]    | [64, 127]    |
| 8             | [-255, -128]   | [128, 255]   |
| 9             | [-511, -256]   | [256, 511]   |
| 10            | [-1023, -512]  | [512, 1023]  |
| 11            | [-2047, -1024] | [1024, 2047] |



TSEA44: Computer hardware – a system on a chip

2022-11-09 40

## An example of RLE

### 1) After Q

22 12 0 -12 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

### 2) After zig-zag

### 3) After RLE

Value raw bits (amplitude value) 05 10110 -12 => 1100 04 12-1, force MSB=0 13 100 => 0011 0011 0111 F0 8-1, force MSB=0 F0 => 0111 D1 00

Run of 0:s Magnitude (number of raw bits)

### 4) Huffman coding

- Value are HC (variable length) using table lookup
- raw bits are left untouched

LINKÖPING reg





2022-11-09 43

TSEA44: Computer hardware – a system on a chip

# Finally

- AC and DC values are treated differently
- Two Huffman LUTs are used
- DC
  - Differential, magnitude encoding, Huffman table lookup

| in   | code  | length |
|------|-------|--------|
| 0x00 | 1010  | 4      |
| 0x01 | 00    | 2      |
| 0x02 | 01    | 2      |
| 0x03 | 100   | 3      |
| 0x04 | 1011  | 4      |
| 0x05 | 11010 | 5      |
|      |       |        |

max length=16

- AC
  - As mentioned, raw bits left untouched, Huffman table lookup
- Example: value 04, raw bits 1100 => ....10111100....



www.liu.se

