Lecture 4: The lab system and JPEG acceleration



TSEA44: Computer hardware - a system on a chip

2016-11-10

\_

## Agenda

- Array/memory hints
- Cache in a system
  - The effect of cache in combination with accelerator
- Introduce JPEG encoding of images
  - DCT transform
  - Rata reduction
- Introduce lab 2 (JPEG acceleration)



2016-11-10

3

# Some tips about arrays/memories

- · FPGA memories can be created using
  - Flipflops; asynchronous read, synchronous write
  - Distributed using LUTs; asynchronous read,
     synchronous write, 16x1 each
  - BRAMs; synchronous read, synchronous write,
     512x32, 1024x16 ....
- Memories can be designed
  - Using templates (BRAMs)
  - Inferred (distributed)



TSEA44: Computer hardware - a system on a chip

2016-11-10

.

# Some tips about arrays/memories

- Usually describe memory as arrays
- Two ways to describe arrays in SystemVerilog
  - Packed, e.g., logic [3:0] p;
    - Guaranteed to be contiguous



- Unpacked, e.g., logic u [3:0];
  - Support other data types





```
Unpacked arrays

wire [31:0] bus;
reg [7:0] mem [0:3]; // a 4-byte memory

...

assign bus[31:24] = mem[3];

7
0
mem[0][7:0]
mem[1][7:0]
mem[2][7:0]
mem[3][7:0]

bus[31:0]
```

```
TSEA44: Computer hardware - a system on a chip
                                                        2016-11-10
Packed arrays
wire [31:0] bus;
                                   // a packed array
reg [3:0][7:0] mem;
                                    // so is this
                                    // both are
contiguous
assign bus = mem;
assign bus[31:16] = mem[3:2];
          mem[3][7:0] | mem[2][7:0] | mem[1][7:0] | mem[0][7:0]
         31
                                                          0
                              bus[31:0]
LINKÖPING
UNIVERSITY
```



2016-11-10

#### Caches

- Essential! Required to get good (close to 1) instructions per clock cycle (CPI)
- Expect to fetch 1 instruction each clock cyle
  - Internal (FPGA ROM/RAM) memory have a latency of 3 clockcycles
  - External (SRAM/SDRAM/FLASH) have a latency of 4 clockcycles
- Size: (depending on FPGA) there are up to 120 x 2KB block RAMs
  - => Select 8KB each for IC and DC
- Type: direct mapped (or set associative)



8









2016-11-10

## Cache policy

Cacheline = 4 words = 16 Bytes

#### Instruction cache

|      | hit             | miss                                 |
|------|-----------------|--------------------------------------|
| read | read from cache | fill (replace) cacheline from memory |

#### Data cache

|       | hit                                       | miss                                 |
|-------|-------------------------------------------|--------------------------------------|
| read  | read from cache                           | fill (replace) cacheline from memory |
| write | write to cache<br>write thru to<br>memory | write to memory only                 |



TSEA44: Computer hardware – a system on a chip

#### Or1200 store buffer

- In a write-through data cache is every write equivalent to a cache miss!
- A store (write) buffer is placed between CPU and memory
- Writes are placed in a queue, so that the data cache is available on the next clock cycle







2016-11-10

16

## Accelerator interfacing

- Accelerator should implement functionality that is timeconsuming to run on the CPU
- Interfacing the accelerator require additional data moves
- Simplest case (for the processor)
  - CPU send data to accelerator
  - CPU gets data from accelerator
    - · Data available immediately, no waiting
    - · Usually difficult to implement, processing takes time



16

2016-11-10

## Accelerator interfacing, cont.

- More common case: Accelerator require some time to process data
  - CPU send data to accelerator
  - CPU waits for some time (N clock cycles)
    - No useful work performed by processor
  - CPU gets data from accelerator
  - Worse if time required to wait is unknown
    - Busy wait on the bus: Ask accelerator, but not get a respons for many clock cycles => Stalling CPU, locking bus



TSEA44: Computer hardware - a system on a chip

2016-11-10

18

# Accelerator interfacing, cont.

- Common for the accelerator to have large amount of data to receive, process, and return
- Simplest approach: Use CPU to feed accelerator with data

Mem->CPU

**CPU-> Accelerator** 

Feed data to accelerator, uses CPU

...wait

Accelerator->CPU

CPU -> Mem

Return data from accelerator, uses CPU



2016-11-10

# Accelerator interfacing, cont.

 Want to reduce load on CPU: let the accelerator do the data moves by itself: DMA! (Direct Memory Access)

CPU setups DMA controller in accelerator (startadress, length)

Mem -> Accelerator Feed data to accelerator, CPU do other things ...processing

Accelerator->Mem Return data from accelerator, CPU do other things

A drawback: Both accelerator and CPU compete for the bus Even worse if a number of accelerators work on data in sequence (Accelerator1 -> Accelerator2 ->...)



TSEA44: Computer hardware – a system on a chip

2016-11-10

# Accelerator interfacing, cont.

- Stop communication between accelerators from going over the bus
  - Use special memories interconnecting only accelerators
  - Remove bus use (increase availability for the CPU)
  - The memories are unavailable to the CPU

CPU->Accelerator (setup startadress, length etc.) Mem->Accelerator1

... process in Accelerator1, store result in extra memory

... process in Accelerator2, read input from extra memory Accelerator2->Mem





2016-11-10

22

#### JPEG Introduction

- Joint Photographers Expert Group
- Image compression standard defined by JPEG
  - Remove things that we cannot see
  - Decoded image is slightly different from original
    - Lossy compression





2016-11-10

#### **Problem definition**

- JPEG compression of testbild.raw 512x400 pixels
  - JPEG works on 8x8 blocks => 3200 blocks
  - Unaccelerated JPEG takes more than 32 000 000 clock cycles => 1 block takes more than 10 000 clock cycles!





2016-11-10

#### **Color Conversion**

Y = 0.299 R + 0.587 G + 0.144 B $Cb = -0.1687 R - 0.3313 G + 0.5 B + 2^{P_s - 1}$ 

 $Cr = 0.5 R - 0.4187 G - 0.0813 B + 2^{P_s - 1}$ 

Υ



Cb



F

G

В

Y = luminance Cb/Cr = chrominance

Cr



LINKÖPING UNIVERSITY

TSEA44: Computer hardware – a system on a chip

2016-11-10

26

# Resampling

Υ



Cb



Cr



Data reduction 50%



2016-11-10

### 8-point 1-D DCT/IDCT

$$T(k)=c(k)\sum_{x=0}^{7}v(x)\cos\left(\frac{(2x+1)k\pi}{16}\right), \quad k=0....7$$

$$v(x)=\sum_{k=0}^{7}c(k)T(k)\cos\left(\frac{(2x+1)k\pi}{16}\right), \quad x=0...7$$

$$c(0)=\sqrt{\frac{1}{8}}$$

$$c(k)=\frac{1}{2}, \quad k\neq 0$$

$$C(x;k)=\cos\left(\frac{(2x+1)k\pi}{16}\right)$$

$$\cos\left(\frac{(2x+1)k\pi}{16}\right)$$

$$\cos\left(\frac{(2x+1)k\pi}{16}\right)$$

LINKÖPING UNIVERSITY

TSEA44: Computer hardware - a system on a chip

2016-11-10

# 8x8-point 2-D DCT/IDCT

$$T(k,l) = c(k,l) \sum_{x=0}^{7} \sum_{y=0}^{7} v(x,y)C(y;l)C(x;k), \qquad k,l = 0...7$$

$$v(x,y) = \sum_{k=0}^{7} \sum_{l=0}^{7} c(k,l)T(k,l)C(y;l)C(x;k), \qquad x,y = 0...7$$

$$c(0,0) = \frac{1}{8} \qquad k = l = 0$$

$$c(k,l) = \frac{1}{4} \qquad else$$

$$C(x;k) = \cos\left(\frac{(2x+1)k\pi}{16}\right)$$

LINKÖPING UNIVERSITY

2016-11-10

#### 29

# Simplifications

1. Separation in 
$$\sum_{x=0}^{7} \left[\sum_{y=0}^{7} v(x,y)C(y;l)\right] C(x;k)$$

$$= c(k,l) \sum_{x=0}^{7} B(x,l)C(x;k)$$

2. 1-D DCT can be simplified for N=8



TSEA44: Computer hardware - a system on a chip

2016-11-10

30

# Meaning of the transform

$$v(x,y) = \sum_{k=0}^{7} \sum_{l=0}^{7} T(k,l)c(k,l)C(y;l)C(x;k) =$$

$$= \sum_{k=0}^{7} \sum_{l=0}^{7} T(k,l)C(x,y;k,l)$$

$$C(x,y;0,0)$$

$$v(x,y)$$

$$C(x,y;0,0)$$

$$C(x,y;7,0)$$

$$C(x,y;7,0)$$

$$C(x,y;7,0)$$

$$C(x,y;7,0)$$

$$C(x,y;7,0)$$

$$C(x,y;7,0)$$

$$C(x,y;7,0)$$

LINKÖPING UNIVERSITY



2016-11-10

32

#### **Data Reduction**

Transform, rounded division result Y<sub>q</sub> = round ( DCT<sub>2</sub>(Y)/Q<sub>L</sub>)

$$Y = \begin{bmatrix} 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 162 & 162 & 162 & 161 & 162 & 157 & 163 & 161 \\ 164 & 164 & 158 & 155 & 161 & 159 & 159 & 160 \\ 160 & 160 & 163 & 158 & 160 & 162 & 159 & 156 \\ 159 & 159 & 155 & 157 & 158 & 159 & 156 \end{bmatrix}$$

$$\mathcal{DCT}_2(Y) = \begin{bmatrix} 259 & 5 & 3 & 0 & 0 & -1 & -5 & 6 \\ 8 & -1 & 1 & -5 & 2 & 3 & -4 & 3 \\ -5 & 0 & -2 & 2 & -1 & 0 & 2 & -2 \\ 2 & 1 & 2 & 1 & -1 & -1 & 0 & 1 \\ -1 & -1 & 0 & -1 & 2 & 1 & -1 & -1 \\ 1 & 0 & -2 & 0 & -2 & 1 & 2 & 1 \\ -2 & 0 & 3 & 2 & 2 & -2 & -1 & -1 \\ 1 & 0 & -2 & -2 & -1 & 2 & 1 & 1 \end{bmatrix}$$



2016-11-10

# Loefflers algoritm (fast DCT)

• 1-D 8-point DCT can be simplified

ied 
$$=>$$
 multiplication with  $\sqrt{2}$ 

$$x_{in} + y_{in}$$







$$x_{in} \rightarrow y_{in} \rightarrow cn$$
 $\Rightarrow x_{out} = x_{in} \cdot \cos n\pi/16 + y_{in} \cdot \sin n\pi/16$ 
 $y_{out} = -x_{in} \cdot \sin n\pi/16 + y_{in} \cdot \cos n\pi/16$ 



TSEA44: Computer hardware – a system on a chip

2016-11-10

34

#### Final modification



$$k_3(k_1x + k_2y) = k_3k_1 x + k_3k_2 y$$

precompute



2016-11-10

# RLE = run length encoding, Example:

Raw data: 0001111000010

Encoded as: 3:0, 4:1, 4:0, 1:1, 1:0

Alternative (if only zeroes are plentiful):

Raw data: 50007009000000

Encoded as: 5:3, 7:2, 9:DONE

Special code

LINKÖPING UNIVERSITY

TSEA44: Computer hardware - a system on a chip

2016-11-10

#### 36

## Zigzag Pattern

- Increase possibility of zeros at the end of the sequence
  - Small energy in highest frequencies





35

2016-11-10

# Magnitude encoding (DC only)

| Encoded value | DC Value       | Range        |
|---------------|----------------|--------------|
| 0             | 0              |              |
| 1             | [-1]           | [1]          |
| 2             | [-3, -2]       | [2,3]        |
| 3             | [-7, -4]       | [4,7]        |
| 4             | [-15, -8]      | [8, 15]      |
| 5             | [-31, -16]     | [16, 31]     |
| 6             | [-63, -32]     | [32, 63]     |
| 7             | [-127, -64]    | [64, 127]    |
| 8             | [-255, -128]   | [128, 255]   |
| 9             | [-511, -256]   | [256, 511]   |
| 10            | [-1023, -512]  | [512, 1023]  |
| 11            | [-2047, -1024] | [1024, 2047] |



TSEA44: Computer hardware – a system on a chip

2016-11-10 38

# An example of RLE

#### 1) After Q

```
22 12 0
       -12 0 0 0 0
    -8 0 0 0 0 0
4
  0 0
       0 0 0 0 0
0
           0 0 0 1
  0 0
        0
0
  0
           0
0
          0 0 0 0
  0
  0 0
        0 0 0 0 0
           0
             0 0 0
```

#### 2) After zig-zag

#### 3) After RLE

Value raw bits (amplitude value) 05 10110 -12 => 04 1100 12-1, force MSB=0 13 100 => 0011 24 0011 04 0111 -8 => F0 8-1, force MSB=0 F0 => 0111 D1 1 00 Run of 0:s Magnitude

#### 4) Huffman coding

- Value are HC (variable length) using table lookup
- raw bits are left untouched

LINKÖPING UNIVERSITY reg



2016-11-10

10

#### JFIF Format

- JPEG File Interchange Format
  - Markers
  - Data





2016-11-10

## Finally

- AC and DC values are treated differently
- Two Huffman LUTs are used
- DC
  - Differential, magnitude encoding, Huffman table lookup

| in   | code  | length |
|------|-------|--------|
| 0x00 | 1010  | 4      |
| 0x01 | 00    | 2      |
| 0x02 | 01    | 2      |
| 0x03 | 100   | 3      |
| 0x04 | 1011  | 4      |
| 0x05 | 11010 | 5      |
|      | ٠     |        |

max length=16

- AC
  - As mentioned, raw bits left untouched,
     Huffman table lookup
- Example: value 04, raw bits 1100 => ....10111100....



TSEA44: Computer hardware - a system on a chip

2016-11-10

#### Lab 2 - A JPEG accelerator

- 1. Design HW
- 2. Change existing software jpegfiles under uCLinux
  - a) insert your accelerator
  - b) insert your DMA
  - c) insert your instruction





2016-11-10

44

# Raw image format in memory



0x00ff00ff



8 bit pixels [0,255] 4 pixels/word Somewhere 128 must be subtracted from each pixel!

LINKÖPING UNIVERSITY



2016-11-10

46

## Testcases available

- Lab page of the course webpages
- Includes code for quantization





2016-11-10 4

#### DCT module

- Given to you
  - 1D DCT
    - 8 in ports (12 bits), 8 out ports (16 bits)
    - Fix point arithmetic
    - Straightforward implementation of Loeffler's algorithm



TSEA44: Computer hardware – a system on a chip

2016-11-10

...

# Proposed architecture









2016-11-10

#### **Block RAM**

- Different timing
- "Normal" SRAM
  - Asynchronous read
  - Asynchronous write
- Block RAM in Virtex 2
  - Synchronous read
  - Synchronous write



TSEA44: Computer hardware - a system on a chip

2016-11-10

--

# Proposed architecture





51







2016-11-10

#### Test benches - 2 alternatives

1) Simulate the whole computer - make sim



Insert some code
In the beginning of
the monitor mon2.c

There are some alternatives to uncomment

Tip: You can write to parport to make it easier to find things in ModelSim



2016-11-10

## Test benches - 2 alternatives

2) Simulate the accelerator - make sim\_jpeg

```
tests
1)Write a block to acc
2)Start acc
3)Wait for acc to finish
4)Read block from acc
5) print it out
```

LINKÖPING UNIVERSITY

TSEA44: Computer hardware – a system on a chip

2016-11-10

58

# wb\_tasks.sv

```
module wishbone_tasks(wishbone.master wb);
  int result = 0;
   reg oldack;
  reg [31:0] olddat;
   always @(posedge wb.clk) begin
      oldack <= wb.ack;
      olddat <= wb.dat_i;
   end
   task m_read(input [31:0] adr, output logic [31:0] data);
      @(posedge wb.clk);
      wb.adr <= adr;
      wb.stb <= 1'b1;
      wb.we <= 1'b0;
                                                                 wb.stb <= 1'b0;
      wb.cyc <= 1'b1;
                                                                 wb.we <= 1'b0;
wb.cyc <= 1'b0;
wb.sel <= 4'h0;
      wb.sel <= 4'hf;
      @(posedge wb.clk);
                                                                 data = olddat;
      while (!oldack) begin
                                                                 end
        @(posedge wb.clk);
                                                              endtask // m_read
              #1:
                                                          endmodule // wishbone_tasks
```



2016-11-10 59

TSEA44: Computer hardware - a system on a chip

#### uClinux

- Operating system
- Flat memory
  - User program can crash OS
- /mnt and /var writeable
  - /mnt/htdocs
- TFTP to transfer files
  - Jpegtest
  - Jcdctmgr
  - Jdct
  - jchuff

jpegtest

LINKÖPING UNIVERSITY

www.liu.se

