# TSEA44: Computer hardware – a system on a chip

Lecture 4: The lab system and JPEG acceleration



TSEA44: Computer hardware - a system on a chip

2017-11-13

### Agenda

- · Array/memory hints
- · Cache in a system
  - The effect of cache in combination with accelerator
- Introduce JPEG encoding of images
  - DCT transform
  - Rata reduction
- Introduce lab 2 (JPEG acceleration)



TSEA44: Computer hardware - a system on a chip

2017-11-13 3

### Practical issues

- Forming groups
  - Send email to me (to get shared folder access)
  - Try to form groups of 3
    - Groups of 2 or 3 is expected, 3 is prefered
    - See exam webpage of course for list of students

LINKÖPING UNIVERSITY

TSEA44: Computer hardware – a system on a chip

2017-11-13 4

# Some tips about arrays/memories

- · FPGA memories can be created using
  - Flipflops; asynchronous read, synchronous write
  - Distributed using LUTs; asynchronous read, synchronous write, 16x1 each
  - BRAMs; synchronous read, synchronous write, 512x32, 1024x16 ....
- · Memories can be designed
  - Using templates (BRAMs)
  - Inferred (distributed)

LINKÖPING UNIVERSITY





TSEA44: Computer hardware – a system on a chip

2017-11-13 9

### Caches

- Essential! Required to get good (close to 1) instructions per clock cycle (CPI)
- Expect to fetch 1 instruction each clock cyle
  - Internal (FPGA ROM/RAM) memory have a latency of 3 clockcycles
  - External (SRAM/SDRAM/FLASH) have a latency of 4 clockcycles
- Size: (depending on FPGA) there are up to 120 x 2KB block RAMs
  - => Select 8KB each for IC and DC
- Type: direct mapped (or set associative)

LIU LINKÖPING















TSEA44: Computer hardware = a system on a chin

2017-11-13 17

### Accelerator interfacing

- · Accelerator should implement functionality that is timeconsuming to run on the CPU
- · Interfacing the accelerator require additional data moves
- Simplest case (for the processor)
  - CPU send data to accelerator
  - CPU gets data from accelerator
    - · Data available immediately, no waiting
    - Usually difficult to implement, processing takes time

LINKÖPING UNIVERSITY

TSEA44: Computer hardware - a system on a chip

2017-11-13 18

### Accelerator interfacing, cont.

- More common case: Accelerator require some time to process data
  - CPU send data to accelerator
  - CPU waits for some time (N clock cycles)
    - No useful work performed by processor
  - CPU gets data from accelerator
  - Worse if time required to wait is unknown
    - Busy wait on the bus: Ask accelerator, but not get a respons for many clock cycles => Stalling CPU, locking bus

LIU LINKÖPING UNIVERSITY

TSEA44: Computer hardware = a system on a chin

2017-11-13 19

## Accelerator interfacing, cont.

- · Common for the accelerator to have large amount of data to receive, process, and return
- · Simplest approach: Use CPU to feed accelerator with data

Mem->CPU

CPU-> Accelerator

Feed data to accelerator, uses CPU

...wait

Accelerator->CPU

CPU -> Mem

Return data from accelerator, uses CPU

LINKÖPING UNIVERSITY

TSEA44: Computer hardware – a system on a chip

2017-11-13 20

# Accelerator interfacing, cont.

• Want to reduce load on CPU: let the accelerator do the data moves by itself: DMA! (Direct Memory Access)

CPU setups DMA controller in accelerator (startadress, length)

Mem -> Accelerator Feed data to accelerator, CPU do other things

...processing

Accelerator->Mem Return data from accelerator, CPU do other things

A drawback: Both accelerator and CPU compete for the bus Even worse if a number of accelerators work on data in sequence (Accelerator1 -> Accelerator2 ->...)

LINKÖPING

TSEA44: Computer hardware – a system on a chip

Accelerator interfacing, cont.

• Stop communication between accelerators from going over the bus

2017-11-13 21

- Use special memories interconnecting only accelerators
- Remove bus use (increase availability for the CPU)
- The memories are unavailable to the CPU

CPU->Accelerator (setup startadress, length etc.)
Mem->Accelerator1

- ... process in Accelerator1, store result in extra memory
- ... process in Accelerator2, read input from extra memory Accelerator2->Mem

LINKÖPING UNIVERSITY



TSEA44: Computer hardware - a system on a chip

2017-11-13 23

### JPEG Introduction

- Joint Photographers Expert Group
- Image compression standard defined by JPEG
  - Remove things that we cannot see
  - Decoded image is slightly different from original
    - Lossy compression

LINKÖPING UNIVERSITY











TSEA44: Computer hardware - a system on a chip

2017-11-13 29

# 8x8-point 2-D DCT/IDCT

$$T(k,l) = c(k,l) \sum_{x=0}^{7} \sum_{y=0}^{7} v(x,y) C(y;l) C(x;k), \qquad k,l = 0...7$$

$$v(x,y) = \sum_{k=0}^{7} \sum_{l=0}^{7} c(k,l) T(k,l) C(y;l) C(x;k), \qquad x,y = 0...7$$

$$c(0,0) = \frac{1}{8} \qquad k = l = 0$$

$$c(k,l) = \frac{1}{4} \qquad else$$

$$C(x;k) = \cos\left(\frac{(2x+1)k\pi}{16}\right)$$

LINKÖPING UNIVERSITY

Simplifications

1. Separation in  $\sum_{x=0}^{\infty} \left( \sum_{y=0}^{\infty} (x,y) C(y;l) \right) C(x;k)$   $= c(k,l) \sum_{x=0}^{7} B(x,l) C(x;k)$ 2. 1-D DCT can be simplified for N=8























TSEA44: Computer hardware – a system on a chip

2017-11-13 42

### Finally

- AC and DC values are treated differently
- Two Huffman LUTs are used
- DC
  - Differential, magnitude encoding, Huffman table lookup

in code lengtl 0x00 1010 4 0x01 000 2 0x02 01 2 0x03 100 3 0x04 1011 4 0x05 11010 5

max length=16

- AC
  - As mentioned, raw bits left untouched, Huffman table lookup
- Example: value 04, raw bits 1100 => ....10111100....



www.liu.se

