Kent Palmkvist, Erik Bertilsson http://www.isy.liu.se/edu/kurs/TSEA44 Based on slides by Andreas Ehliar



TSEA44: Computer hardware - a system on a chip

2016-11-01

#### What is the course about?

- How to build a complete embedded computer using an FPGA and a few other components. Why?
  - Only one chip
  - The computer can easily be tailored to your needs.
    - Special instructions
    - Accelerators
    - DMA transfer
  - The computer can be simulated
  - A logic analyzer can be added in the FPGA
    - Add performance counters
  - It's fun!



2016-11-01

3

#### Prerequisites (expected knowledge!)

- Digital logic design. You will design both a data path and a control unit for an accelerator.
- Binary arithmetic. Signed/unsigned numbers.
- VHDL or Verilog. SystemVerilog (SV) is the language used in the course.
- Computer Architecture. It is extremely important to understand how a CPU executes code. You will also design part of a DMA-controller. Bus cycles are central.
- ASM and C programming. Most of the programming is done in C, with a few cases of inline asm.



TSEA44: Computer hardware – a system on a chip

2016-11-01

.

#### Course organisation

- Lab 0: learn enough Verilog, 4 hours
  - Individual work and demonstration
- Lab course: 4 mini projects
  - 6 groups \* 3 students in the lab
- Lectures: 8\*2 hours
  - 1 guest lecture from ARM
- Examination 6 credits:
  - 3 written reports/group
  - Oral individual questions



Lab course is based on an application 2004 - tracking

www.robocup.org/2012/06/robocup-the-small-sized-league/
Funny/impressive youtube video available

our FPGA computer

lirobot BW

lirobot BW

TSEA44: Computer hardware - a system on a chip

LINKÖPING UNIVERSITY

2016-11-01

6

# Lab course is based on an application 2015-16 – JPEG compression

- Take 2-D DCT on 8x8-blocks
- Quantize = Divide and set small values to zero





2016-11-01

8

#### Lab tasks and examination

- Lab 0 (individual work and demonstration)
  - Build an UART in Verilog
  - Demonstration
- Lab 1 (in groups of 2 or three students)
  - Interface to the Wishbone bus
  - Demonstration (individual questions)
     Written report



2016-11-01

0

#### Lab tasks and examination, cont.

- Lab 2+3
  - Design a JPEG accelerator + DMA
  - Demonstration (with individual questions)
     Written report
- Lab 4
  - Custom Instruction
  - Demonstration (with individual questions)
     Written report



TSEA44: Computer hardware - a system on a chip

2016-11-01

#### Written report requirements

- A readable short report typically consisting of
  - Introduction
  - Design, where you explain with text and diagrams how your design works
  - Results, that you have measured
  - Conclusions
  - Appendix: All Verilog and C code with comments!



2016-11-01

11

#### Competition - fastest JPEG compression

- An unaccelerated JPEG compression (using jpegfiles) takes roughly 13.0 Mcycles (@ 25MHz) ~ 2 FPS (Frames Per Second)
- Our record: ~ 100 000 cycles (everything in hardware)
- Goal: Highest framrate. Exception: At over 25 FPS, the smallest implementation wins



wunderb.jpg 320 x 240



TSEA44: Computer hardware - a system on a chip

2016-11-01

12

#### The hardware

6 boxes with FPGA boards









2016-11-01

#### `

#### Processor core: Openrisc 1200

- Initially developed within opcores initiative
- Split into a new website
  - Openrisc.io
- Complete risc processor including synthesizable code, instructions set simulator etc.



TSEA44: Computer hardware - a system on a chip

2016-11-01

#### (System)Verilog

- The course uses SystemVerilog
- SystemVerilog is easy to learn if you know VHDL/C
- Our soft computer (80% downloaded from OpenCores) is written in Verilog
- It is possible to use both languages in a design
- You need to understand parts of the computer



2016-11-01

#### (System) Verilog vs VHDL

An edge-trigged D-flip/flop

```
Ada-like syntax
C-like syntax
                             entity dff is
module dff(
                             port (clk,d : in std_logic;
  input clk, d,
                                    q: out std_logic);
  output reg q);
                             end dff;
  always_ff @(posedge clk) architecture firsttry of dff is
   q \le d;
                             begin
                             process (clk) begin
endmodule
                              if rising_edge(clk) then
                                q \le d;
                              end if:
                             end process;
         clk
                             end firsttry;
```



TSEA44: Computer hardware - a system on a chip

2016-11-01

#### (System) Verilog vs VHDL

Using the D-flip/flop, instantiation



UNIVERSITY









2016-11-01 23

## Booting uClinux

uClinux/OR32 Flat model support (C) 1998,1999 Kenneth Albanowski, D. Jeff Dionne Calibrating delay loop. ok - 2.00 BogoMIPS
Memory available: 53000k/62325k RAM, 0k/0k ROM (667892k kernel data, 2182k code)
Swansea University Computer Society NET3.035 for Linux 2.0
NET3: Unix domain sockets 0.13 for Linux NET3.035.
Swansea University Computer Society TCP/IP for NET3.034 TIP Protocols: ICMP, UDP, TCP uClinux version 2.0.38.1pre3 (olles@kotte) (gcc version 3.2.3) #180 Sat Sep 11 0 9:01:55 CEST 2004 Serial driver version 4.13p1 with no serial options enabled tty500 at 0x90000000 (irq = 2) is a 16550A Ramdisk driver initialized : 16 ramdisks of 2048K size Blkmem copyright 1998,1999 D. Jeff Dionne Blkmem copyright 1998 Kenneth Albanowski Blkmem O disk images: loop: registered device at major 7 eth0: Open Ethernet Core Version 1.0 RAMDISK: Romfs filesystem found at block 0 RAMDISK: Loading 1608 blocks into ram disk... done. VFS: Mounted root (romfs filesystem). Executing shell .. Shell invoked to run file: /etc/rc Command: #!/bin/sh
Command: setenv PATH /bin:/sbin:/usr/bin Command: hostname bender Command: # Command: mount -t proc none /proc ... More of the same Command: # Command: # start web server Command: /sbin/boa -d & [12]



TSEA44: Computer hardware – a system on a chip

2016-11-01 24

#### Web server





2016-11-01

#### Lecture info

- 1 Course Intro, FPGA
- 2 Verilog (lab0)
- 3 A soft CPU
- 4 A soft computer (lab1)
- 5 HW acceleration (lab2), guest lecture from ARM
- 6 FPGAs
- 7 Test benches, SV
- 8 Custom instructions (lab4)



TSEA44: Computer hardware - a system on a chip

2016-11-01

SEARCH INSIDE!

. .

#### **Books**

Lilja, Saptnekar: *Designing Digital Computer Systems with Verilog*, Cambridge University Press



Sutherland et al: SystemVerilog for Design, Springer



Spear: SystemVerilog for Verification, Springer



2016-11-01

#### How we built our first FPGA computer

- 1. Download CPU OR1200, roughly 60 Verilog files
- 2. Download Wishbone bus 3 Verilog files
- 3. Download UART 16550, 9 Verilog files
- 4. Figure out a computer



LINKÖPING UNIVERSITY

TSEA44: Computer hardware - a system on a chip

2016-11-01

28

#### How we built our first FPGA computer

5. Write top file ("wire wrap in emacs")

Size 35 kB in Verilog, 13 kB in SV (Verilog does not have struct)

```
module myfirstcomputer(clk,rst,rx,tx)
  input clk,rst,rx;
  output tx;

wishbone Mx[0:1], Sx[0:1];

or1200cpu cpu0(.iwb(Mx[0]), ...);
  wb_conbus wb0(clk, rst, Mx, Sx);
  romram rom0(Sx[1]);
  uart uart0(Sx[0], ...);
end module
```

LINKÖPING UNIVERSITY

2016-11-01

29

#### How we built our first FPGA computer

- 6. Download the cross compiler
- 7. Write a small monitor and place in ROM
- 8. ModelSim. Does it boot? Anything on tx?
- 9. Test with the simulator or 32-uclinux-sim
- 10. Synthesize for 10 minutes (originally 40 minutes, note that simulation are quite important in this course)









TSEA44: Computer hardware - a system on a chip 2016-11-01 Xilinx - Virtex II Overview Device 40 80 250 500 1000 1500 2000 3000 4000 6000 8000 XC2V 8 x 16 x 24 x 32 x 40 x 48 x 56 x 64 x 80 x 96 x 112 x CLB Array 8 16 24 40 48 56 88 104 18Kb BRAM 24 40 48 56 120 168 Multiplier 32 40 48 56 120 8 24 96 144 168 DCM 8 200 528 Max IOB 88 120 264 432 624 720 912 1,104 1,296 4 Columns 2 Columns 6 Columns **BRAM & BRAM & BRAM & Multipliers Multipliers Multipliers** Our FPGA has 5760 CLBs = 23.040 slices = 46080 LUTs+FFs LINKÖPING UNIVERSITY

TSEA44: Computer hardware - a system on a chip

2016-11-01

## Synthesis result

| Module         | LUT             | FF        | RAMB16       | MULT_18x18   | IOB        |
|----------------|-----------------|-----------|--------------|--------------|------------|
|                | 64              |           | <br>         |              | 216        |
| cpu            | 5029 j          | 1345      | 12           | 4            | İ          |
| dvga           | 813             | 755       | 4            |              | 1          |
| eth3           | 3022            | 2337      | 4            |              |            |
| jpg0           | 2203            | 900       | 2            | 13           |            |
| leela          | 685             | 552       | 4            | 2            |            |
| pia            | 2               | 5         |              |              |            |
| pkmc_mc        | 218             | 122       |              |              |            |
| rom0           | 82              | 3         | 12           |              |            |
| sys_sig_gen    |                 | 6         |              |              |            |
| uart2          | 825             | 346       |              |              |            |
| wb_conbus      | 616             | 11        |              |              | <u> </u>   |
| Total          | 13559           | 6382      | 38           | 19           | 216        |
| <br> Available | ++<br>+ 46080 + | - 46080 - | +<br>+ 120 - | +<br>+ 120 · | +<br>+ 912 |



2016-11-01

#### Floorplan from FPGA Editor



Computer

**CPU OR1200** 



TSEA44: Computer hardware - a system on a chip

2016-11-01

.,

#### **CLB** contains four slices

- Each CLB is connected to one switch matrix
  - 1 slice = 2 LUT/FF + ...
- High level of logic integration
  - Wide-input functions
    - 16:1 multiplexer in 1 CLB
    - 32:1 multiplexer in 2 CLBs
  - Fast arithmetic functions
    - 2 look-ahead carry chains per CLB column
  - Adressable shift register in LUT
    - 16-b shift register in 1 LUT
    - 128-b shift register in 1 CLB









2016-11-01

#### **IOB** element

- Input path
  - Two DDR registers
- Output path
  - Two DDR registers
  - Two 3-state DDR registers
- Separate clocks for I & O
- Set and reset signals are shared
  - Separated sync/async
  - Separated Set/Reset attribute per register





TSEA44: Computer hardware - a system on a chip

2016-11-01

. .

#### Embedded 18 kb Block RAM

- Up to 3 Mb on-chip block RAM
- High internal buffering bandwidth
- Clocked write and read

| 1 | 18Kbit block RAM    |
|---|---------------------|
|   | TOLYDIC DIOCK LY NY |

- Parity bit locations (parity in/out busses)
- ✓ Data width up to 36 bits
- 3 WRITE modes
- ✓ Output latches Set/Reset
- ✓ True Dual-Port RAM
  - Independent clock (async.) & control



2016-11-01

#### True Dual-Port<sup>™</sup> configurations

• Configurations available on each port:

| Configuration | Depth        | Data bits | Parity bits |
|---------------|--------------|-----------|-------------|
| 16Kx1         | <b>16K</b> b | 1         | 0           |
| 8Kx2          | 8Kb          | 2         | 0           |
| 4Kx4          | 4Kb          | 4         | 0           |
| 2Kx9          | 2Kb          | 8         | 1           |
| 1Kx18         | 1Kb          | 16        | 2           |
| 512x36        | 512          | 32        | 4           |

- Independent port A and B configuration.
  - Support for data width conversion including parity bits (same memory array!)





TSEA44: Computer hardware - a system on a chip

2016-11-01

#### How to use Block RAM: Just Instantiate template







```
RAMB16_S9_S36 inmem
    (// port A
     // port B
```



2016-11-01

#### Distributed RAM

- Virtex-II LUT can implement
  - 16 x 1-bit synchronous RAM
  - Synchronous write
  - Asynchronous read
    - D flip-flop in the same slice can register the output
- Allow fast embedded RAM of any width
  - Only limited by the number of slices in each device
  - Example: RAM 16 x 48-bit fits in 48 LUTs



TSEA44: Computer hardware – a system on a chip

2016-11-01

#### How to use

Distributed RAM: 8 LUTs 1-adr 16x8



```
logic [7:0] mem0[0:15];
always_ff @(posedge clk)
    if (wr) begin
    mem0[addr] <= d_i;
    end
assign d_o=(rd) ? mem0[addr] : 8'h0;</pre>
```

Distributed RAM : 16 LUTs 2-adr 16x8



LINKÖPING UNIVERSITY

2016-11-01

#### 18 x 18 Multiplier

- Embedded 18-bit x 18-bit multipliers
  - 2's complement signed operation
- Multipliers are organized in columns



LINKÖPING UNIVERSITY

TSEA44: Computer hardware – a system on a chip

2016-11-01

#### 2016-11-0

#### counter

```
module dec(
    input clk, rst
    output u);
    reg u;
    reg [3:0] q;
    always_ff @(posedge clk or posedge rst)
      if (rst)
        q <= 4'h0;
      else if (q == 9)
        q <= 4'h0;
      else
        q <= q+1;
    always_ff @(posedge clk)
      if (q == 9)
        u <= 1'b1;
      else
        u <= 1'b0;
endmodule
```

LINKÖPING UNIVERSITY







49

TSEA44: Computer hardware – a system on a chip

### Synthesized counter, logic description





www.liu.se

