# DMA

- Lab 3
  - DMA
  - The task
  - (Bursts)

MMU, soft CPUs

1

Guest lecture Time: Friday 5/12 1515-1600. Place: Nollstället Image Processing on FPGAs. Johan Pettersson, Sick IVP

# Packed array

left to right, right first logic [11:0] tm1[0:7][0:7];



logic [0:7][0:7][11:0] tm2;



# Comment: Array slicing

The size of the part select or slice must be constant, but the position can be variable.

```
logic [31:0] b;
logic [7:0] a1, a2;
```

```
a1 = b[x -: 8];
a2 = b[y +: 8];
d = b[x:y];
```

// OK fixed width// OK fixed width// not OK

## Lab 3 - DMA





# Address generation



- We want to transfer block by block (8x8)
- Address generator must know format (width,height) of image

testbild.raw

# State diagram



The DMA accelerator has to release the bus regularly so that other components can access it. Do it for every line you read. When we finish the first block, we start the DCT accelerator.



Same as WAITREADY except that we go to the IDLE state when done.

The DMA module is fetching an 8x8 block. Once the block is fetched we go to the WAITREADY state and start the DCT transform.

In this state we wait until the program tells us that it has read the result of the transform by writing to the control register.

#### A measurement make sim\_jpeg





#### A closer look at the DMA



Release bus for m0, m1, m2 ⇒ If CPU is waiting it will get the bus

| MRW       |           |          |             |                           |
|-----------|-----------|----------|-------------|---------------------------|
| 🖅 🔶 adr   | 00000180  | 00000000 | 00000004 00 | )000018 <u>)</u> 0000001c |
| 👍 🔶 dat_o | 00000000  | 00000000 |             |                           |
| 💽 🔶 dat_i | xxxxxxxxx | 01020304 | (0\$060708  | 090a0b0c 0d0e0f           |
| 🔶 stb     | 0         |          |             |                           |
| 🔶 сус     | 0         |          |             |                           |
| i 🔶 we    | 0         |          |             |                           |
| 🔶 ack     | 0         |          |             |                           |
| 🕳 🥠 sel   | 1111      | 1111     |             |                           |
| Control   |           |          |             |                           |

## DCT => Mem (Software)

| M wave - default                   | 1                               |                     |           |                 |          |                           |                  |                  |            |                                 |
|------------------------------------|---------------------------------|---------------------|-----------|-----------------|----------|---------------------------|------------------|------------------|------------|---------------------------------|
| <u>File E</u> dit <u>V</u> iew In: | sert F <u>o</u> rmat <u>T</u> o | ools <u>W</u> indow | ļ.        |                 |          |                           |                  |                  |            |                                 |
| 🖻 🖬 🎒 🎇                            | 🛛 🔏 🛍 🛍                         | <b>m</b>            | 🕺 🛨 🛨 📗   | .IXI 👾 🗰        | 14 18    | ትቶ በጦ ነት                  | ¶   ⊌⊏ ï,        | f }/=   <b>%</b> | . <b></b>  | ď                               |
|                                    | Q Q Q 🖪                         | <b>3</b> ⊷   I¶     |           | \$              |          |                           |                  |                  |            |                                 |
| clk<br>SLAVE 0                     | St1                             |                     |           | huu             |          |                           |                  |                  |            |                                 |
| ⊕ <b>-</b> 今 adr                   | 90000003<br>zzzzzzz             | 4000319c            | )00000800 |                 | <u> </u> | 0 <u>)</u> 00 <u>)</u> 40 | <u>)96000804</u> | <u>)40)96.</u>   | <u> </u>   |                                 |
| <b>⊕-</b> 今 <- dat<br>⊕-今 dat ->   | 00030003                        | 00000000            | 00f0ff6e  | 0ff6e<br>100100 |          | 0                         | 00000            |                  | ),0000fff0 | ( <u>0000fff0</u> )<br>(00)(00) |
| stb                                | StO                             |                     |           |                 |          |                           |                  |                  |            |                                 |
| s0_ack                             | StO                             |                     |           | ת               |          |                           |                  |                  |            |                                 |
| PARPORT<br>⊡                       | 0000000                         | 00000003            |           |                 |          |                           |                  |                  |            |                                 |
|                                    |                                 |                     |           |                 |          |                           |                  |                  |            |                                 |
|                                    |                                 |                     |           |                 |          |                           |                  |                  |            |                                 |
| Now                                | 00000000 ps                     |                     | 718400 ns |                 |          | 7188                      | n ns             |                  |            | 719200 ns                       |
| Cursor 1                           | 739995184 ps                    |                     |           |                 |          |                           |                  |                  |            |                                 |
| Cursor 2                           | 716922301 ps                    |                     |           |                 |          |                           |                  |                  |            |                                 |
| Cursor 3<br>Cursor 4               | 712093092 ps<br>698544481 ps    |                     |           |                 |          |                           |                  |                  |            |                                 |
|                                    | 300044401 ps                    |                     |           |                 |          |                           |                  |                  |            |                                 |
| 718253866 ps to                    | 719251213 ps                    | }                   | Now: 1 ms | Delta: 4        |          |                           |                  |                  |            |                                 |

## A hint



How long time do these blocks take?

#### **Burst Read**







# burst - cycle types

| Signal group  | Value   | Description                  |  |  |
|---------------|---------|------------------------------|--|--|
| cti           | 000     | Classic cycle                |  |  |
| <b>c</b> ycle | 001     | Constant address burst cycle |  |  |
| type          | 010     | Incrementing burst cycle     |  |  |
| identifier    | 011-110 | Reserved                     |  |  |
|               | 111     | End of burst                 |  |  |
| bte           | 00      | Linear burst                 |  |  |
| burst         | 01      | 4-beat wrap burst            |  |  |
| type          | 10      | 8-beat wrap burst            |  |  |
| extension     | 11      | 16-beat wrap burst           |  |  |

# Changes in the slave







# Do you really wanna burst?



# Why not write? (acc->mem)



```
This the main encoding loop
void encode image(void)
ł
   int i;
   int MCU count = width*height/DCTSIZE2;
   short MCU block[DCTSIZE2];
   for(i = 0; i < MCU count; i++)</pre>
      forward DCT(MCU block);
      encode_mcu_huff(MCU_block);
```

- 1) I/O is on 0x90, 0x91, ..., 0x99 other addr to PKMC
- 2) Noncacheable data mem addr >= 0x8000\_0000, SDRAM 0x0, SRAM 0x2000\_0000 or 0xc000\_0000
- 2) MCU\_block must be in noncacheable area
- 3) Skip MCU\_block, let encode\_mcu\_huff read from acc

#### Ethernet controller



FIFOs 2 x 16\*32

## Ethernet controller

Ethernet Packet Sender Number Receiver Data of bytes MAC-address MAC-address **IP** Packet Sender Receiver ToSLID FL fo ttl Prot VIHU CHs Data IP-address **IP-address TCP Packet** Sender Receiver S# Ack# FI CHs Data Port number Port number

- Transmits and receives Ethernet frames
- 10 Mbit/s and 100 Mbit/s
- Half duplex and full duplex

- Wishbone I/F "similar" to our JPEG acc
- Up to 128 buffer descriptors (rx and tx)

| length        | control & status |  |  |  |
|---------------|------------------|--|--|--|
| address to bu | ffer in mem      |  |  |  |

- Tx: automatically reads (DMA) and transmits length bytes adds length to address
- Proceeds to next BD

## or1200 Memory Management



MMU needed for

- 1) Address translation virtual -> physical
- 2) Memory protection(OS protected from user processes, ...)
- 3) "each process runs in its own memory"

## Memory Management Unit

- Harvard model with split instruction and data MMU
- Instruction/data TLB (translation lookaside buffer)
   size scalable from 16 to 256 entries
- TLB organized as a direct-mapped cache
- Page size 8KB with per-page attributes
  - LS 13 address lines left untouched
  - MS (32-13) = 19 address lines translated



# A sketchy explanation

- \* The OS administers a list of page translations for each process.
- \* These are kept in memory, page tables
- \* The translations are automatically loaded into the TLB when the process executes.







### I/D-TLB = Translation Lookaside buffer

Implemented as a direct mapped cache, 64 entries



## Does the MMU need an extra pipeline stage?



NO, the cache is physical and works in parallel with the MMU!

# What about DMA and MMU?



- µCLinux does not use the MMUs
- Real Linux must use MMUs
- System call to find the P addr for a V addr
- DMA ctrl must handle scatter/gather
- DMA ctrl typically executes a linked list of commands





## other soft CPUs

|               | Open RISC               | Leon              | Nios          | Micro-<br>Blaze |
|---------------|-------------------------|-------------------|---------------|-----------------|
| who           | opencores               | gaisler           | altera        | Xilinx          |
| what          | verilog                 | VHDL              | netlist       | netlist         |
| CPU<br>stages | RISC<br>5               | RISC<br>5         | RISC<br>6/5/1 | RISC<br>3       |
| cache         | Direct IC/DC            | IC/DC             | IC/DC         | IC/DC           |
| MMU           | Split IMMU<br>DMMU      |                   |               |                 |
| bus           | Wishbone<br>simple/Xbar | AMBA<br>(AHP/APB) |               | LMB/OPB/<br>FSL |

### leon - open source processor

#### www.gaisler.com

- •The full source code is available under the **<u>GNU LGPL</u>** license
- •LEON2 is a synthesisable VHDL model of a 32-bit processor compliant with the SPARC V8 architecture
- <u>SPARC V8 compliant</u> integer unit with 5-stage pipeline
- •Hardware multiply, divide and MAC units
- •Separate instruction and data cache (Hardvard architecture)
- •Set-associative caches: 1 4 sets, 1 64 kbytes/set. Random, LRR or LRU replacement
- •Data cache snooping
- •<u>AMBA-2.0</u> AHB and APB on-chip buse
- •8/16/32-bits memory controller for external PROM and SRAM
- •32-bits PC133 SDRAM controller
- •On-chip peripherals such as uarts, timers, interrupt controller and 16-bit I/O port

### leon



# leon has virtual caches!



- + Address translation only at cache miss!
- Cache flush needed at task switch



# A 4-way 8kb instruction cache



A replacement policy is needed like:

- \* LRU = least recently used
- \* LRR = least recently replaced



## MicroBlaze Processor

- Thirty-two 32-bit general purpose registers
- 32-bit instruction word with three operands and two addressing modes
- Separate 32-bit instruction and data buses that conform to IBM's OPB (On-chip Peripheral Bus) specification
- Separate 32-bit instruction and data buses with direct connection to on-chip block
- RAM through a LMB (Local Memory Bus)
- 32-bit address bus
- Single issue pipeline
- Instruction and data cache
- Hardware debug logic
- FSL (Fast Simplex Link) support
- Hardware multiplier (in Virtex-II and subsequent device



## MicroBlaze Pipeline

| _             | cycle 1 | cycle 2 | cycle 3 | cycle4  | cycle5  |
|---------------|---------|---------|---------|---------|---------|
| instruction 1 | Fetch   | Decode  | Execute |         |         |
| instruction 2 |         | Fetch   | Decode  | Execute |         |
| instruction 3 |         |         | Fetch   | Decode  | Execute |

- Execute stage will dominate the pipeline
- + No data hazards
- Delay slot still needed



# Nios (Altera)

#### Table 1: Key features of the Nios II family members

|                   | Nios II /f<br>Fast | Nios II /s<br>Standard | Nios II /e<br>Economy |  |
|-------------------|--------------------|------------------------|-----------------------|--|
| Pipeline          | 6 Stage            | 5 Stage                | None                  |  |
| Multiplier *      | 1 Cycle            | 3 Cycle                | None                  |  |
| Branch Prediction | Dynamic            | Static                 | None                  |  |
| Instruction Cache | Configurable       | Configurable           | None                  |  |
| Data Cache        | Configurable       | None                   | None                  |  |

\*Uses digital signal processing (DSP) blocks in Stratix and Stratix II FPGAs

## **Custom Instructions**



## HW Accelerator



# Zynq - a programmable SOC

#### PS = processing system



PS = programmable logic

# PS = processing system



SCU keeps cachelines in the two L1(D) synchronised (if they refer to the same mem position)

# Snoop Control Unit

For each cacheline we keep track of:

- Modified : Unique & Dirty (only in this cache, has been changed)
- **O**wned : Shared & Dirty
- Exclusive : Unique & Clean
- Shared : Shared & Clean
- Invalid : Nothing here yet



The SCU listens to the bus, has a copy of the tag RAMs, ...

# PL = programmable logic

|                                                        | Z-7010           | Z-7015                                               | Z-7020           | Z-7030                    | Z-7045                    | Z-7100             |  |  |
|--------------------------------------------------------|------------------|------------------------------------------------------|------------------|---------------------------|---------------------------|--------------------|--|--|
| Processor                                              | Dual core        | Dual core ARM Cortex-A9 with NEON and FPU extensions |                  |                           |                           |                    |  |  |
| Max. processor<br>clock frequency                      |                  | 866MHz                                               |                  |                           | 1GHz                      |                    |  |  |
| Programmable<br>Logic                                  | Artix-7          |                                                      |                  |                           | Kintex-7                  |                    |  |  |
| No. of FlipFlops                                       | 35,200           | 96,400                                               | 106,400          | 157,200                   | 437,200                   | 554,800            |  |  |
| No. of 6-input<br>LUTs                                 | 17,600           | 46,200                                               | 53,200           | 78,600                    | 218,600                   | 277,400            |  |  |
| No. of 36Kb Block<br>RAMs                              | 60               | 95                                                   | 140              | 265                       | 545                       | 755                |  |  |
| No. of DSP48 slices<br>(18x25 bit)                     | 80               | 160                                                  | 220              | 400                       | 900                       | 2020               |  |  |
| No. of SelectIO<br>Input/Output<br>Blocks <sup>a</sup> | HR: 100<br>HP: 0 | HR: 150<br>HP: 0                                     | HR: 200<br>HP: 0 | HR: 100<br>HP: 150        | HR: 212<br>HP: 150        | HR: 250<br>HP: 150 |  |  |
| No. of PCI Express<br>Blocks                           | -                | 4                                                    | -                | 4                         | 8                         | 8                  |  |  |
| No. of serial<br>transceivers                          | -                | 4                                                    | -                | 4                         | 8 or 16 <sup>b</sup>      | 16                 |  |  |
| Serial transceivers<br>maximum rate                    | -                | 6.25Gbps                                             | -                | 6.6Gbps/<br>12.5Gbps<br>c | 6.6Gbps/<br>12.5Gbps<br>b | 10.3Gbps           |  |  |

# Performance Soft vs Hard







## ACP = accelerator coherency port

