Multicore Reconfiguration Platform

Multicore Reconfiguration Platform —
A Research and Evaluation FPGA
Framework for Runtime Reconfigurable
Systems
Dipl.-Inf. Dominik Meyer
18. M¨arz 2015
Multicore Reconfiguration Platform —
A Research and Evaluation FPGA Framework
for Runtime Reconfigurable Systems
Von der Fakult¨
at Elektrotechnik
der Helmut-Schmidt-Universit¨
at/
Universit¨
at der Bundeswehr Hamburg
zur Erlangung des akademischen Grades
eines Doktor-Ingenieurs
genehmigte
DISSERTATION
vorgelegt von
Diplom-Informatiker Dominik Meyer
aus Rendsburg
Hamburg 2015
iii
Gutachter
Vorsitzender der Pr¨
ufungskommission
Tag der m¨
undlichen Pr¨
ufung
Prof. Dr. Bernd Klauer
Prof. Dr. Udo Z¨olzer
Prof. Dr. Gerd Scholl
16.03.2015
Gedruckt mit freundlicher Unterst¨
utzung der HSU-Universit¨at der Bundeswehr Hamburg.
iv
Curriculum Vitae
Personal information
Surname(s) / First name(s)
Meyer, Dominik
Email(s)
dmeyer@hsu-hh.de
Nationality(-ies)
Date of birth
German
June 17, 1976
Education
Dates
Title of qualification awarded
Name and type of organisation
providing education and training
Dates
Title of qualification awarded
Name and type of organisation
providing education and training
1993 - 1997
Abitur
Helene Lange Gymnasium Rendsburg/ Germany
1998 - 2008
Diplom in Computer Science
¨ zu Kiel
Christian-Albrechts-Universitat
Work experience
Dates
Occupation or position held
Main activities and
responsibilities
Name and address of employer
Dates
Occupation or position held
Main activities and
responsibilities
Name and address of employer
Dates
Occupation or position held
Main activities and
responsibilities
Name and address of employer
2000 - 2003
technical advisor/manager
Buildup and management of the server infratructure of an
internet service provider and webhoster.
PcW KG
2003 - 2009
technical manager
Buildup and management of the server infratructure of a
webhoster. Development of firewall solutions.
die Netzwerkstatt
2009 - now
research assistant
research in runtime reconfigurable systems
Computer Engineering/ Helmut Schmidt University
Hamburg
v
Publications
[1] Dominik Meyer. Runtime reconfigurable processors.
Presentation at the Chaos Communication Camp, 2011.
[2] Dominik Meyer. Introduction to processor design. Presentation at the 30th Chaos Communication Congress,
2013.
[3] Dominik Meyer and Bernd Klauer. Multicore reconfiguration platform an alternative to rampsoc. SIGARCH
Comput. Archit. News, 39(4):102–103, December 2011.
v
Acknowledgments
This thesis is the result of my work at the Institute of Computer Engineering at the
Helmut Schmidt University/ University of the Federal Armed Forces Hamburg.
I want to thank Prof. Dr. Bernd Klauer, my chair, for his support and the opportunity
to work on this thesis. I also want to thank the remaining members of my dissertation
committee Prof. Dr. Scholl and Prof. Dr. Z¨olzer.
The discussions of my research results with my current and former colleagues at the
Helmut Schmidt University helped a lot. Therefore, I want to thank Marcel Eckert,
Rene Schmitt, Klaus Hildebrandt, Christian Richter and Jan Haase.
Finally, I want to thank my girl friend, Sarah Zingelmann, for her understanding and
support during the last years.
vii
viii
Acronyms
Acronyms
AES
ALU
AMBA
API
Advanced Encryption Standard.
Arithmetical Logical Unit.
Advanced Microcontroller Bus Architecture.
Application Programming Interface.
BRAM
Block RAM.
CAN
CDC
CEB
CLB
CMT
CPLD
CPU
CSMA/CD
CSN
Controller Area Network.
Clock Domain Crossing.
Configurable Entity Block.
Configurable Logic Block.
Clock Management Tiles.
Complex Programmable Logic Device.
Central Processing Unit.
Carrier Sense - Multiple Access / Collision Detection.
Circuit Switched Network.
DDR
DIP
DNF
DSP
Double Data Rate.
Dual Inline Package.
Disjunctive Normal Form.
Digital Signal Processor.
FF
FFT
FIFO
FPGA
FSM
FlipFlop.
Fast Fourier Transformation.
First In First Out.
Field Programmable Gate Array.
Finite State Machine.
GPIO
GPU
General Purpose Input Output.
Graphical Processing Unit.
HDL
HSTL
HTTP
Hardware Description Language.
High-Speed Transceiver Logic.
Hypertext Transfer Protocol.
I2C
IC
ICAP
ILP
IOB
IP
Inter-Integrated Circuit.
Integrated Circuit.
Internal Configuration Access Port.
Instruction Level Parallelism.
Input/Output Block.
Intellectual Property.
ix
Acronyms
x
ISA
ISO
ITU
Instruction Set Architecture.
International Organization for Standardization.
International Telecommunication Union.
LAN
LED
LUT
LVDS
LVTTL
Local Area Network.
Light Emitting Diode.
LookUpTable.
Low-Voltage Differential Signaling.
Low-Voltage Transistor Transistor Logik.
MAC
MPSoC
MPU
MRP
Media Access Control.
Multi-Processor System-on-Chip.
Multiplyer Unit.
Multicore Reconfiguration Platform.
NOC
Network On Chip.
OCSN
OS
OSI
On Chip Switching Network.
Operating System.
Open Systems Interconnection Model.
PAL
PCI
PCIe
PE
PLA
POP3
PR
PRHS
Programmable Array Logic.
Peripheral Component Interconnect.
Peripheral Component Interconnect Express.
Processing Element.
Programmable Logic Array.
Post Office Protocol Version 3.
Partial Reconfiguration.
Partial Reconfiguration Heterogenous System.
RAM
RampSoC
RC
RM
RO
RS
RTL
Random Access Memory.
Runtime adaptive multiprocessor system-on-chip.
Reconfigurable Computing.
Reconfigurable Module.
Ring Oscillator.
Reconfigurable System.
Register Transfer Layer.
SATA
SCI
SoC
SPI
SRAM
Serial Advanced Technology Attachment.
Scalable Coherent Interface.
System on Chip.
Serial Peripheral Interface.
Static Random Access Memory.
Acronyms
TCP
Transmission Control Protocol.
UART
UDP
USB
Universal asynchronous receiver/transmitter.
User Datagram Protocol.
Universal Serial Bus.
VA
VHDL
VR
Virtual Architecture.
Very High Speed Integrated Circuits HDL.
Virtual Region.
WAN
Wide Area Network.
XDL
XML
Xilinx Description Language.
Extensible Markup Language.
xi
List of Figures
1.1
1.2
History of the ic processing size[1] . . . . . . . . . . . . . . . . . . . . . .
partitioning of an FPGA for the Xilinx PR design flow[2] . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
and/or Matrix . . . . . . . . . . . . . . . . . . . . .
Halfadder implemented in an and/or Matrix . . . . .
4 to 1 Multiplexer . . . . . . . . . . . . . . . . . . .
Cascaded 4 to 1 Multiplexer . . . . . . . . . . . . . .
Simple structure of an FPGA without interconnects
Structure of two Virtex5 CLBs[3] . . . . . . . . . . .
simple PR example[2] . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
9
10
11
12
13
14
16
3.1
3.2
3.3
3.4
3.5
3.6
example RAMPSoC Configuration[4] . . . . . . . . . . . . . . . . . . . . .
PRHS System Overview[5] . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overview of the Convey HC1 architecture[6] . . . . . . . . . . . . . . . . .
Structure of an Intel Stellarton Processor, combined with an Altera FPGA
Structure of the Xilinx Zynq architecture[7] . . . . . . . . . . . . . . . . .
COPACOBANA and RIVYERA interconnection overview . . . . . . . . .
17
19
21
22
23
24
4.1
4.2
4.3
Example mobile phone SystemOnChip (SoC) . . . . . . . . . . . . . . . . 25
graphical representation of the ISO/OSI Model . . . . . . . . . . . . . . . 27
direct and indirect interconnection networks . . . . . . . . . . . . . . . . . 29
5.1
5.2
5.3
5.4
5.5
Example
Example
Example
Example
Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
40
43
45
46
6.1
6.2
6.3
6.4
Example granularity problem . . . . . . . . . . .
Example grouping solution configuration . . . . .
Example granularity solution configuration . . .
Area requirements of the different usage patterns
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
49
51
52
7.1
7.2
7.3
7.4
Example MRP System Overview .
OCSN frame description . . . . . .
OCSN network structure overview
OCSN address structure . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
56
56
57
Ring network with eight nodes
bus with 4 nodes . . . . . . . .
grid networks with 16 nodes .
tree networks . . . . . . . . . .
4×4 crossbar networks . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
3
xiii
LIST OF FIGURES
xiv
7.5
7.6
7.7
7.8
7.9
7.10
Example support platform . . . .
Example reconfiguration platform
CEB Signal Interface . . . . . . .
CSN group . . . . . . . . . . . .
full MRP design flow . . . . . . .
reduced MRP design flow . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
62
63
65
68
69
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
8.13
8.14
8.15
8.16
8.17
8.18
8.19
8.20
8.21
Clock Domain Crossing (CDC) component interface
Dual Port Block RAM interface . . . . . . . . . . . .
SimpleFiFo interface . . . . . . . . . . . . . . . . . .
Reception of one OCSN Frame . . . . . . . . . . . .
OCSN physical transmission component . . . . . . .
OCSN physical reception component . . . . . . . . .
Flowchart of OCSN identification protocol . . . . . .
Flowchart of OCSN flow control protocol . . . . . .
OCSN IF signal interface . . . . . . . . . . . . . . .
OCSN IF implementation schematic . . . . . . . . .
Graph of the OCSN IF FSM . . . . . . . . . . . . .
signal interface of an OCSN Switch . . . . . . . . . .
signal interface of the addr compare component . . .
OCSN switch implementation schematic . . . . . . .
OCSN application component basic schematic . . . .
OCSN Ethernet Bridge FSMs . . . . . . . . . . . . .
OCSN Ethernet Discovery Protocol . . . . . . . . . .
Crossbar Interconnection Schema . . . . . . . . . . .
CSN Crossbar Switch Signal Interface . . . . . . . .
CSN Crossbar Switch Implementation Schematic . .
CSN2OCSN Bridge Signal Interface . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
72
73
73
74
74
75
76
77
78
79
80
80
81
84
85
86
87
88
89
90
10.1
10.2
10.3
10.4
MRP Measurement Configuration for Setup 1 . . .
Floorplan of the reconfiguration platform . . . . .
Floorplan with interconnects of the reconfiguration
MRP CPU Configuration . . . . . . . . . . . . . .
. . . . .
. . . . .
platform
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
103
105
106
List of Tables
1.1
1.2
Configuration speed and -time for a Xilinx xc5vlx330 FPGA . . . . . . . .
Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MB
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
3
2.1
2.2
2.3
Truth table of a Halfadder . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
different Boolean functions implemented with a 4 to 1 multiplexer . . . . 11
Example LUT implementing ∧, ∨ and ⊕ . . . . . . . . . . . . . . . . . . . 13
5.1
5.2
5.3
5.4
5.5
5.6
Classification
Classification
Classification
Classification
Classification
Classification
7.1
variable speed of the OCSN . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.1
Address to register mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 91
of
of
of
of
of
of
a bidirectional ring . . . . . . . . . . .
a bus . . . . . . . . . . . . . . . . . .
an open grid (mesh) with 4 × 4 nodes
a closed grid (illiac) with 4 × 4 nodes
a tree . . . . . . . . . . . . . . . . . .
a crossbar network with n nodes . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
42
43
44
44
46
10.1 Area usage of the MRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.2 Maximum clock rates within each switch . . . . . . . . . . . . . . . . . . . 101
10.3 Propagation Delay Matrix for all CEBs in ns . . . . . . . . . . . . . . . . 102
A.1 used OCSN frame types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xv
Contents
List of Figures
xiii
List of Tables
xv
1 Introduction
1.1 Reconfigurable Hardware . . . .
1.1.1 Runtime Reconfiguration
1.2 Hybrid Hardware Approaches . .
1.2.1 Datapath Accelerators . .
1.2.2 Bus Accelerators . . . . .
1.2.3 Multicore Reconfiguration
1.3 Thesis Objectives . . . . . . . . .
1.4 Thesis Structure . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
2
3
4
4
4
5
7
2 Reconfiguration Fundamentals
2.1 Matrix Approach . . . . . . . . .
2.2 Multiplexer Approach . . . . . .
2.3 Look Up Table Approach . . . .
2.4 Field Programmable Gate Arrays
2.4.1 Input/Output Blocks . . .
2.4.2 Configurable Logic Blocks
2.4.3 Block RAM . . . . . . . .
2.4.4 Special IO Components .
2.4.5 Interconnection Network .
2.5 Partial Reconfiguration . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
11
12
13
14
14
15
15
15
16
.
.
.
.
.
.
.
.
.
17
17
17
18
20
20
20
21
22
24
3 Example Reconfigurable Systems
3.1 Research Systems . . . . . . . . .
3.1.1 RampSoC . . . . . . . . .
3.1.2 PRHS . . . . . . . . . . .
3.1.3 Dreams . . . . . . . . . .
3.2 Commercial Systems . . . . . . .
3.2.1 Convey HC1 . . . . . . .
3.2.2 Intel Stellarton . . . . . .
3.2.3 Xilinx Zynq Architecture
3.3 COPACOBANA and RIVYERA
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xvii
Contents
4 Interconnection Networks
4.1 Open Systems Interconnection Model . . . . . . .
4.1.1 Application Layer . . . . . . . . . . . . .
4.1.2 Presentation Layer . . . . . . . . . . . . .
4.1.3 Session Layer . . . . . . . . . . . . . . . .
4.1.4 Transport Layer . . . . . . . . . . . . . .
4.1.5 Network Layer . . . . . . . . . . . . . . .
4.1.6 Data Link Layer . . . . . . . . . . . . . .
4.1.7 Physical Layer . . . . . . . . . . . . . . .
4.2 Topology . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Interconnection Type . . . . . . . . . . .
4.2.2 Grade and Regularity . . . . . . . . . . .
4.2.3 Diameter . . . . . . . . . . . . . . . . . .
4.2.4 Bisection Width . . . . . . . . . . . . . .
4.2.5 Symmetry . . . . . . . . . . . . . . . . . .
4.2.6 Scalability . . . . . . . . . . . . . . . . . .
4.3 Interface Structure . . . . . . . . . . . . . . . . .
4.3.1 Direct Networks . . . . . . . . . . . . . .
4.3.2 Indirect Networks . . . . . . . . . . . . .
4.4 Operating Mode . . . . . . . . . . . . . . . . . .
4.4.1 Synchronous Connection Establishment .
4.4.2 Synchronous Data Transmission . . . . .
4.4.3 Asynchronous Connection Establishment
4.4.4 Asynchronous Data Transmission . . . . .
4.4.5 Mixed Mode . . . . . . . . . . . . . . . .
4.5 Communication Flexibility . . . . . . . . . . . . .
4.5.1 Broadcast . . . . . . . . . . . . . . . . . .
4.5.2 Unicast . . . . . . . . . . . . . . . . . . .
4.5.3 Multicast . . . . . . . . . . . . . . . . . .
4.5.4 Mixed . . . . . . . . . . . . . . . . . . . .
4.6 Control Strategy . . . . . . . . . . . . . . . . . .
4.6.1 Centralised Control . . . . . . . . . . . .
4.6.2 Decentralised Control . . . . . . . . . . .
4.7 Transfer Mode and Data Transport . . . . . . . .
4.8 Conflict Resolution . . . . . . . . . . . . . . . . .
5 Example Network On Chip Architectures
5.1 Ring . . . . . . . . . . . . . . . . . .
5.2 Bus . . . . . . . . . . . . . . . . . .
5.2.1 Bus-Arbitration . . . . . . . .
5.2.2 Data Transmission Protocol .
5.2.3 Classification . . . . . . . . .
5.3 Grid . . . . . . . . . . . . . . . . . .
xviii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
26
27
27
27
28
28
28
28
29
29
30
31
31
32
32
32
33
33
33
33
33
33
33
34
34
34
34
34
34
35
35
35
35
36
.
.
.
.
.
.
39
39
40
41
41
42
42
Contents
5.4
5.5
Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Granularity Problem of Runtime Reconfigurable
6.1 Solutions . . . . . . . . . . . . . . . . . . .
6.1.1 Grouping Solution . . . . . . . . . .
6.1.2 Granularity Solution . . . . . . . . .
6.2 Granularity Problem and Hybrid Hardware
7 Multicore Reconfiguration Platform
7.1 On Chip Switching Network . .
7.1.1 Physical Layer . . . . .
7.1.2 Data-link Layer . . . . .
7.1.3 Network Layer . . . . .
7.1.4 Transport Layer . . . .
7.1.5 Session Layer . . . . . .
7.1.6 Presentation Layer . . .
7.1.7 Application Layer . . .
7.2 Support Platform . . . . . . . .
7.2.1 GPIO . . . . . . . . . .
7.2.2 BRAM . . . . . . . . .
7.2.3 DDR3 RAM . . . . . .
7.2.4 UART Bridge . . . . . .
7.2.5 Ethernet Bridge . . . .
7.2.6 Soft-core SoC . . . . . .
7.3 Reconfiguration Platform . . .
7.3.1 ICAP . . . . . . . . . .
7.3.2 CEB . . . . . . . . . . .
7.3.3 CSN . . . . . . . . . . .
7.3.4 IOB . . . . . . . . . . .
7.4 Operating System Support . .
7.5 Design Flow . . . . . . . . . . .
Design Flow
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Description
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
49
49
50
51
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
54
55
55
55
57
58
58
58
58
59
60
60
60
61
61
61
62
62
64
66
67
68
8 Implementation of the Multicore Reconfiguration Platform
8.1 General Components . . . . . . . . . . . . . . . . . . . .
8.1.1 Clock Domain Crossing . . . . . . . . . . . . . .
8.1.2 Dual Port Block RAM . . . . . . . . . . . . . . .
8.1.3 FiFo Queue Component . . . . . . . . . . . . . .
8.2 OCSN . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 OCSN Physical Interface Components . . . . . .
8.2.2 OCSN Data-Link Interface Component . . . . .
8.2.3 OCSN Network Component . . . . . . . . . . . .
8.2.4 OCSN Application Components . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
71
71
72
72
73
73
75
80
82
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xix
Contents
8.3
CSN . . . . . . . . . . . . . . . . . . .
8.3.1 Physical Layer Implementation
8.3.2 Network Layer Components . .
8.3.3 Application Layer Components
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
86
87
87
89
9 Operating System Support Implementation
93
9.1 OCSN Network Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.2 OCSN Network Device Driver . . . . . . . . . . . . . . . . . . . . . . . . . 96
10 Evaluation
10.1 Area Usage . . . . . . . . . . . . . . . . . . . . . .
10.2 Maximum CSN Propagation Delay Measurement .
10.2.1 RO-Component . . . . . . . . . . . . . . . .
10.2.2 ReRouter-Component . . . . . . . . . . . .
10.2.3 Measuring Setup . . . . . . . . . . . . . . .
10.2.4 Measurement Results . . . . . . . . . . . .
10.3 Example Microcontroller Implementation for MRP
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
97
97
99
99
100
100
100
104
11 Conclusion
109
11.1 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Appendix
113
A
OCSN Frame Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Bibliography
xx
115
1 Introduction
Gordon E. Moore[8] stated in 1965 in the growing Integrated Circuit (IC) market context:
“The complexity for minimum component costs has increased at a rate of roughly a factor
of two per year.” The main conclusion of his paper is that the density of transistors on
a IC periodically doubles. This prediction still holds after 48 years, according to Intel
employees Mark T. Bohr, Robert S. Chau, Tahir Ghani and Kaizad Mistry[9].
ICs, such as general-purpose processors, are now produced in a 14nm technology
process. Figure 1.1 displays the history of processing sizes for ICs of the last decades.
With every doubling of the transistor density, more logic components can be placed
onto one IC . Processor designers are using this newly available space to add more and
more Central Processing Unit (CPU) and Graphical Processing Unit (GPU) cores to
processors. For example the OpenSPARC T2 processor[10] has 8 CPU cores, and the
NVIDIA Fermi device[11] even has 512 GPU cores. This development is expected to
continue for a while, equipping general-purpose processors with more parallel computing
power. System on Chips (SoCs) are another product of the available space on ICs. They
feature single and multicore processors combined with a GPU and additional accelerator
hardware. This accelerator hardware improves the computing power with Digital Signal
10000
8000
Size in nm
6000
4000
2000
0
1970
1975
1980
1985
1990
1995
2000
2005
2010
2015
Year
Figure 1.1: History of the ic processing size[1]
1
1 Introduction
File Size (MB)
Interface
Bit-width
Clk (MHz)
Speed (Mb/s)
Time (ms)
9,6
9,6
9,6
SelectMap
SelectMap
SelectMap
8
16
32
50
50
50
400
800
1600
192
96
48
Table 1.1: Configuration speed and -time for a Xilinx xc5vlx330 FPGA
Processors (DSPs) or other mathematical functions implemented in hardware.
Beyond exploiting the available space with more and more static hardware, it can also
be used for adding reconfigurable hardware.
1.1 Reconfigurable Hardware
Reconfigurable hardware has the ability to change its function after chip assembly and
allows the configuration of every digital circuit, such as Advanced Encryption Standard
(AES)-, Fast Fourier Transformation (FFT) accelerators, other DSP like instructions
and even some specialised CPU cores. The industry has already reacted to the importance of reconfigurable hardware and produces different types of standalone ICs with
this feature. One example is the Field Programmable Gate Array (FPGA). It features a
large reconfigurable hardware area, some accelerator components like Arithmetical Logical Unit (ALU) and Multiplyer Unit (MPU), and distributed Random Access Memory
(RAM). Chapter 2 gives a more detailed introduction to reconfigurable hardware and
commercially available ICs. From now on, we will use FPGA as a synonym for reconfigurable hardware.
One important limitation of FPGAs was that they had to be reconfigured completely,
even for small system changes. Every computation taking place in hardware had to be
stopped and a programming file, representing the changed functionality, was loaded into
the FPGA. Even, if only half of the reconfigurable area was computing and the other
half was without functionality, the whole area had to be replaced. This was and still is
a very time intensive task. It takes many milliseconds for the reconfiguration process to
complete, depending on the size of the file and the configuration channel. This process
erases the internal states of all configured hardware components. Table 1.1 presents the
calculated minimal configuration times for a Xilinx FPGA and a 9,6MB configuration
file using the fastest available configuration interface.
1.1.1 Runtime Reconfiguration
Because of the configuration time limitation and to enable replacing one part of a design
while other parts are still doing computations, hardware vendors introduced the concept
of runtime reconfiguration. Runtime reconfiguration is also often referenced as dynamic
reconfiguration or partial runtime reconfiguration. Such a runtime reconfigurable project
is developed by dividing the FPGA into some Reconfigurable Modules (RMs) during
2
1.2 Hybrid Hardware Approaches
RM03.bit
RM02.bit
RM01.bit
RM00.bit
FPGA
RM0
,,static´´
Logic
RM13.bit
RM12.bit
RM11.bit
RM10.bit
RM1
Figure 1.2: partitioning of an FPGA for the Xilinx PR design flow[2]
the design phase. Figure 2.7 shows an example partitioning of a FPGA for use with
the Xilinx Partial Reconfiguration (PR) design flow[2]. This design flow targets partial
reconfiguration for Xilinx FPGAs. Two different sized RMs are available, each connected
to some special “static” control hardware.
This feature does not speed up the configuration process itself, but through the partitioning of the reconfigurable area the size of the individual configuration stream shrinks,
which reduces the time for the reconfiguration process of one RM . For example, if you
can reduce the size of the configuration stream for one RM to 0,25 MB, you achieve
the configuration times of Table 1.2. This is an enormous speed up, but it can only be
achieved, if the design is apportionable and the RMs can be reconfigured individually
rather than all at once.
The partitioning of a FPGA can only be altered by a full replacement of the configured
logic. More benefits of PR are summarized by Kao[12].
1.2 Hybrid Hardware Approaches
Systems combining a general-purpose von Neumann[13] CPU with some kind of configurable or reconfigurable area are often called Hybrid Hardware Systems.
The industry has already produced some hybrid systems, such as the Xilinx Zynq
architecture[7], the Intel Atom processor E6X5C series[14] and the Convey HC1/HC2[6].
The first combines an ARM Cotex A9 processor core with a Xilinx FPGA on the same
File Size (MB)
Interface
Bit-width
Clk (MHz)
Speed (Mb/s)
Time (ms)
0,25
0,25
0,25
SelectMap
SelectMap
SelectMap
8
16
32
50
50
50
400
800
1600
5
2,5
1,25
Table 1.2: Configuration speed and -time for a Xilinx xc5vlx330 FPGA with 0,25MB
Data
3
1 Introduction
chip, but not on the same die. The next combines an Intel Atom processor with an
Altera FPGA in the same manner. The last interconnects one Intel Xeon processor with
four Xilinx FPGAs through the Intel co-processor interface. Still missing are hybrid
hardware systems combined on a single die.
Extending a static processor core with some kind of reconfigurable hardware has already been the focus of research. The following classes of combining strategies have
already been evaluated.
1.2.1 Datapath Accelerators
Hallmannseder[15], Dales[16], Hauser et al. [17] and Razdan[18] added reconfiguration
directly into processor cores by adding reconfigurable accelerator units to the datapath
of the processor. These units are small and cannot be merged to form larger ones. They
improve the processor performance by exploiting Instruction Level Parallelism (ILP)
through additional computational datapath units, or by extending the Instruction Set
Architecture (ISA) with special instructions. Examples of these special instructions are
cryptograhic accelerators for AES and mathematical accelerators for FFT . Datapath
accelerators can improve the performance the most, if they are tightly integrated into
the processor core without long interconnects.
1.2.2 Bus Accelerators
Bus accelerators are small to medium-sized reconfigurable components and can be configured with specialised hardware to improve the runtime of a specific part of a program.
They are connected through a bus or a network to the processor. These accelerators
have to work independently on some part of data because of the high bus/network latency. This can release the static core(s) of some portion of parallel computable data.
Because of the independent nature of these accelerators, they have an internal state and
sometimes a connection to the main memory of the system. Bus Accelerators are a very
simple form of extending the performance of processor cores because existing Busses,
like Peripheral Component Interconnect (PCI) or Universal Serial Bus (USB), can be
used, but more tightly coupled interconnects are also possible.
1.2.3 Multicore Reconfiguration
The Runtime adaptive multiprocessor system-on-chip (RampSoC) framework of Gohringer
et al.[4, 19] evaluates the multicore reconfiguration approach. With Multicore Reconfiguration, multiple processor cores can be configured at system runtime. The system can
adjust itself to the nature of the current problem to solve. Some kind of dynamic or runtime reconfiguration design flow implements RMs, each containing one processor core.
These processor cores are called softcores because they are not staticly implemented.
The size of the largest one defines the size of the smallest RM , if every processor core
shall fit into every RM . An alternative to defining some different sized RMs for different
4
1.3 Thesis Objectives
sized processor cores, but this reduces the number of usable processor cores of the same
size.
1.3 Thesis Objectives
Most of the research about hybrid hardware systems focuses on one combining class
only, is always using a fixed number of static sized cores or units, and includes only high
performance computing applications. This is also true for industrial products.
These restrictions limit the number of application scenarios for each architecture. To
deploy hybrid hardware in a general-purpose environment and to support many applications, the number and the size of the components has to be variable. Example
applications benefiting from hybrid hardware in general-purpose computing are: image processing applications, simulation of electromagnetic fields, solid state physics and
computer games. Image processing applications could use hybrid hardware to accelerate
certain filter and transformation algorithms by uploading accelerator units into the reconfigurable hardware. The simulation of electromagnetic fields and solid state physics
can accelerate their computations by offloading certain calculations to the reconfigurable
hardware. Both fields already use modern graphic cards to accelerate their computations on general-purpose hardware. Reconfigurable hardware would enable developers
to use more specialised hardware and increase the calculation power even more. Computer games also use modern graphic cards to accelerate physical calculations for their
simulated world. Hence, with reconfigurable hardware, each computer game could bring
its own hardware for doing such calculations. All these reconfigurable hardware can be
implemented as an accelerator unit or multiple streaming processor cores. Individualising hardware for each computer application can increase the processing power or reduce
the power consumption of the whole system. Often, applications in a general-purpose
environment are running concurrently, inducing the requirement of a variable number
and a variable size of reconfigurable modules. This all-purpose computing capabilities
requires more flexible design rules than systems supporting just one combination class.
Computer systems are divisible into single-purpose computers, multipurpose computers and general-purpose computers. Single-purpose computers are designed for a specific
calculation. In this systems reconfiguration is used to update the system and to fix development mistakes. This is already very common. Multipurpose computers are specialised
for a group of computations, such as audio and video processing. A typical multipurpose computer is a DSP. In some DSPs reconfigurable accelerator units are available.
They enable developers to extend the functionality or integrate new algorithms. The
last computation class, the general-purpose computers, lacks support for reconfigurable
hardware at the moment. This situation shall be changed by this thesis.
As mentioned earlier, the FPGA has to be partitioned into multiple modules to support
runtime reconfiguration. This partitioning is fixed after the initial system design stage.
This early stage floorplaning leads to the granularity problem of runtime reconfigurable
design flow because different sized components shall be runtime reconfigurable with
maximum flexibility and good area usage ratio. During floorplaning, the maximum
5
1 Introduction
sized component determines the size of one module. This module size and the size of
the FPGA determines the number of available reconfigurable modules, which leads to a
very inefficient design, if components with very different sizes are used. This granularity
problem, and the solution proposed in this thesis, are described more in Chapter 6.
Deploying hybrid hardware into general-purpose computing leads to another problem.
At the moment it is relatively easy to write platform-independent programs by using
a higher level programming language like C. Languages like Java are ignored because
the programs are running in a runtime virtual machine, not on the bare hardware[20].
Virtual machines could be another target for hardware support in general-purpose computers. One advantage of current general-purpose CPUs is, that all of them are based
on the von Neumann architecture[13]. This simplyfies the development of platform independent code because a compiler can be written for all architectures, with the same
base assumptions, only differing in the ISA. Writing platform independent programs for
hybrid hardware is much more complicated because these programs consist of software
and hardware parts. The reconfigurable hardware in such a system is called configware.
While the software part can still be written in C and is based on the von Neumann architecture, the different FPGA and CPU vendors have not agreed upon an architecture
for the hardware part yet. It cannot be expected that all these companies decide for the
same reconfiguration approach for their hybrid hardware system. This complicates the
development of the configware because developers have to describe hardware for different
reconfiguration approaches.
Both problems — the granularity problem and the development of platform independent code — are addressed in this thesis by implementing a multi FPGA framework
called Multicore Reconfiguration Platform (MRP). This framework uses a new floorplaning technique for partitioning the FPGAs, and a Circuit Switched Network (CSN)
for interconnecting all the RMs. This combination of floorplaning and interconnection
network enables the framework to support a variable number of different sized reconfigurable components, only limited by FPGA size, in contrast to all other, at the moment
available systems. This is achieved by dividing larger components into multiple smaller
components, which fit into the RMs and interconnecting them through the CSN . This
framework also simplifies the development of platform independent software and configware because the framework can be synthesised for any FPGA. It abstracts from the
underlying FPGA and provides the same Application Programming Interface (API) for
every hybrid hardware developer.
The proposed floorplaning technique of the MRP and the CSN generate a medium
sized hardware overhead. Because of this overhead, the FPGA size is a limiting factor in
the evaluation process. To overcome this restriction, the MRP supports a flexible and
easily extensible packet switched network, called On Chip Switching Network (OCSN).
It allows intra FPGA communication for configuring the RMs and programming the
CSN , and also inter FPGA communication, to combine multiple FPGAs to form a
larger hybrid hardware system. This feature is also a novelty, like the solution to the
granularity problem and the platform independence of configware.
6
1.4 Thesis Structure
1.4 Thesis Structure
The thesis is organised in eleven chapters. The introduction in Chapter 1 briefly describes the frame and the objectives of the thesis. To understand hybrid hardware, the
principles of reconfigurable hardware, FPGAs, and runtime/dynamic reconfiguration
are introduced in Chapter 2 and some example Reconfigurable Systems (RSs), related
to the MRP, are presented in Chapter 3. The MRP uses two different kinds of Network On Chips (NOCs), the CSN and the OCSN . Chapter 4 introduces the principles
of NOCs. It describes the Open Systems Interconnection Model (OSI) and presents a
network classification based on work done by Schwederski et al. [21] and Feng[22]. Some
important interconnection networks are described and rated according to this classification in Chapter 5. After the introduction of all basic principles, Chapter 6 explains
the granularity problem of runtime reconfigurable design flow, occurring, if FPGAs are
divided into multiple RMs to support flexible PR designs and describes possible solutions to the problem. The main work of the thesis, the MRP, is presented in Chapter 7.
It introduces the CSN , OCSN and the design of the RMs. Chapter 8 describes the
implementation of the MRP in more detail. Because the MRP is designed as a hybrid
system it needs support from the Operating System (OS). The required device drivers
are described in Chapter 9. The verification, proving that the MRP is usable and allows the reconfiguration of multiple different sized computing elements, is presented in
Chapter 10. It evaluates the MRP according to area usage, maximum clock speed and
example implementations. The conclusion of the thesis results and an outlook to future
work is given in Chapter 11.
7
2 Reconfiguration Fundamentals
Reconfigurable hardware describes some kind of electronic circuit, whose Boolean function can be changed or reconfigured after production of the circuit. Such hardware
supports the creation of variable and specialised components the moment they are required. Different approaches exist to build basic elements of reconfigurable hardware.
These basic elements can be combined to form larger systems and are produced as ICs,
such as FPGAs, Programmable Logic Arrays (PLAs), Complex Programmable Logic Devices (CPLDs) and Programmable Array Logics (PALs). The most important difference
between these systems is their basic reconfigurable component. FPGAs are build out
of LookUpTables (LUTs), while PLAs, PALs and CPLDs use arrays of and/or matrices to configure Boolean functions. Another approach on reconfigurable hardware uses
multiplexers. All the reconfigurable ICs can be used to build RSs or hybrid hardware
systems. These systems often combine a general-purpose processor with some reconfigurable hardware to improve the computational power of the processor. This approach is
called Reconfigurable Computing (RC). The following sections give a short introduction
to reconfigurable hardware. Compton et al.[23] provides a more detailed overview of
reconfigurable hardware and related software.
2.1 Matrix Approach
The basis for the matrix approach is the and/or matrix. Figure 2.1 shows an example
a a b b c c d d 1 0
&
&
&
y0
y1
Figure 2.1: and/or Matrix
9
2 Reconfiguration Fundamentals
matrix. On the left side, the and matrix prepares the connection of input signals, the
negated input signals, a zero and a one signal to some and-gates. None of the vertical
signals are connected to the horizontal ones at the moment. The intersections of these
signals are connected to a programmable switch, such as an electronic fuse or a Static
Random Access Memory (SRAM) cell. An electronic fuse will make the matrix onetime programmable, while the SRAM or other memory types will cause a multiple
programmable matrix. On the right side, the or-matrix prepares the connection of the
and-gates to some or-gates. The intersections of the signals are used the same way as
the intersections of the and-matrix. To configure a Boolean function of type f : Bn → B
into this and/or matrix, the function is required in Disjunctive Normal Form (DNF). A
DNF is the normalisation of a logical function, displayed as a disjunction of conjunctive
clauses. Every logical function, without quantifiers, can be converted to DNF [24].
a
b
S
Cout
0
0
1
1
0
1
0
1
0
1
1
1
0
0
0
1
Table 2.1: Truth table of a Halfadder
a a b b c c d d 1 0
&
&
&
Cout
S
Figure 2.2: Halfadder implemented in an and/or Matrix
Figure 2.2 displays an example implementation of a HalfAdder with the truth table
given in Table 2.1. The formulas for S and Cout can be read out of the truth table:
S = (a ∧ ¬b) ∨ (¬a ∧ b),
Cout = a ∧ b
10
2.2 Multiplexer Approach
Both are in DNF and can be directly implemented into an and/or Matrix. The nodes
in Figure 2.2 represent connections at the intersection points of the signals.
Three forms of expressions exist for the matrix approach.
• The and and the or matrix are programmable.
• Only the and matrix is programmable, the or matrix has a fixed programming.
• Only the or matrix is programmable, the and matrix has a fixed programming.
Different ICs use different expressions of the matrix approach.
2.2 Multiplexer Approach
A multiplexer is a small digital selector device. It routes one of n input signals to its
output. The number of input signals depends on the number of selection signals. If x
selection signals are available, the multiplexer can process 2x input signals. Figure 2.3
shows a 4 to 1 multiplexer with data inputs e0 . . . e3 and selection inputs s0 and s1 .
s0 s1
e0
00
e1
01
e2
10
e3
11
4-1 MUX
y
Figure 2.3: 4 to 1 Multiplexer
Simple Boolean functions f : B × B → B can be build out of this multiplexer by using
s0 and s1 as the input variables and assigning each of the data inputs the results of the
function. Table 2.2 shows how to implement the logic functions ∧, ∨ and ⊕ with a mule0
e1
e2
e3
function
0
0
0
0
1
1
0
1
1
1
1
0
f (s0 , s1 ) = s0 ∧ s1
f (s0 , s1 ) = s0 ∨ s1
f (s0 , s1 ) = s0 ⊕ s1
Table 2.2: different Boolean functions implemented with a 4 to 1 multiplexer
tiplexer. To make this approach reconfigurable to different Boolean functions, FlipFlops
(FFs) can be connected to e0 , . . . , e3 . By saving new values into these FFs, different
11
4-1 MUX
11
00
01
10
11
00
01
10
11
e7
e8
e9
e10
e11
e12
e13
e14
e15
s0 s1
4-1 MUX
11
10
01
10
e6
s0 s1
4-1 MUX
01
e5
11
e3
00
10
e2
e4
01
e1
s0 s1
00
e0
s0 s1
4-1 MUX
00
s2 s3
4-1 MUX
y
2 Reconfiguration Fundamentals
Figure 2.4: Cascaded 4 to 1 Multiplexer
functions can be configured. This pattern can be extended to implement functions of
type f : Bn → B by cascading multiplexers. An example is given in Figure 2.4. There are
two additional input variables available: s2 and s3 . Hower, this pattern does not scale
because for every two input variables the required number of multiplexers quadruples.
Another method to increase the number of input variables is to increase the number
of selection signals, but this will not scale either due to signal fanning. For x selection
signals 2x input signals are required.
Functions of type f : Bn → Bm have to be split in m functions of type f : Bn → B to
be implementable with the multiplexer pattern.
2.3 Look Up Table Approach
A better solution to implement reconfigurable functions of type f : Bn → B is to use a
small RAM or LUT . The address signals of the RAM are used as the input parameters
and the data words hold the result of the function. Table 2.3 displays the implementation
of the simple Boolean functions ∧, ∨ and ⊕ in a LUT with an address width of three
and a data width of eight. Because only two operands are required for these operations,
a1 and a2 are selected as the input variables. The result is encoded in the dataword,
starting from the first left bit for ∧.
It is obvious that the LUT approach supports the calculation of multiple functions of
type f : Bn → B concurrently by using different bits of the data-word as the result.
This approach is better suited for the calculation of f : Bn → Bm functions than any
other presented approach because it only requires one LUT , as long as m is less or equal
the size of one data word. For functions with m greater the size of one data word, LUTs
can easily be chained together.
12
2.4 Field Programmable Gate Arrays
a0
a1
a2
Dataword (8bit)
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
00000000
01100000
01100000
11000000
00000000
00000000
00000000
00000000
Table 2.3: Example LUT implementing ∧, ∨ and ⊕
2.4 Field Programmable Gate Arrays
To extend boolean functions as explained in previous subsections to Finite State Machines (FSMs) or even more compley circuits it is necessary to have memory and interconnects.
Many IC provide the required resources to configure digital circuits, such as FPGAs,
PLAs, CPLDs and PALs. This section describes the general structure of FPGAs because they are used for the prototype system in this thesis. Many books provide this
information, but this section is based on the book by Urbanski et al. [25]. In contrast to
the name, a FPGA is not an array of gates, but an array of configurable basic elements,
as there are Configurable Logic Blocks (CLBs), Input/Output Blocks (IOBs), Block RAM
(BRAM), small DSPs and Clock Management Tiless (CMTs). Figure 2.5 displays the
IOB
IOB
IOB
IOB
IOB
IOB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
IOB
IOB
IOB
IOB
IOB
IOB
Figure 2.5: Simple structure of an FPGA without interconnects
basic FPGA structure with CLBs and IOBs, and without interconnects. They are organised in an array structure to simplify the interconnection of the blocks. All components
of the FPGA are vendor and device specific. The focus here is on Xilinx Virtex5 FPGAs.
The following information is taken from the Xilinx Vitex5 User Guide[3].
13
2 Reconfiguration Fundamentals
2.4.1 Input/Output Blocks
IOBs are the interface from the configured hardware to the input and output pins of the
FPGA. They are also configurable by the developer to support different voltage levels
and input/output signal standards, such as Low-Voltage Transistor Transistor Logik
(LVTTL), Low-Voltage Differential Signaling (LVDS), and High-Speed Transceiver Logic
(HSTL).
2.4.2 Configurable Logic Blocks
CLBs are the main reconfigurable elements of the Virtex5 FPGAs. Figure 2.6 displays
COUT
Slice
Y1Y1
Slice
X1Y0
Switch
Matrix
SHIFT
COUT
CIN
Slice
X0Y1
Slice
X0Y0
Fast
Connects
to neighbors
CIN
Figure 2.6: Structure of two Virtex5 CLBs[3]
the structure of two CLBs. The switch matrix is already part of the FPGAs interconnection network. One CLB consist of two slices. These slices are tightly interconnected
through carry lines to increase the operand size of Boolean functions. Always two CLBs
are connected through a shift line to form large shift registers.
Every slice contains four LUTs, which are the basic reconfigurable elements of FPGAs,
four storage elements, wide-function multiplexers, and carry logic[3].
The used LUTs have six independent inputs and two independent outputs. This
structure supports the configuration of one Boolean function of type f : B6 → B or
two Boolean functions of type f : B5 → B if the two functions share the same input
parameters. Three multiplexers are connected to the four LUTs in one slice to support
combining two LUTs to increse the number of possible inputs to seven or eight. Functions
with more inputs are implemented by combining slices.
D-type FFs provide storage functionality within each slice. Their input can be directly
driven from a LUT . Some special slices provide more storage capacity by merging LUTs
into a small RAM . Different merging strategies are supported.
14
2.4 Field Programmable Gate Arrays
2.4.3 Block RAM
FPGAs support BRAM to provide reconfigurable hardware with fast and area inexpensive RAM . On Xilinx FPGAs BRAM is provided in 36kbyte blocks. They are placed in
columns on the FPGA. The number of available blocks is FPGA dependent. For Virtex5
devices the available BRAM ranges from 144 kbytes up to 2321 kbytes.
BRAM can be used as single port, dual port RAM , or as First In First Out (FIFO)
queues. Virtex5 FPGAs even provide dedicated hardware for asynchronous FIFO queues,
reducing space requirements of the reconfigurable hardware. Access times for BRAM are
very fast, compared to off-chip Double Data Rate (DDR) RAM . A dataword is available
one clock tick after issueing the address into the RAM , making it a good choice for fast
buffers or caches.
2.4.4 Special IO Components
Often, reconfigurable hardware requires special I/O components, such as Ethernet, Serial
Advanced Technology Attachment (SATA), PCI , etc.. Implementing these I/O components in reconfigurable hardware is possible, but requires much FPGA space. Therefore, the FPGAs support some special non-reconfigurable I/O hardware. This hardware
implements common parts of I/O devices, which can be used to create the required
components. The Virtex5 FPGA family supports Ethernet MACs, and RocketIO GTP
Transceivers.
Ethernet MACs reduce the area usage for Ethernet devices because they implement
the Media Access Control (MAC) layer of the Ethernet protocol.
RocketIO GTP Transceivers support general components for high speed serial I/O like
8b/10b encoders/decoders and fast serialiser and deserialiser. These transceivers can be
used to implement the physical layer of the PCI or SATA bus. The correct working mode
can be set through special instructions in the Hardware Description Language (HDL).
2.4.5 Interconnection Network
The interconnection network and the CLBs are the most important parts of the FPGA.
Without the interconnection network the CLBs can not be combined and larger components can not exchange data. FPGAs distinguish three different signal types, which
have to be routed through the interconnection network with different priorities and signal
latencies.
clock signals Clock signals require a fast distribution time throughout the FPGA
because they synchronise all the components to its rising or falling edge.
reset signals Reset signals are similar to clock signals. Through reset signals components are initialised at the same moment. This also requires a fast distribution
throughout the FPGA.
I/O signals For I/O signals a fast distribution is also important, but the maximum
clock rate a design can work at, is calculated using the I/O signal line latencies.
15
2 Reconfiguration Fundamentals
Another important requirement for I/O signals is their number. A normal design only
has around one to three different clock signals and as much reset signals, but the
number of I/O signals are very huge.
Therefor, the FPGAs support two different interconnection networks. One for clock
and reset signals and one for all the I/O signals, required to exchange data between
components.
2.5 Partial Reconfiguration
PR is a feature and a design flow of Xilinx Virtex5, Virtex6, and Virtex7 FPGAs[2]. It
extends the normal configuration possibility of FPGAs with the ability to modify parts
of a running configuration, without interrupting the computation.
The design is divided into a static and a reconfigurable part during development. For
the static part special entities, called reconfiguration modules, are defined, which hold
the reconfigurable components. This definition includes a signal interface declaration
for communicating with the static part. There can be different reconfiguration modules
in one design with variable number of instances. The reconfigurable part of the design
consist of entity descriptions for every component, which should be configurable into one
module.
RM03.bit
RM02.bit
RM01.bit
RM00.bit
FPGA
RM0
,,static´´
Logic
RM1
RM13.bit
RM12.bit
RM11.bit
RM10.bit
Figure 2.7: simple PR example[2]
The synthetisis process creates some FPGA configuration files. The main file includes
the static design and a component for each instance of a reconfiguration module. For
every component and every instance an additional partial configuration file is created.
These files can be loaded into the FPGA after the main file to reconfigure certain reconfiguration module instances. Figure 2.7 shows a simple example of a reconfigurable
system. It features two reconfiguration module instances and four partial configuration
files per module. Instances can only be configured into the RMs for which they have
been synthesised, placed, and routed.
16
3 Example Reconfigurable Systems
3.1 Research Systems
3.1.1 RampSoC
A RampSoC is a Multi-Processor System-on-Chip (MPSoC) that can be adapted during
run-time by exploiting dynamically and partially reconfigurable hardware[4]. A special
design-flow is used, which combines the top-down and bottom-up approach. The bottomup approach is used during design time to set up the basic conditions of a RampSoC
according to the problem-space it should be used in. In the top-down approach the
software is optimised for this initial setup. Parts of this initial setup can be reconfigured
to meet arising needs of applications during runtime, such as a different processor core
or a special accelerator unit. Figure 3.1 shows a possible RampSoC configuration at
FPGA
Switch
MicroProcessor
(Type 1)
Switch
MicroProcessor
(Type 1)
Accelerator
Accelerator
Switch
MicroProcessor
(Type 1)
Switch
MicroProcessor
(Type 1)
Accelerator
Accelerator
Switch
MicroProcessor
(Type 2)
Accelerator
Switch
MicroProcessor
(Type 1)
Accelerator
Accelerator
Accelerator
Figure 3.1: example RAMPSoC Configuration[4]
17
3 Example Reconfigurable Systems
some point in time. Two types of processor cores are supported in this configuration,
each having at least one accelerator unit. Switches connect the individual cores to the
communication network.
The implementation of a RampSoC is done using the early access PR concept of Xilinx. This design flow is not supported by the Xilinx toolchain anymore. The early
access PR design flow requires, that reconfigurable modules are defined before synthesis of the project. To reconfigure different cores, accelerators and the communication
infrastructure all reconfigurable parts have to be defined at the system design stage.
The maximum number of accelerators and processor cores is fixed during runtime. The
developer has to decide, if each type of core requires its own reconfiguration module
defined or if the biggest core size is selected as the size for the reconfiguration unit. He
has to balance between space exploitation and flexibility. The RampSoC approach uses
proprietary processor cores, such as Pico- and Microblaze cores from Xilinx. To this
cores accelerator units are connected, which can change their hardware function while
the processor is executing a program.
The RampSoC approach is a very flexible improvement compared to normal multicoreprocessors or MPSoCs. Its heterogeneous structure allows the optimal execution of
applications with different hardware requirements and can adapt to applications needs
during runtime very easily. Processor cores can even be exchanged by special FSMs
supporting calculations in special hardware components.
3.1.2 PRHS
The Partial Reconfiguration Heterogenous System (PRHS) developed by Eckert[5] tries
to exploit the available new space on ICs also by reconfiguration. The PRHS is a softcore
SoC , configured onto a FPGA. It features one RM of the Xilinx PR design-flow. In the
available RM different hardware components can be configured. The RM can accelerate
computations on the SoC , but its main pupose is virtualisation.
Virtualisation in this case means the instantiation of a full SoC running under the
supervision of the static core. The virtualised SoC also runs Linux as OS. Figure 3.2
displays this scenario. The static system on the right is running Linux as its OS. It
has full access to memory and memory mapped IO hardware components like Universal
asynchronous receiver/transmitters (UARTs) or timers. On the left a RM is available
and connected to the static system. The SoC configured at runtime into this RM has
only partial access to the memory. The accessible memory space is configured from the
static system before the virtualised system is started. A memory mapped IO component
interconnects the RM and the static system. It supports starting and stopping the
virtualised system, but not suspending it. Providing a virtualised hard-disk to the
reconfigurable system is another feature of the static system.
The PRHS is an interesting way of using tighly couple reconfigurable hardware from
a static processor core. The virtualised processor cores can feature different ISAs and
run without performance losses, compared to the static processor core.
18
PRHS Bus
RS232 Tx/Rx
lines 1
<option base>
uart1
(uart4prhs)
PRextension_inst (PRextension)
baseReconf
<option reconf>
icap4prhs
reconfiguration
guard
reconfIF4prhs
RS232 Tx/Rx
lines 0
icnExtInterrupts
28
30
BCS
secondary
data bus
timers and uart0
present information
icBusStatusLines
28
30
32
(busComponentStatus)
PRextension or uart1
present information
SysIntChip
(intchip4prhs)
uart0
(uart4prhs)
32
ClockSourceTimer
(timer4prhs)
ClockEventTimer
(timer4prhs)
bootRam
(bram4prhs)
secondary
instruction bus
processor
(prhspA)
dataCache
(Cache)
primary
data bus
dataBusCtrl
(BusCtrl)
processor
data bus
nIRQ
processor
instruction bus
instrBusCtrl
(BusCtrl)
primary
instruction bus
instrCache
(Cache)
instruction SD bus
data SD bus
reconfigurable module
staticSys
(base)
reconf PRHS SD Bus
ReconfArbiter
(prhsSDbusArbiter)
<option reconf>
<option base>
PRHS SD Bus
3.1 Research Systems
Figure 3.2: PRHS System Overview[5]
19
systemArbiter
(prhsSDbusArbiter)
static PRHS SD Bus
3 Example Reconfigurable Systems
3.1.3 Dreams
Dreams is not directly a RS, but it is a tool to build runtime reconfigurable systems.
It processes Xilinx Description Language (XDL) files, created by the Xilinx tools, and
provides a partial reconfiguration design flow on top of PR. While the Xilinx design flow
enforces the developer to run the synthesis, place, and route process for every RM and
every implementation of a module, the dreams design flow does not. It supports easy
relocation of RMs just synthesised, placed and routed one time.
XDL is a human readable language for describing netlists. It is compatible with the
ncd netlist file format and Xilinx provides programs for easy conversion.
Dreams is developed by Otera et al.[26]. It tries to improve the Xilinx design flow in
four different ways:
1. Module relocation in any compatible region in the device
2. Independent design of modules and the static system
3. Hiding low level details from the designer
4. Enhanced module portability among different reconfigurable devices
Its design flow targets reconfigurable architectures build out of disjoint rectangular regions.
The system architecture, enforced by the Dreams tool, is divided into Virtual Regions
(VRs) and Virtual Architectures (VAs). A VA combines FPGA resources for use as a RM
or static module. The VA describes the full system, including static and reconfigurable
parts and how they are interconnected using the FPGAs interconnect. The VRs and
the VA description are provided by Extensible Markup Language (XML) files by the
developer.
Dreams is a very interesting tool. Very large reconfigurable systems suffer in the Xilinx
PR design flow from very long placement and routing times. Dreams could significantly
reduce these times and improve the development time of such systems.
3.2 Commercial Systems
3.2.1 Convey HC1
One commercially available RS is the Convey HC1[6]. It combines four Xilinx Virtex5
FPGAs with an Intel Xeon processor through the X86 co-processor interface. Figure 3.3
gives an overview of this architecture. The system contains two memories, one connected
to the processor cores and another one connected to the four FPGAs. Both are accessible
from the processor and the FPGA side. Hardware ensures cache-coherency between
them. The memory on the FPGA side is specially partitioned to support concurrent
access to different memory banks from different FPGAs to increase the overall memory
access speed.
20
3.2 Commercial Systems
"Commodity" Intel Server
Intel 5138
Dual Core
Processor
Application
Engine Hub
Intel 5400
MCH
Intel IO
Subsystem
Convey FPGA-based coprocessor
Memory
Intel x86-64 Server
x86-64 Linux
Application Engines
Virtex5 Virtex5 Virtex5 Virtex5
FPGA FPGA FPGA FPGA
Memory
FPGA based
Shared cache-coherent memory
Figure 3.3: Overview of the Convey HC1 architecture[6]
Communication with the FPGAs is implemented using the coprocessor interface of Intel processors. Software running on the Xeon processor can trigger hardware operations
running on one of the FPGAs by issuing special coprocessor instructions and writing
data, required for the operation, to special memory regions. Programs can change configurations in idle times of the FPGA. The Xilinx PR design flow is basically available,
but is not supported yet by Convey, enforcing long reconfiguration latencies and very
fixed FPGA designs. Still, the Convey HC1 is a very interesting platform for high performance computing. In high performance computing the accelerator hardware seldom
changes and one important factor is memory access. Memory access is very fast on the
HC1 because of their special memory layout.
3.2.2 Intel Stellarton
Another commercial RS is the Intel Stellarton processor and FPGA SoC [14]. It combines
a standard Intel Atom Stellarton processor core with an Altera FPGA on the same chip,
but not on the same die. Figure 3.6 gives an overview of its hardware structure. The SoC
contains all the standard components of the Intel Atom processor, like DDR interface,
graphics adaptor/accelerator, audio component and Peripheral Component Interconnect
Express (PCIe) bus interface.
The Altera FPGA[27] ist connected to the processor by this PCIe bus. Through this
bus the FPGA is configurable and application data can be exchanged between FPGA
and processor. The main purpose of this RS was to improve the performance of host
programs by accelerator hardware.
The production of the system has been discontinued, but a new approach by Intel
seems to be on its way, according to Diane Bryant[28]. According to her, Intel is working
on combining their Xeon server processors with FPGAs to improve the performance of
internet cloud services, such as Ebay, Amazon, etc..
21
3 Example Reconfigurable Systems
Intel Atom Processor
DDR2 IF
Graphics
SPI, SMBus
Legacy
GPIO
Intel Audio
PCIe Gen 1
PCIe
PCIe
FPGA
Figure 3.4: Structure of an Intel Stellarton Processor, combined with an Altera FPGA
3.2.3 Xilinx Zynq Architecture
Zynq[7] is a very new hybrid hardware system produced by Xilinx. It features a dual
ARM Cortex A9 processor core connected to many peripherals and a FPGA through an
Advanced Microcontroller Bus Architecture (AMBA) bus. Figure 3.5 presents the overall
system structure. Processor core and FPGA share the same chip, but not the same die,
like the Intel Stellarton processor. It supports a lot of static hardware components to
connect to common embedded devices, such as Inter-Integrated Circuit (I2C) controller,
Serial Peripheral Interface (SPI) controller, or Controller Area Network (CAN) controller. The FPGA is connected to the processor through an AMBA bus. The AMBA
bus is a very common bus in embedded devices. It supports general-purpose ports and
high performance ports from the processor to the FPGA. The FPGA has access to high
speed serial I/O transceivers going offchip and to the AMBA bus. All other features of
a Virtex7 FPGA are also supported, including PR.
The Zynq architecture is an interesting system for embedded hardware developers.
On the ARM processor cores a standard embedded OS can run and the FPGA can
improve calculation performance for special applications, like audio and video editing,
radio transmissions, and cryptographic algorithms.
22
3.2 Commercial Systems
Figure 3.5: Structure of the Xilinx Zynq architecture[7]
23
3 Example Reconfigurable Systems
3.3 COPACOBANA and RIVYERA
FPGA
2
FPGA
3
FPGA
4
FPGA
5
FPGA
1
FPGA
0
FPGA
7
FPGA
6
Host Interface
Svc
FPGA
Backplane
Figure 3.6: COPACOBANA and RIVYERA interconnection overview
The Copacobana and Revyera systems developed by SciEngines hybrid hardware systems optimized for cryptoanalysis and scientific computing.
Both systems consist of many interconnected FPGAs working together to solve a
problem. The host system is connected through 10Gbit Ethernet cards, 4Gb Fibre
Channel cards, or InfiniBand. The Copacobana can try the complete 56-Bit DES key
space within 12.8 days. The Revyera is the advancement of the Copacobana.
24
4 Interconnection Networks
Modern hardware design often requires the development of some interconnected components. Different interconnection network schemes are available today. If more tightly
coupled systems are required these components are combined on a single chip. Such a
tightly connected system is called SoC .
Figure 4.1 displays an example mobile phone system, with three different interconnection schemes. This system can be developed as a multi-chip system or as a SoC .
The shown mobile phone system consist of a CPU , memory, a DSP, a keypad, and a
Memory
RF
Keypad
Memory
RF
Keypad
Memory
RF
Switch
Switch
CPU
DSP
a) bus connection
CPU
DSP
b) P2P connection
CPU
Keypad
DSP
c) noc connection
Figure 4.1: Example mobile phone SystemOnChip (SoC)
radio transceiver. These components interact in different ways to get the mobile phone
running. The interactions can be implemented using different kinds of interconnection
networks. Figure 4.1 shows three possible topologies. In a) all components are connected
to a bus with the typical bus communication restrictions, such as exclusive bus access for
a single component and poor scaleability. In b) all components are directly connected
with all components they are interacting with. This network topology supports a very
flexible communication, but requires many interconnection links. The last displayed
topology is a packet switched network build out of the components and switches. This
kind of networks are called NOCs. NOCs are very similar to the communication infrastructure of inter computer networks, such as Local Area Networks (LANs) or Wide Area
Networks (WANs).
Much more different network architectures exist. To distinguish these networks and to
easily highlight their differences and performance properties a classification is necessary.
In this work part of the classification done by Schwederski et al. [21] is used, which is
based on research done by Feng[22].
25
4 Interconnection Networks
The base for a classification is usually a mathematical representation of the entity of
interest. In this case finite graphs are a good representation of interconnection networks.
The edges of the graph model the interconnection links and the nodes are the Processing
Elements (PEs), connected to the network. A PE is the component doing calculations
and using the network for communication purposes, such as a processor core, a DSP, or
some other kind of device controller.
This chapter is organised as follows: Section 4.1 describes the OSI . It is an industrial
standardising model for different communication protocols, simplifying their development.
The distinguishing characteristics of NOCs are explained and described from Section 4.2 to Section 4.8.
4.1 Open Systems Interconnection Model
Communication systems mostly consist of more than just two communication partners.
These communication partners can be under the control of the same developer or company, but this is not always the case. Data is transmitted over multiple nodes to reach
its destination and the underlying infrastructure can differ from node to node because of
different responsibilities. The transmitted data can be divided into a header, enclosing
source and destination addresses, payload size, quality of service information, and the
actual payload. The position of the header data and the payload has to be defined to
help every developer and manufacturer to produce compatible hardware. Later in this
work, protocols will be described, using the terminology of the OSI .
The International Telecommunication Union (ITU) and the International Organization for Standardization (ISO)[29] developed the OSI model to simplify the definition
of communication protocols. Seven functional distinct layers divide the communication
process. Figure 4.2 gives a graphical representation of these layers and the expected
protocol flow. The flow starts at either side of the network stack. If some data shall
be transmitted to another communication partner, the communication usually starts
at the application layer. Every layer processes the data and passes it down to the next
layer until reaching the physical layer. Each layer adds header information or transforms
the data according the network requirements. Sometimes control messages are created,
passed down the layers and send to their corresponding layer at the next communication
partner, to create a virtual connection between them.
The physical layer transmits the data through some kind of medium (wire, air, fibre
optic, . . . ) to the next node. After the transmission, the data passes the layers up.
If the node is just an intermediate one the data moves up to the network layer, where
it gets formatted for the transmission to the next node. If the data has arrived at its
destination, it gets passed up to the application layer.
In the following sections each of the seven layers is briefly described. More information
about the OSI model can be found in [29] or [30].
26
4.1 Open Systems Interconnection Model
Network Stack
Data
Protocol
Network Stack
application layer
application layer
presentation layer
presentation layer
session layer
session layer
transport layer
transport layer
network layer
network layer
data link layer
data link layer
physical layer
physical transmission of bits
Data
physical layer
Figure 4.2: graphical representation of the ISO/OSI Model
4.1.1 Application Layer
The application layer is the interface between a program or application running on a PE
and the communication infrastructure. It defines the interaction between two or more
communication partners, such as how to request some data or how to send the partner
data. For this interaction the application does not require any information about the
underlying network, the destination address is enough. Very common application layer
protocols used in the Internet are Hypertext Transfer Protocol (HTTP) and Post Office
Protocol Version 3 (POP3).
4.1.2 Presentation Layer
Data can be presented in multiple forms. For example some processor cores use big
endian or little endian byte encoding for working with structures bigger than one byte.
A higher level form ist the language encoding with ISO codes or UTF-8.
To allow the application layer to just use the passed data, the presentation layer
converts and transforms the data to the required representation.
The presentation layer can be used to implement point to point encryption too.
4.1.3 Session Layer
A communication session consists of the connection establishment, the transmission and
reception of multiple data and the detachment of the connection.
27
4 Interconnection Networks
Not every communication requires the establishment of a session. For example in a
network, where every information is broadcasted to every network member, it is not possible to establish a session. Sessions are always necessary, if multiple requests, belonging
to the same context, have to be transmitted.
The Session layer is responsible for connection establishment before the data of session
is transmitted and the tear down of the connection, when the session is finished.
4.1.4 Transport Layer
The transport layer defines at least one protocol or method, on how to transmit data to
another node in the network. This protocol can be connection less or connection oriented.
In a connection oriented protocol the connection establishment, data transmission and
the connection tear down has to be described. In this case the data transmission ensures
the reception of the data at the communication endpoint. For a connection less protocol
only the data transmission is required, without acknowledgement of receipt.
Well known transport layer protocols are the User Datagram Protocol (UDP) and the
Transmission Control Protocol (TCP).
4.1.5 Network Layer
Networks can be build with different topologies. How data is transmitted from a start
node to a destination node depends on this topology because it specifies if nodes are
directly connected, or how many intermidate nodes exist between them. The network
layer is responsible for defining routing and path finding algorithms for transmitting data
beween the network nodes. If necessary, it creats an abstraction layer over all network
nodes with its own distinct address range. In this logical view the nodes seem to be
directly connected. Common network layer protocols are IPv4 and IPv6.
4.1.6 Data Link Layer
The data-link layer is responsible, that the entities forming the network, can communicate securely with each other. If the underlying physical connection is not very robust,
the data link layer ensures error-detection through some kind of checksum and, if possible, error-correction. This is achieved by requesting a retransmission of the data from
the data-link layer on the other communication side or by recalculating lost data. If the
physical transmission has a maximum number of bits, it can transmit at one time the
data-link layer arranges the framing of the data.
4.1.7 Physical Layer
The physical layer of the OSI transmits data from one network entity to another one.
The structure of the data is not important at this layer because just bits are transferred.
The physical layer describes the electrical and physical specification for transmitting one
bit. It determines the modulation of the data and which transfer medium is used. It
offers the data-link layer an interface to transmit x bits of data.
28
4.2 Topology
4.2 Topology
The physical layer of the OSI describes how bits are transferred between network entities.
These entities are organised in a specific structure, such as a star, ring or cube. This
structure, represented by a finite graph, is called the network topology. Because it is
obviously a distinctive feature of a network and influencing the performance significantly,
the following topology classification properties are very important. For all the properties
we assume that the network N has n interconnected PEs numbered pe0 . . . pen−1 .
4.2.1 Interconnection Type
The network entities can be interconnected in different ways when forming a network.
The following values describe the interconnection type in this classification:
static
If entities are statically linked, the link cannot be changed during runtime of the network.
The network has to be recreated to change them. Such a network ist called static network.
An example static network is a ring.
dynamic
A dynamically linked network is called dynamic network. It allows the alteration of
connection links between two components during runtime of the network. A good example of a dynamic network is a bus. The address signals of a bus allow the selection
of different communication partners.
direct
In a directly connected network (direct network) each network entity or PE is connected
to at least one other network entity through some fixed links. No other component is
PE
PE
PE
PE
SW
PE
PE
PE
PE
SW
PE
a) direct net
PE
b) indirect net
Figure 4.3: direct and indirect interconnection networks
29
4 Interconnection Networks
required to communicate with other entities. If data needs to be transferred through
intermediate nodes to its destination, the network entities have to provide this functionality on their own. Figure 4.3 a) shows a direct network of five PEs.
indirect
The opposite of a directly connected network is an indirectly coupled one (indirect network). In this type of networks the entities or PEs are connected through some kind
of network infrastructure, which is responsible for data routing, for example a network
switch or hub. The individual entities only possess uni- or bidirectional links to one
network infrastructure component. Such a network is displayed in Figure 4.3 b).
combination
The properties mentioned above rule out each other in pairs. Overall, a static network
cannot be a dynamic network at the same time. The same holds for direct and indirect
networks. There could be special cases, in which this is not the case, but these will not
be considered in this work.
The combination of the pairs are possible. For example a static and indirect network
is a very common case, looking at the interconnection of computer systems. Another
example is a bus, which can be implemented as a dynamic and direct network.
4.2.2 Grade and Regularity
It is always important to know, how many data can be transferred between PEs in
parallel and if this value is the same between all network entities. These values always
differ between different network topologies.
The grade Γ of a PE is defined as:
Γ(pei ) = number of connections of pei f or i ∈ 0 . . . n − 1
The grade measures the density of interconnection links in a network. We define:
δ(N ) = M inimum(Γ(pei )) ∀i ∈ 0 . . . n − 1
and
∆(N ) = M aximum(Γ(pei )) ∀i ∈ 0 . . . n − 1
The term regularity describes, if the structure of the interconnection links is the same
at all PEs of the network:
N is r-regular if δ(N ) = ∆(N ) = r
This implies:
Γ(pei ) = r ∀i ∈ 0 . . . n − 1
This characteristic is only important for direct networks because usually the PEs of
an indirect network just have one bidirectional connection to an infrastructure element.
30
4.2 Topology
4.2.3 Diameter
The network diameter quantifies the maximum distance between network nodes. The
classification by Schwederski et al. [21] defines the diameter for direct networks only. But
the diameter is such an important characteristic that in this work it is also extended to
indirect networks.
direct networks
Let N be a direct network with n nodes numbered 0, . . . , n − 1. Let da,b be the minimum
number of steps (connection links) between the nodes a and b. The diameter is defined:
Φ(N ) = max(da,b ), ∀a, b ∈ N, 0 ≤ a < n, 0 ≤ b < n
indirect networks
An indirectly coupled network consists of at least one level of coupling elements. These
coupling elements take over the routing functions of the nodes in a direct network. Every
node or PE in an indirect network has one connection to a coupling element. Let N be
an indirect network with s level of coupling elements and n nodes numbered 0, . . . , n − 1.
Let a, b ∈ N and a connected to coupling element X and b connected to coupling element
Y . Let dC
x,y the minimum number of steps (connection links) between X and Y . Now,
let da,b = dC
X,Y + 2 be the minimum number of steps between the nodes a and b. The
diameter is defined again:
Φ(N ) = max(da,b ), ∀a, b ∈ N, 0 ≤ a < n, 0 ≤ b < n
Dimension of the diameter
Sometimes it is not possible to calculate an exact number for the diameter. Still, it is
important to know the dimension the diameter can take on. For this case we define:
Φ(N ) = Θ(f (n))
for a function f and a parameter n. The meaning of this is, that the diameter of a
network depends on a function f and the parameter n of this function.
4.2.4 Bisection Width
We still have our network N with n PEs. The bisection width partitions the network
into two halves and measures the minimum interconnection links between these halves.
The segmentation into M1 and M2 is done according to these equations:
M1 = bn/2c P Es
and
M2 = dn/2e P Es
31
4 Interconnection Networks
.
The bisection width Wk (M1 , M2 ) of a single segmentation is given by:
Wk (M1 , M2 ) = minimum number of interconnection links between M1 and M2
The bisection with of the whole network N is given by:
W (N ) = M inimum(Wk (M1 , M2 )) ∀ segementations M1 , M2
The bisection width is an important metric for the performance of networks because
many algorithms require that the nodes of one halve of the network communicate with
corresponding nodes in the other halve.
4.2.5 Symmetry
The symmetry of a network simplifies the writing of distributed algorithms. A network
can be asymmetric, node-symmetric or link-symmetric. In a node-symmetric network,
the network structure looks the same from every PE. This symmetry allows the deployment of the same algorithm to all PEs in the network. In a link-symmetric network the
network is identical, looking from every link. This may simplify the scalability of the
network. If the network is asymmetric, every PE has to be considered individually.
4.2.6 Scalability
After deployment of a network, whether it be between some small hardware components
or between computer systems, the scalability is always very important. If a SoC is
extended for a new revision, new components are added to the system and have to be
integrated into the NOC . If the NOC is not scalable, integrating the component will be
a very big problem, possibly leading to a complete redesign of the system.
A network is scalable if:
1. the topology mostly stays the same, if a new component is integrated. In the best
case all existing connections and nodes are fixed and only the new connections for
the PE have to be appended.
2. the communication performance does not suffer by increasing the number of nodes.
3. the increase of the network complexity is limited.
4.3 Interface Structure
The interface is the bridge between one PE and the network. Its structure determines
the communication between PEs. The requirements for such an interface differ in direct
and indirect networks, but the implementation varies within each network type too.
32
4.4 Operating Mode
4.3.1 Direct Networks
The requirements for direct networks are very versatile because the PEs are directly
responsible for the network access. The interfaces in a direct network have to implement
the wire selection, path finding and data forwarding algorithms. These tasks require lots
of hardware, such as multiplexers for selecting the correct path or buffers to store data
before forwarding it.
4.3.2 Indirect Networks
Interfaces in indirect networks are normally very simple because one PE has only one
bidirectional connection to the network. The interface does not require any complex
multiplexer or router functionality. The hardware just transmits and receives data from
a network infrastructure component. At most a small buffer is necessary.
4.4 Operating Mode
The operating mode of networks refers to the connection establishment and the data
transmission of PEs. Both task can be executed synchronously or asynchronously.
4.4.1 Synchronous Connection Establishment
In this operating mode all PEs are establishing their network connection or communication link at the same time. The exact point of time is synchronised by a global clock
signal.
4.4.2 Synchronous Data Transmission
Data designated for transmission can be divided into individual bits or groups of bits,
such as one byte. These groups are transmitted at the appearance of one global clock
tick. So every network interface transmits its own group of bits at the same time.
4.4.3 Asynchronous Connection Establishment
The PEs need not wait for a specific global clock signal or a number of clock ticks to be
allowed to establish communication. It can happen at any clock tick.
4.4.4 Asynchronous Data Transmission
As with synchronous data transmission, the data can be divided into groups of bits. But
in this case, handshake protocols are used, to ensure the transmission of the data. For
example, the sender is only allowed to put the next group of bits onto the transmission
line, if the receiver has acknowledged the reception of the current group.
33
4 Interconnection Networks
4.4.5 Mixed Mode
All these operating modes can be mixed. A very common mixture is the combination
of asynchronous connection establishment with synchronous data transmission. This
combination allows a very simple transmission hardware because it is controlled by
a central clock signal and a flexible communication pattern because PEs can start a
communication at any time.
4.5 Communication Flexibility
Communication within a network can follow different strategies or patterns. A network
can support all of them or just one. The level of communication flexibility is dependend
on how many and which of the strategies the network supports.
4.5.1 Broadcast
The simplest communication strategy in a network is a broadcast. If a PE wants to
transmit data to another PE, it sends the data to all the other PEs. The receiving PE
recognises the data for himself and can use it. All the other PEs just drop the data. This
is not very flexible or efficient, but does not require a very complex routing algorithm.
4.5.2 Unicast
The unicast communication strategy is the opposite of a broadcast. PEs address exactly
one other PE and the data is only transmitted to this one. No other element in the
network receives the data.
4.5.3 Multicast
A broadcast is often too expensive because the data is transmitted to all PEs in the
network. To improve the flexibility and the cost of the communication pattern the
multicast strategy was developed. It allows the addressing of a subset of all the PEs in
the network. This improves the flexibility much because the network can be divided into
different groups, which can be address individually.
4.5.4 Mixed
All the strategies mentioned above can be combined within a network. For example in
TCP/IP networks you find all of them. But it is also very common, to combine the
unicast and multicast strategy. This combination increases the flexibility of a network
a lot because you can on the one hand address individual PEs and on the other hand
groups of them.
34
4.6 Control Strategy
4.6 Control Strategy
As mentioned earlier in this chapter, networks can be divided into static and dynamic
ones. If a network is dynamic, the control over the dynamic links can be organised in
different ways. This property is inapplicable for static networks because their links are
fixed.
4.6.1 Centralised Control
In a centralised controlled dynamic network a single control unit is responsible for the
selection of the source and destination of the interconnecting links.
This often requires much hardware because the central control unit needs to control
all components in the network, which can switch the connection links. The configuration
of all the links requires a very complex algorithm too. This strategy is best used in an
environment with very few changes.
But in such a network all connected resources can be configured at once and in cooperation with all the others to achieve the best possible interconnection pattern for the
current work.
4.6.2 Decentralised Control
The opposite of a central controlled network is a decentral controlled network. In this
kind of network many network components exist, which organise the connection links
for a small part of the network. These networks are called self-routing networks too
because if data is transmitted through the network, the decentralised components need
to decide how to switch the connection links and route the data without a view onto the
complete network.
This leads to a network without the optimal interconnection pattern, but is very
flexible and adaptable to different communication requirements on the fly.
4.7 Transfer Mode and Data Transport
Two network transfer modes are common today. In a circuit switched network a complete
link is established between two communicating PEs through every intermediate PE. This
can be done in a centralised or decentralised manner, explained earlier in this chapter.
In a packet switched network, data is grouped by packets. These packets contain the
source and destination address in a header section. In a direct network the PEs and in
an indirect network some infrastructure component forwards these packets according to
an algorithm, until received by its destination.
Detached from the actual hardware implementation, communication within a network
can be connection oriented or connection less. In a connection oriented communication the source always establishes a connection with the destination first, which stays
active for the whole communication. In packet switched networks this is always done
using some kind of virtual connection, where the destination is told when a connection
35
4 Interconnection Networks
starts and when it ends. In a circuit switched network a “real” connection can be established between both communication partners. In a connection less communication
the source just sends data packets into the network. These packets travel along the
cheapest interconnection links. No preferred communication way exists. Connection
less communication is only possible in a packet switched network.
According to the underlying hardware and the connection type, different routing algorithms have to be used, to get the data to its destination.
Store and Forward Routing This kind of routing is used in packet switched networks
to forward packets between network entities in a whole. The packet is transmitted
completely and is saved at the next component into a buffer. If the link to the next
component is ready, it is forwarded again. This routing mechanism is very simple, but
very hardware consuming. Much buffer space is required at each network component.
Wormhole Routing Wormhole Routing uses the advantages of packet and circuit
switched networks in environments, where the data transport is done over intermediate nodes. The data packets are divided into smaller pieces, called flits. The first flit
contains the connection information. Each level in the network, builds up the connection link if it receives the first flit. After this connection establishment there is a
complete link between source and destination and all flits of the packet are somewhere
in between. The last flit tears down the link. The advantage of this strategy is a
reduced latency between transmission and reception of a message. The disadvantage
is the possibility of deadlocks because one transfer locks multiple network components
at a time.
Virtual Cut Through Routing This routing schema is related to the wormhole routing. It is used in packet switched networks. In each level of the network there is
enough buffer space available for saving the complete data packet. Packets are transferred into the network and each level forwards it to the next level. If the way to the
next level is blocked, the packet is detained. If the way is free the forwarding of the
packet is immediately started, without waiting for the reception of the full packet.
Like in wormhole routing, a packet may distribute through multiple levels of the network. A long blocking of the network is prevented by buffering packets, if the way is
blocked.
4.8 Conflict Resolution
Networks can differ in the mode, they dissolve conflicts. The two main network conflicts
are output conflicts and internal conflicts.
output conflict These conflicts occur, if messages are transferred from multiple sources
to one destination, but only one connection can be established between source and destination. This conflict cannot be dissolved by changing the network topology because
the destination can only support one connection.
36
4.8 Conflict Resolution
internal conflict Even, if all messages are addressed to different destinations, an internal conflict can occur. In networks consisting of consecutively interconnected links,
a message can travel partly the same way as another message, leading to a conflict because only one message can pass this link at a time. This conflict is traffic induced and
can be dissolved by changing the network topology, for example, creating redundant
links to bridge the part of the network with the bottleneck.
To dissolve these conflicts, without changing the topology if possible, three resolution
methods are available.
Block Method If a message cannot be routed to the destination or the next network
level, the message has to wait at the source. This requires the source component to
have enough buffer space for at least one message.
Drop Method In this case, a non routable message is discarded. No additional attempt to deliver the message will be made, the data is lost.
modified Drop Method A small change can reduce the impact of the drop method.
In this mode packets are only dropped, if buffer space is exhausted or the network has
been blocked a certain duration.
37
5 Example Network On Chip Architectures
Many NOCs exist today. This chapter will introduce the reader to some simple NOCs,
which will later be used to compare to the NOCs developed in this work. For information
about more complex NOCs the reader can use Schwederski et al. [21] or Bjerregaard et
al. [31]. The last are giving a very interesting survey of research in NOC architectures.
5.1 Ring
Ring networks are one of the simplest networks available. Its communication can be
unidirectional or bidirectional. Figure 5.1 shows an example bidirectional ring with
0
1
2
3
7
6
5
4
Figure 5.1: Example Ring network with eight nodes
eight communication elements. Every one of these can transmit a message at the same
moment. A bidirectional ring can transmit data in both directions, a unidirectional ring
just in one. The structure of the ring allows very fast local communication between two
neighbouring nodes, but only a slow global communication. Table 5.1 presents some
classification properties for a bidirectional ring with N nodes. A ring is a static network,
Type
Grade
Regularity
Diameter
Symmetry
Scalability
Bisektion-Width
direct-static
Γ=2
2 − regular
ΦRIN G = bN/2c
node & link
WRIN G = 2
Table 5.1: Classification of a bidirectional ring
39
5 Example Network On Chip Architectures
because the communication partners are always fixed. In this case, the communication
infrastructure is located in the PEs and is therefore a direct network. But by moving
the communication infrastructure outside the PE, it can become an indirect one. The
grade and the regularity explain, that the nodes in the network have a maximum of two
communication links and that all of them have the same number. The diameter is bN/2c
in a bidirectional ring and N − 1 in an unidirectional ring.
The following are examples of a specific implementation of the ring architecture:
• Token-Ring[32]
• Register Insertion Rings[33]
• Scalable Coherent Interface (SCI) Ring[34]
5.2 Bus
A bus is a very simple and flexible network architecture. It is mostly used for accessing
components in a memory like manner. The interconnection links are divided into data-,
address and control signals and are shared by all network nodes. Figure 5.2 shows an
example bus with four interconnected components. Because the network is using a shared
8 bit Data signal
4 bit Address Signal
2 Bit Control Signal
Address : 0000
Address : 0001
Address : 0010
Address : 0011
Node 0
Node 1
Node 2
Node 3
Figure 5.2: Example bus with 4 nodes
medium for data transfer the maximum number of components is limited. The access to
the medium is implemented in a time-multiplexed way. The data transmission between
network nodes is more complicated than in a ring. First the access to the interconnection
links, the bus arbitration, has to be organised. This can be implemented in a centralised
or decentralised style. The true data transmission can be synchronous or asynchronous.
The destination of a transmission is selected by the value of the address signals. This
explicit address selection allows a direct communication between two components. One
of the components, the initiator of the communication is controlling the communication
and the other, the responder, is answering the request.
40
5.2 Bus
5.2.1 Bus-Arbitration
The bus arbitration decides, which component is allowed access to the interconnecting
links. This is necessary because a bus uses a shared medium and only one active component is allowed on the bus. The access decision can be made by a central control
unit. Each network component has a bus-request and a bus-grant line to this central
control unit. This unit selects the one bus component with the highest priority out of
all components requesting bus access.
If no central control unit is available, or not practical, the access decision can be made
decentral. An example decentral decision making patter is daisy chaining the network
components. With daisy chaining the bus-request signals are combined with the and
operation in pairs. The resulting request line is combined with the next bus component
in the same way. This physically ordered network nodes determines the access priority.
Another decentralised access method is Carrier Sense - Multiple Access / Collision
Detection (CSMA/CD). This method requires the network node to listen on the interconnection lines all the time. If the lines are not in use, the node can start a transmission
of its own. If multiple components try to access the bus at the same time, the nodes can
recognise this, by comparing the data on the bus with the data they transmit. If such
a collision is detected the components stop transmitting and wait for a random time,
before trying again.
These arbitration methods are not fixed to busses. They can be used for any other
decentralised network too.
5.2.2 Data Transmission Protocol
While the bus arbitration is responsible for allowing access to the bus, protocols organise
the data transfer between two bus nodes. Two different kind of protocols are common.
synchronous protocol
The synchronous protocol requires the data transmission concurrently to a global clock
signal. This clock rate determines the transmission speed for all network components.
Because of the synchronicity to a global clock signal this transmission scheme is very
fast and very simple. The communication partners save the applied signal values at the
rising edge of a clock tick.
asynchronous protocol
The asynchronous transmission protocol is more complex compared to the synchronous
one. The transmission is not controlled by a central clock signal, but by four additional
handshake signals. These signals are working in pairs assigned to the communication
partners. Each pair consist of a request-start signal applied by the sender of a message
and a request-done signal applied by the receiver of the message. The data signal can
only be updated if the request-done signal has been applied. This handshaking allows
components to have different transmission speeds, but reduces the overall transfer speed.
41
5 Example Network On Chip Architectures
5.2.3 Classification
Table 5.2 displays the classification of the described simple bus. The interconnection type
Type
Grade
Regularity
Diameter
Symmetry
Scalability
Bisektion-Width
direct-dynamic
ΓBU S = 1
1 − regular
ΦBU S = 1
node & link
no
WBU S = 1
Table 5.2: Classification of a bus
is direct-dynamic because the bus participants are responsible for the data transmission
and the bus arbitration and the connections between two components can be changed
through the address signals. All network nodes have only one connection to the bus and,
if connected, the transmission is done without any intermediate nodes. The grade of the
bus is one and it is 1-regular. The diameter is one. The bus is not scaleable because
the medium access gets more and more difficult the more components want to share it.
If another component shall be added to an existing bus, the central arbiter has to be
extended or the priority in a decentral controlled network has to be changed.
5.3 Grid
Grid networks arrange their nodes in a two or more dimensional array. Every node is
connected to its neighbours and supports direct communication with them. Figure 5.3
displays two different kinds of grid networks. The difference between both types is, that
the mesh network is irregular because the edge and border nodes have a different grade
than the other nodes. The Illiac network is based on the famous illiac computer[35]. The
simplest versions of grid networks are 2-dimensional. The nodes are arranged in rows
and columns with the same number of nodes, as displayed in Figure 5.3. In the more
general case the number of nodes per row or column can be different and the dimension
can be more than two.
The transmission of messages between nodes is much more complex than in a ring or
bus. Multiple shortest paths exist between the source and the destination of a message.
The selection of the path is a hard decision, but will not be part of this introduction.
Closed grids often have the ability to reconfigure the interconnection of their border
and edge nodes to adapt to required communication patterns.
The disadvantage of grid networks is there long diameter. This disadvantage can be
reduced by adding more dimensions to the network, but increasing the complexity of the
path finding algorithm.
Table 5.3 and Table 5.4 show the classification for the grid networks presented in Figure 5.3. The interconnection type of both networks is direct-static because the nodes
42
5.4 Tree
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(a) open grid (mesh)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(b) Illiac network
Figure 5.3: Example grid networks with 16 nodes
are responsible for all the communication, including path finding, and there is no possibility of reconfiguring the interconnection network. The mesh network is irregular, as
Type
Grade
Regularity
Diameter
Symmetry
Scalability
Bisectionwidth
direct-static
ΓM ESH = undef
irregular
ΦM ESH = 6
unsymmetrical
no
WM ESH = 2
Table 5.3: Classification of an open grid (mesh) with 4 × 4 nodes
mentioned earlier because of the different interconnection links at the border nodes. The
longest path between two nodes is six intermediate transfers. Because of the irregularity,
the network is unsymmetrical. In contrast to the mesh network, the illiac network is
4-regular. Every node has connections to exactly four neighbours. This reduces the
network diameter to three.
5.4 Tree
A tree is an undirected coherent azyclic graph. It has exactly one root node spreading
into multiple child nodes. A node without any children is a leaf node. The depth T of
a tree is the maximum number of edges from a leaf node to the root. Many distributed
algorithms prefer this topology because the structure of the algorithm can easily be
43
5 Example Network On Chip Architectures
Type
Grade
Regularity
Diameter
Symmetry
Scalability
Bisectionwidth
direct-static
ΓILLIAC = 4
4 − regular
ΦILLIAC = 3
node-symmetric
no
WILLIAC = 4
Table 5.4: Classification of a closed grid (illiac) with 4 × 4 nodes
mapped on the nodes in a tree network, such as “Divide and Conquer” algorithms[36].
Trees can be classified by the number of children per node too. If we name a tree, the
maximum number of children per node is given at the beginning. For example a 2-tree
is a binary tree with a maximum of two children per node and a 4-tree is a quadruple
tree with a maximum of four children per node. Figure 5.4 shows exactly these two
tree networks. A tree is called complete, if all nodes have all their edges assigned,
except the leafs. Table 5.5 shows the classification of a simple tree. It is a direct-static
Type
Grade
Regularity
Diameter
Symmetry
Scalability
Bisection-Width
direct-static
ΓT REE = undef
irregular
ΦT REE = 2T
asymmetric
yes
WT REE = 1
Table 5.5: Classification of a tree
network because their communication infrastructure is located within each node and
the communication partners cannot be changed. The number of connection on the leaf
nodes differ from all the other nodes, leading to an irregular and asymmetric network.
The diameter is calculated through the maximum path between nodes in the network.
The longest path in a tree is from the leaf of the left side of the root node to a leaf node
on the right side leading to a diameter of 2T. The Bisection-Width is determined by the
path through the root node.
5.5 Crossbar
Crossbar networks are indirect networks build out of network nodes and the network
infrastructure component, the crossbar. The crossbar interconnects all output signals of
the nodes with all their input nodes. Through the crossbar configuration the nodes can
be interconnected with each other, supporting all possible permutations.
44
5.5 Crossbar
0
1
2
3
7
5
4
9
8
10
6
11
13
12
14
(a) binary tree of depth 3
0
1
5
6
3
2
7
8
9
10
11
12
13
14
4
15
16
17
18
19
20
(b) quadruple tree of depth 2
Figure 5.4: Example tree networks
Figure 5.5 displays an example crossbar with four nodes. The boxes within the crossbar
are configuration elements. By turning them on a connection between the horizontal and
the vertical signal lines can be established. Only one active element per vertical signal
line is allowed, resulting in a conflict otherwise. Through activating multiple elements
per horizontal signal line, broadcast and multicast communication can be implemented.
Table 5.6 shows the classification of a n-node crossbar. A crossbar is an indirect-static
network because the nodes are not responsible for the routing of data and the nodes are
always connected to the crossbar. Each node has only one bidrectional connection to
the crossbar, resulting in a 1-regular system. The diameter of the network is calculated
according to the definition of the diameter for indirect networks in Section 4.2.3. Because
the crossbar network has only one level of interconnection infrastructure, the diameter
is two. A crossbar is a very flexible and fast interconnection method, but requires many
hardware resources to implement. n × n configuration elements are required to build the
crossbar. These configuration elements are often multiplexer. A 4 × 4 crossbar requires
four 4–1 multiplexer. This does not scale for larger crossbars. Even adding another
node is not that simple because you have to replace all n-1 multiplexers with (n+1)-1
45
5 Example Network On Chip Architectures
0
1
2
3
0
1
2
3
Figure 5.5: Example 4×4 crossbar networks
Type
Grade
Regularity
Diameter
Symmetry
Scalability
Bisectionwidth
indirect-static
ΓCROSSBAR = 1
1 − regular
ΦCROSSBAR = 2
node-symmetric
no
WCROSSBAR = n
Table 5.6: Classification of a crossbar network with n nodes
multiplexers.
46
6 Granularity Problem of Runtime
Reconfigurable Design Flow
Dynamic- or runtime reconfiguration is becoming more and more important in FPGA
design. It enables the designer to fit more hardware onto the chip than is physically
available by swapping components in and out as required by the system. Another possible use is the optimisation of the configured hardware to runtime requirements. The
communication stack within a network switch can be optimised for the negotiated speed
(10/100Mbit/1/10Gbit) or CPU cores can be improved by configuring special accelerator units. Section 2.5 gives a more detailed introduction to the Xilinx PR design flow,
which is used in this thesis.
The general steps to create a partial runtime reconfigurable system with multiple
reconfiguration components are:
1. decide for the number of reconfigurable modules
2. decide the size of each reconfigurable module
3. decide where to place each reconfigurable module
4. decide which interconnection network to use
5. decribe the static system and the interconnection network in a HDL
6. describe every reconfigurable system for placing into the reconfigurable modules
in a HDL
7. synthesise, place and route the static system
8. synthesise, place and route each reconfigurable system for every reconfigurable
module
Because of the fixed decision about the size, number and placement of RMs during the
first three steps of the design flow, the repositioning or resizing is impossible during
runtime.
In many designs this fixed decision is not a problem. For example in a one or two
RMs design with nearly same sized reconfigurable components it is rarely necessary to
resize or reposition the RMs during runtime.
But in designs with more RMs and many, different sized components the fixed decision
limits the flexibility and creates much slack space in RMs.
47
6 Granularity Problem of Runtime Reconfigurable Design Flow
The granularity problem describes the difficulty to choose the right size and number of
RMs in such a system.
If different sized components shall fit into all available RMs, most developers will
choose the maximum component size as the RM size. This will reduce the number of
configurable smaller components, but allows the configuration of all components into any
RM . Figure 6.1 displays an example granularity problem. The FPGA is divided into four
FPGA
reconf Module
reconf Module
Small CPU
(PIC/ATmega)
CPU
(ARM,MIPS)
reconf Module
reconf Module
FSM
Figure 6.1: Example granularity problem
same sized RMs. ARM and MIPS processor cores, PIC and ATmega microcontrollers,
FSMs and Boolean functions are available as components to configure into these modules.
The displayed system tries to solve a problem by using one ARM/MIPS processor core,
one PIC/ATmega microcontroller and one FSM . The components easily fit onto the
FPGA, but only the ARM/MIPS core exploits all the available space in its RM . The
unused space in the other RMs is wasted because it is linked to the modules and cannot
be configured independently.
The space on the FPGA could be exploited much more efficient, if the placement of
the components would be more flexible and the RM boundaries would not exist. This
would possibly allow more than one system doing computations on the FPGA.
48
6.1 Solutions
6.1 Solutions
The following sections describe two different solutions to reduce the effects of the granularity problem to runtime reconfigurable system design. They use different floorplaning
strategies to achieve this goal.
6.1.1 Grouping Solution
A very simple solution, reducing the consequences of the granularity problem, is having
groups of different sized RMs on the FPGA. Figure 6.2 presents an example system
using the grouping solution. The FPGA is partitioned into three regions, each holding
f(B)
f(B)
f(B)
f(B)
FSM
CPU Core
FSM
f(B)
f(B)
f(B)
f(B)
FSM
CPU Core
f(B)
f(B)
FSM
f(B)
f(B)
Figure 6.2: Example grouping solution configuration
different sized RMs. In this case the sizes are chosen to fit two CPU sized, four medium
sized FSMs and 12 small sized Boolean function components onto the FPGA. The RMs
of each group feature the same signal interface and are interconnected staticly.
Advantages
Because of the same signal interface and interconnection network within each group of
RMs, converting a design from the standard PR design flow to the grouping solution
is very easy. Every reconfigurable component can be reused without adaptions. The
static system requires some small changes to the interconnection and management part
to operate the groups concurrently. In comparison to the standard flow the overhead is
very small.
49
6 Granularity Problem of Runtime Reconfigurable Design Flow
The computable outline of the design is another advantage of this solution. An algorithm with the parameters number of groups, size of the RM in each group and number
of RMs per group can compute the outline of the RMs groups very fast. This greatly
speeds up and simplifies the whole development process.
Disadvantages
Despite the advantages of this system, the design process requires a decision about the
size, number and position of the RMs, leading to the granularity problem at some size
of the overall system. A change in these parameters requires a full re-synthesizing of
the whole system. After configuring the FPGA with the new partitioning all running
computation have been stopped and their current state is lost. Within the regions the
design is still bounded by the maximum number of RMs in it.
The structure of each RM is regular, but the full system is not. The groups of RMs
enforce their own signalling interface. This prevents components to be configured in RMs
outside their RM group. This even prevents the development of components fitting into
all RMs.
6.1.2 Granularity Solution
The granularity solution partitions the FPGA into many same sized RMs. These RMs
have the same signal interface to the interconnection network. They can be combined to
form larger components by interconnecting them through the interconnection network.
The size of one RM is the only parameter required at design time. During runtime
configuration files belonging to a reconfigurable component can be placed into any RM
on the FPGA. These RMs are not required to be positioned next to each other. Figure 6.3
presents an example partitioning. The FPGA is devided into 7 × 6 RMs. The example
design contains two different sized CPU cores at the moment, a FSM and two different
sized Boolean functions. Still, there is more space available for additional components.
Advantages
It is obvious, that the placement of the reconfigurable components in this solution is very
flexible and does not create as much slack space as the standard PR design flow. The
number of RMs is only bounded by the size of the FPGA. At design time the number
of reconfigurable components fitting onto the FPGA is unknown. All the RMs can be
used for one or two CPUs or for many small Boolean functions. Any component, which
is dividable into multiple smaller sub components, is possible.
The regular structure of the whole system enables each entity, configurable into a RM ,
to look at the system the same way from any RM . This promotes the simple development
of components. The same interface for all the RM supports this simple development too.
50
6.2 Granularity Problem and Hybrid Hardware
CPU1 Core
FSM
f1(B)
f2(B)
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
RM
CPU2 Core
Figure 6.3: Example granularity solution configuration
Disadvantages
The disadvantages of the granularity solution starts with the decomposition of the reconfigurable components into smaller components fitting into one RM . The decomposition
and the different signal interface prevent the re-usage of the reconfigurable components of
a standard PR design. The decomposition is also not a simple task. It is not guaranteed
that all components can be divided into smaller parts.
Another disadvantage is the interconnection network. It has to span the whole FPGA
connecting all RMs. This requires additional FPGA space. The number of RMs and the
used interconnection/management space has to be balanced to get a good design. The
path delay of the interconnection lines between the RMs can be another problem. They
could not be fast enough to support the connection speeds, required within reconfigurable
components.
6.2 Granularity Problem and Hybrid Hardware
The granularity problem occur on any runtime RS where multiple different sized reconfigurable components shall be used. In the scenario of coupling processor cores and recon-
51
6 Granularity Problem of Runtime Reconfigurable Design Flow
figurable hardware, introduced in Section 1.2, this is also the case. The standard methods
to couple processors with reconfigurable hardware are datapath-, bus-accelerator or multicore reconfiguration. Datapath accelerators commonly use a very small area, while bus
accelerator are medium sized, and multicore reconfiguration requires much space on a
FPGA. Figure 6.4 gives a graphical overview of this space requirements. Each pattern
CPU Core
CPU Core
reconf
reconf
reconf
(a) Datapath Accelerator
CPU Core
(b) Bus Accelerator
CPU Core
CPU Core
(c) multicore reconfiguration
Figure 6.4: Area requirements of the different usage patterns
has its unique type of use. Datapath accelerators are used to increase the instruction
flexibility. It allows the appending of different instructions to the processors ISA. Bus
accelerators are the most common usage pattern at the moment. It allows the configuration of different kind of accelerators into the reconfigurable area and connect these
through a bus to the processor. With the multicore reconfiguration pattern the reconfigurable area is used to instantiate multiple processor cores. These cores can run on
their own or form a multicore system. In this work, all these connection methods shall
be combined into one system, leading to the granularity problem.
52
7 Multicore Reconfiguration Platform
Description
After introducing the basics of reconfiguration and NOCs and describing the granularity
problem of runtime reconfigurable design flows, this chapter presents the main part of
this thesis, the Multicore Reconfiguration Platform (MRP).
The MRP is a hybrid hardware system. In contrast to the existing research- and
commercially available systems, the MRP uses the Xilinx PR design flow to implement
its reconfigurability. The use of dynamic- or runtime reconfiguration helps to solve the
granularity problem by using the granularity solution presented in Section 6.1.2. This
granularity solution enables the MRP to support multiple different sized reconfigurable
components, without taking component sizes into account at the initial floorplaning
stage.
Inter FPGA connections are another new feature of the MRP. A packet switched network, called OCSN , can interconnect multiple FPGAs. Figure 7.1 displays an overview
OCSN
to host
system
OCSN
support platform
OCSN
reconfiguration platform
reconfiguration platform
softcore
Figure 7.1: Example MRP System Overview
of an example MRP system, consisting of three FPGAs. By adding more FPGAs to
the OCSN , the reconfiguration area of the MRP is easily extensible. This extensibility
helps, if applications require more reconfiguration space during runtime.
As Figure 7.1 shows, a MRP system is divided into support- and reconfiguration
platforms. The first provides access to system resources through the OCSN , like BRAM ,
DDR RAM , General Purpose Input Output (GPIO), USB controllers, and mass storage
and the second provides many RMs. This setup allows a maximum of reconfigurable
space, while still supporting additional hardware resources. The number of platforms is
only limited by the addressing space of the OCSN .
The platforms and the host system, such as a server or workstation, are also connected
through the OCSN . To support high speed connection between the MRP and its host
system, the connection is implemented using 1Gbit Ethernet as its physical layer. As an
alternative to a full featured host system, the support platform can provide a soft-core
SoC connected to the OCSN . This SoC can control the MRP and distribute hardware
applications.
53
7 Multicore Reconfiguration Platform Description
Except for the Convey HC1, most of the other hybrid systems, suffer from direct
operating system support. The MRP is directly integrated in the Linux OS. The device
drivers provide a network API to communicate with all OCSN components and to
configure the RMs.
The remainder of this chapter introduces the OCSN in Section 7.1, the support platform in Section 7.2 and the reconfiguration platform in Section 7.3. Furthermore, it
describes the OS support in Section 7.4 and the design flow for working with the MRP
in Section 7.5.
7.1 On Chip Switching Network
The requirements for a NOC , which interconnects the support and reconfiguration platform are diverse.
First, the NOC has to support the interconnection of multiple FPGAs with different
physical connections and variable signal lengths. FPGA boards can be interconnected
by Ethernet, CAN, simple wires using some kind of serial protocol like SPI or RS232, or
other interconnection schemes.
Scaleability is another very important requirement. Adding another platform or component should not lead to the reconstruction of the whole NOC .
The network should support broadcast and unicast connections because information
has to be distributed through the network very fast and certain components require a
lot of data transfer.
Because many components participate in this network, the hardware requirements for
connecting one component to the network should be as small as possible.
Most networks cannot satisfy all these requirements. For example, a bus is not
scaleable and does not permit multiple components to communicate concurrently. But
a static indirect packet switched network fulfils all the requirements.
The OCSN is a static indirect packet switched network. It supports the interconnection of multiple FPGA boards by using bridges through different physical connection
and different protocols. It is limited scaleable by adding components to network switches
and by increasing their number. Broadcast and unicast packet transmission is supported
by routing all broadcast packets to all outgoing connections of a network switch. The
usage of network switches for most of the network organisation reduces the interface size
in the network devices.
The OCSN uses the OSI model to divide functionality into layers to ease the adaption
to different hardware and software, and standardise the interconnection points. Therefore, the OCSN description starts with the definition of the physical layer, walking up to
the application layer. All these layers are implemented in hardware, without the usage
of additional micro-controllers, to save configuration space onto the FPGAs.
54
7.1 On Chip Switching Network
Clock
Bit-width
Speed
200MHz
200MHz
200MHz
100MHz
100MHz
100MHz
8
12
26
8
12
26
1.267Gbit/s
2.235Gbit/s
4.843Gbit/s
0.634Gbit/s
1.118Gbit/s
2.421Gbit/s
Table 7.1: variable speed of the OCSN
7.1.1 Physical Layer
At the physical layer always two network interfaces are connected to each other. Each
interface transmits a full OCSN frame of 39bytes in one transfer. Using such large
frames in one transfer often leads to transmission errors. In this case the network spans
mostly over one FPGA, reducing the error probability approximately to zero. The simple
approach of transmitting a full frame at once, reduces the area usage for each network
interface. In this case, the advantage of reduced area usage outweighs the disadvantage.
The 39bytes of each transfer are divided into a configurable number of bits, transmitted
concurrently at each clock tick. The allowed bit-widths are {x : 312 mod x = 0}bits
because 39bytes × 8bits = 312bits. Full duplex mode, by using dedicated transmission
and reception lines, is also supported. The typical clock rates at this layer are 100MHz
and 200MHz, resulting in the maximum network speed displayed in Table 7.1.
7.1.2 Data-link Layer
The data-link layer of the OCSN is responsible for detecting and identifying the remote
device. To prevent overflowing of the receive buffer, it implements hardware flow control
between the two directly coupled interfaces. If the receive buffer of one interface hits
an upper bound, it signals the other interface to stop transmitting. If, after stopping
the transmission, a bottom bound is reached, the interface request the continuing of the
transmission.
The data-link layer of the OCSN does not provide any error detection/correction
methods because the error probability, if configured onto a FPGA, is very small. But
this feature can easily be added, if required.
7.1.3 Network Layer
The network layer defines everything required for routing OCSN frames through the
network to the correct destination. Figure 7.2 displays the structure of one OCSN frame.
It is build out of source and destination addresses, additional source and destination port
fields, a frame type field and the payload of the frame. For the network layer the 16bit
source and destination addresses are of interest.
55
7 Multicore Reconfiguration Platform Description
16bit
16bit
SRC Address
SRC Port
DST Address
DST Port
Frame Type
DATA
31 byte DATA
Figure 7.2: OCSN frame description
The network infrastructure components of the OCSN are OCSN switches. They
are organised in a tee structure to reduce routing complexity. A grid network would
be faster and more flexible because different routes between two components exist, but
would increase the routing overhead. A big disadvantage of a tree is its bisection width of
one. Regardless of how you divide a network organised in a tree structure, the maximum
connection number between two halves is always one. This leads to a big bottleneck, if
components from one side have to communicate intensely with components on the other
side. This disadvantage can be reduced by interconnecting all switches of one level in
a ring, but this is not applicable in this network because the tree spans over multiple
FPGAs. Furthermore, most of the components in this network will communicate with
their direct neighbours. This communication will usually be taking place over one switch.
All of these OSI layers have to be implemented in hardware, without the usage of
additional micro-controllers. To generate this hardware with a very small area footprint,
the advantages of simple routing outweighs the bandwidth disadvantages in this case.
An example OCSN , consisting out of OCSN switches only, is displayed in Figure 7.3.
The example network is organised as a binary tree, but more outgoing edges per OCSN
Root Switch: 1.0.0.0.0.0
OCSN
Switch
1.1.0.0.0.0
1.2.0.0.0.0
OCSN
Switch
1.1.1.0.0.0
OCSN
Switch
OCSN
Switch
1.1.2.0.0.0
OCSN
Switch
1.2.1.0.0.0
OCSN
Switch
1.2.2.0.0.0
OCSN
Switch
Figure 7.3: OCSN network structure overview
switch are also possible. Switches are only specialised network devices. This flexible
design allows replacing switches by any other component and using switch ports for
switches and devices without reconfiguring the system.
To get routing working in this tree network, the 16bit network addresses have to
56
7.1 On Chip Switching Network
respond to the tree structure of the network. Therefore, the addresses are divided into
the six parts shown in Figure 7.4. To support broadcast and unicast in the network,
the first bit (r) of an address selects broadcast or unicast mode. The remaining bits are
partitioned into five groups of three bits each. In the figure these groups correspond to
the coloured characters a1 a2 a3 . . . e1 e2 e3 . If the value of r is one, the address 1.0.0.0.0.0
identifies the root node of the tree. Looking at Figure 7.3 the root node is the top switch.
The switches generate the tree, while devices are leaves of the tree. Switches always own
an address starting with a zero at their group.
The second group consisting of the bits a1 a2 a3 and addresses all tree components
directly connected to the root switch. They are the second level components of the tree.
The bits b1 b2 b3 identify all components directly connected to switches of the second level,
like shown in Figure 7.3. This makeup goes on until group e1 e2 e3 , which identifies all
components connected to switches of the fifth level. The six level cannot hold any more
switches because there are no addresses left. This limitation can easily be removed by
extending the address space.
This addressing scheme enables all switches in the network to identify their uplink
and downlink ports by checking the addresses of all connected devices. One advantages
of a tree is the existence of only one route from one component to another. This eases
the routing decision, to only identify the uplink of a switch and the calculation, to which
of the connected switches the address belongs. Frames with a broadcast destination are
transmitted to all ports, except the incoming one.
Because all frames in the OCSN have the same size of 39bytes, no framing or padding
is required.
7.1.4 Transport Layer
To access the interconnected components, the network has to transport frames. In
this scenario, the network is required to transmit configuration data, request status
information, or access some kind of RAM . Because of the small error probability and
the fact, that frames cannot be reordered while transmitted through the network, no
connection oriented transport protocol is required. Instead, a connection less, UDP like,
protocol is responsible for the data transport within the OCSN . The protocol features
8bit source and destination ports (Figure 7.2) and a 8bit frame-type field to identify
the service at the destination. The maximum payload length is 31bytes. The frames
are routed from source to destination using the network layer. If a service is listening
at the destination on the destination port, the payload is processed and an answer is
r a 1 a 2 a 3 b 1 b 2 b 3 c 1 c 2 c 3 d 1 d 2 d 3 e 1 e2 e 3
r=0
r=1
broadcast address
unicast address
Figure 7.4: OCSN address structure
57
7 Multicore Reconfiguration Platform Description
transmitted.
7.1.5 Session Layer
The session layer starts and tears down connections in a connection oriented protocol.
Because the transport layer of the OCSN only specifies a connection less protocol the
session layer is not required.
7.1.6 Presentation Layer
Like in the TCP/IP suit the presentation layer is merged into the application layer. The
main purpose of the merged presentation layer is, to ensure all information in an OCSN
frame is in big endian byte order.
7.1.7 Application Layer
Accessing components in the OCSN requires different application layer protocols. The
main distinction between these protocols is, if they require an answer frame or not.
Usually it is enough to send one frame to a destination device to set registers or to
request information. Still, the application layer defines the structure of the payload.
Looking at the communication with an OCSN connected RAM the access mode (read,
write), the access size (byte, word, double-word, . . . ) and the data for a write operation
has to be encoded into the payload of an OCSN frame. In case of a frame send to a
BRAM connected to the OCSN the first byte of the payload identifies the operation
to perform. Bytes eight to five encode the RAM address and bytes twelve downto nine
encode the dataword. In the answer frame from the BRAM the first byte signals what
kind of answer this frame holds and bytes 8 downto 5 encodes the first data word. If
more datawords are requested from the BRAM they are encoded after the first word.
7.2 Support Platform
The support platform combines all system resources of one FPGA board, including
off-board extensions, into one platform. Using a distinct FPGA board, reduces the
space requirements for the reconfigurable platforms because no additional hardware is
required. The reconfigurable platforms can concentrate on providing reconfigurability.
Figure 7.5 presents an example support platform with all supported FPGA resources.
These resources are connected through an interface to the OCSN . At the moment the
following components are supported:
• GPIO
• BRAM
• DDR RAM
58
7.2 Support Platform
Uplink
Ethernet/
Uart
FPGA - support plattform
OCSN
Switch
BRAM
GPIO
DDR RAM
Softcore SoC
Downlink
Ethernet/
Uart
Figure 7.5: Example support platform
In addition an uplink and downlink device exist, to connect a host system or other
platforms to this FPGA. Two alternative devices are available. One UART and one
Ethernet based bridge.
7.2.1 GPIO
For querying and inserting debug data out of/into the OCSN , the GPIO component
is very helpful. Outgoing GPIO signals can be set to certain values and drive, for
example Light Emitting Diodes (LEDs). By sending status request frames the settings
of a connected Dual Inline Package (DIP) switch can be checked using the pulling
approach. It would be possible to implement interrupts by sending an OCSN frame out,
if a DIP switch changes its status.
59
7 Multicore Reconfiguration Platform Description
7.2.2 BRAM
The FPGA used for the support platform has BRAM resources left, after using much
of it for buffers in the OCSN . These BRAM can be combined to form a BRAM OCSN
device. It allows access to the RAM from the OCSN with different access modes. The
following access modes are supported at the moment:
READ{length} read a data word of length bytes
WRITE{length} write a data word of length bytes
SWAP{length} atomic swap of a data word of length bytes
The supported number of bytes for length are: 4,8,16,32,64 and 128 bytes. For initialising
the RAM , two commands are available:
INIT ZERO initialise the RAM from a given start address and some 4 byte words
with “00000000000000000000000000000000”
INIT ONE initialise the RAM from a given start address and some 4 byte words with
“11111111111111111111111111111111”
The following commands are planed as future extensions to support concurrent access
to the RAM from different OCSN devices.
LOCK lock the device for use by the source of this command only
UNLOCK unlock the device for use by everyone, only possible from same device,
which send the lock command or some master device to prevent a deadlock
LOCK RANGE lock part of the address space for use by the source of this command
only
UNLOCK RANGE unlock a previously locked address space
LIST LOCKS list all enforced locks
7.2.3 DDR3 RAM
This component uses the same interface and access model like the BRAM device. The
difference is the used DDR RAM controller, instead of a BRAM one.
7.2.4 UART Bridge
To get a very simple option to connect additional off-board components and additional
FPGA boards to a support or reconfiguration platform, the UART bridge is used. It is
build out of one OCSN interface and a UART . The interface receives an OCSN frame and
the UART just transmits every byte of the frame through RS232 to the remote device.
In the other direction the UART receives exactly 39bytes and transmits these bytes as a
60
7.3 Reconfiguration Platform
frame through the OCSN interface. The bridge sends end of frame synchronisation bytes
to the remote bridge through the UART by using the parity bit to distinguish between
data and control bytes. This interconnection method is very slow (max 2Mbps), but is
stable and requires only three wires.
7.2.5 Ethernet Bridge
For connecting the OCSN to the host system and other FPGA boards, a high speed
connection is essential. The Ethernet bridge encapsulates an OCSN frame into an Ethernet frame and transmit it over a 1Gbit Ethernet network device. Crossover cables and
switches between the Ethernet bridge and the remote station are supported. The maximum bandwidth of 1Gbit Ethernet cannot be achieved because the Ethernet packets
transmitted and received are always 60bytes long. The maximum Ethernet payload size
is 1500 bytes. Still, a maximum throughput of 465Mbit/s is possible.
7.2.6 Soft-core SoC
A soft-core SoC consist of at least one processor core and additional components for
storing program code and data input/output. Soft-core SoCs, provided by the support
platform, can replace a full featured host system, such as a server or workstation, for controlling the MRP. The MRP supports only the PRHS SoC , written by Eckert[5], at the
moment. The integration into the OCSN has been done by Grebenjuk[37]. The PRHS
runs Linux as its OS. Access to the OCSN is implemented through a communicator
device and a network card device driver for Linux.
7.3 Reconfiguration Platform
The reconfiguration platform provides the reconfigurable resources for the MRP. The
prototype uses Xilinx Virtex5 FPGAs at the moment and requires the availability of
the Xilinx PR design flow. Figure 7.6 presents an example reconfiguration platform. It
is divided into a reconfiguration module, supplying many same sized RMs, and the infrastructure connecting host systems or additional FPGAs. The reconfiguration module
encapsulates all the structure required for runtime reconfiguration into one component.
This encapsulation simplifies the instantiation of the runtime reconfiguration on different
FPGAs because the FPGA specific requirements can be implemented without interfering
with the runtime reconfigurable implementation.
The connection infrastructure is basically the same as on the reconfiguration platform.
Bridges to and from the OCSN are used to provide the interconnection functionality.
The reconfiguration module uses the granularity solution, presented in Section 6.1.2,
to reduce the effects of the granularity problem, while partitioning the FPGA into many
RMs. These RMs are called Configurable Entity Blocks (CEBs) because they can be configured with entities of the Register Transfer Layer (RTL), not only of the logical layer.
These CEBs are interconnected by a CSN for combining them into larger components.
61
7 Multicore Reconfiguration Platform Description
Uplink
Ethernet/
Uart
FPGA - reconfiguration plattform
reconfiguration Module
OCSN
Switch
ICAP
OCSN
Switch
CEB
IOB
CEB
CEB
SW
CEB
SW
OCSN
Switch
CEB
CEB
CEB
CEB
CEB
CEB
CEB
CEB
IOB
SW
SW
CEB
CEB
CEB
CEB
Downlink
Ethernet/
Uart
Figure 7.6: Example reconfiguration platform
The Internal Configuration Access Port (ICAP) of Xilinx Virtex{5,6,7} devices is
used, to configure the CEBs through the OCSN .
7.3.1 ICAP
Like the resources of the support platform, the reconfiguration platform has one important device, the ICAP. The ICAP configures the CEBs of the reconfiguration module
during runtime of the system. It is connected to the OCSN and accepts up to seven
32bit configuration words in one OCSN frame. These configuration words are written
to the ICAP with 50MHz at the moment, but can be increased up to 100MHz. The
maximum configuration speed is 381 MB/s at 100MHz.
7.3.2 CEB
The CEB is the main building block of the MRP. It is the one component providing
the reconfigurability of the system. Different components can be configured into a CEB.
62
7.3 Reconfiguration Platform
All the CEBs in the reconfiguration module have the same size and provide the same
static signal interface to the interconnection network. Figure 7.7 describes this signal
odID
8
ocDebug
icEnabled
idSingle
odSingle
idBus
odBus
icReset
4
4
ic25MhzClk
CEB
128
128
ic50MhzClk
ic100MhzClk
ic200MhzClk
Figure 7.7: CEB Signal Interface
interface. Every CEB has four different clock inputs reducing the hardware complexity
in a CEB for additional clock dividers. A clock divider is only necessary, if none of the
provided clock rates (25, 50, 100 and 200MHz) fit into the design. The clock signals are
generated on the FPGA for system wide usage. They are not distributed through the
CSN , but use the dedicated clock lines of the FPGA.
After the configuration of a component into a CEB, the state of the component is
unknown. For setting it in a known state, a reset signal (scReset) exist.
During the configuration process the values of the input/output signals can fluctuate.
To prevent the flooding of the whole MRP with invalid data, the components have to
be disabled during the configuration process. All components, developed to fit into a
CEB, have to react to the active high scEnable signal. It also starts a component at a
specific moment in time.
The MRP requires a way to evaluate, which CEB is already configured and what kind
of component is using the CEB. This is achieved through the eight bit odID signal. If the
CEB is empty, the signal is not driven by any component. The signal is configured at the
FPGA level as a pull up, returning 0xFF at an empty state. Each possible component
has been assigned a distinct id, which has to be put onto odID.
A debugging signal (scDebug) is also available, to connect one CEB to off-chip components, such as a LED or a logic analyser.
For receiving and transmitting data from and into a CEB, two kinds of input/output
signals exist. The first are simple single lines. idSingle provides four single lines input
and odSingle four single lines output in this example. The second kind of input/output
signals are signal clusters. Signal clusters are useful for designing busses or register
input/output. In this example the CEB supports four 32bit signal clusters (idBus,
odBus). The number of signals is chosen as small as possible to be easily routable onto
the FPGA and as large enough to support a wide range of components.
63
7 Multicore Reconfiguration Platform Description
7.3.3 CSN
To interconnect CEBs to the reconfiguration module, different requirements exist. The
signal interface requires at the moment four single signals and four clustered signals
for each CEB, but this requirement can change in the future. Because of the possible
requirement change, the interconnection network should be scalable in the number of
signal lines it can support.
Most larger components of the RTL synchronise each other by using a global clock
signal. To support such larger components on the MRP, low latency signal lines are
very important because the largest latency is responsible for the maximum achievable
clock-rate. In this case the clock signals are using dedicated signal lines of the FPGA
to connect to each CEB. Still, the data has to travel from one CEB to another. The
latency of these transmissions selects the usable clock rates.
The network may be divided into fast localised signals, tightly interconnecting a small
group of CEBs and long distance signals interconnecting these groups. The last are
allowed to have a slightly higher latency.
To form larger components one CEB possibly has to connect to multiple different
other CEBs or to connect to one other CEB multiple times. These connection schemes
require the network to support multipath links and multiple routes from a source to a
destination.
These requirements suggest a dynamic indirect circuit switched network. Through
the dynamic part, connections can easily be changed, rerouted and even shared among
CEBs. The indirect aspect reduces the space requirements for the network interface
hardware, like done with the OCSN . To use single signals and signals clusters as the
main kind of communication a circuit switched network is best suited because the signal
lines can just be routed to their destination. It is not necessary to sample the signals
and transmit the results in a multibyte frame. This reduces the latency for all signals.
The following sections describe this network in more detail, by using the OSI model.
Physical Layer
The physical layer of the CSN uses the communication infrastructure of the underlying FPGA. The FPGA provides a low latency network connecting all the CLBs. This
network is best suited to work as the physical layer for the CEBs interconnection because it has the same base requirements. Additional parameters, enforced by the used
application, has to be implemented inside each CEB.
Data-link Layer
The data-link layer is not necessary in this network, because no actual data is transmitted, just a direct connection established. If an application is using the CSN to transmit
data, it has to implement its own data-link layer.
64
7.3 Reconfiguration Platform
Network Layer
The CSN is an indirect network build out of crossbar switches. A crossbar interconnects
all inputs to its outputs (see Section 5.5). Only one permutation of these connections
is possible at one moment. In this network each input has a corresponding output and
two different kinds of inputs/outputs exist. The first kind are single signals and the
other clustered signals. The inputs/outputs are divided between the connected CEBs
and extension devices. The extension device inputs/outputs are used to interconnect
the switches. In Figure 7.6 four CEBs are connected to one switch and the switches
are interconnected in a grid (see Section 5.3). Because the connections at the end of
each row and column of the grid are open, this connection scheme is called a mesh. The
number of inputs/outputs of a switch can be easily increased to support more CEBs,
more extension devices or more inputs/outputs for each of them by the cost of a higher
area usage on the FPGA.
Figure 7.8 gives a more detailed view of the connection interface of one switch in
the example network. The inputs/outputs are numbered from 31 downto 0. Signals
ocRO
CEB0
ocRO
11 .. 8
31 .. 28
15 .. 12
19 .. 16
CEB1
27 .. 24
CSN Switch
3 .. 0
CEB3
7 .. 4
23 .. 20
CEB2
ocRO
ocRO
Figure 7.8: CSN group
31 downto 28, 27 downto 24, 23 downto 20 and 19 downto 16 are always reserved for
connecting CEBs. All switches are programmable through the OCSN by sending configuration frames for single or clustered signals to it. Through status requests the MRP
controller can read the current crossbar configuration and what kind of components are
configured into a CEB. Through the programming interface the MRP controlling device
65
7 Multicore Reconfiguration Platform Description
can select which input is connected to which output. By programming different switches
all CEBs connected to all the switches can be interconnected.
Transport, Session, Presentation and Application Layer
All OSI layers above the network layer have to be implemented by the application/component using the CSN for interconnections. The CSN does not provide any interface
for a transport protocol or any application layer protocols.
7.3.4 IOB
Like any digital hardware component, the interconnected CEBs have to communicate
with the outside world at some point in time. Parameters and results of computations
have to be fed into and out of the components. This is done by using IOBs. The IOBs
of the MRP are very similar to the IOBs of FPGAs. On FPGAs they are connected
to the pins of the chip housing. They allow components on the FPGA to communicate
with off-chip components.
The MRP supports two different kinds of IOBs. Both are connected to the extension
ports of a CSN switch and to an OCSN switch.
CSN2OCSNsimple bridge The CSN 2OCSN simple bridge maps the signals of the
extension ports to internal registers. These registers can be read and written using
OCSN network frames. By reading the registers, the values of the connected signal
lines can be identified and the outgoing signals can be set to special values. This
component is very useful for debugging the CSN because the value of every signal can
be read and written. The disadvantage of this bridge is, that it cannot react to fast
changing signals because the OCSN requires multiple clock ticks to transmit a frame.
CSN2OCSNbridge The CSN 2OCSN bridge is the preferred IOB for the MRP. It
maps a normal OCSN IF to the CSN physical layer. A component in a CEB is
connected to the CSN 2OCSN bridge with two 32bit busses input and two 32bit busses
output. One input and output bus is responsible for data transfer and the other
for control lines. The CEB can create a full OCSN frame by providing data at its
output bus and selecting, through the control lines, which part of the frame to set.
For example, to set the source and destination addresses of the OCSN frame, the
component writes the source address to the upper 16bit of the data bus and the
destination address to the lower 16bit. Then it selects the input zero, through the
control lines. Reading an OCSN frame works very similar. The component selects,
which part of the frame to read, through the control lines, and can read the data
through the data input bus. All control signals from the OCSN IF component are
mapped to the control bus, within the CSN . All data signals are selectable through
the control signals and can be read and written through the data bus.
66
7.4 Operating System Support
7.4 Operating System Support
A system like the MRP requires some kind of controlling master component, such as a
workstation, server or soft-core SoC . But providing the hardware is not enough. The OS
of these systems has to support the MRP and the concept of reconfigurable hardware.
For the host systems of the MRP, Linux was chosen as the OS because its source code is
available as open-source and it is running on most platforms, including the PRHS SoC .
Linux is a UNIX-like operating system[38]. It is build out of the Linux OS kernel and
additional applications. Device drivers extend the Linux kernel and integrate additional
hardware and network protocols.
There are two interfaces from the MRP to the host system. An Ethernet bridge
(Section 8.2.4) and a native memory mapped OCSN device for the PRHS SoC . Both
have to be integrated into the Linux kernel for accessing the OCSN and the components
configured into a CEBs.
The OS support is partitioned into the implementation of the network driver and
the device driver. The network driver is responsible for the socket interface. It is the
interface to the Linux user space. Programmers get access to the OCSN using socket
programming. The device driver is responsible for copying the OCSN frames from and
to the hardware. For the PRHS memory mapped io device, the driver copies data to
and from memory addresses to/from internal kernel structures. For the Ethernet bridge
this is not necessary because device drivers for Ethernet cards are already available in
the kernel.
The implementation of the OS support is described in Chapter 9. Accessing the
components connected to the OCSN is done through user space programs at the moment.
The following programs are available:
lsocsn list all devices connected to the OCSN
ocsn-ping check if a device is alive and get its round trip time
ocsn-switch-status get the status of an OCSN switch (free/used ports, connected
devices, received/transmitted frames)
ocsn-file2icap copy a partial bitfile to a ICAP for configuration
ocsn-file2ram copy a file to a RAM device
ocsn-ram2file copy part of a RAM to a file
ocsn-print-ram print part of a RAM to the output
ocsn-init-ram initialise part of a RAM to a given value
lscebs list all CEBs connected to all CSN switches
ocsn-csn-status get the status of a CSN switch (connected CEBs, if active or not)
ocsn-csn-get-routing print the routing information of one CSN switch
67
7 Multicore Reconfiguration Platform Description
ocsn-csn-set-single set the routing for a single signal
ocsn-csb-set-bus set the routing for a clustered signal
ocsn-csn-ceb-on activate a configured CEB
ocsn-csn-ceb-off deactivate a configured CEB
7.5 Design Flow
At this moment the MRP only supports the Xilinx PR design flow (see Section 2.5). It
is the base for the MRP design flow. It can be divided into a full design flow, in which
all components including the static MRP system are synthesised, placed and routed,
and a reduced design flow, in which only the CEB components are synthesised, placed
and routed. Figure 7.9 presents the eight step full design flow. The first five steps are
1. create/adapt the static MRP system in Very High Speed Integrated Circuits
HDL (VHDL)
2. add VHDL entities for using as CEB components
3. create the netlist for the static system, using CEBs as black-boxes
4. place and route the static system
5. create bitfile for the whole system with CEBs as black-boxes
6. create netlists for all the CEB components
7. place and route the static system including one CEB component at a time
8. create bitfiles for the whole system, including one CEB component and partial
bitfiles for each CEB component and every CEB
Figure 7.9: full MRP design flow
required to create the bitfile for a MRP system without any CEB components. After
configuring the created bitfile, all CEBs are empty. The last three steps create bitfiles
for all the CEB components. The normal Xilinx PR design flow would create all these
components successively. The MRP design flow uses a parallel approach.
The reduced design flow displayed in Figure 7.10 assumes that the MRP static system
is already created and running on a FPGA. The already available placement and routing
information is used in the reduced design flow to place and route the components for
the CEBs only.
68
7.5 Design Flow
1. add VHDL entities for using as CEB components
2. create netlists for all the CEB components
3. place and route the static system including one CEB component at a time
4. create bitfiles for the whole system, including one CEB component and partial
bitfiles for each CEB component and every CEB
Figure 7.10: reduced MRP design flow
69
8 Implementation of the Multicore
Reconfiguration Platform
After introducing the MRP in the previous chapter, this chapter describes the implementation of the important MRP components.
8.1 General Components
In the design process of digital circuits some components are reused constantly. These
components provide common functionality, like FIFO queues, small BRAM , decoders,
and encoders. The general components, used throughout the MRP, are described in the
following subsections.
8.1.1 Clock Domain Crossing
In larger digital circuit designs multiple different clock domains may exist. One clock
domain contains all the digital components running at one specific clock rate, for example
25Mhz. Often data has to cross the boundary of two clock domains, differing in speed
and polarity. Special actions are required to ensure the integrity of the data. The
problem of clock domain crossing is described, among others, by Biddappa[39].
idData
gen_data_size
ocDataAvail
icWe
icRe
CDC_fifoIF
ocFull
icWriteClk
odData
icReadClk
icReset
gen_data_size
Figure 8.1: Clock Domain Crossing (CDC) component interface
The CDC fifoIF, displayed in Figure 8.1, is a simple component for clock domain
crossing, using the recommended solution of Biddappa. It uses a FIFO queue interface
to connect to other components, allowing it to replace FIFO queues, which are often used
to cross domain boundaries. The usage of FIFO queues is often very expensive because
71
8 Implementation of the Multicore Reconfiguration Platform
they are build out of scarce resources, BRAM . Not all designs/components require a
queue at the domain boundaries. In these cases the CDC fifoIF can replace them.
Internally a handshake protocol and multiple register stages move the data to the
other clock domain. The handshake protocol drives the external FIFO signals ocFull and
ocDataAvail. The sizes of the data signals (idData, odData) are configurable through a
generic, a VHDL parameter for configuring individual components.
8.1.2 Dual Port Block RAM
Dual ported BRAM provides two interfaces to a RAM . Through the one interface a component writes data into it while another component reads data from the RAM through
the second interface. This is often useful while working on data streams or building FIFO
queues. Figure 8.2 describes the signal interface of the dual port block ram component.
icClkA
icClkB
icWeA
icEnB
dual_port_block_r
am
icEnA
idAddrA
idDataA
gen_addr_size
gen_addr_size
gen_width
idAddrB
odDataB
gen_width
Figure 8.2: Dual Port Block RAM interface
The Xilinx tools identify the component as an onboard BRAM , if available onto the
used FPGA. Otherwise, the RAM is build out of logic cells. This kind of implementation allows the flexible usage of this component on any FPGA, without the requirement
of available BRAM .
8.1.3 FiFo Queue Component
FIFO queues are a very common component on the RTL. The queues can be used to
cross clock boundaries (like described earlier in this section) or to implement buffers.
They are often implemented using BRAM components, available on certain FPGAs.
This requires the creation of special Intellectual Property (IP) cores for each FPGA.
The SimpleFifo, shown in Figure 8.3, implements a simple Fifo using the techniques
described by Cummings[40]. It uses the dual port block ram component for saving the
queue objects. To prevent buffer over- and underflow the write and read addresses are
converted into gray code and propagted through two register stages into the other clock
domain. In Gray code the code distance between two adjacent words is just one (only one
bit can change from one Gray count to the next)[40]. This ensures that all changing bits
72
8.2 OCSN
odData
gen_width
ocEmpty
icReadClk
ocFull
icReadEnable
idData
ocAempty
gen_width
SimpleFifo
icWriteClk
ocAfull
icClkEnable
icWe
icReset
Figure 8.3: SimpleFiFo interface
of the address are synchronized at the same clock tick into the other clock domain. The
SimpleFifo can be synthesised for any FPGA without the need of a special IP core. The
design of the dual port block ram ensures that Xilinx tools can use BRAM , if available.
It supports different read and write clock signals for clock domain crossing. Through the
generics gen width and gen depth the data-width and the maximum number of queue
elements can be selected. The thresholds for the ocAFull and ocAEmpty signals are
selectable through the generics gen a full and gen a empty.
8.2 OCSN
The OCSN implementation is divided into multiple components, according to the OSI
model.
8.2.1 OCSN Physical Interface Components
The OCSN physical interface consist of the five signals idOCSNdataIN, odOCSNdataOUT,
icOCSNctrlIN, ocOCSNctrlOUT and icOCSNclk. They are used to interconnecting all
the OCSN devices. Figure 8.4 shows the reception of a single OCSN frame through
icOCSNclk
icOCSNctrlIN
idOCSNdataIN
Figure 8.4: Reception of one OCSN Frame
these five signals. The transmission of a packet works alike.
icOCSNclk is the clock signal for the whole OCSN on one FPGA. icOCSNctrlIN and
ocOCSNctrlOUT are active low signals for controlling, when a transmission is taking
place. The transmission in Figure 8.4 starts when icOCSNctrlIN is going from high to
73
8 Implementation of the Multicore Reconfiguration Platform
low and ends when it is going from low to high again. The number of required clock ticks
varies according to the number of bits transmitted concurrently. The generic data link
determines these number of bits.
This simple interface is chosen in favour of a more sophisticated physical interface because it reduces the design complexity of the system. Using a high speed serial io physical
interface would require much more components, such as some high speed serialiser and
deserialiser and a special transmission encoding like 8b/10b[41].
The interface to the data link layer are 312bit data input/output signals and control
signals for signalling the reception or transmission of the data and a trigger signal for
starting the transmission.
implementation
The implementation of the OCSN physical layer is done through two components. The
ocsn write component is responsible for transmitting data and the ocsn read component
for the reception of data.
ocsn write is a simple shift register implementing the OCSN physical output interface.
The signal interface of csn write is given in Figure 8.5. In addition to the OCSN physical
odOCSNdata
data_link
icSend
ocOCSNctrl
ocReady
OCSN_WRITE
icOCSNclk
idData
312
icClkEnable
icReset
Figure 8.5: OCSN physical transmission component
interface it features a 312bit data input for the OCSN frame and control signals to start
transmission and signal the end of transmission (icSend, ocReady).
idOCSNdata
data_link
icOCSNctrl
OCSN_READ
icOCSNclk
odData
ocReceived
icClkEnable
icReset
312
Figure 8.6: OCSN physical reception component
74
8.2 OCSN
ocsn read likewise is a simple shift register implementing the OCSN physical input
interface. It works in the opposite direction than ocsn write. Figure 8.6 displays its
signal interface. A new OCSN frame is received and its data is only valid for the one
clock tick the ocReceived signal is high.
8.2.2 OCSN Data-Link Interface Component
The data link layer is implemented in the OCSN IF component. It is responsible for
identifying the remote interface and for initiating flow control, before the receive buffer
overflows. The flowchart in Figure 8.7 describes the used identification protocol. Both
IF0
IF1
identify
identity
Figure 8.7: Flowchart of OCSN identification protocol
endpoints of the communication send an identification request to the OCSN physical
interface. If a remote interface is connected, it responds with an identity response.
Sending an identification request is repeated, with a short timeout, until an identification
response is received.
The flow control protocol is similar easy as the identification protocol. An example
flow chart is given in Figure 8.8. IF1 is transmitting many OCSN frames to IF0. At
some point the receive buffer of IF0 will hit an upper bound. At this moment IF0
transmits a wait request to IF1. IF1 stops sending frames as soon as it processes this
wait request, still some more frames can be transmitted. Because of these frames, the
upper bound cannot be the maximum FIFO queue depth. At some later point in time
IF0 has processed most of the frames in its receive buffer and will hit a lower bound. At
this moment is transmits a continue request and IF1 starts transmitting again.
Both protocols are identified through OCSN frame type zero and the first byte of the
payload. Appendix A gives an overview of all available OCSN frame types.
The OCSN IF encapsulates the components of the physical layer. Therefore, it provides the OCSN physical interface to the outside and passes it through to these components. Figure 8.9 displays the full signal interface of the OCSN IF component. In
addition to the OCSN physical interface, it has to provide an interface to the network
75
8 Implementation of the Multicore Reconfiguration Platform
IF0
IF1
frame
receive buffer reaches
upper bound
wait
receive buffer reaches
lower bound
continue
frame
Figure 8.8: Flowchart of OCSN flow control protocol
layer. This interface includes signals for controlling the status of the connection, for
working with OCSN frames, for controlling the transmission and reception of frames
and for resetting and running the component.
The following signals are used for controlling the status of the connection between two
connected OCSN IF components.
identity input for the 16bit OCSN address of the interface
icIdentity this active high control signal selects, if the identity is automatically set
for each transmitted frame
odIdentity 16bit output of the OCSN address of the remote interface
ocIdvalid active high validity signal for odIdentity
The interface to the network layer consist of the frame and frame controlling signals.
It simplifies the usage of OCSN frames by dividing them into individual signals for each
frame part.
{id,od}DST destination address of the OCSN frame
76
8.2 OCSN
idOCSNdataIN
data_link
16
8
icOCSNctrlIN
odOCSNdataOUT
data_link
8
8
ocOCSNctrlOUT
256
icOCSNclk
identity
idDST
idSRC
idType
idSrcPort
idDstPort
idData
16
odSrcPort
odDstPort
odData
ocDataAvail
icReadEn
OCSN_IF
8
icIdentity
8
16
8
odIdentity
ocIdvalid
256
icReset
icSend
icClkEn
ocReady
odDST
odType
icForward
16
16
odSRC
icClk
16
Figure 8.9: OCSN IF signal interface
{id,od}SRC source address of the OCSN frame
{id,od}DstPort destination port of the OCSN frame
{id,od}SrcPort source port of the OCSN frame
{id,od}Type the frame type of this OCSN frame
{id,od}Data the 31byte payload of the OCSN frame
The frame control signals form a simple FIFO queue interface. The active high ocReady signal indicates, if the interface is ready to transmit a new frame. Through the
icSend signal, the frame, created in the frame part, is transmitted. icDataAvail indicates the availability of OCSN frames in the receive FIFO queue. ocReadEn removes
the first queue element.
The system interface consist of the main clock signal icClk, an active high asynchronous
reset signal icReset and an active high clock enable signal icClkEn.
77
8 Implementation of the Multicore Reconfiguration Platform
implementation
The OCSN interface is build out of the components ocsn write, ocsn read, SimpleFifo,
CDC FifoIF and a FSM controlling all these components. Figure 8.10 displays a simpli-
idOCSNdataIN
icWe
ocsn_read odData
idData
Register
ocReceived
icOCSNctrlIN
OCSNFrameOUT
odData
SimpleFIFO
icWe
ocDataAvail
icReadEn
ocFifoWe
scReady
ocReady
FSM
ocReady
ocOCSNctrlOUT
odOCSNdataOUT
OCSNCMACFrameIN/icSend
icSend
ocsn_write
idData
MUX
icSend
OCSNFrameIN/icSend
CDC
OCSNFrameIN
Figure 8.10: OCSN IF implementation schematic
fied block diagram of the OCSN IF buildup. ocsn read and ocsn write are responsible
for the physical communication. If an OCSN frame is received it is cached in a register
and the FSM evaluates the frame at the same moment. If the frame belongs to the
identification or flow control protocol, the frame is not stored in the FIFO queue. If
the frame is a normal OCSN frame the FSM sets the write enable signal (icWe) of the
FIFO queue to append the frame. Through the multiplexer the FSM controls, if a frame
from the outside is transmitted through ocsn write or if a control frame generated by
the FSM . Figure 8.11 shows the FSM graph. The FSM starts with the state st start
on the left side. After waiting for the ocsn write component getting ready the FSM
switches to the st identify state. In this state it transmits the identify request to the
remote interface and switches to st wait id for waiting for an identity response. The
internal signals scSendIdentity and scIdentityReceived are control flags. The first flag
request that the interface should transmit its own identity and the other shows, if the
remote identity has already been received. If the remote interface is identified, the FSM
switches to the st idle state. The st idle state is the main state of the FSM . The states
st wait, st cnt send, st wait send are just intermediate states returning to the st idle
state as soon as an OCSN frame has been successfully been sent to the network. All
other states are only reachable from st idle. If a new identify request is received, the
FSM switches to the st identify state. If a wait request is received from the remote
interface the FSM stays in the st stop state until a continue request is received. If the
FIFO queue is almost full the FSM transmits a wait request in the st wait state and, if
the FIFO is almost empty again a continue request in st continue.
78
st_start
scReady = 0
scReady=1
st_identify
st_wait_id
scIdentityReceived=1
scSendIdentity = 0 & scIdentityReceived=0
st_idle
scSendIdentity=1
scAlmostFull = 1 & scAF=0
scAlmostFull = 0 & scAF=1
scWait = 1
scWait=0
scCDCdataAvail = '1' and scReady='1'
scSendIdentity=1
scReady=1
scReady=1
st_send_wait
scReady=1
st_continue
st_stop
scWait=1
scReady=1
st_send
st_identity
st_wait_send
scReady=0
st_cnt_send
scReady=0
st_wait
scReady=0
st_id_send
scReady=0
8.2 OCSN
Figure 8.11: Graph of the OCSN IF FSM
79
8 Implementation of the Multicore Reconfiguration Platform
8.2.3 OCSN Network Component
The OCSN switch implements the network layer of the OCSN . It uses the OCSN IF
of the previous section to provide seven ports for interconnecting devices, including
additional switches. Because of the addressing scheme introduced in Section 7.1.3, seven
is the maximum number of ports at one switch. Figure 8.12 displays the signal interface
16
icOCSNclk
idOCSNdataIN
7*data_link
7
icOCSNctrlIN
odOCSNdataOUT
7
OCSN_Switch_7Port
7*data_link
identity
odLED
icReset
icClkEn
7
ocOCSNctrlOUT
Figure 8.12: signal interface of an OCSN Switch
of an OCSN switch. Switches are devices of the OCSN too and, as such, require its
own address, given by the identify signal. odLED is a debug interface showing at which
ports a remote interface has been detected. Devices are connected through the OCSN
physical signal interface. The switch implements the same interface than an OCSN IF
, but has seven control signals and seven times data link data signals. data link is the
number of data signals for one OCSN IF . The icOCSNclk is shared by all the OCSN
devices.
The main task of a switch is routing incoming OCSN frames according to their destination address to another port. This includes forwarding frames to other connected
switches. Because of the tree structure, a switch has to identify its uplink switch, which
can be connected to any of the seven ports. A connected switch A is the uplink of a
switch B, if the address of B is a postfix of the address of A. The same comparison has
to be done for the destination address of each incoming OCSN frame.
The addr compare component, shown in Figure 8.13, is responsible for this comparison
process. Two OCSN addresses are inducted into the component and it calculates, if
idAddr2 is a postfix of idAddr1. It uses a chain of multiplexer to compare every subidAddr1
idAddr2
16
16
isNet
addrCompare
ocValid
Figure 8.13: signal interface of the addr compare component
part of the OCSN addresses, leading to very long signal propagation delays, reducing
80
8.2 OCSN
the maximum clock rate for an OCSN switch. The alternative is to implement the
component clock triggered and invest multiple clock cycles for the comparison. This
would increase the complexity of the FSM , controlling the OCSN switch. Furthermore,
the comparison of two addresses could require a different number of clock cycles, making
it harder to calculate the actual switch throughput. The multiplexer approach is used
in this work because a simpler implementation is better suited for a prototype system
than the higher performance solution.
While forwarding OCSN frames, multiple problems can occur, which has to be addressed by the switch. If multiple received frames have the same destination address,
the switch has to select one for transmission at a time for preventing a deadlock. The
transmission of the frames has to occur as soon as possible and no starvation of interface
ports have to take place. No frame-drop is allowed to occur on switches other than the
root switch.
ac
ac
idOCSNdataIN0
ac
ac
icOCSNctrlIN0
ac
ac
ac
ac
ac
ac
ac
ac
odOCSNdataOUT0
OCSN IF 0
FSM0
odOCSNctrl0
ac
ac
idOCSNdataIN1
ac
ac
icOCSNctrlIN1
ac
ac
ac
ac
ac
ac
ac
ac
odOCSNdataOUT1
OCSN IF 1
FSM1
odOCSNctrl1
ac
ac
idOCSNdataIN2
ac
ac
icOCSNctrlIN2
ac
ac
ac
ac
ac
ac
ac
ac
odOCSNdataOUT2
OCSN IF 2
FSM2
odOCSNctrl2
ac
ac
idOCSNdataIN3
ac
ac
icOCSNctrlIN3
ac
ac
ac
ac
ac
ac
ac
ac
odOCSNdataOUT3
odOCSNctrl3
OCSN IF 3
FSM3
idOCSNdataIN4
FSM4
OCSN IF 4
icOCSNctrlIN4
odOCSNdataOUT4
odOCSNctrl4
idOCSNdataIN5
FSM5
OCSN IF 5
icOCSNctrlIN5
odOCSNdataOUT5
odOCSNctrl5
idOCSNdataIN6
FSM6
OCSN IF 6
icOCSNctrlIN6
odOCSNdataOUT6
odOCSNctrl6
Uplink
Check
FSM Main
ac
Figure 8.14: OCSN switch implementation schematic
Figure 8.14 gives a simplified overview of the OCSN switch implementation. Each of
the seven OCSN IF components has a FSM connected. For each port six add compare
components (ac) calculate, if any incoming frame is designated for it. Another seven
add compare components compare the remote interface addresses of each switch port
with the address of the switch to identify the uplink port of this switch. The FSMs
81
8 Implementation of the Multicore Reconfiguration Platform
implement, together with the main FSM , a snapshot based pulling algorithm.
The algorithm ensures fairness by saving the availability of incoming frames of each
OCSN port in a snapshot. Every available incoming frame is pulled to its destination
port in a round robin manner. If the snapshot is processed, another is created. Listing 8.1
displays this algorithm in a C like pseudo language.
Lines 3 to 6 are responsible for doing the snapshot by saving the data available signal
from each OCSN port and marking each port as not transmitted.
In lines 8 to 44, two encapsulated for loops, with the indices s for source and d for
destination port, walk through all port combinations. The snapshot is tested, if any port
combination has an available and not yet transmitted incoming frame.
If source and destination port are the same and the destination address of the frame
is the address of the switch, the destination of the frame is the switch itself and has to
be processed appropriately. Processing such a frame only, if source and destination port
are the same, ensures that it is processed once.
If source and destination ports differ and the destination of the frame at source port
s is a sub-address of the remote address at destination port d, the frame is forwarded to
d.
If d is identified as the uplink port of the switch and the destination of the frame at
source port s is not a sub-address of any remote address, the frame is forwarded to d.
After working through all ports in the snapshot, all frames are removed from the
incoming queue. Frames not transmitted yet are dropped. This happens at the root
switch only because all other switches have an uplink port, to which all not directly
routable frames are sent.
The hardware implementation of this algorithm uses two different kind of FSMs.
The main FSM takes the snapshot and removes frames from the incoming queues. It
synchronises the seven FSMs, of the second type. Each of these FSMs is responsible
for one OCSN port. They test, if incoming frames in the snapshot from any port are
destined for their assigned port and implement all the tests described in Listing 8.1 line
8 to 44.
Through the partitioning of the algorithm in multiple FSMs, its implementation is
straight forward and clear.
8.2.4 OCSN Application Components
The components of the OCSN application layer are connected to OCSN switches through
OCSN interfaces. All of them have the same basic structure, consisting out of an
OCSN IF and a FSM , processing the incoming data. Figure 8.15 displays this basic
structure. The device has the OCSN physical signal interface as minimum input/output
signals. More signal are added according to the application specific hardware part, such
as the GPIO pins of a OCSN GPIO device.
The FSM divides into a general and application specific part. The application specific
part implements actions for incoming OCSN frames specific to this device, such as
reading and writing internal registers or RAM . The general part implements actions
for OCSN frames, which are common to all OCSN devices. This includes reactions to
82
8.2 OCSN
1
while (1) {
// create the snapshot , save which ports have data available
for ( int i =0; i <7; i ++) {
3
snapshot [ i ]. avail = port [ i ]. dataAvail ;
snapshot [ i ]. transmitted =0;
5
}
// pull frames from source ( s ) to destination ( d ) ports
for ( int d =0; d <7; d ++) {
for ( int s =0; s <7; s ++) {
// only do something if a frame is available and not transmitted yet
if ( snapshot [ s ]. transmitted ==0 && snapshot [ s ]. avail ==1) {
// destination and source port are the same and the dest .
// address is the same as the switch address of port d
if ( d == s && port [ s ]. frame . dst == switch . address ) {
// do something according to the frame type , destination port and payload
// eg . send a ping response
} else
// if destination and source port differ and the
// destination address is a subaddr of the remoteAddr of
// port d
if ( subAddr ( port [ s ]. frame . dst , port [ d ]. remoteAddr ))) {
// forward frame to this port
send (d , port [ s ]. frame );
snapshot [ s ]. transmitted =1;
} else
// if d is the uplink port and the frame is not destined for any other port
// forward it to i
if ( uplink ( d )==1 && (
! subAddr ( port [ s ]. frame . dst , port [ d +1%7]. remoteAddr ) &&
! subAddr ( port [ s ]. frame . dst , port [ d +2%7]. remoteAddr ) &&
! subAddr ( port [ s ]. frame . dst , port [ d +3%7]. remoteAddr ) &&
! subAddr ( port [ s ]. frame . dst , port [ d +4%7]. remoteAddr ) &&
! subAddr ( port [ s ]. frame . dst , port [ d +5%7]. remoteAddr ) &&
! subAddr ( port [ s ]. frame . dst , port [ d +6%7]. remoteAddr )
)
) {
// forward frame to this port
send (d , port [ s ]. frame );
snapshot [ s ]. transmitted =1;
}
}
}
}
// remove frames in snapshot from fifo queue
for ( int i =0; i <7; i ++) {
if ( snapshot [ i ]. avail ==1) {
snapshot [ i ]. avail =0;
port [ i ]. removeFromQueue ();
}
}
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
}
Listing 8.1: basic snapshot based pulling algorithm
83
8 Implementation of the Multicore Reconfiguration Platform
idOCSNdataIN
icOCSNctrlIN
OCSN IF
odOCSNdataOUT
FSM
application specific
hardware
odOCSNctrl
Figure 8.15: OCSN application component basic schematic
ICMP ping requests only at the moment. Through ICMP ping requests, the identify of
a OCSN component can be determined.
OCSN BRAM device
The VHDL description of the application specific part is very similar to the description
of the dual ported block ram, described earlier, but it uses only one port for read and
write access. Each of the supported frames, as described in Section 7.2.2, corresponds
to a state in the application specific part of the FSM . Data read or written from and to
the BRAM has to be encoded into the payload of OCSN frames. The address, to read
from or to write to, is also encoded into the payload. The main function of the FSM
states is to read the requested number of bytes from the RAM and write them into the
payload of the frame, or otherwise round, writing the given number of bytes from the
frame to the RAM .
OCSN ICAP device
The ICAP device takes the number of bytes to write and the bytes from an OCSN frame.
The FSM always writes 32 bit data words to the ICAP component at 50MHz.
OCSN GPIO device
The GPIO device maps registers to external input and output pins. The FSM takes
bytes from an OCSN frame and writes them into internal registers, leading to a change
in the GPIO pins. If the status of the input pins is requested, the FSM returns the
internal register, connected to these pins.
OCSN PRHS device
The OCSN PRHS device connects the OCSN to the PRHS SoC through a memory
mapped input/output interface. The implementation is described by Grebenjuk[37].
84
8.2 OCSN
OCSN Ethernet Bridge
The OCSN Ethernet Bridge device consist of the basic OCSN device structure, an
Ethernet MAC IP core and two synchronised FSMs, for controlling the transmission
and reception of data. Figure 8.16 displays both FSMs. The numbers at the beginning
of the transition labels set the priority of each transition. They implement a simple
synchronisation protocol (shown in Figure 8.17) to ensure, the Ethernet MAC addresses
of both endpoints are known to each other.
st_start
st_idle
(1)sdRemoteMAC = 000000000000
&& scDiscoverTimerInterrupt=1
st_discover
(2)sdRemoteMAC /= 000000000000
&& srSelectionACKsend=0
st_sel_ack
(3)scOCSNdataAvail = 1
st_prepare
st_ocsn
(1)sdTransmitCounter = 0
st_send
(2)
st_wait
scTXdstRDY = 0
(a) Transmission FSM
scRXsrcRDY=0 && scRXsof=0
(scRXeof =0 &&
sdReceiveCounter<60)||
(scRXeof = 1 &&
sdReceiveCounter>60)
st_start
st_idle
st_receive
scRXeof =0 && sdReceiveCounter = 60
st_check1
sdReceivedETH.DST_MAC = idInitialMAC &&
sdReceivedETH.FRAME_TYPE=0x81fc &&
sdReceivedETH.OCSN_OP=OP_SELECTION
st_check2
sdReceivedETH.DST_MAC = idInitialMAC &&
sdReceivedETH.FRAME_TYPE=0x81fc &&
sdReceivedETH.OCSN_OP=OP_OCSN_FRAME
st_send_frame
scFIFOfull=0
(b) Reception FSM
Figure 8.16: OCSN Ethernet Bridge FSMs
The OCSN2Ethernet bridge starts by sending discovery Ethernet frames through the
Ethernet MAC IP core every second. If a host system is available on the other side of
the connection or connected to the same Ethernet switch, it answers with a selection
frame to the MAC address of the OCSN2Ethernet bridge. The OCSN2Ethernet bridge
confirms the reception of the selection frame by sending a selection ack frame.
After this handshake protocol every OCSN frame is encapsulated into an Ethernet
frame and transmitted to the remote device. The FSMs do not support answering to
OCSN ping frames.
85
8 Implementation of the Multicore Reconfiguration Platform
Host
OCSN2Ethernet
discover
selection
selection ack
Figure 8.17: OCSN Ethernet Discovery Protocol
OCSN UART Bridge
Like all application devices, the base of the OCSN UART Bridge is the basic application
device structure of Figure 8.15. The application specific hardware consist of a UART
component and another FSM , which controls the incoming data from the UART . No
special handshake protocol is implemented. The device just starts transmitting through
the UART as soon as an OCSN frame arrives and builds an OCSN frame out of the
incoming data from the UART . Sending an end of frame byte, identified through the
Parity bit, is the only used synchronisation method between local and remote bridge
component.
8.3 CSN
Like the description of the OCSN implementation, the implementation of the CSN is
divided into different components, according to the OSI model. Section 7.3.3 already
described the required OSI layers.
86
8.3 CSN
8.3.1 Physical Layer Implementation
The CSN uses the interconnection network of the underlying FPGA. This reduces the
implementation complexity of the CSN physical layer. The signal interface, to communicate through the CSN , is the only implementation specific part of it. It is already
described in Section 7.3.3.
8.3.2 Network Layer Components
The CSN is an indirect network with crossbar switches as the main network components.
Through the crossbar switches application layer devices can be connected and other
crossbar switches, to extend the network. Figure 8.18 displays the connection schema of
ocRO
CEB0
ocRO
11 .. 8
31 .. 28
15 .. 12
19 .. 16
CEB3
ocRO
CEB1
27 .. 24
CSN Switch
3 .. 0
7 .. 4
23 .. 20
CEB2
ocRO
Figure 8.18: Crossbar Interconnection Schema
one CSN crossbar switch. There are dedicated ports for connecting CEBs, and dedicated
extension ports, for connecting switches and application layer devices. Each device is
connected with four single signal lines and four clustered or bus signal lines. One bus
line is 32 bit wide.
The CSN crossbar switch requires a complex signal interface, to support this kind
of connection schema. Figure 8.19 presents this signal interface. The first six signals
on the left side belong to the OCSN physical interface, because the routing table of the
CSN crossbar switch is programmable through the OCSN . Additional status information
concerning CEBs can be requested from the OCSN too.
icSWid identifies all connected switches. It consists of eight times the number of
connectable switches bits. For every switch eight bits of identifier are available, limiting
the number of switches for one CSN to 256. Each switch connects to this signal starting
with the “top” switch at bits 8 × nr sw − 1 down to 7 × nr sw.
ocResetCEB and ocEnabled are control signals to the CEBs. The first resets the
component configured into the CEB to a known state, the second enables the clock for
the component. Both signals have bit width number of connectable CEBs.
87
8 Implementation of the Multicore Reconfiguration Platform
idOCSNdataIN
data_link
nr_cebs*8
2**ctrl_lines_single
icOCSNctrlIN
odOCSNdataOUT
idCtrl
data_link
2**ctrl_lines_single
odCtrl
2**ctrl_lines_bus*bus_size
ocOCSNctrlOUT
idBUS
icOCSNclk
identity
icSWid
ocResetCEB
ocEnabled
icCEBid
CSN_Switch
2**ctrl_lines_bus*bus_size
16
nr_sw*8
nr_cebs
odBUS
icClkEnable
icReset
icClk
nr_cebs
Figure 8.19: CSN Crossbar Switch Signal Interface
icCEBid is the same as icSWid but identifies the connected CEBs. The eight bits
width per CEB limits the number of CEBs on a reconfiguration platform to 256. But
this value is easily extended, if necessary.
idCtrl, odCtrl, idBUS and odBUS are the data signals of the CSN . The first two
have a bit width of 2nr ctrl lines single and the later two of 2nr ctrl lines bus × bus size. At
the moment there are five control lines for single signal lines and five control lines for
clustered or bus signal lines. The bus width is 32. Eight components can connect to one
crossbar switch, leading to four signals of each type for one component. The components
connect to the crossbar switch according to the connection schema of Figure 8.18.
implementation
Figure 8.20 displays the main components of a CSN crossbar switch. Its main structure resembles the basic structure of a OCSN application layer component. An OCSN
interface and a FSM manage the connection to the OCSN .
The number of single and cluster control lines is reduced to two, in this example.
This simplifies the display of all required components. The more control lines, the more
components are required.
With two control lines four signal lines or signal clusters can be addressed. In this
example, four outgoing single signal lines are shown on the left side and four outgoing
clustered signals on the right. Each of these outputs is connected to the outgoing port
of a multiplexer. The incoming signal lines are connected to the input ports of the
multiplexer. Through a connected routing register, the signal passing through to the
output is selected.
The outgoing signals for resetting and enabling CEBs and the incoming signals for
CEB and switch identifiers are connected to registers too.
All the available registers, except the identification registers, can be set by sending
special OCSN frames to the switch and program the routing.
88
8.3 CSN
idCtrl(3 downto 0)
Routing Register
Routing Register
odCtrl(0)
MUX
MUX
MUX
MUX
MUX
MUX
MUX
MUX
idOCSNdataIN
icOCSNctrlIN
odOCSNdataOUT
OCSN IF
odBUS(63 downto 32)
Routing Register
Routing Register
odCtrl(3)
odBUS(95 downto 64)
Routing Register
Routing Register
odCtrl(2)
odBUS(127 downto 96)
Routing Register
Routing Register
odCtrl(1)
idBUS(127 downto 0)
odBUS(31 downto 0)
Reset Reg
ocResetCEB
Enable Reg
ocEnabled
FSM
odOCSNctrl
CEB IDs
icCEBid
SW IDs
icSWid
Figure 8.20: CSN Crossbar Switch Implementation Schematic
8.3.3 Application Layer Components
The application layer components of the CSN divide into the CEBs and other extension
devices. At the moment only one extension device is available, the OCSN2CSN bridge
to communicate with the outside world.
89
8 Implementation of the Multicore Reconfiguration Platform
CEB
The interface of the CEBs has already been described in Section 7.3. The implementation
is application specific and is not described here.
OCSN2CSNsimple Bridge
Both OCSN2CSN bridges are gateways between the packet switched OCSN and the
circuit switched CSN . Therefore, they require a physical OCSN signal interface and a
physical CSN signal interface. Figure 8.21 displays these signal interfaces. The OCSN
idOCSNdataIN
data_link
4
4*bus_size
icOCSNctrlIN
odOCSNdataOUT
data_link
ocOCSNctrlOUT
4*bus_size
CSN2OCSN
icOCSNclk
identity
idSingle
odSingle
idBus
odBus
icReset
icClkEnable
16
icClk
4
Figure 8.21: CSN2OCSN Bridge Signal Interface
interface is the same as for any other OCSN device and enables the bridge to connect
to an OCSN switch or directly to any other OCSN application layer component.
The CSN signal interface ist designed to connect directly to the extension ports of a
CSN crossbar switch.
The OCSN2CSNsimple Bridge is implemented as an OCSN application layer device,
introduced in Section 8.2.4. It supports four different OCSN network frames.
readSingle returns the value of the idSingle lines
writeSingle sets the value of the odSingle lines
readBus returns the value of the idBus lines
writeBus sets the value of the odBus lines
The values returned are sampled at the moment the OCSN frame is processed by the
bridge.
OCSN2CSN Bridge
The structure of OCSN2CSN bridge is nearly the same as of the OCSN2CSNsimple
bridge. The signal interface is the same displayed in Figure 8.21 and it is also an OCSN
90
8.3 CSN
application layer component. The difference is, that the OCSN2CSN bridge enables a
CEB to create a full OCSN frame and transmit it and to receive a full OCSN frame.
To create the OCSN frame, the following signal mapping on the CSN physical layer is
used:
idBus(31 downto 0) data input from the CSN
odBus(31 downto 0) data output to the CSN
idBus(32) directly mapped to the OCSN IF icSend signal
idBus(33) directly mapped to the OCSN IF icReadEn signal
idBus(63 downto 60) request to which register the incoming data should be written
idBus(59 downto 56) request which register to put on the output data bus
odBus(32) directly mapped to the OCSN IF ocIDvalid signal
odBus(33) directly mapped to the OCSN IF ocReady signal
odBus(34) directly mapped to the OCSN IF ocDataAvail signal
The CEBs can use this interface to create or read an OCSN frame. Table 8.1 describes
the selectable registers. New values are written to the register at the next clock tick.
Address
Register
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
rest
source address and destination address
source port, destination port and frame type
bits 31 downto 0 of OCSN payload
bits 63 downto 32 of OCSN payload
bits 95 downto 64 of OCSN payload
bits 127 downto 96 of OCSN payload
bits 159 downto 128 of OCSN payload
bits 191 downto 160 of OCSN payload
bits 223 downto 192 of OCSN payload
bits 255 downto 224 of OCSN payload
identity of the remotly connected OCSN device
Table 8.1: Address to register mapping
After creating an OCSN frame, it can easily be transmitted by setting the icSend signal
to high.
If an OCSN frame is available, it can also be read through this interface.
The interface is necessary because the CSN only features four 32bit busses and four
single lines for each connected component at the moment. One OCSN frame is 312bit
wide and has to be mapped to fewer signals.
91
8 Implementation of the Multicore Reconfiguration Platform
One problem arises from the fact that each CEB can be operated with a different clock
speeds and this clock speed is not required to match the clock speed of the OCSN2CSN
bridge. If the clock signals do not match the CDC problems arises, described in Section 8.1.1.
Different solutions, to ensure, that the data is correctly saved into the internal registers, exist:
• The interface can be extended by read and write acknowledge signals. These
acknowledge signals ensure that the data can correctly cross the clock boundaries,
like a CDC component does. It requires additional hardware in the CEBs and the
OCSN2CSN bridge for handling the acknowledge signals.
• Using clock speed selections lines instead of acknowledge signals, would reduce the
hardware requirements within a CEB because no FSM is required to handle the acknowledge signals, but would require the usage of special BUFG-MUX components
in the OCSN2CSN bridge. These special components are multiplexer dedicated to
global clock lines of the FPGA and are limited in number. This approach is only
feasible, if the number of clock signals and the number of OCSN2CSN bridge
components is very small.
• The simplest solution is to reduce the flexibility of the overall design and determine
one fixed clock rate for communication with OCSN2CSN bridges. This increases
the hardware requirements in the CEBs only, if the CEB is running at a different
clock rate than the OCSN2CSN bridge.
For the prototype of the MRP the last option is chosen because the implementation
complexity is very small and using a simple interface without additional control signals
reduces the error probability in CEB implementations. The determined clock rate is
25MHz at the moment.
92
9 Operating System Support
Implementation
Section 7.4 described the overall idea of the OS support for the MRP. At the moment
only support for the OCSN is required to interact with the MRP, especially the CEBs.
Linux is chosen as the OS for the host system of the prototype. It is an UNIX like OS[38]
and divides into the Linux kernel and user applications. The current kernel version is
3.14.3.
The MRP operating system support requires adaption of the Linux kernel and writing
user applications for managing the different tasks of the MRP.
Robert Love[42] gives a good introduction to Linux Kernel Development. The Linux
OS has different ways of extending its functionality. The main, and most used, way is
writing device drivers. These device drivers interact with hardware devices connected
to the system, and integrate them into the Linux kernel as character, block or network
devices. Character and block devices are represented as ordinary files in the Linux device
tree and require the implementation of at least open, read, write and release callback
functions. The network device driver requires read, write and poll callbacks. The kernel
uses these callback functions to interact with the hardware devices.
Another extension point of the Linux kernel are network drivers. Network drivers are
different from network device drivers. While the later interact with hardware, network
drivers implement the BSD socket API for every supported network. This includes
creating a kernel structure, representing the addressing schema of the network, callbacks
for bind, connect, release, accept, listen, poll, sendmsg and recvmsg. The socket interface
allows user space applications to open sockets and transmit and receive data through
the network. Common network drivers of the Linux kernel are IPv4, IPv6, AppleTalk
and Ethernet.
All drivers of the Linux Kernel register at least one C structure with the kernel. These
C structures contain configuration parameters, like names and sizes of other structures,
and function pointers to callbacks.
The OS support for the MRP uses a device driver and a network driver. The network
driver for the OCSN allows user application to directly create, transmit and receive
OCSN frames. The frames are en-/decapsulated by the network driver into/from Ethernet frames and transmitted/received using the Ethernet network driver. If the OCSN
is connected natively to the host system, for example using the PRHS SoC , a OCSN
network device driver interacts with the OCSN network interface hardware. The driver
fetches received frames from the interface hardware, encapsulates them into Ethernet
frames. The Ethernet frames are passed to the OCSN network driver. The network
driver delivers the frame to the corresponding user space process. A frame transmitted
93
9 Operating System Support Implementation
from a user space application is first processed by the OCSN network driver and than
delivered to the network interface connected to the OCSN .
9.1 OCSN Network Driver
The first part of the network driver initialisation is registering a new network protocol
to the Linux kernel with its name and the size of its socket data structure (Listing 9.1).
1
static struct proto ocsn_proto = {
. name = " OCSN " ,
. owner = THIS_MODULE ,
. obj_size = sizeof ( struct ocsn_sock ) };
3
Listing 9.1: OCSN protocol structure
The ocsn sock structure represents a network socket. In the OCSN context it consist of
the basic kernel socket structure, src and dst address, src and dst port and the application
layer frame type as presented in Listing 9.4.
2
4
6
8
struct ocsn_sock {
struct sock
sk ;
unsigned short
unsigned short
unsigned char
unsigned char
unsigned char
ocsn_dst ;
ocsn_src ;
ocsn_src_port ;
ocsn_dst_port ;
protocol ;
};
Listing 9.2: OCSN socket structure
The basic socket structure sk holds information about the incoming or outgoing network
device and a queue for incoming network frames.
The second initialisation step is registering a new sub-packet of an Ethernet packet,
with a fixed Ethernet frame type of ETH P OCSN(0x81fc) and the callback function
ocsn rcv.
static struct packet_type ocsn_packet_type __read_mostly = {
. type = cpu_to_be16 ( ETH_P_OCSN ) ,
. func = ocsn_rcv
2
4
};
Listing 9.3: OCSN packet structure
This packet type is represented by the structure displayed in Listing 9.3. This step
ensures that all incoming Ethernet frames of type ETH P OCSN are forwarded to this
network driver by calling the ocsn rcv function and the Ethernet frame as a parameter.
The ocsn rcv function is responsible for processing the incoming Ethernet frames, extract the OCSN frame from its payload and find the destination socket from a list of
sockets, by comparing destination address and destination port of the incoming frame
and every existing socket. If the OCSN is connected to the host system through an
94
9.1 OCSN Network Driver
OCSN Ethernet bridge, ocsn rcv has to respond according to the handshake protocol
described in Section 8.2.4 too.
The last step registers the socket interface of the network driver at the kernel. The
implemented interface is identified by the structure given in Listing 9.4.
static const struct proto_ops ocsn_dgram_ops =
2
{
. family = PF_OCSN ,
. owner = THIS_MODULE ,
. release = ocsn_release ,
. bind = ocsn_bind ,
. connect = sock_no_connect ,
. socketpair = sock_no_socketpair ,
. accept = sock_no_accept ,
. getname = sock_no_getname ,
. poll = datagram_poll ,
. ioctl = sock_no_ioctl ,
. listen = sock_no_listen ,
. shutdown = sock_no_shutdown ,
. setsockopt = sock_no_setsockopt ,
. getsockopt = sock_no_getsockopt ,
. sendmsg = ocsn_sendmsg ,
. recvmsg = ocsn_recvmsg ,
. mmap = sock_no_mmap ,
. sendpage = sock_no_sendpage ,
4
6
8
10
12
14
16
18
20
};
Listing 9.4: OCSN socket interface structure
Only the bind, release, poll, sendmsg and recvmsg callbacks are implemented, because
the OCSN does not feature a connection oriented transmission protocol.
bind The bind function creates a persistent OCSN socket with a fixed OCSN src port.
This src port identifies the user space application and every OCSN frame received
with the same destination address is delivered to this socket. The user application
can choose a new random src port or request a specific port, if it is available.
release The release function removes a previously created OCSN socket from the list
of sockets and frees its used memory.
poll Poll uses a standard datagram polling function.
sendmsg The sendmsg function creates an OCSN frame out of a given address structure and data buffer. It creates the kernel structure for transmitting Ethernet frames
and passes this structure to the network device for transmission.
recvmsg The recvmsg function is called for receiving data from an OCSN socket. It
fetches a received frame from the socket queue and creates an OCSN address structure
and data buffer from it. These are returned to the user application.
95
9 Operating System Support Implementation
9.2 OCSN Network Device Driver
The network device driver for the OCSN -PRHS-SoC memory mapped io interface was
written by Grebenjuk[37] and its implementation is only briefly described here.
The hardware OCSN network interface is connected to an OCSN IF on the one side
and on the other side to the memory bus of the PRHS SoC .
The network driver is responsible for copying received OCSN frames from the memory mapped registers to the kernel space, encapsulate them into Ethernet frames and
pass them to the Linux network stack for more processing. In the opposite direction
the network stack delivers Ethernet frames to the network device driver. The device
driver extracts the OCSN frame and copies it to the memory mapped io registers of the
hardware interface.
96
10 Evaluation
The usability of the presented framework is evaluated using the two dimensions space
and time and an example application. The space dimension is analysed by looking at the
area usage of the MRP. For the time dimension the maximum clock rates, achievable by
CEBs interconnected through the CSN are measured. For the example implementation
a small general-purpose processor is ported to the MRP.
10.1 Area Usage
The area required to support the MRP onto the FPGA is a very important factor how
efficient designs using the MRP can be. The area is measured in FPGA LUTs (see
Section 2.4).
The reconfiguration platform of the MRP is configured into a Xilinx xc5vlx330 Virtex5
FPGA supporting 207360 LUTs divided into 51840 slices.
The CEBs consist of slices only. The integration of special purpose hardware, such as
DSPs and BRAM , is not supported at the moment. To use the available special purpose
hardware the usage requirements for the complete MRP infrastructure has to be aquired.
The available resources have to be evenly distributed through all CEBs. The CEBs have
to be placed in such a way on the FPGA that each of them encapsulates all the hardware
resources it should support. The size of the used FPGA does not allow that. The MRP
uses 156096 LUTs of the FPGA, including the area for the CEBs. This is roughly 75%
of the available resources. Relocating the CEBs leads to an unroutable design. A larger
FPGA can support the placement of CEBs with integrated special purpose hardware.
Table 10.1 displays the area usage of the MRP system. The given Percentage relates to
the number of LUTs not the maximum number available.
A CEB consist out of 800 CLBs, which equals 3200 LUTs. All the CEBs together
require 32.8% of the used FPGA area. The CSN switches differ in size because during
design synthetisis the components get optimised for area usage. Switch 3 and 1 only
support two switch extension ports, while the other feature three. These additional port
and the number of used connections per port determine the size of each switch. They are
roughly three time larger than a CEB and together require 21.86% of the used FPGA
space. The IOB components are only the size of halve a CEB. Most of the area is required
by the OCSN . Alltogether it requires 43.31% of the used FPGA space. The reason for
this is the complex routing algorithm within the OCSN switches. A simple BUS can
replace the OCSN and reduce the area usage of the interconnection infrastructure, but
would limit the flexibility of communication, for example with resources like RAM ,
processor cores and additional FPGAs. Another drawback would be the limited size and
97
10 Evaluation
Component
clkManager
OCSN-Switch0
OCSN-Switch2
OCSN-Switch1
OCSN2BRAM
OCSNbridgeUART
OCSN2ICAP
CEB-0-0
CEB-0-1
CEB-0-2
CEB-0-3
CEB-1-0
CEB-1-1
CEB-1-2
CEB-1-3
CEB-2-0
CEB-2-1
CEB-2-2
CEB-2-3
CEB-3-0
CEB-3-1
CEB-3-2
CEB-3-3
CSN-Switch3
CSN-Switch2
CSN-Switch1
CSN-Switch0
CSN2OCSN
CSN2OCSNsimple
Gesamt:
Nr. LUTs
Nr. MUXFX
Nr. BRAM
40
11920
34627
14747
1834
2594
1886
3200
3200
3200
3200
3200
3200
3200
3200
3200
3200
3200
3200
3200
3200
3200
3200
7840
10024
7585
8682
1502
1613
156096
0
1153
2208
1351
4
2
6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
801
1157
715
781
22
2
8202
0
35
35
35
6
7
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
5
5
5
5
5
153
Table 10.1: Area usage of the MRP
98
Area Usage Percentage
0,03
7,64
22,18
9,45
1,17
1,66
1,21
2,05
2,05
2,05
2,05
2,05
2,05
2,05
2,05
2,05
2,05
2,05
2,05
2,05
2,05
2,05
2,05
5,02
6,42
4,86
5,56
0,96
1,03
100
10.2 Maximum CSN Propagation Delay Measurement
extensibility of busses. Looking only at the CSN and the CEBs the hardware overhead is
not that big because four switches provide interconnectivity to 16 CEBs. The overhead
can be reduced even more by increasing the number of CEBs per switch and improve
the multiplexer implementation within them.
10.2 Maximum CSN Propagation Delay Measurement
The CSN is a very critical part of the MRP. It is an indirect network and has no
direct connections between network components, such as CEBs and IOBs. Virtual
paths through CSN switches have to be created to interconnect them. The propagation
delay of a path is an important factor in digital circuit design because it determines the
maximum clock rate of the overall system. At least two physical paths are necessary to
create a virtual path within the CSN because it has to connect a CEB or IOB to a CSN
switch, and this switch has to connect to the other CEB or IOB. If the second component
is connected to a different switch, more physical paths are necessary. It is obvious that
the propagation delay of the created virtual path is composed of the propagation delay of
the individual physical paths and the gate delay within each CSN switch. It is important
to analyse all the possible path delays within the CSN to determine the maximum overall
clock frequency, and to indentify areas of the same maximum clock frequency.
The measurement of propagation delays on a FPGA is difficult because the start and
endpoints are not directly accessible from outside. Routing both to I/O pins of the
FPGA would greatly distort the measurement result because the additional path to the
I/O buffer, and the I/O buffer itself are affecting the propagation delay with an unknown
factor. Another not feasible method is grinding the FPGA to get access to the path. A
working solution to analyse the propagation delay of paths on a FPGA was published
by Ruffoni and Bogliolo[43]. They used two Ring Oscillators (ROs) R0 and R1 on the
FPGA. R1 was extended by the path p to analyse. They determined the periods T0
and T1 of the ROs. The period of a RO is twice the propagation delay of its loop[43].
Adding a path to the loop extends the period by twice the propagation delay of the path
p: T1 = T0 + 2dp . Hence, the delay d of the path is calculated by dp = (T1 − T0 )/2. This
method has been adapted for the MRP.
10.2.1 RO-Component
A special RO component has been developed for configuring into any of the CEBs. It
consists of a RO, whichs path can be extended by using a control output and a control
input of the CEB interface. The switching between the base and the extendend path is
implemented using a 2-1 multiplexer and a 2-1 demultiplexer. The control line of each
of them is connected to the CEBs enable signal (see Figure 7.7). The RO is driving the
clock input of a 32bit counter. The enable and reset signal of the counter is driven by
a FSM , clocked at 50Mhz. Both signals are passed into the clock domain of the RO
using two FFs connected in a row. The FSM is responsible for doing the measurement
of the number of RO ticks within a given amount time. If it receives the start signal
99
10 Evaluation
from the outside the FSM enables the counter, waits for a given number of 50Mhz clock
cycles, and disables the counter. The counters value is connected to an outgoing 32bit
bus connection. On reception of a reset signal from the outside, the FSM resets the
counter. The component can be used to first measure the base period TB of the RO and
afterwards the period TE of the RO with the extended path. The period in nanoseconds
can be calculated from the measured number of ticks by
T =
1
RO ticks
f [M hz] ticks
× f [M hz]
× 1000
The propagation delay of the extended path p can then be calculated with:
dp = (TB − TE )/2
10.2.2 ReRouter-Component
Another component is required to measure the propagation delay of all paths within
the CSN . The RO requires an extended path to start and stop at itself. Therefore,1 a
component is necessary, which can route the incoming singals of a CEB back through its
outputs. The component is called ReRouter. Its implementation is very simple because
it just connectes its inputs to its outputs.
10.2.3 Measuring Setup
To get as much information as possible out of the propagation delay measurement all
the paths between the CEBs are analysed. Figure 10.1 displays one configuration of
the measurement setup. This configuration is used to measure all path delays between
CEB0 at CSN switch 0 to any other CEBs. Hence, the RO component is configured
into CEB0 at CSN switch 0. All the other CEBs are configured with the ReRouter
component. The red line shows one of the measurement virtual paths. It consists out of
six physical paths (CEB0 to SW0, SW0 to SW2, SW2 to CEB0, CEB0 to SW2, SW2 to
SW1, SW1 to CEB0). As you can see, the round trip time between the two CEBs are
measured. Therefore, the result has to be divided by two to estimate the one way time.
First the base period of the one RO component is determined. After that the CSN is
programmed to every possible virtual path and the period of it is measured. The last
step is to calculate the individual virtual path propagation delay.
10.2.4 Measurement Results
Table 10.3 presents the propagation delay matrix for the full MRP. To improve the
table size the column and row names are shortend. The format “x-y” states CEBy at
CSN switch x. The measurment results are symmetric with small variations. The blue
marked leading diagonal represents the propagation delays of the CEB to its switch.
The results are already divided by two to estimate the one way trip time, not the round
trip time. There are a few variants in the symmetrie of the matrix, which need to be
explained.
100
10.2 Maximum CSN Propagation Delay Measurement
ReRouter
ReRouter
ReRouter
0
1
0
1
31...28
27..24
31...28
27..24
RO
CSN SW 0
CSN2OCSN
1.2.2.2.5.9
CSN SW 1
15...12
7...4
15...12
1.2.2.2.1.0
1.2.2.2.2.0
3...0
3...0
23...20
19..16
23...20
19..16
ReRouter
ReRouter
ReRouter
ReRouter
3
2
3
2
ReRouter
ReRouter
ReRouter
ReRouter
0
1
0
1
31...28
27..24
31...28
27..24
11...8
11...8
CSN SW 2
CSN2OCSNsimple
7...4
15...12
1.2.2.2.6.0
CSN SW 3
15...12
1.2.2.2.3.0
1.2.2.2.4.0
23...20
19..16
23...20
19..16
ReRouter
ReRouter
ReRouter
ReRouter
3
2
3
2
Figure 10.1: MRP Measurement Configuration for Setup 1
1. There is always at least a small variant within the propagation delay of the path
to a CEB and back.
2. Sometimes a propagation delay from one CEB to another is shorter than the sum
of the propagation delay to their switch. An example of this phenomenon is the
path between CEB1-2 and CEB1-1. Their propagation delay is measured 1.86ns
while their propagation delays to their switch are measured 3.15ns and 2.39ns.
The problem with measuring the propagation delay within the CSN is, that it is not
regularly placed into the FPGA. Figure 10.2 displays the placement of all four CSN
switches. It is clearly visible that all the switches are distributed throughout the FPGA,
Switch
Clks (Mhz)
Clkc (Mhz)
0
1
2
3
135
150
162
159
67
75
81
79
Table 10.2: Maximum clock rates within each switch
101
102
0-1
5.61
2.90
7.22
6.19
9.36
10.62
11.80
12.47
9.79
11.49
10.31
9.26
10.99
10.39
9.53
10.74
0-0
2.36
5.72
5.32
5.05
7.57
8,63
10.00
10.68
8.86
10.56
9.38
8.34
10.06
9.46
8.60
9.81
CEB
0-0
0-1
0-2
0-3
1-0
1-1
1-2
1-3
2-0
2-1
2-2
2-3
3-0
3-1
3-2
3-3
5.34
7.37
3.07
5.85
7.39
8.65
9.82
12.47
8.43
10.13
8.95
7.90
9.63
9.03
8.17
9.38
0-2
7.75
9.28
7.46
9.12
1.83
4.60
5.50
5.40
10.72
12.43
11.24
10.20
9.96
9.54
8.92
10.00
1-0
8.58
10.11
8.29
9.95
4.15
2.39
6.65
6.48
11.51
13.22
12.03
10.99
10.21
9.79
9.17
10.24
1-1
9.61
11.14
9.32
10.98
5.25
1.86
3.15
5.74
12.90
14.60
13.42
12.38
10.86
10,43
9.82
10.89
1-2
10.82
12.35
10.53
12.18
5.46
1.91
6.05
2.70
13.90
15.61
14.43
13.38
10.92
10.50
9.88
10.96
1-3
8.70
9.74
8.24
8.43
10.42
11.68
12.85
13.53
1.87
5.22
5.86
4.55
9.50
8.91
8.04
9.25
2-0
10.41
11.45
9.94
10.14
12.12
13.38
14.56
15.23
5.22
3.01
5.99
5.38
10.09
9.50
8.63
9.84
2-1
9.41
10.45
8.94
9.14
11.12
12.38
13.56
14.24
6.04
6.16
2.44
6.07
9.31
8.72
7.85
9.06
2-2
2-3
8.25
9.29
7.78
7.97
9.96
11.22
12.40
13.07
4.62
5.45
1.33
2.63
9.51
8.92
8.05
9.26
Table 10.3: Propagation Delay Matrix for all CEBs in ns
5.07
6.32
5.82
2.31
8.19
9.45
10.62
11.30
8.60
10.31
9.12
8.08
9.80
9.21
8.35
9.55
0-3
10.33
11.37
9.86
10.06
9.62
10.32
10.76
10.50
9.45
10.04
9.08
9.38
3.24
6.26
5.78
5.98
3-0
9.82
10.85
9.35
9.54
9.13
9.83
10.27
10.01
8.78
9.38
8.42
8.72
6.19
3.00
4.28
5.72
3-1
9.65
10.69
9.19
9.38
8.89
9.60
10.04
9.78
8.32
8.91
7.95
8.25
6.10
4.67
2.17
4.95
3-2
9.65
11.37
9.86
10.05
9.70
10.40
10.84
10.58
9.25
9.85
8.89
9.19
6.03
5.84
4.67
2.70
3-3
10 Evaluation
10.2 Maximum CSN Propagation Delay Measurement
yellow CSN Switch 0, red CSN Switch 1, green CSN Switch 2, purple CSN Switch 3
Figure 10.2: Floorplan of the reconfiguration platform
103
10 Evaluation
and are even entangled. This distribution leads to very different gate delays for different
parts of the CSN switches. This can lead the second phenomenon because the route
through the used multiplexer to another CEB can be very short while the path back to
itself is very long.
Another problem is the placement within each CEB area. The RO could be placed
very near the I/O signals or very far away. The placement process is a highly randomised
process, so this scenario is likely. Figure 10.3 shows the CEB to CSN switch 0 connections in orange and the connections from CSN switch 0 to switch 2 in pink. The lengths
of these paths are very different, such as the paths to the left of CEB0-3.
The result of these measurements are, that CEBs connected through one switch can be
clocked at a higher frequency than CEBs connected at different switches. For example
components configured into the CEBs at switch 0 can be clocked at 135Mhz if sequential
circuits are used and at 67Mhz if a combinational circuit is required in at least one CEB.
The clock frequencies are calculated using the worst case propagation delay at one switch.
The clock rates for the other switches are displayed in Table 10.2. Clks is the maximum
achievable clock rate using sequential circuits only. Clkc is the maximum clock rate with
at least one combinational circuit, but ignoring its gate delay. As soon as a CEB at a
different switch is connected to a system, the clock rate is at least halved.
10.3 Example Microcontroller Implementation for MRP
Showing that the MRP can support complex digital components is very important for the
framework evaluation. Therefore, a small CPU has been ported to run as a distributed
core onto the MRP. The used processor core was developed for teaching purposes by
the Computer Engineering group of the Helmut Schmidt University in Hamburg. It
supports 16 32bit registers, a 32bit ISA, a 32bit databus, and a 16bit address bus. A
simple assembler is available for easier software development.
To port the processor core onto the MRP the processor core has to be divided into
its core parts, such as fetch and decode unit, control unit, register file, and ALU . These
components have to be encapsulated into the CEB signal interface. The fetch and decode
unit has to be divided into two units. One unit is responsible for fetching datawords
from a RAM component within the OCSN using the CSN2OCSN bridge. The second
one decodes the fetched words for the datapath of the processor core. The control unit
was extended by two states in its FSM to use the additional fetch stage, enforced by the
OCSN access.
The fetch unit is accessible from the OCSN to select the address of the OCSN RAM
component and its port. Additional command frames are available to start, stop, and
reset the proccessor core. This is necessary because programms running on the MRPs
host system shall manage the processor core and its software. Figure 10.4 presents the
MRP configuration for the processor core. All components, except the ALU , fit into
the CEBs of CSN switch 0. The ALU is configured into CEB 1 of switch 1. Without
the MRP and configured as a SoC onto a Xilinx Virtex5 FPGA the processor core can
run at 30Mhz. Hence, 25Mhz is the maximum frequency of the core on the MRP. Using
104
10.3 Example Microcontroller Implementation for MRP
yellow CSN Switch 0, red CSN Switch 1, green CSN Switch 2, purple CSN Switch 3
Figure 10.3: Floorplan with interconnects of the reconfiguration platform
105
10 Evaluation
Fetch
ALU
Control
0
1
0
1
31...28
27..24
31...28
27..24
CSN SW 0
CSN2OCSN
1.2.2.2.5.9
CSN SW 1
15...12
7...4
15...12
1.2.2.2.1.0
1.2.2.2.2.0
3...0
3...0
23...20
RegFile
19..16
Decode
23...20
19..16
3
2
3
2
0
1
0
1
31...28
27..24
31...28
27..24
11...8
11...8
CSN SW 2
CSN2OCSNsimple
7...4
15...12
1.2.2.2.6.0
CSN SW 3
15...12
1.2.2.2.3.0
1.2.2.2.4.0
23...20
19..16
23...20
19..16
3
2
3
2
Figure 10.4: MRP CPU Configuration
the propagation delay matrix in Table 10.3 one can look up the maximum path delay
between all components. The ALU is connected to the control unit, the decode unit
and the register file. The maximum propagation delay between these components is
10.62ns. We have to take into account that the ALU is a combinational circuit. So the
1
maximum possible clock frequency is 10.62×2
= 47M hz, but the processor can not run
at this speed.
The software running on the host system of the MRP is responsible for programming
the fetch unit, start the processor core, and stop it after program execution. Further it
emulates an OCSN RAM interface to supply the processor core with an easy to debug
memory. At program start, the internal RAM buffer is filled from a file given on the
command line. The program uses socket programming to communicate through the
OCSN with the fetch unit. It programs the fetch unit to use the host system at OCSN
port 100 as its RAM , and starts the processor core. After that it waits for RAM requests
from the fetch unit and serves the correct data.
Multiple programs were executed on the distributed processor core without any problems, such as a simple multiplication and printing the fibonacci progression of f ib(33).
The processor was also tested against the OCSN2BRAM component, which improves
execution speed because the RAM is not emulated in software. There are more performance improvements possible, such as implementing a small cache into the fetch unit or
106
10.3 Example Microcontroller Implementation for MRP
extending the number of registers by adding another register file component.
This example system shows, that is is possible to run complex distributed components
onto the MRP. The divided processor core easily fits into the five CEBs.
107
11 Conclusion
This thesis addresses the usage of partial runtime reconfiguration in a general-purpose
environment, such as standard personal computers. Such hybrid-hardware systems are
commonly used for high performance computing, single-purpose computers and multipurpose computers, but not in general-purpose computers yet. Image processing applications, simulation of electromagnetic fields, solid state physics and computer games
among others can benefit from this integration by bringing their own hardware accelerators. These accelerators can be simple filter algorithms implemented in hardware or
many streaming processors tightly interconnected. The requirements for hybrid hardware systems in general-purpose computing are different from high performance computing. Application software changes very fast in general-purpose computing. The processing tasks are very variable in contrast to high performance computing. Therefore, many
components in many different sizes have to be configured into the runtime reconfigurable
hardware. This requirement leads to the granularity problem of runtime reconfigurable
design flows. The effects of this problem can be reduced using the grouping and the granularity solution presented in Chapter 6. Platform independence is another requirement
in general-purpose computing because many CPU and FPGA vendors exist. OS integration is also very important to get a wide acceptance of the reconfigurable hardware
by developers and users.
In this thesis a multi FPGA framework, called MRP, is presented. It uses the granularity solution (Chapter 6) to build an easy extensible reconfigurable system for generalpurpose computing. In contrast to many other reconfigurable systems it supports a
packet switched network spanning multiple FPGAs. This network features fast interconnection links up to 4.8Gbit/s. It supports a bridge to 1Gbit/s Ethernet. Through
the Ethernet it is connectable to offboard host systems, such as a workstation or server.
An onboard host system using a PRHS SoC is also available. Operating system support for the OCSN is available, enabling users and developers to access any component
connected to the OCSN using BSD socket programming. This easy access supports
the platform independence because it standardises hardware access to a common API .
No other RS has this kind of OS integration. The MRP is divided into support and
reconfiguration platforms. The first provides access to FPGA board resources like RAM
or storage devices, while the second provides the runtime reconfigurability. The reconfiguration platform is implemented using the PR design flow of Xilinx Virtex5 FPGAs.
Therefore, it is partitioned into many same sized RMs, called CEBs. These CEBs are
interconnected using a CSN and a common signal interface. Through this buildup they
reduce the effects of the granularity solution. Components, to be used on the MRP,
have to be divided into smaller components fitting into a CEB. Through the CSN they
are interconnected to form the complex component again.
109
11 Conclusion
Chapter 10 evaluates the MRP according the area usage, maximum clock speed measurement and an example CPU based application.
The example MRP system, presented in this thesis, requires 75% of a Xilinx xc5vlx330
Virtex5 FPGA. The OCSN uses the most of this space (43.31%). But this investment in
area provides a very flexible and fast interconnection network with unique features. The
actual hardware providing the runtime reconfiguration uses 54.66% of the used area.
This area can be divided into 32.8% for the CEBs and 21.86% for the CSN . This is
a hardware overhead of 0.6, but there is still improvement potential by increasing the
number of CEBs per switch and optimizing the switch implementation.
Table 10.3 presents a matrix of the propagation delays of all possible CEB connections.
The minimum clock frequency for CEBs connected to one switch is 135MHz using sequential circuits only and 67MHz with at least one combinational circuit. The maximum
clock rates are 162MHz and 81MHz. Common clock rates for normal FPGA designs on
a Virtex5 range from 25MHz up to 200MHz for very optimised designs. Hence, the
measured minimum and maximum clock rates range in between. A reduced clock rate
is the price for the improved flexibility.
The last evaluation property is a complex example application. A 32bit microcontroller for teaching purposes has been ported to the MRP. It is divided into the five
CEBs, fetch unit, decode unit, control unit, register file and ALU . The fetch unit requests datawords from OCSN components providing RAM , such as the OCSN 2BRAM
device. It is even possible to emulate a RAM on the host system using a user space
program. An application on the host system loads the microcontroller program into
some RAM , instantiates all the microcontroller components within the MRP and starts
it. Programs like a simple multiplication or calculating the fibonacci progression run on
this distributed microcontroller without any problems.
This evaluation shows that the MRP fullfils the requirements for a RS in a generalpurpose environment. The implementation of the MRP can be seen as a success.
11.1 Outlook
The development of the MRP is finished, but many development steps to integrate
runtime reconfiguration into general-purpose computing need to be done.
OS support for runtime reconfiguration needs to be improved. At the moment reconfiguration is not part of any modern OS. Most research concerning this topic is done to
evaluate reconfiguration speed and schedule reconfigurable hardware like processes, but
this approach is not feasible at the moment because reconfiguration times are not fast
enough (see Table 1.1). Therefore, a more general approach would be better suited, such
as looking at reconfigurable hardware more like a memory resource, not like a process.
In this way reconfigurable hardware could be requested in a malloc style.
The MRP provides many CEBs for configuration. These CEBs are very similar to
the CLBs of the FPGA infrastructure. Another field of research could be to implement
a synthesis, placing and routing environment based on the MRP. The first step would
be to design a generic CEB component, which could be the target of the synthetisation
110
11.1 Outlook
process. The source of this process could be a hardware description in a HDL or even
a C program would be possible. Such a process enables the developer to optimise the
implementation from two different directions, from the hardware and software side.
Another research topic could be to implement runtime reconfigurable processors onto
the MRP. Some basic approaches to runtime reconfigurable processors have been made
by Dales[16], Hauser et al. [17], Razdan[18], Hallmanseder[15] and Niyonkuru[44]. These
approaches could be advanced and tested on the MRP because it provides the basic infrastructure for this research. The implemented microcontroller system is divided into
some individually reconfigurable CEB. This is a base requirement for all the reconfigurable processors.
111
Appendix
A OCSN Frame Types
Table A.1 shows all, at the moment assgined, frame types.
Type ID
Protocol
Description
0
MAC
1
2
ICMP
LED
3
DATA
4
5
CEB
ICAP
6
CSN SW
used at the data-link layer for identifying remote interfaces
and flow control
used at the application layer for ping like operation
application layer protocol for communication with LED component
application layer protocol for communication with RAM devices
application layer protocol for communucation with CEBs
application layer protocol for communication with ICAP devices
application layer protocol for communication with CSN
switch
Table A.1: used OCSN frame types
113
Bibliography
[1] Wikipedia, “14 nanometer — wikipedia, the free encyclopedia,” May 2014.
[Online]. Available: http://en.wikipedia.org/w/index.php?title=14 nanometer&
oldid=599971737
[2] Xilinx, Inc., Partial Reconfiguration User Guide, 2010, http://www.xilinx.com.
[3] ——, Virtex-5 FPGA User Guide, 2012, http://www.xilinx.com.
[4] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multiprocessor system-on-chip: Rampsoc,” in Parallel and Distributed Processing, 2008.
IPDPS 2008. IEEE International Symposium on, Apr. 2008, pp. 1 –7.
[5] M. Eckert, “Fpga-based system virtual machines,” Ph.D. dissertation, HelmutSchmidt-Universit¨
at/Universit¨
at der Bundeswehr Hamburg, 2014.
[6] Convey Computer Corporation, Convey Personality Development Kit Reference
Manual, December 2010, http://www.conveycomputer.com.
[7] Xilinx Zynq Product brief, Xilinx Inc., Xilinx Inc., 2100 Logic Drive, San Jose,
CA 95124, USA. [Online]. Available: http://www.xilinx.com/products/silicondevices/soc/zynq-7000/
[8] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics,
vol. 38, no. 8, pp. 114–117, 1965.
[9] M. Bohr, R. Chau, T. Ghani, and K. Mistry, “The high-k solution,” Spectrum,
IEEE, vol. 44, no. 10, pp. 29 –35, oct. 2007.
[10] Sun Microsystems, Inc., “Opensparc t2 processor design and verification users’s
guide,” November 2008, https://www.opensparc.net/.
[11] NVIDIA Corporation, “Nvidia’s next generation cuda compute architecture:
Fermi,” 2009, http://www.nvidia.com/.
[12] C. Kao, “Benefits of partial reconfiguration,” Xcell journal, vol. 55, pp. 65–67, 2005.
[13] J. Von Neumann, “First draft of a report on the edvac,” IEEE Annals of the History
of Computing, vol. 15, no. 4, pp. 27–75, 1993.
115
Bibliography
R atomTM = configurable processor,”
[14] K. Williston, “Roving reporter: Fpga + intel
http://embedded.communities.intel.com/community/en/hardware/blog/2010/12/
10/roving-reporter-fpga-intel-atom-configurable-processor, Dec. 2010. [Online].
Available:
http://embedded.communities.intel.com/community/en/hardware/
blog/2010/12/10/roving-reporter-fpga-intel-atom-configurable-processor
[15] D. Hallmannseder and B. Klauer, “Compilerunterst¨
utzung f¨
ur die Dynamische
Rekonfiguration eines Mikroprozessors,” in PII Workshop. Hamburg: Technische
Informatik, Helmut-Schmidt-Universit¨at, 2009.
[16] M. Dales, “The proteus processor - a conventional cpu with reconfigurable functionality,” in FPL ’99: Proceedings of the 9th International Workshop on FieldProgrammable Logic and Applications. London, UK: Springer-Verlag, 1999, pp.
431–437.
[17] J. R. Hauser and J. Wawrzynek, “Garp: A mips processor with a reconfigurable
coprocessor,” in Proceedings of the FCCM’97, 1997, pp. 12–21.
[18] R. Razdan, “Prisc: programmable reduced instruction set computers,” Ph.D. dissertation, Harvard University, Cambridge, MA, USA, 1994.
[19] D. Gohringer, M. Hubner, T. Perschke, and J. Becker, “New dimensions for multiprocessor architectures: On demand heterogeneity, infrastructure and performance
through reconfigurability; the rampsoc approach,” in Field Programmable Logic and
Applications, 2008. FPL 2008. International Conference on, Sep. 2008, pp. 495 –
498.
[20] B. Venners, Inside the Java Virtual Machine. New York, NY, USA: McGraw-Hill,
Inc., 1996.
[21] T. Schwederski and M. Jurczyk, Verbindungsnetze, ser. Leitf¨aden der Informatik.
Teubner, 1996.
[22] T.-Y. Feng, “A survey of interconnection networks,” Computer, vol. 14, no. 12, pp.
12–27, 1981.
[23] K. Compton and S. Hauck, “Reconfigurable computing: a survey of systems and
software,” ACM Computing Surveys, vol. 34, no. 2, pp. 171–210, 2002, an excellent
survey paper on reconfigurable computing.
[24] H.-D. Ebbinghaus, J. Flum, and W. Thomas, Einf¨
uhrung in die mathematische
Logik (5. Aufl.). Spektrum Akademischer Verlag, 2007.
¨
[25] K. Urbanski and R. Woitowitz, Digitaltechnik: ein Lehr- und Ubungsbuch,
ser.
Engineering online library. Springer, 2004.
[26] A. Otero, E. de la Torre, and T. Riesgo, “Dreams: A tool for the design of dynamically reconfigurable embedded and modular systems,” in Reconfigurable Computing
and FPGAs (ReConFig), 2012 International Conference on, 2012, pp. 1–8.
116
[27] Altera Product Catalog, Altera Inc. [Online]. Available: http://www.altera.com/
literature/sg/product-catalog.pdf
[28] D.
Bryant,
“Disrupting
the
data
center
to
create
the
digital
services
economy,”
June
2014.
[Online].
Available: https://communities.intel.com/community/itpeernetwork/datastack/blog/
2014/06/18/disrupting-the-data-center-to-create-the-digital-services-economy
[29] I. T. U. T. S. S. Itu-T, “X.200 : Information technology - open systems
interconnection - basic reference model: The basic model,” ISOIEC, no.
7498-1, p. 59, 1994. [Online]. Available: http://www.iso.org/iso/iso catalogue/
catalogue tc/catalogue detail.htm?csnumber=20269
[30] A. S. Tanenbaum, “Network protocols,” ACM Comput. Surv., vol. 13, no. 4, pp.
453–489, 1981.
[31] T. Bjerregaard and S. Mahadevan, “A survey of research and practices of
network-on-chip,” ACM Comput. Surv., vol. 38, no. 1, 2006. [Online]. Available:
http://doi.acm.org/http://doi.acm.org/10.1145/1132952.1132953
[32] K. C. Sevcik and M. J. Johnson, “Cycle time properties of the fddi token ring,”
IEEE Transactions on Software Engineering, vol. 13, 1987.
[33] W. H. Bahaa-El-Din and M. T. Liu, “Register-insertion: a protocol for the next
generation of ring local-area networks,” Computer networks and ISDN systems,
vol. 24, no. 5, pp. 349–366, 1992.
[34] H. Hellwagner and A. Reinefeld, SCI: Scalable Coherent Interface. Springer, 1999.
[35] G. Barnes, R. Brown, M. Kato, D. J. Kuck, D. Slotnick, and R. Stokes, “The illiac
iv computer,” Computers, IEEE Transactions on, vol. C-17, no. 8, pp. 746–757,
Aug 1968.
[36] R. Knecht, “Implementation of divide-and-conquer algorithms on multiprocessors,”
in Parallelism, Learning, Evolution, ser. Lecture Notes in Computer Science,
J. Becker, I. Eisele, and F. M¨
undemann, Eds. Springer Berlin Heidelberg, 1991, vol.
565, pp. 121–136. [Online]. Available: http://dx.doi.org/10.1007/3-540-55027-5 7
[37] N. Grebenjuk, “Conecting of ocsn to prhs framework,” Bachelor Thesis, Helmut
Schmid University, 2014.
[38] Wikipedia, “Linux — wikipedia, the free encyclopedia,” February 2014. [Online].
Available: http://en.wikipedia.org/w/index.php?title=Linux&oldid=597293747
[39] R. Biddappa, “Clock domain crossing,” The Cadence India Newsletter, pp. 2–8, May
2005. [Online]. Available: http://www.cadence.com/india/newsletters/icon 200505.pdf
117
Bibliography
[40] C. E. Cummings, “Simulation and synthesis techniques for asynchronous fifo design,” in SNUG 2002 (Synopsys Users Group Conference, San Jose, CA, 2002)
User Papers, 2002.
[41] A. Athavale and C. Christensen, High-speed serial I/O made simple.
[42] R. Love, Linux-Kernel-Handbuch: Leitfaden zu Design und Implementierung von
Kernel 2.6, ser. Open source library. Addison-Wesley, 2005.
[43] M. Ruffoni and A. Bogliolo, “Direct measures of path delays on commercial fpga
chips,” in Signal Propagation on Interconnects, 6th IEEE Workshop on. Proceedings,
may 2002, pp. 157 –159.
[44] A. Niyonkuru and H. C. Zeidler, “Designing a runtime reconfigurable processor for
general purpose applications,” in IPDPS, 2004.
118