

Energy Efficient Digital Electronic Systems Design for Edge-Computing Applications, through Innovative RISC-V Compliant Processors.

By

Abdallah Cheikh عبدالله نبيل الشيخ

A Thesis Submitted to the Department of Information Electronics and Telecommunication Engineering (DIET) La Sapienza Università di Roma

> DOCTOR OF PHILOSOPHY February 2020



Prof Mauro Olivieri (Thesis Supervisor)

## Acknowledgements

These three years of doing research at the LSD lab at Sapienza have passed like the wind. Throughout my journey on this PhD career, I slowly transformed from being a High-Power Electrical Engineer, into a Low-Power Computer Architect for IoT devices and Embedded Systems. The only thing I regret is the amount of strain I placed on my eyes, caused by staring daily at the computer screen for over 12 hours. However, I am much more thankful than regretful for the things I have experienced in this wonderful journey.

First and foremost, I am thankful to all mighty God الله for giving me everything I have asked for, and more. He facilitated my means to travel to Italy, He surrounded me with kind hearted, and supportive people that helped me in a foreign country in which I barely understood the language, and He was always my guide in the good and the bad moments in life. I never had to beg anyone for anything neither did I feel at any point in time the struggle to sustain myself. God is great, and no matter how many times I thank Him, I feel that it is not enough and that I should be ever more thankful.

Needless to mention, but nonetheless I am thankful for both my parents that had never batted an eye when I asked them for help. They covered any financial shortcomings I had throughout my PhD career. They provided me with every kind of support in order help me succeed in my career. My parents never once have abandoned me, and they always prayed for my success. As they grow into their older days, I wish to support them by giving back at least a fraction of what they'd been giving me my entire life.

Second of all, I would then like to thank my professor and thesis supervisor Mauro Olivieri, as I feel eternally grateful for his support throughout my PhD career. I contacted Mauro in early July 2016 for a chance to pursue a PhD career at Sapienza, and Mauro quickly responded to my request, helping me every step of the way in the application process until I finally got admitted to the PhD program. Mauro continued his support and guidance throughout the years providing me with opportunities to pursue conferences in different parts of Italy and Europe. In May 2018 he again provided me with the unforgettable and wonderful opportunity to move to Barcelona and collaborate on the European Processing Initiative (EPI) project. In Barcelona, I met amazing people and gained a lot of experience in the field of computer architecture. Mauro till this day continues to be a great support, as he constantly provides me with wonderful opportunities at every turn, and for that I am always very grateful, and have very much respect for all what he has done for me.

Furthermore, I am grateful to my amazing friend and mentor Antonio Mastrandrea. Antonio from day one in Italy was there for me. The reason I managed to stay standing on my feet in Italy, without getting lost or stranded was Antonio himself. He helped me literally in anything I asked for. Antonio was basically my guide for everything in Italy. Not to mention throughout my PhD he continuously provided a lot of support in various areas I lacked experience in. Antonio is a great friend, and a great support, and I really enjoyed his company throughout my PhD career. Thank you, Antonio!

I have met a great deal of people in the past three years, among them is my great friend Simone Ponzio. Simone volunteered to help me with my work continuously for more than 8 months. Simone helped, me perform the earlies parts of verification of my work in the second year and he also was a great friend and a very fun guy to be around. Then came along my colleague and my dear friend Stefano Sordillo. Without Stefano's amazing hard work and his collaboration on common areas of interests in our researcher, I would not have had my work results flourish as they had today, Stefano was the software developer that made the complex tests which benchmarked my work. Stefano was a great help at all times, even on the weekends. For all the people that I have met, I want to say you were all amazing, and thank you all for giving me the pleasure of meeting you.

As a final note I would like to express my gratitude to the Italian government, and their vision to provide a career opportunity for a both foreign and national students equally by allowing them to pursue a PhD career all under their expenses, and without any bias in the selection process be it race, nationality, gender, or religion. Italy in that sense I consider to be a model country, and I owe my thanks to all the Italians for their kindness and hospitality towards me, and other foreign researchers as well.

# Table of Contents

### Contents

|             |                                                            | 1  |
|-------------|------------------------------------------------------------|----|
| Table of    | Contents                                                   | 1  |
| List of Fig | gures                                                      | 5  |
| List of Ta  | bles                                                       | 7  |
| Abstract    |                                                            | 8  |
| Organiza    | tion of the Dissertation:                                  | 9  |
| Chapter :   | 1 Preface                                                  | 10 |
| 1.1.        | Internet of things                                         | 10 |
| 1.2.        | Energy efficient IoT devices:                              | 13 |
| 1.3.        | Artificial neural networks                                 | 14 |
| Chapter 2   | 2 RISC-V and the Klessydra Processor Family                | 16 |
| 2.1.        | Motivation behind adopting RISC-V                          | 16 |
| 2.2.        | Background                                                 | 16 |
| 2.3.        | Instruction set architecture briefing                      | 17 |
| 2.4.        | Custom instruction set extensions                          | 19 |
| 2.5.        | RISC-V support in Klessydra                                | 19 |
| 2.6.        | Patches to the riscv-gnu-toolchain:                        | 20 |
| 2.7.        | Concluding remarks                                         | 21 |
| Chapter 3   | 3 The PULPino Microcontroller Platform                     | 23 |
| 3.1.        | Motivation behind choosing PULPino                         | 23 |
| 3.2.        | Background                                                 | 23 |
| 3.3.        | PULPino native processor cores                             | 24 |
| 3.4.        | Embedding non-native Klessydra processing cores in PULPino | 25 |
| Chapter-    | 4 Klessydra T0 Architecture                                | 26 |
| 4.1.        | The Klessydra-T family                                     | 26 |
| 4.2.        | Motivation for choosing interleaved multithreading         | 26 |
| 4.3.        | Klessydra-T0 introduction and background information       | 27 |
| 4.4.        | Choosing the optimal IMT pipeline organization:            | 29 |
| 4.5.        | Deeper pipeline organizations                              | 33 |
| 4.6.        | The T03 core                                               | 35 |
| 4.7.        | Trap handling                                              | 40 |
| 4.8.        | Thread synchronization.                                    | 44 |

| 4.9.      | Conclusion                                                         | 46  |  |  |  |  |
|-----------|--------------------------------------------------------------------|-----|--|--|--|--|
| Chapter   | 5 Klessydra-T1 Architectures                                       | 47  |  |  |  |  |
| 5.1.      | Background                                                         | 47  |  |  |  |  |
| 5.2.      | Motivation for augmenting the T03 core with a hardware accelerator | 47  |  |  |  |  |
| 5.3.      | Special Purpose Mathematical Unit Microarchitecture                | 48  |  |  |  |  |
| 5.4.      | SPMU Implementations                                               | 63  |  |  |  |  |
| 5.5.      | Performance evaluation of the SPMU implementations.                | 69  |  |  |  |  |
| 5.6.      | Area, Power, and Energy Reports                                    | 78  |  |  |  |  |
| 5.7.      | Further Evaluations (memory test, GCC optimizations)               | 81  |  |  |  |  |
| Chapter   | 6 C Language Software Suite                                        | 83  |  |  |  |  |
| 6.1.      | Instruction level testing:                                         | 83  |  |  |  |  |
| 6.2.      | Convolution tests:                                                 | 86  |  |  |  |  |
| 6.3.      | Supplementary VGG16 libraries                                      | 90  |  |  |  |  |
| Conclusi  | ons                                                                | 92  |  |  |  |  |
| Appendi   | х А                                                                | 94  |  |  |  |  |
| Appendi   | х В1                                                               | L24 |  |  |  |  |
| Glossary  | i <b>lossary</b>                                                   |     |  |  |  |  |
| Bibliogra | 1 <b>phy</b>                                                       | 157 |  |  |  |  |

# List of Figures

| Figure.1.1, Graph depicting Moore's Law that predicted the doubling of the transistors per die every two years                                                        |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Figure.1.2. Typical IoT devices in homes                                                                                                                              |
| Figure.1.3. The bandwidth growth with the frequency growth                                                                                                            |
| Figure.1.4. Coverage area for a set of transmission frequencies                                                                                                       |
| Figure.1.5. Number of IoT devices to non-IoT and their project growth                                                                                                 |
| Figure.1.6.Typical depiction of an IoT Embedded System                                                                                                                |
| Figure.1.7. Layers in an artificial neural network14                                                                                                                  |
| Figure.1.8. Accuracy versus number of operations single forward pass for a certain class of CNN 15                                                                    |
| Figure.2.1. Base Instruction Formats                                                                                                                                  |
| Figure.3.1. Propagation delay versus power supply voltage                                                                                                             |
| Figure.3.2 Architecture of PULPino                                                                                                                                    |
| Figure.3.3 Klessydra family roadmap25                                                                                                                                 |
| Figure.4.1. Conceptual view of hardware context counter (harc) interleaved execution                                                                                  |
| Figure.4.2. (a) Klessydra T033 datapath, three harts interleave from RF to WB,                                                                                        |
| Figure.4.3 (a) Klessydra T044 datapath five pipeline stage but still works by interleaving only four harts                                                            |
| Figure.4.4 Klessydra T033 block organization, interleaves three harts in the instruction pipeline 35                                                                  |
| Figure.5.1. Klessydra T133 block organization, interleaves three harts and has three execution units working in parallel                                              |
| Figure.5.2. SPMU Block Diagram                                                                                                                                        |
| Figure.5.3. Partial Adder Circuit in SIMD=4                                                                                                                           |
| Figure.5.4. Partial Multiplier Circuit in SIMD=4                                                                                                                      |
| Figure.5.5. Partial Right Shifter Circuit in SIMD=4                                                                                                                   |
| Figure.5.6. Diagram of the Shared-SPMU, all accesses to the SPMU are shared by all the harts64                                                                        |
| Figure.5.7. Diagram of dedicated SPI shared SPE model. Each hart has a dedicated set of scratchpads, busy signals will only block the hart belonging to the same SPMU |

| Figure.5.8. Diagram of Dedicated-SPMU, each hart has a dedicated SPE and SPI, a busy signal will only block the hart belonging to the same SPMU                |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Figure.5.9. Number of cycles taken to perform an arithmetic vector operation without the SPMU.69                                                               |
| Figure 5.10. Cycle time using the SPMU with SIMD=1 and hardware loops disabled                                                                                 |
| Figure.5.11. Cycle time using the SPMU with SIMD=1 and hardware loops enabled70                                                                                |
| Figure 5.12. Cycle time using the SPMU with SIMD=4 and hardware loops enabled71                                                                                |
| Figure .5.13. Speed boost from exploiting the DLP, TLP, and both together (Hybrid)                                                                             |
| Figure.5.14. Total execution time to perform convolutions when running at the maximum attainable frequency for accelerated and non-accelerated implementations |
| Figure.5.15. Layers of the VGG16 deep convolutional neural network                                                                                             |
| Figure.5.16. KlessydraT13 Shared-SPMU, Single Thread Vs Multithread cycle count per layer for VGG1677                                                          |
| Figure.5.17. KlessydraT13 Dedicated-SPMU SIMD-2, vs Zeroriscy cycle count per layer for VGG16 execution                                                        |
| Figure.5.18. Dynamic Power Consumption of the T13 core running 32x32 convolutions79                                                                            |
| Figure.5.19. Energy Consumption for running each implementation at the top frequency on the different convolution sizes                                        |
| Figure.5.20. Vector addition C test performed with GCC optimizations disabled                                                                                  |
| Figure.5.21. Vector addition C test performed with GCC optimizations enabled                                                                                   |

| Figure.6.1. Convolution of feature map on the left and kernel map on the right           | 87 |
|------------------------------------------------------------------------------------------|----|
| Figure.6.2. Convolution of feature map on the left and kernel map on the right           | 87 |
| Figure.6.3. Division of the sub-kernels. On the left shows the overlap with sub-kernel F | 88 |
| Figure.6.4. Sub-Kernel F executed in the SPMU                                            | 89 |
| Figure.6.5. Discrete Kmemlds for zeropadded implementations                              | 89 |
| Figure.6.6. Zero-Padded Convolution method using the SPMU instructions                   | 90 |

## List of Tables

| Table.2. 1 Table.2.1 RISC-V mnemonics for RISC-V integer and floating point registers                     | 17 |
|-----------------------------------------------------------------------------------------------------------|----|
| Table.2.2. RAS stack prediction hints                                                                     | 18 |
| Table.2.3. RISC-V based opcode map, inst[1:0] = 11 i.e. compressed instructions are not include the table |    |

| Table.4.1. Resource Utilization, and Minimum cycle time [ns]                                     | 29 |
|--------------------------------------------------------------------------------------------------|----|
| Table.4.2. Throughput at Maximum Frequency [MIPS] (N.A. = NOT APPLICABLE)                        | 30 |
| Table.4.3. Average Dynamic Power at Maximum Clock Frequency [mW] (N.A. = NOT         APPLICABLE) | 31 |
| Table.4.5. Control and status registers supported by Klessydra cores                             |    |

| Table.5. 1 Type, and parallelism of the functional units in the SPE                | .55 |
|------------------------------------------------------------------------------------|-----|
| Table.5.2. Cycle number to execute a set of convolutions                           | .72 |
| Table.5.3. Top frequency for each T13 configuration and Riscy Cores.               | .72 |
| Table.5.4. T13 Area Utilization on FPGA for all SPMU Configurations                | .78 |
| Table.5.5. Size in Bytes of the program memory and data memory for different tests | .82 |

## Abstract

The number of IoT devices has greatly increased over the years, so that they have invaded the electronic market. IoT describe a device-to-device communication without human interface. A large class of these devices are battery powered, and the energy consumption inside them is considered critical.

Today's embedded IoT systems interface multiple peripherals such as sensors that perform continuous monitoring of the environment around it, and actuators that are controlled by the embedded systems. Also, they interface wireless devices for data transmissions. A part of their job includes some basic pre-processing of the data before transmitting it over those wireless networks. Such pre-processing "on the edge of the network" minimizes the data to be transmitted over the wireless channels, and only transmits the desired outputs.

In front of the increase demand to support pre-processing, such as computer vision and voice recognition, on small embedded systems on the edge of the network, they cannot completely satisfy those demands due to their little performance

In this study we demonstrate the performance and energy efficiency of interleaved multithreaded architectures, which can be used in an embedded system on the edge of the IoT interfacing multiple sensors and peripherals, each serviced by a different hardware thread. We show the optimal pipeline organization to use in such architectures, and we finally demonstrate how these architectures can be exploited to easily improve instruction level parallelism by integrating a convolutional neural networking accelerator that can perform very fast vector arithmetic operations, and finally benchmarking this accelerator by running a custom implementation of the VGG16 convolutional neural network.

The microprocessors presented are a part of a family of processing cores called *Klessydra*. The Klessydra microprocessors were written such that they have a pinout that are 100 percent identical with Riscy cores from PULPino SoC. The subset of the Klessydra cores presented in this thesis is called the *Klessydra-T*. The letter 'T' indicating that the cores are multithreaded, the Klessydra-T subset has two main implementations used throughout this thesis, they are *Klessydra-T03* and *Klessydra-T13*. *T03* and *T13* for short.

The processor cores have been tested with the Modelsim / Questasim simulators. The cores have been synthesized on the 7-series FPGAs from Xilinx with the Vivado Synthesis tool. Synthesis and Post-synthesis simulations have been made. Dynamic Power estimations were calculated by Vivado from the power report generated by Modelsim after having simulated a post-synthesis Vivado netlist. FPGA synthesis was chosen as our target implementation, as they provide high reconfigurability, which allows the user to easily customize their own accelerator and make it adapt accordingly to their specific applications.

In our assessment throughout this thesis we nominated the T03 interleaved multithreaded processor as our optimal and most balanced pipeline organization. The T03 core had many advantages over other architectures, however it was only suitable to be used in control applications. T13 solves this problem by implementing superscalar hardware accelerators. A hybrid implementation of the hardware accelerator targeting thread level parallelism and slight data level parallelism was the approach yielding the highest performance and still maintaining a relatively low energy consumption for energy critical environments.

### Organization of the Dissertation:

- Chapter 1 This dissertation starts with the preface that provides a brief literature review of IoT devices, and the convergence between cloud computing and embedded systems.
- Chapter 2 The second chapter gives an overview of the RISC-V ISA focusing on the implemented instruction sets in Klessydra-T, and the custom instructions appended to the native RISC-V ISA.
- Chapter 3 The third chapter provides an overview of the PULPino SoC, and describes the modifications made to the Pulpino environment that made it possible for Klessydra to be integrated.
- Chapter 4 The fourth chapter introduces the Klessydra-T0. In this chapter we investigate the optimal pipeline organization to adopt through a series of experimental and analytical studies. Then the building blocks of the Klessydra-T0 will be illustrated, and then we show some basic libraries written to compliment the hardware side with some software code.
- Chapter 5 The fifth chapter introduces the Klessydra-T1, and shows the hardware accelerator added to the T1 core. Then the accelerator is benchmarked when implemented in three different approaches, and we deduce which approach is the most ideal to use. The accelerator is benchmarked with VGG16 DCNN test, and it is shown how it was benchmarked
- Chapter 6 The sixth chapter just shows the software suite of the tests that were used to benchmark the accelerator in chapter 5. They demonstrate how the convolutions were implemented on the accelerator, and a brief display of how the different structures in the VGG16 test were written.
- Conclusion We conclude by summarizing the results presented in chapters four and five.
- Appendix A Contains the Klessydra technical manual detailing the implementation, the ISA support, the architecture, and the CSR instructions in the Klessydra-T cores.
- Appendix B This RTL of the Klessydra-T is here. The T1 and the T0 implementations can be generated from the PKG file, as well as all the configurations detailed in chapter 5.

## **Chapter 1 Preface**

This chapter is a preface to the work being detailed in this study. In the first section we provide a brief introduction on IoT devices and their growth in the current electronic market. In the next section we discuss the artificial neural networks, focusing on the sector of computer vision and convolutional neural networks. In the last section we show the convergence of AI applications from cloud computing to embedded low power IoT devices. Then we discuss the energy efficient digital system developed in this study that target the IoT market, and facilitates the execution of CNNs that are being steadily embedded in IoT devices.

### 1.1. Internet of things

The MOSFET was the main driver for the rise of the Internet of things. The scaling of the MOSFET down to the nanoscale was also followed by the scaling down of the power consumption as well. As of 2019, the smallest MOSFETs in production are 5nm FinFETs manufactured by Samsung and TSMC [1][2]. Gordon Moore observed the shrinking of the transistor and predicted that the number of transistors on an integrated circuit would approximately double every two years (figure 1.1) with the speed doubling every 18 months without increasing the power [3].



Figure.1.1, Graph depicting Moore's Law that predicted the doubling of the transistors per die every two years

However, the world was still farfetched from becoming fully connected. Two main inventions provided the next milestone that facilitated the convergence towards an IoT world, the first was the development of high-performance multi-core processors, and the second was the emergence of high bandwidth wireless technologies.

The nanoscale scale parallel microprocessors were capable computing large chunks of data for a very slight energy consumption. This in turn encouraged the incorporation of these smart technologies into all types of electronic devices especially inside battery powered devices (figure 1.2). One main

example for the use of these smart devices other than home automation domains as shown in the figure above was the deployment of smart devices for sensing and monitoring tasks, such as office monitoring, agricultural monitoring, traffic monitoring, defense monitoring, space monitoring, and not to mention even human monitoring through medical devices and wearable technologies. These areas were situated with a handful of sensing instrumentation for temperature, humidity, fire, air pollution, traffic jam, rain wind, storms, etc.) [4].



Figure.1.2. Typical IoT devices in homes

However, these smart devices needed to be accessed over long distances. and this is where the emergence of wireless technologies played a key role in which they were capable of providing a connection between two nodes over large distances. But one main drawback to wireless transmission was that; the larger the distance got between the two nodes; the more transmission power was needed to maintain the nodes connected. Another challenge was the exorbitant increase in the bandwidth over the years, required by certain streaming applications, and in order to provide these large bandwidths, the wireless technologies needed to transmit over higher frequencies in the spectrum as shown in figure 1.3.



However, the power consumption required to transmit a certain packet of data over a certain distance 'X' is much higher than the power consumption required to transmit the same packet over a lower frequency, and figure 1.3 showed that larger bandwidths broadcasted at higher frequencies. The tradeoff between coverage area and frequency when transmitting over the same frequency is shown in figure 1.4.



Figure 1.4 shows that coverage area for transmitting over the same power (dBm), but different frequency ranges was very different. Such that transmitting over 700MHz covered the 3.5 times the distance for transmitting over 2.5GHz.

The challenge was to accommodate the demand to transmit high bandwidth of data over very large distances, while still maintaining low power consumption. Thus, came the third milestone which was connecting these smart devices to local gateways either through a wire or wirelessly, and the gateways are connected to a global system of interconnected nodes communicating with an open protocol; called TCP/IP otherwise known as the internet.

Providing internet connectivity to smart devices made them capable of transmitting very high data bandwidths over high frequencies to local wireless nodes that are only a few meters away from the transmitter. These communicating nodes are otherwise known as wireless local area networks (WLAN). The WLANs are then connected to the internet and provide access to these smart devices globally. This connection of the various smart devices from over the internet is what is now known as the Internet of Things (IoT).

However, not every device that has IP connectivity is considered IoT. For example, desktops, laptops, cellphones, tablets, game consoles are not considered to be IoT [27]. An IoT device is a network of devices that can communicate without human interactions. In other words, it is a network of things. Figure 1.5 shows the number of IoT devices available till date, and their projected growth over the next five to six years.

IoT encompasses only device-to-device interactions and connectivity. Although human interaction can be present at some endpoint of the IoT network, but all the intermediate device communications are considered IoT. For example, a wearable smart watch interacts with the cellphones over wireless personal area networks (WPAN), and cellular mobile stations through LTE, and connect to GPS systems to provide continuous tracking. All these communications are part of the IoT network, and the final presentation to the human interface would be the non-IoT human factor in this network [27].



Figure.1.5. Number of IoT devices to non-IoT and their project growth

#### 1.2. Energy efficient IoT devices:

Gradual increase in the integration of convolutional neural networks in low power embedded IoT devices by applying image recognition and classification was prevalent in the recent years [5]. IoT devices were able to move AI algorithms from cloud computing down to the edge computing [6]. IoT endpoint SoC refer to a large number of microcontrollers interfacing a various class of sensors on one end, and a wireless device on the other end. The IoT end-nodes might contain specialized units for fast memory access such as scratchpad units [22]. The IoT end-node design demands low-power specialized processors [24][25][26], in which they will be used to collect and pre-processes information from the peripheral devices, and sends the data over the wireless channel (figure 1.6). Preprocessing might include in many cases speech and/or image recognition. This is why we developed a RISC-V processor that can exploit IoT applications which interface multiple peripheral devices, and also, can pre-process images quickly with high performance and energy efficient CNN accelerators.



Figure.1.6. Typical depiction of an IoT Embedded System

### 1.3. Artificial neural networks

#### 1.2.1. Background:

The human brain is a collection of billions of neurons connected to each other through synapses and can pass the signal from one neuron to the next either electrically or chemically. Artificial neural networks although not identical to biological neural networks, however, they were inspired by them. They aimed to loosely imitate the behavior of the brain in order to solve some of the problems the brain does through emulating its learning ability.

ANNs are a collection of artificial neurons that connect to one another to form a large system of artificial neurons. These systems are an aggregate of layers that are connected to each other, they are capable of learning through continuous feedback loop connections, or through algorithms in single-forward pass networks that modify the weights after the whole operation is done (such as the case in feed-forward networks like convolutional neural networks). During the learning process, the system adjusts the weights which can either strengthen or weaken the connection between the two neurons. Figure 1.7 shows the basic organization of an ANN.

The first layer takes the external data that is known as the input layer, and performs a transformation of these data and sends its output to the next layer. The final layer of the networks is the output layer that infers the final result from all the transformations of the previous layers. Between the input and the output layers, there might exist some intermediate layers also known as hidden layers (figure 1.7).



Figure.1.7. Layers in an artificial neural network

The layers can be fully-connected by having every neuron in layer[i] connect to every other neuron of layer[i+1], or the connections can be pooling by connecting a set of neurons in layer[i] to a single neuron in layer[i+1] thereby reducing the number of neurons in layer[i+1].

#### 1.2.2. Learning in ANN

Learning is a continuous process of adjusting connections between the neurons by modifying the weights, so that the output results will converge towards the correct output after running the network in each iteration. The learning can be considered complete if the error rate ideally becomes zero, or that if each iteration of running the network does not reduce the error rate. In order to try and avoid oscillations of weights inside the neural network during learning, adaptive learning must be implemented to in order to maintain a gradient ascent or descent of the weights.

Final results of the network are mapped into a probability distribution of predicted outputs by using normalizing functions such as *softmax*. However, the actual output might not be the desired output.

The error rate in ANN does not typically reach zero, even after the learning is done. A cost function maps the desired real results to the actual results, and if the error rate determined by the cost function is deemed too high, then the network is basically is not designed very well, and re-designing it must be put into consideration.

#### 1.2.3. Deep Convolutional Neural Networks and Deep Learning

A deep neural network (DNN) is a subset of ANN where there exists a large number of layers between the input the and the output layers. The extra layers in DNN enable the extraction of features from the previous layers. DNN are feedforward in nature. They do not provided feedback to the previous layers, and the adjusting of the weights is done at the end of network after the probability distribution has been calculated.

One of the main fields of DNN is convolutional or deep convolutional neural networks (CNN / DCNN), they are used in computer vision [28], or speech recognition. CNNs are fully-connected networks in which each neuron in one layer connects to all the neurons in the next. CNN employ mathematical convolutions in order to transform the input data into the output. There are a large class of CNN that were developed over the years. Figure 1.8 arranges them in accuracy versus number of operations in a single forward pass. One single forward pass indicates how many operations (G-OPS) are required in order to transform the input data of the network to the output result. The size of the circles indicates the memory footprint of each network.



Figure.1.8. Accuracy versus number of operations single forward pass for a certain class of CNN

## Chapter 2 RISC-V and the Klessydra Processor Family



#### 2.1. Motivation behind adopting RISC-V

The first step in building Klessydra a majorly open source family of processing cores, was through choosing an instruction set. Our choice in that matter considering we are a group of researchers with limited funding was to adopt an open instruction set free from royalties.

Our motivation for adopting the RISC-V instruction set, was basically similar to the motivation of the team from University of California, Berkley when they developed the RISC-V ISA. Which was to make instruction sets free. Another reason encouraged us was that RISC-V was designed to tailor and exploit all types of architectures. In-order, out-of-order, embedded low-power, supercomputers and etc. The third reason was that, RISC-V providing encoding space for custom instructions, helped flourish the research community by allowing students, researchers and industries to test, and experiment their own non-native instruction sets.

Also, comparing both RISC-V and OpenRISC, RISC-V being a more revised and well-studied ISA made the case that they were a better option to adopt than OpenRISC for several reasons, most importantly is that openRISC supports condition codes and branch delay slots which complicate higher performance implementations. Also, OpenRISC supporting fixed sized 16-bit immediates made little encoding space to let the ISA grow.

#### 2.2. Background

RISC-V is an open instruction set architecture, the project was started in 2010 at the University of California, and it still continues to expand the ISA specification till the present day.

The ISA is based on reduced instruction set computing, and it provides two reference manuals. The first being the user-level ISA, and the second being the privileged architecture [7]. The main motivation behind having an open source instruction set, was the availability of the open source Linux operating system, and the open networking protocols TCP/IP [8]. The question came as to why instruction sets cannot be free as well. This motivated the engineers at Berkley to create an ISA being open and royalty free. Commercial ISAs from Intel, ARM, and IBM being proprietary limited the research in computer architectures to those companies themselves. And in order to adopt the standards, one must undergo a rigorous process of negotiations in order to take about six to twenty-four months.

RISC-V till date supported the computer architecture research and education consortium in developing their own proprietary or open-source processors. Currently there are tens of RISC-V implementations, like Rocket, RI5CY, Ariane, Klessydra, BOOM, Taiga, and many more [9]. One of their main future goals is to have the instruction set adopted also in industry implementations.

In the next sections in this chapter we will make a brief summary or the RISC-V instruction sets, then we will discuss one huge advantage provided by RISC-V that enabled researches to innovate even more in the computer architecture domain, by giving more implementation freedom to the user. Finally, we will discuss which architecture and ISA extensions were adopted in the Klessydra-T cores presented in this thesis.

#### 2.3. Instruction set architecture briefing

The RISC-V ISA is the base integer ISA, which must be defined in any implementation. The base integer ISA is the backbone of the entire standard that delivers a minimal set of instructions sufficient to be provided to compilers, linkers, assemblers, and operating systems. The base integer ISA can be implemented for both 32-bit and 64-bit architectures.

The base integer ISA is labeled "I" and is preceded by either one of the following labels. "RV32" or "RV64". It supports 32 general purpose registers from "x0-x31" with "x0" being a read only register hardwired to 0. Table 2.1, shows the application binary interface (ABI) of the integer and floating point registerfiles.

| Register | ABI Name | Description                       | Saver  |
|----------|----------|-----------------------------------|--------|
| x0       | zero     | Hard-wired zero                   |        |
| x1       | ra       | Return address                    | Caller |
| x2       | sp       | Stack pointer                     | Callee |
| xЗ       | gp       | Global pointer                    |        |
| x4       | tp       | Thread pointer                    |        |
| x5       | t0       | Temporary/alternate link register | Caller |
| x6-7     | t1-2     | Temporaries                       | Caller |
| x8       | s0/fp    | Saved register/frame pointer      | Callee |
| x9       | s1       | Saved register                    | Callee |
| x10-11   | a0-1     | Function arguments/return values  | Caller |
| x12–17   | a2–7     | Function arguments                | Caller |
| x18–27   | s2–11    | Saved registers                   | Callee |
| x28-31   | t3-6     | Temporaries                       | Caller |
| f0-7     | ft0-7    | FP temporaries                    | Caller |
| f8-9     | fs0-1    | FP saved registers                | Callee |
| f10-11   | fa0-1    | FP arguments/return values        | Caller |
| f12-17   | fa2-7    | FP arguments                      | Caller |
| f18-27   | fs2-11   | FP saved registers                | Callee |
| f28-31   | ft8-11   | FP temporaries                    | Caller |

Table.2.1. RISC-V mnemonics for RISC-V integer and floating point registers

The return address register "x1" is not hardwired automatically in function calls, but rather jump instruction branching to call environments use register "x1" by default to hold the return address. The stack pointer "x2" is identical to each hardware thread or core, and in RISC-V it always points to the beginning of the stack, and the loads and stores to the stack are relative to the base address (i.e. stack pointer in this case).

The base ISA has four instruction formats, as shown in figure 2.1. All instructions have a fixed length and must be aligned 32-bit aligned.

| 31 25                      | 5 24 20   | ) 19 | 15 14 12 | 2 11 7                    | 7 6    | 0      |
|----------------------------|-----------|------|----------|---------------------------|--------|--------|
| $\mathrm{funct7}$          | rs2       | rs1  | funct3   | rd                        | opcode | R-type |
|                            |           |      |          |                           |        |        |
| imm[11:                    | 0]        | rs1  | funct3   | rd                        | opcode | I-type |
|                            |           |      |          |                           |        |        |
| $\operatorname{imm}[11:5]$ | rs2       | rs1  | funct3   | $\operatorname{imm}[4:0]$ | opcode | S-type |
|                            |           |      |          |                           |        |        |
|                            | imm[31:12 | 2]   |          | rd                        | opcode | U-type |
|                            |           |      |          |                           |        |        |

Figure.2.1. Base Instruction Formats

The source *rs1*, *rs2* and destination *rd* operands always fixed in their positions in order to keep the decoding simple. The immediates are always sign extended except for CSR immediates.

The base ISA is divided into five categories of instructions:

- The **integer computational instructions** have a subset of arithmetic, logic, and shifting operations. That either in majority the I-type or R-type format. LUI/AUIPC use the U-type.
- The **control transfer instructions** have a subset of conditional and unconditional jumps. Conditional jumps are relative to the program counter, and do not link any registers. Unconditional jumps can behave like a *goto* statement if there are no pushes to the return address stack (RAS), or they could behave like function calls, or function returns by pushing and popping to the RAS (Table 2.2).

| rd    | rs1   | rs1=rd | RAS action            |
|-------|-------|--------|-----------------------|
| !link | !link | -      | none                  |
| !link | link  | -      | pop                   |
| link  | !link | -      | $\operatorname{push}$ |
| link  | link  | 0      | push and pop          |
| link  | link  | 1      | push                  |

Table.2.2. RAS stack prediction hints

- The **load and store** instructions get the memory address by adding the base address stored in rs1 to the Immediate in the instruction. Load instructions have the I-immediate, and Store use the S-Immediate. They can fetch/write bytes, half-words, and words.
- The **memory fence** instructions insure that one hart performs its memory access before the other hart by fencing the memory accesses.
- The **control and status instructions** access the CSR registers, and modify the ones that are not read only. A large subset of these are registers used for performance counting.
- The last are **environment call and break points** which transfer the execution to a more privileged environment or to a debugger.

RISC-V supports more extensions that include operations being ubiquitous in the computing world. They include the M-extension for Multiply/Divide, A-extension for Atomic operations that help ensure thread synchronization, and memory region locks, F/D-extension for single and double floating-point instructions, and many more that are still being drafted.

#### 2.4. Custom instruction set extensions

RISC-V has been designed to support extensive customization by providing encoding space for custom-instructions as shown in table 2.3. Any custom implementation is considered to be a part of the *non-standard* extensions. The following table shows the map of the base 7-bit opcode and the spaces reserved for each opcode.

| inst[4:2] | 000    | 001      | 010      | 011      | 100    | 101      | 110               | 111        |
|-----------|--------|----------|----------|----------|--------|----------|-------------------|------------|
| inst[6:5] |        |          |          |          |        |          |                   | (> 32b)    |
| 00        | LOAD   | LOAD-FP  | custom-0 | MISC-MEM | OP-IMM | AUIPC    | OP-IMM-32         | 48b        |
| 01        | STORE  | STORE-FP | custom-1 | AMO      | OP     | LUI      | OP-32             | 64b        |
| 10        | MADD   | MSUB     | NMSUB    | NMADD    | OP-FP  | reserved | custom-2/ $rv128$ | 48b        |
| 11        | BRANCH | JALR     | reserved | JAL      | SYSTEM | reserved | custom- $3/rv128$ | $\geq 80b$ |

Table.2.3. RISC-V based opcode map, inst[1:0] = 11 i.e. compressed instructions are not included in the table

As seen from table 2.3 the are four base opcode spaces reserved for custom instruction extensions: *custom-0, custom-1, custom-2,* and *custom-3.* 

#### 2.5. RISC-V support in Klessydra

All Klessydra implementations till date support the "I" base integer instruction set in 32-bit. The introduction of the later multithreaded Klessydra-T0 required at least minimal support of the atomic extensions, by implementing the *AMOSWAP* instruction from the A-extension. The Klessydra-Fx implementation continued to support multithreading thus maintaining the atomic support. Also, the M-extension has been augmented in later releases to provide fast multiplication, especially in the Klessydra-T1 to help execute small vectors quickly in convolutional neural networks.

As for the custom instruction set augmentation, they were included only in the Klessydra-T1, they base opcode encoded for the custom instruction was as follows:

- Custom memory instructions encode the opcode space reserved for "*custom-0*", the opcode[6:0] being "7b'0001011"
- Custom vector arithmetic instructions encode the opcode space for "*custom-1*", the opcode[6:0] being "7b'0101011".

Table 2.4 shows the augmented instructions in Klessydra-T1, and their description will be found in appendix A.

| Name     | Binary format | Assembly syntax       | Opcode   |
|----------|---------------|-----------------------|----------|
| KMEMLD   | R             | kmemld rd, rs1, rs2   | custom-0 |
| KMEMSTR  | R             | kmemstr rd, rs1, rs2  | custom-0 |
| KBCASTLD | R             | kaddv rd, rs1, rs2    | custom-0 |
| KADDV    | R             | kaddv rd, rs1, rs2    | custom-1 |
| KSUBV    | R             | ksubv rd, rs1, rs2    | custom-1 |
| KVMUL    | R             | kvmul rd, rs1, rs2    | custom-1 |
| KVRED    | R             | kvred rd, rs1, rs2    | custom-1 |
| KSVADDSC | R             | ksvaddsc rd, rs1, rs2 | custom-1 |

 Table.2.4. Klessydra K custom instruction set extensions

| KSVADDRF | R | ksvaddrf rd, rs1, rs2 | custom-1 |
|----------|---|-----------------------|----------|
| KSVMULSC | R | ksvmulsc rd, rs1, rs2 | custom-1 |
| KSVMULRF | R | ksvmulrf rd, rs1, rs2 | custom-1 |
| KDOTP    | R | kdotp rd, rs1, rs2    | custom-1 |
| KDOTPPS  | R | kdotpps rd, rs1, rs2  | custom-1 |
| KSRLV    | R | ksrlv rd, rs1, rs2    | custom-1 |
| KSRAV    | R | ksrav rd, rs1, rs2    | custom-1 |
| KRELU    | R | krelu rd, rs1, rs2    | custom-1 |
| KBCAST   | R | kbcast rd, rs1        | custom-1 |
| KVCP     | R | kvcp rd, rs1          | custom-1 |

In addition to instructions, also custom CSR registers were added, table 2.5 lists the custom CSR registers.

| Name     | CSR_Addr | TYPE | Reg_Size                    | Description                          |
|----------|----------|------|-----------------------------|--------------------------------------|
| MVSIZE   | 0xBF0    | R/W  | Log <sub>2</sub> (SPM_Size) | Contains the vector size the         |
|          |          |      |                             | maximum being                        |
|          |          |      |                             | the SPM size                         |
| MVTYPE   | 0xBF8    | R/W  | 2-bits                      | Contains the type of data the vector |
|          |          |      |                             | has (8-bit, 16-bit. 32-bit)          |
| MPSCLFAC | 0xBE0    | R/W  | 5-bits                      | Post scaling factor for right shifts |
|          |          |      |                             | (used by kdotpps instruction)        |

Table.2.5. Klessydra K custom CSR extensions

### 2.6. Patches to the riscv-gnu-toolchain:

Two simple modifications were to be made, to the sources in the RISC-V GCC toolchain [35], the first was to "*riscv-opc.c*", where it had all the structures of the RISC-V instruction listings. As seen below:

| 1  | /* Vector Extensions */        |                                                            |
|----|--------------------------------|------------------------------------------------------------|
| 2  | {"kmemld", "I", "d,s,t",       | MATCH_K_MEMLD , MASK_K_MEM , match_opcode, 0 },            |
| 3  | {"kmemstr", "I", "d,s,t",      | MATCH K MEMSTR, MASK K MEM, match opcode, 0 },             |
| 4  | {"kbcastld", "I", "d,s,t",     | MATCH_K_BCASTLD, MASK_K_MEM, match_opcode, 0 },            |
| 5  | {"kaddv", "I", "d,s,t",        | MATCH_K_ADDV, MASK_K_ARITH, match_opcode, 0 },             |
| 6  | {"ksubv", "I", "d,s,t",        | MATCH_K_SUBV, MASK_K_ARITH, match_opcode, 0                |
| 7  | {"kvmul", "I", "d,s,t",        | MATCH_K_VMUL, MASK_K_ARITH, match_opcode, 0 },             |
| 8  | {"kvred", "I", "d,s",          | MATCH_K_VRED, MASK_K_ARITH   MASK_RS2, match_opcode, 0 },  |
| 9  | {"kdotp", "I", "d,s,t",        | MATCH_K_DOTP , MASK_K_ARITH, match_opcode, 0 },            |
| 10 | {"ksvaddsc",   "I",   "d,s,t", | MATCH_K_SVADDSC, MASK_K_ARITH, match_opcode, 0             |
| 11 | {"ksvaddrf",  "I",  "d,s,t",   | MATCH_K_SVADDRF, MASK_K_ARITH, match_opcode, 0 },          |
| 12 | {"ksvmulsc", "I", "d,s,t",     | MATCH_K_SVMULSC, MASK_K_ARITH, match_opcode, 0 },          |
| 13 | {"ksvmulrf", "I", "d,s,t",     | MATCH_K_SVMULRF, MASK_K_ARITH, match_opcode, 0 },          |
| 14 | {"ksrav", "I", "d,s,t",        | MATCH_K_SRAV, MASK_K_ARITH, match_opcode, 0 },             |
| 15 | {"ksrlv", "I", "d,s,t",        | MATCH_K_SRLV, MASK_K_ARITH, match_opcode, 0 },             |
| 16 | {"kbcast", "I", "d,s",         | MATCH_K_BCAST, MASK_K_ARITH   MASK_RS2, match_opcode, 0 }, |
| 17 | {"krelu", "I", "d,s",          | MATCH_K_RELU, MASK_K_ARITH   MASK_RS2, match_opcode, 0 },  |
| 18 | {"kdotpps", "I", "d,s,t",      | MATCH_K_DOTPPS, MASK_K_ARITH, match_opcode, 0 },           |
| 19 | {"kvcp", "I", "d,s",           | MATCH_K_VCP, MASK_K_ARITH   MASK_RS2, match_opcode, 0 },   |

The second modification was made to the "riscv-opc.h", where all the defines were made that include the instruction mask and instruction opcode, as well as the CSR defines.

| /* Klessydra Extensions */             |
|----------------------------------------|
| /* CSR Extensions */                   |
| #define CSR MVSIZE 0xbf0               |
| #define CSR_MVTYPE 0xbf8               |
| #define CSR_MPSCLFAC 0xbe0             |
| _                                      |
| /* Vector Instructions Extensions */   |
| #define MASK K MEM 0xfe00707f          |
| #define MATCH $\overline{K}$ MEMLD 0xb |
| #define MATCH_K_MEMSTR 0x200000b       |
| #define MATCH_K_BCASTLD 0x400000b      |
| #define MASK K ARITH 0xfe00707f        |
| #define MATCH K ADDV 0x200202b         |
| #define MATCH_K_SUBV 0x400202b         |
| #define MATCH_K_VMUL 0x800202b         |
| #define MATCH_K_VRED 0xC00202b         |
| #define MATCH_K_DOTP 0x1000202b        |
| #define MATCH_K_SVADDSC 0x1800202b     |
| #define MATCH_K_SVADDRF 0x1a00202b     |
| #define MATCH_K_SVMULSC 0x1c00202b     |
| #define MATCH_K_SVMULRF 0x1e00202b     |
| #define MATCH_K_SRAV 0x2000202b        |
| #define MATCH_K_SRLV 0x2200202b        |
| #define MATCH_K_RELU 0x3000202b        |
| #define MATCH_K_DOTPPS 0x3200202b      |
| #define MATCH_K_BCAST 0x3c00202b       |
| #define MATCH_K_VCP 0x3e00002b         |
|                                        |

| 1  | DECLARE_CSR(mvsize, CSR_MVSIZE)                       |
|----|-------------------------------------------------------|
| 2  | DECLARE_CSR(mvtype, CSR_MVTYPE)                       |
| 3  | DECLARE_CSR(mpsclfac, CSR_MPSCLFAC)                   |
| 4  |                                                       |
| 5  | DECLARE_INSN(kmemld, MATCH_K_MEMLD, MASK_K_MEM)       |
| 6  | DECLARE_INSN(kmemstr, MATCH_K_MEMSTR, MASK_K_MEM)     |
| 7  | DECLARE_INSN(kbcastld, MATCH_K_BCASTLD, MASK_K_MEM)   |
| 8  | DECLARE_INSN(kaddv, MATCH_K_ADDV, MASK_K_ARITH)       |
| 9  | DECLARE_INSN(ksubv, MATCH_K_SUBV, MASK_K_ARITH)       |
| 10 | DECLARE_INSN(kvmul, MATCH_K_VMUL, MASK_K_ARITH)       |
| 11 | DECLARE_INSN(kvred, MATCH_K_VRED, MASK_K_ARITH)       |
| 12 | DECLARE_INSN(kdotp, MATCH_K_DOTP, MASK_K_ARITH)       |
| 13 | DECLARE_INSN(ksvaddsc, MATCH_K_SVADDSC, MASK_K_ARITH) |
| 14 | DECLARE_INSN(ksvaddrf, MATCH_K_SVADDRF, MASK_K_ARITH) |
| 15 | DECLARE_INSN(ksvmulsc, MATCH_K_SVMULSC, MASK_K_ARITH) |
| 16 | DECLARE_INSN(ksvmulrf, MATCH_K_SVMULRF, MASK_K_ARITH) |
| 17 | DECLARE_INSN(ksrav, MATCH_K_SRAV, MASK_K_ARITH)       |
| 18 | DECLARE_INSN(ksrlv, MATCH_K_SRLV, MASK_K_ARITH)       |
| 19 | DECLARE_INSN(krelu, MATCH_K_RELU, MASK_K_ARITH)       |
| 20 | DECLARE_INSN(kdotpps, MATCH_K_DOTPPS, MASK_K_ARITH)   |
| 21 | DECLARE_INSN(kbcast, MATCH_K_BCAST, MASK_K_ARITH)     |
| 22 | DECLARE_INSN(kvcp, MATCH_K_VCP, MASK_K_ARITH)         |

### 2.7. Concluding remarks

In the end RISC-V is not only an open source ISA available for simulations, it is a real ISA suitable for inherent hardware implementations. The standards were provided to be balanced to be exploited by all types of architectures. It supports 32 and 64-bit address space and IEEE standard floating-point standards, it provides custom instruction encoding space to allow researchers to explore native non-standard custom extensions, or companies to integrate their own specialized instructions and finally it still has a great potential to become even more pervasive throughout the industry.

## Chapter 3 The PULPino Microcontroller Platform



#### 3.1. Motivation behind choosing PULPino

Having already chosen to build a RISC-V processor required also choosing a SoC. Designing our own SoC from scratch was not feasible since our group of researchers were limited. RISC-V being an emerging technology, the choices among the open SoCs available were not many. Pulpino being part of the ultra-low power projects also was a good reason to adopt the Systen. Finally, having close relations and collaborations with the University of Bologna, provided an ongoing communication channel in order to get continuous support from their side. For the above reasons, we can say that Pulpino was our choice.

Pulpino is an open-source System-on-Chip embedding a 32-bit RISC-V based microprocessor. Pulpino targets embedded systems and embeds ultra-low power designs. The Pulpino SoC was adopted by a large group of researchers globally either for research or commercial purposes.

#### 3.2. Background

PULPino is a smaller version of PULP which stands for Parallel Ultra Low Power processor. The idea behind starting the PULP project, was that in order to achieve low dynamic power consumption, the processors needed to be operated at near threshold voltage levels [10]. The speed will drop rapidly when operating at near threshold voltages since the delay follows a quadratic curve (figure 3.1). Their solution was to re-ramp up the speed by embedding several processors in PULP to work in parallel.



PULP is a large project with a very wide scope of work, it incorporates a large group of engineers, and specialized experts. The project includes open source processors, peripherals, communication buses, an integrated all-in-one environment to build and test the embedded cores with Modelsim and Vivado and the entire SoC, also adds a custom RISC-V toolchain.

PULPino is a miniaturized version of PULP which embeds only one core. PULPino is completely open source[17][18], and can be found on GitHub. Figure 3.2 shows the building blocks of PULPino.



Pulpino targets RTL simulations, FPGAs, and ASICs. It has by default a 32KB program memory, and a 32KB data memory. The boot ROM is 512B. Peripherals are mapped in the upper region of the core and are dedicated 4KB each. The peripherals in Pulpino communicate through sending interrupts. All the interrupts are saved in an interrupt vector table (IVT). When servicing the interrupt, the core will check the IVT in order to jump to the appropriate interrupt handling routine.

Other than the Peripherals, it features an SPI Slave port that can be used to pre-load programs into the memories without the help of the core. It is connected on the AXI as an AXI master which allows external access to all memories and peripherals. Also, Pulpino has a JTAG debugging interface that accesses all peripherals and memories, and can halt and single step the core.

#### 3.3. PULPino native processor cores

Pulpino integrates two RISC-V processors. They are RI5CY and Zero-Riscy. RI5CY is an in order four pipeline stage processors. It supports the base integer instruction set RV32I, compressed instructions RV32C, multiplication extension RV32M, and single precision floating point extensions RV32F. RI5CY also implements other extensions to the ISA such as hardware loops, bit manipulation instructions, MAC operations, packed SIMD instructions and many more [52][53].

Zero-Riscy is an in-order, single-issue processor with only two pipeline stages. It supports the base integer instruction set RV32I, the compressed instructions RV32C, and the multiplication extension RV32M. The core can be configured to support the embedded extension RV32E, and thus reducing the registerfile to half its size. A tiny version of zero-riscy can be implemented by enabling the

embedded extension (RV32E), and disabling the multipliers and dividers (RV32M). This implementation is called Micro-Riscy which is the smallest version supported.

### 3.4. Embedding non-native Klessydra processing cores in PULPino

Figure 3.3 shows the Klessydra and Pulpino Roadmap. Klessydra targeting FPGA implementations, while Riscy cores targeting ASIC implementations.



Figure.3.3 Klessydra family roadmap

In order to correctly embed Klessydra core and software libraries inside Pulpino, changes had to be made to the Pulpino environment on many levels:

- **Modifying the Klessydra RTL:** The pinout of the Klessydra was made one hundred percent compatible with the riscy cores from Pulpino. Also, the interrupt handling, and exception, and event handling had to be modified so that it passes the generic tests.
- **Modifying the Pulpino RTL:** The system verilog of the Pulpino RTL and testbench were modified to add the instances of Klessydra cores, and pass the added generic parameter.
- **Modifying the Software Environment:** The CMake files were modified to include the generic Klessydra tests and software libraries. Also, they were modified along with a shell script in order to pass the arguments to the Tcl simulate scripts.
- **Modifying the Modelsim compile and Simulate scripts:** In addition to the software environment and RTL, compile scripts were also modified to compile the different versions of Klessydra among the compiled Pulpino libraries, and similarly the simulate scripts.

## Chapter-4 Klessydra TO Architecture



### 4.1. The Klessydra-T family

Klessydra is a processing core family that features full compliance with the RISC-V instruction set. Klessydra cores were designed in order to be fitted inside the PULPino SoC. The Klessydra family is composed of a single in-order two pipeline-stage core named Klessydra-S0 [11], a set of multithreaded cores named Klessydra-Tx, and a set of fault tolerant cores named Klessydra-Fx [20][21]. This thesus will cover the Klessydra-Tx family and its different variants. All the Tx cores have been synthesized and tested for FPGAs from XILINX. FPGA synthesis being our main target, was because soft-cores are wildly available on embedded systems [11]. A customizable embedded core is favorable since it can be reconfigured to adapt easily to the user's target applications.

Klessydra cores support RISC-V ISA, all versions support the base integer instruction set in 32-bit |"RV32I" in bare metal, the *Tx* and *Fx* versions extend the ISA with the atomic instruction extension, some *Tx* variants further extend the ISA with multiplication and division extension from RISC-V, and some augment a set of specialized custom instructions augmented to the RISC-V ISA designed to accelerate convolutional neural networking applications. The ports of the Klessydra cores are pin-topin compatible with the RISCY cores inside PULPino. The Tx versions of Klessydra support a multithreading paradigm called interleaved multithreading (IMT) also known as barrel processing. This chapter illustrates the early version of the Tx cores known as the T0 cores, and the different variants of the T0 cores. Chapter 5 upgrades the optimal T0 implementation adopted in this chapter and adds a specialized neural network accelerator that is specifically designed to exploit the IMT architectures. The upgraded version is known as the T1 core.

### 4.2. Motivation for choosing interleaved multithreading

A good guideline to follow in order to increase the energy consumption per instruction of an embedded processor, is through decreasing the idle time of the embedded systems by eliminating the pipeline stalls.

In-order architectures stall the processor's pipeline to fence between same-operand read and write access. These stalls are unfavorable as they degrade the performance of the processor, as well as decrease the energy efficiency by continuously accumulating the total idle time of the processor.

Out-of-order architectures can easily eliminate the pipeline stalls [49][50][51], however in order to do that, they employ highly complex dynamic scheduling logic to resolve the data dependency hazards. These data dependency eliminating schemes give rise to anti-dependency hazards, and again out-of-order architectures employ register renaming approaches to remove those anti-dependencies. In addition, these architectures being highly pipelined must integrate a well-advanced branch predicting logic, since branch miss prediction will greatly impact the overall performance. This type of architecture succeeded in greatly mitigating the pipeline stalls and improves the overall performance. However, these designs being very complex greatly increased the area and the power

consumption of those architectures. In other words, the performance was actually a tradeoff with the power and area.

One existing approach named barrel processing or interleaved multithreading (IMT) [16] aimed at replacing the out-of-order processor's highly complex approach to mitigate the pipeline stalls with another relaxed approach. That is by employing hardware threads to utilize the idle time of the core and fence between the registerfile read and write accesses.

An IMT architecture interleaves a hardware thread (hart) to fill the bubbles in the instruction pipeline in order to avoid Read-after-Write (RAW) data hazards. Doing so, it does not introduce a new class of anti-dependency hazards such as Write-after-Read (WAR) and Write-after-Write (WAW) as in the case of *OOO* architectures.

A basic IMT processors emulates a single-core single-issue processor with zero pipeline stalls. IMT processors with their ability to continuously issues instructions without data dependency stalls can converge easily towards 1 IPC in single issue processors, bit for a certain class of applications. The first class being decoupled sequential applications, and the second being balanced parallel applications. Regarding sequential applications, if the IMT processor was running in a way such that the programs are executing only on one hart and the other harts are idle, the overall performance will surely suffer from the overhead of the interleaving the other harts in the core, and the bigger the number of harts an IMT core has, the worse it performs when executing sequential program. Such that the inputs data of one hart are completely independent from the output results of another hart. Such applications might include for example a microcontroller interfacing multiple sensors, and monitoring the changes, then transmitting the data over a wireless channel in order to be interacted by a human interface.

As for the second class of applications easily exploitable by IMT processors, one might quickly deduce that an IMT architecture can perform well in applications with parallel workloads. Although that is partly true, however, the evaluation of how an IMT core performs when running a parallel application is mainly dependent on how balanced the divided workload is between the harts. A balanced workload in a parallel program can have inter-thread dependencies that require thread synchronization; however, the nature of the workload being balanced makes the overhead of thread synchronization unnoticeable. If the parallel applications are balanced and loosely coupled, they will perform better than a balanced workload with tightly coupled applications. Such application classes are very much suitable for IMT architectures since they utilize all the interleaving harts very efficiently. There are many examples of such applications like; data sorting, searching algorithms, Monte-Carlo simulations, computational fluid dynamics (CFD) simulations, molecular modeling and simulations.

### 4.3. Klessydra-TO introduction and background information

The Klessydra-T0 core is a basic IMT microprocessor which supports the RV32IMA instruction set extensions of RISC-V in bare metal. The 'T' symbol indicates that the core architecture is multithreaded. The multithreading paradigm supported is Interleaved Multithreading or IMT. The Klessydra-T0 can be parametrized to run without the M-extension, and also the registerfiles can also be parametrized to support the Embedded E-extension instead for area critical environments. Throughout this chapter, I will refer to the core as "T0" as an abbreviation to the name Klessydra-T0.

The T0 IMT is a single-issue in-order processor which is available in different variants, and the variants each of which has a different instruction pipeline organization, and they are designated by

the following abbreviation: "*T0ab*". Where the letter 'a' following the zero is the identifier for the minimum number of hardware threads needed to be interleaved in a core in order to avoid inserting any bubbles in the pipeline and this is known as the *thread pool baseline*. The 'b' identifier is to indicate the number of harts present in the current version of the core or otherwise known as *thread pool size*.

In order to build an IMT architecture, the following entities must be replicated for each hart:

- Registerfile
- Program Counter
- CSR Unit

After having replicated the above units, a hardware context counter "harc" must be built. The harc interleaves between the harts in the IMT core, such that on every instruction grant, we send to the program memory a request from another hart.



Figure.4.1. Conceptual view of hardware context counter (harc) interleaved execution

Klessydra-Tx cores have a parameterizable number of harts to interleave where the hart count is identified in the package file by a parameter called "THREAD\_POOL\_SIZE". The recommended number of harts to put in a core should be less than or equal to the thread pool baseline. In other words, *T0ab* is recommended to be configured such that 'b' is less than or equal to 'a'.

Configuring 'b' to be greater than 'a' is allowed, however, it will not give any performance boosts, rather it will significantly slow down the performance when running sequential applications. And running parallel applications as well degrade the performance by augmenting bigger stall overheads from idle harts, that will remain idle until all the other harts would have arrived at a thread synchronization barrier. Not to mention the area of the architecture will grow bigger, and as the layouts grow bigger, the elements in the FPGA selected during place and route will be placed ever so farther away from each other, which in turn will yield slower layouts resulted from larger net-delays between the FPGA element slices.

In order to know the minimum thread baseline needed so that no data hazards arise, we have to know how many pipeline stages exist from the read port of the registerfile till the write port of the registerfile. For every pipeline stage separating the read and write ports, a hart must be interleaved, else the user can choose to configure the core to have a hart count less than the minimum baseline and NOP operations will be introduced in the pipeline to fence between instructions belonging to the same hart.

### 4.4. Choosing the optimal IMT pipeline organization:

In this section, we will demonstrate the framework that followed in choosing the optimal pipeline organization to use in interleaved multithreaded processors [15]. In the end of the section we will show which *T0ab* organization was chosen as the most ideal processor to use in our research. This section is oriented around three main keywords:

- **TPS** or Thread pool size, which indicates the total number hardware threads in the core.
- **TPB** or Thread pool baseline that indicates the minimum number of harts needed to avoid data dependency stalls.
- NT or Number of active threads, which indicates the number of active harts M, in a core with a TPS equal to N, such that always: .

The exploration parameters of IMT architectures was first studied by implementing a set of pipeline organizations ranging from two stages to four stage [14]. each being run with a different set of thread pool sizes. The pipeline implementations studied were as follows:

- a. F / RDEW (two pipeline stages)
- b. F / R / DEW (three pipeline stages)
- c. F / RD / EW (three pipeline stages)
- d. F / RD / E / W (four pipeline stages)
- e. F / R / DE / W (four pipeline stages)

In the pipeline schemes listed above, F designates the instruction fetch stage, R is the registerfile reading, D is decoding, E is executing, and W is registerfile writeback. Early T0 versions included a fetch stage, and flushing logic to discard instruction of the same hart in the fetch when a branch is taken. However, later releases ignored the stage and the incoming instruction goes directly to the decode unit. The requested instruction goes directly to the stage after the F. These pipeline structures were designed to study the optimal pipeline organization to use in an interleaved multithreaded bare metal RISC-V processor. Synthesis runs were done on XILINX 7 Series FPGAs [3]. The synthesis timing constraints were set low to make the Vivado compiler generate fast netlists.

The FPGA element utilization from the synthesis runs of the set of configurations is shown in table 4.1. As well as the minimum cycle time of each layout. For instance, *T012* architecture has a TPS of 2 and thread pool baseline of 1.

| Architecture   | TPS | Codename | LUT  | LUT FF |      |
|----------------|-----|----------|------|--------|------|
|                | 2   | T012     | 3264 | 2410   | 12.7 |
| F / RDEW       | 3   | T013     | 4018 | 3577   | 13.9 |
|                | 4   | T014     | 4351 | 4744   | 15.9 |
|                | 2   | T022     | 3211 | 2544   | 8.9  |
| F/R/DEW        | 3   | T023     | 3892 | 3711   | 9.7  |
|                | 4   | T024     | 4217 | 4882   | 9.5  |
|                | 2   | T022_v2  | 3583 | 2653   | 9.6  |
| F/RD/EW        | 3   | T023_v2  | 4461 | 3853   | 9.6  |
|                | 4   | T024_v2  | 4608 | 5052   | 9.4  |
|                | 2   | T032     | 3242 | 2679   | 8.6  |
| F / R / DE / W | 3   | T033     | 4011 | 3914   | 8.9  |
|                | 4   | T034     | 4187 | 5144   | 8.6  |
| F / RD / E / W | 2   | T032_v2  | 3635 | 2725   | 7.1  |

| Table.4.1. Resource Utilization, and Minimum cycle time [ns | 1 |
|-------------------------------------------------------------|---|
|-------------------------------------------------------------|---|

| 3 | T033_v2 | 4520 | 3958 | 7.3 |
|---|---------|------|------|-----|
| 4 | T034_v2 | 4825 | 5189 | 7.4 |

It is evident from table 4.1 that every increment of a hart (TPS) in the core, increased the number of flip-flops count by more than 1024 (32\*32) registers. And every pipeline stage introduced increased the flip-flop count by 100~200 or 5% to 7%. For example, going from the pipeline organization T012 to T022 revealed only a 5% increase in the total flip-flop count and a slight decrease in the total LUT count, and going from the T023\_v2 organization to T033\_v2 increased the flip-flop count by 6% and the LUT count by 1%.

The cycle time of each organization is also shown in table 4.1. One concern we had was that the overhead of the interleaving new harts would increase the area utilization in the FPGA such that during the post-synthesis place and route phase, Vivado would place the elements very far away from each other, making the net delay of the critical path a lot bigger. However, the Vivado timing reports [48] only showed evidence to that situation happening in the *F/RDEW* pipeline organization where the cycle time increased from 12.7ns in the T012 to 13.9ns in the T013, and up to 15.9ns in the T014. However, we don't care about these implementations, since they were only control configurations used for comparative purposes to the other T0 pipeline organizations.

Looking at the other implementations shows only little cycle time increase due to interleaving more harts, and more significant cycle time decrease due to pipelining which is good. Hence, we conclude from the timing report that the increase overhead of adding a new hart to resolve the data dependency problems does not really impact the cycle time, and that with every pipeline the maximum frequency of the core keeps on increasing, such that the cycle time demonstrated a sharp drop from 12.7ns in the T012 down to 7.4ns in the T034 v2.

The throughput of an IMT processor running an integer arithmetic application at maximum frequency is shown in table 4.2. The table shows the number of MIPS for each TPS configuration in every pipeline organization, when the active number of threads NT is less than or equal to the TPS.

- When , the number of MIPS suffers from data dependencies and pipeline flushes.
- When, the number of MIPS will suffer only due to pipeline flushes.
- When , the number of MIPS will not suffer from any pipeline flushes, and data dependency stalls. However, the MIPS will also not increase with the further increase of NT.

| Table.4.2. Throughput at Maximum Frequency [MIPS] (N.A. = NOT APPLICABLE) |     |     |            |                             |      |       |       |  |
|---------------------------------------------------------------------------|-----|-----|------------|-----------------------------|------|-------|-------|--|
| A 1 .                                                                     | TDC | TDD | <b>C</b> 1 | Number of Active threads NT |      |       |       |  |
| Architecture                                                              | TPS | TPB | Codename   | NT=1                        | NT=2 | NT=3  | NT=4  |  |
|                                                                           | 2   |     | T012       | 67.9                        | 78.8 | n.a.  | n.a.  |  |
| F / RDEW                                                                  | 3   | 1   | T013       | 61.9                        | 71.8 | 71.8  | n.a.  |  |
|                                                                           | 4   |     | T014       | 54.4                        | 63.1 | 63.1  | 63.1  |  |
|                                                                           | 2   |     | T022       | 69                          | 96.4 | n.a.  | n.a.  |  |
| F/R/DEW                                                                   | 3   | 2   | T023       | 63.6                        | 88.8 | 103   | n.a.  |  |
|                                                                           | 4   |     | T024       | 65                          | 90.8 | 105.3 | 105.3 |  |
|                                                                           | 2   |     | T022_v2    | 64.6                        | 90.2 | n.a.  | n.a.  |  |
| F/RD/EW                                                                   | 3   | 2   | T023_v2    | 64.2                        | 89.6 | 104   | n.a.  |  |
|                                                                           | 4   |     | T024_v2    | 65.6                        | 91.6 | 106.2 | 106.2 |  |
|                                                                           | 2   |     | T032       | 50.8                        | 74.6 | n.a.  | n.a.  |  |
| F / R / DE / W                                                            | 3   | 3   | T033       | 49.1                        | 72.2 | 100.8 | n.a.  |  |
|                                                                           | 4   |     | T034       | 50.6                        | 74.3 | 103.8 | 120.4 |  |

| F / RD / E / W | 2 |   | T032_v2 | 58.8 | 86.4 | n.a.  | n.a.  |
|----------------|---|---|---------|------|------|-------|-------|
|                | 3 | 3 | T033_v2 | 57.4 | 84.3 | 117.7 | n.a.  |
|                | 4 |   | T034_v2 | 56.6 | 83.1 | 116   | 134.6 |

Not applicable are set in cases where NT is greater than TPS (NT >TPS), which is impossible.

Let's study one example from the table above. Take a look at the T023\_v2 implementation, this implementation has a TPB of 2. When NT is equal to 1, the number of MIPS reported shows the throughput of the core that is affected by data dependencies and pipeline flushing, while setting NT to be equal to TPB which is 2, shows the throughput with stalls only due to pipeline flushing, and as NT becomes greater than TPB (i.e. NT=TPS=3), the pipelines in the core will **only have one instruction per hart** at a given time, thus making pipeline flushing unnecessary, and so the throughput maximizes to the top attainable values.

Table 4.3 and table 4.4 report the average dynamic power consumption when running at the maximum frequency for each implementation an integer arithmetic application, and the average energy efficiency of the processor to execute one instruction. Static power consumption was not reported, since FPGAs consume a constant static power independent of the parameters or test, they are running. Also, the designs do not provide any ad-hoc mechanisms to reduce the leakage currents [11][12].

| A unhite stores | TDC | TPB | Code-   | Nun   | Number of Active threads NT |       |       |  |
|-----------------|-----|-----|---------|-------|-----------------------------|-------|-------|--|
| Architecture    | TPS | ТРВ | name    | NT=1  | NT=2                        | NT=3  | NT=4  |  |
|                 | 2   |     | T012    | 43.57 | 45.67                       | n.a.  | n.a.  |  |
| F / RDEW        | 3   | 1   | T013    | 38.44 | 40.29                       | 40.29 | n.a.  |  |
|                 | 4   |     | T014    | 37.2  | 38.99                       | 38.99 | 38.99 |  |
|                 | 2   |     | T022    | 53.43 | 58.43                       | n.a.  | n.a.  |  |
| F/R/DEW         | 3   | 2   | T023    | 46.77 | 51.14                       | 53.61 | n.a.  |  |
|                 | 4   |     | T024    | 44.08 | 48.2                        | 50.53 | 50.53 |  |
|                 | 2   |     | T022_v2 | 45.72 | 50                          | n.a.  | n.a.  |  |
| F/RD/EW         | 3   | 2   | T023_v2 | 45.44 | 49.69                       | 52.08 | n.a.  |  |
|                 | 4   |     | T024_v2 | 38.98 | 42.63                       | 44.68 | 44.68 |  |
|                 | 2   |     | T032    | 60.16 | 65.06                       | n.a.  | n.a.  |  |
| F / R / DE / W  | 3   | 3   | T033    | 49.16 | 53.17                       | 58.14 | n.a.  |  |
|                 | 4   |     | T034    | 52.49 | 56.76                       | 62.07 | 65.06 |  |
|                 | 2   |     | T032_v2 | 67.72 | 73.24                       | n.a.  | n.a.  |  |
| F/RD/E/W        | 3   | 3   | T033_v2 | 57.92 | 62.64                       | 68.49 | n.a.  |  |
|                 | 4   |     | T034_v2 | 61.05 | 66.02                       | 72.2  | 75.68 |  |

Table.4.3. Average Dynamic Power at Maximum Clock Frequency [mW] (N.A. = NOT APPLICABLE)

| Table.4.4 Average Energy Efficiency [nj/instr] (N.A. = NOT APPLICABLE) |
|------------------------------------------------------------------------|
|------------------------------------------------------------------------|

| Architecture | TPS TPB | Codenema | Number of Active threads NT |      |      |      |      |
|--------------|---------|----------|-----------------------------|------|------|------|------|
|              |         | IPB      | Codename                    | NT=1 | NT=2 | NT=3 | NT=4 |
| F / RDEW     | 2       |          | T012                        | 1.63 | 1.43 | n.a. | n.a. |
|              | 3       | 1        | T013                        | 1.7  | 1.49 | 1.49 | n.a. |
|              | 4       |          | T014                        | 1.92 | 1.68 | 1.68 | 1.68 |
| F/R/DEW      | 2       |          | T022                        | 1.75 | 1.3  | n.a. | n.a. |
|              | 3       | 2        | T023                        | 1.79 | 1.33 | 1.17 | n.a. |
|              | 4       |          | T024                        | 1.71 | 1.27 | 1.12 | 1.12 |
| F / RD / EW  | 2       | 2        | T022_v2                     | 1.74 | 1.3  | n.a. | n.a. |

|                | 3 |   | T023_v2 | 1.75 | 1.3  | 1.15 | n.a. |
|----------------|---|---|---------|------|------|------|------|
|                | 4 |   | T024_v2 | 1.62 | 1.2  | 1.05 | 1.05 |
| F/R/DE/W       | 2 |   | T032    | 2.5  | 1.77 | n.a. | n.a. |
|                | 3 | 3 | T033    | 2.37 | 1.66 | 1.24 | n.a. |
|                | 4 |   | T034    | 2.36 | 1.67 | 1.24 | 1.1  |
| F / RD / E / W | 2 |   | T032_v2 | 2.29 | 1.62 | n.a. | n.a. |
|                | 3 | 3 | T033_v2 | 2.18 | 1.54 | 1.15 | n.a. |
|                | 4 |   | T034_v2 | 2.26 | 1.6  | 1.2  | 1.06 |

It is obvious from table 4.3 that implementations with a smaller NT consume less dynamic power than implementations with bugger NT. However, that does not mean they are more energy efficient, since within the same implementation, the tests that were utilizing achieved the highest throughput as shown previously from table 4.2. This is evident, were the implementation running at higher *NT*, have the highest energy efficiency. Also, take note that pipelining boosted the top frequency of the core such that the throughput increase was larger than the dynamic power consumption increase, thus we can say, and as seen from table 4.4 that the pipelined architectures were not only faster, but also more energy efficient than their non-pipelined counterparts.

The reported results in the preceding tables show that the most energy efficient implementations were the T024\_v2, and the T034\_v2. That is due to the T024\_v2 having a very low dynamic power consumption, and the T034\_v2 having the highest throughput. However, our choice as the optimal IMT implementation to be used in our research was the T033\_v2, which is slightly less energy efficient than T034\_v2. One might argue why was our choice not following directly the results in the tables. That is because of the following reasons:

- a) As we suggested at the beginning of this chapter, the recommended number of TPS in an IMT architecture should be set equal to the TPB. So, the best choice in each pipeline organization should be as follows, T011, T022, and T033.
- b) Fetch buffers were present in the reported results in order to demonstrate the impact of pipeline flushing on the performance and energy efficiency. They will be removed in the chosen T033\_v2 implementations. In the upgraded implementations of the T033\_v2, the fetched instruction will directly go to the decode stage, and no flushing will be needed.
- c) Removing the fetch from the T033\_v2 will increase its throughput to match that of the T034\_v2, thus making the T033\_v2 to have the highest energy efficiency.
- d) T033\_v2 is a better choice than T034\_v2 in parallel applications, since thread synchronization overheads will be smaller in the T033\_v2.
- e) The bigger area increase in the T034\_v2 over the T033\_v2 tell us that if the two implementations will be attaining the same throughput at best, then the area increase in the former does not justify its usage as an efficient processor over the latter.
- f) Finally, although not very evident in the pipelined organizations, but the cycle time actually does slightly increase due to interleaving more harts.

For all the reasons above, they justify that the best option is to use the most pipelined version in which TPS is set equal to TPB (TPS=TPB). Having chosen the T033\_v2 as our ideal implementation for a fast, and energy efficient processor, in the next section we will see why deeper pipelines like T04 and T05 were not explored.

#### 4.5. Deeper pipeline organizations

#### 4.5.1. Pipelines stages after registerfile read access:

Following the trend from the above tables, it was evident that deeper pipelines provided higher operating frequencies for the core, and interleaving sufficient threads utilized the wasted energy in the core by allowing another hart to execute instead of having a delay slot. Figure 4.2, shows the datapath of the T0 in two different pipeline organizations. The first having the memory accessed from the execute stage, the second included the memory access from a dedicated memory stage where the memory address was calculated in the previous pipeline stage.

Although deeper pipelines yielded better results as shown from the previous section. One evident problem was saturation in the cycle time decrease as the pipelining got deeper, and implementations such as T044 from figure 4.2b, might not really have higher operating frequencies, since the area overhead of supplementing additional threads will start to decrease the top frequency by increasing the net delay more than the increase in the top frequency gained by decreasing the logic delay.

Also, there will be a definite bigger overhead of stalls when synchronizing the hardware threads, or when there are idle harts in the more pipelined implementations (T044). For example, a program running on a single hart in the T033 will execute one instruction on the first hart followed by **two** wait-for-interrupt (WFI) instructions that act like a NOP. While running an application with a single hart on the T044 will execute one instruction on the first hart followed by **three** WFI instructions. This additional augmented overhead will make deeper pipeline implementations perform worse on single threaded sequential applications, and unbalanced parallel applications. While for balanced parallel applications they will maybe not perform much better due to the saturation in the top frequency increase due to pipelining.



Figure.4.2. (a) Klessydra T033 datapath, three harts interleave from RF to WB,

#### (b) T044 datapath interleaves four harts between RF and WB

So, in-order to have a balanced IMT architecture that is fast enough and does not burden the other harts with a big overhead, T033 remained as our best choice, and post registerfile stage pipelining was ignored.

#### 4.5.2. Pipelines stages before registerfile read access:

However, there are pipeline implementations that can be made before the registerfile read access, that do not require the IMT to increase the thread pool baseline as shown in figure 4.3. That is because the registerfile read and write accesses will still be fenced by the interleaving harts. The first is separating the decode and the registerfile into separate stages, by placing the decode before the regfile access as seen in figure 4.3a. The second can be to install a pre-fetch buffer as seen from figure 4.3b.

T033 pipeline was written such that the registerfile access is completely independent of the decode access, meaning that both entities will work in parallel. Hence separating the decode and the registerfile stages does not give any performance boosts. Second of all, introducing pre-fetch buffers will increase the number of instructions per hart in the core such that each hart will have two instructions in the pipeline, and any branch taken requires the implementation of flushing logic in order to flush the instruction of the same hart that is present in the pre-fetch buffer. Re-introducing flushing is completely avoidable, and as demonstrated from the previous section that it has a significant impact on the throughput of the core thus making it unfavorable as well.



Figure.4.3. (a) Klessydra T044 datapath five pipeline stage but still works by interleaving only four harts (b) Klessydra T044 eight pipeline stage still interleaves four harts, and needs flushing logic for branch miss prediction

For the reasons mentioned earlier, pre-registerfile pipelining is avoided as well since it is either unnecessary or affects the processor's throughput by introducing branch delay slots, so we stick again with the T033 implementation.

#### 4.5.3. Conclusion:

Choosing to maintain T033 as the optimal version of the core. In the remaining part of this chapter we will elaborate more about the building blocks of the T033, and the software developed to facilitate it. Also, one final note; from here on out, any references to T033\_v2 and T022\_v2 will be made as 'T03' and 'T02' respectively since our aim from the beginning was to use IMT cores to have a TPS equal to the TPB (TPS=TPB).

### 4.6. The T03 core

In figure 4.4, we show the basic block organization of the T03 core. It is a balanced [23] four-pipeline stage in order interleaved multithreaded processor. The pipeline stages are Decode/Regfile, Execute, and Writeback. The Fetch stage does not have any buffers to hold the incoming instructions, hence incoming instructions directly pass to the Decode stage, but since the fetch has a one cycle latency, then the fetching is still considered a pipeline stage. And the Registerfile is read in the first stage and written back in the last stage.

Registerfile reading and instruction decoding is insured to be done in parallel, and all dependencies between the two processes are eliminated. Since a dependency between instruction decoding and regfile reading will result in a high logic path delay in that pipeline stage, making the critical path to become present in that particular pipeline.



Figure.4.4. Klessydra T033 block organization, interleaves three harts in the instruction pipeline

The principle of operation of each module in figure 4.4 will be illustrated over the next the few pages. First, we will start with the datapath in the instruction pipeline, and then moving on to the remaining modules in the core.

• Fetch: the fetch unit is a simple finite state machine that sends a fetch request packet containing the program counter of the current active hart whenever the pipeline is not busy. The received instruction is sent to the decode stage. The RTL description of such a process is shown in the following code.

```
1
        fsm IF nextstate : process(all) -- acts as the control unit of the synchronous program memory
 234567
        begin
          if busy ID = 0' then - checks for a stall from the decode stage
           instr req o <= '1'; -- request next instruction
          else
           instr req o \le 0'; -- stall the instruction requests
          end if;
 8
        end process;
 9
10
        process(clk i, rst ni)
11
        begin
12
          if rising edge(clk i) then
13
           if instr gnt i = 1 then – grant from the program memory
14
            -- pc propagation
15
            pc ID <= pc IF; -- push the program counter of the incoming instruction to the decode stage
16
            -- harc propagation
17
            harc ID <= harc IF; -- push the hart identifier to the decode stage
18
           end if:
19
          end if;
20
        end process;
```

• **Decoder:** Fetched instructions directly go into the decoder. The time to fully decode an instruction is one clock cycle only. The decoded instruction can be issued to the IE\_unit or Instruction Execute unit, in which all the instructions are executed. Both execution units receive the type of the decoded instruction in a form of *one hot decoding* in which the instruction to be executed corresponds to one bit only of the entire bit-vector. This decoding scheme passed to the execute stage might generate big vectors as the instruction set supported grows larger, however, it will relieve the execution stage by making it perform the simplest re-decoding of the instruction, and limit its parts to only do the execution. The pipeline is halted whenever the decoder receives a *busy\_IE* signal, from execution stage for each instruction that requires more than one cycle to execute. The pseudo code below shows the one hot decoding of the RISC-V instructions, and demonstrates how this one hot pattern are encoded in the decode stage to be passed to the IE-stage.

```
fsm_ID_sync : process(clk_i, rst_ni, instr_word_ID_lat) -- synch single state process
begin
if rst_ni = '0' then
......
elsif rising_edge(clk_i) then
if busy_IE = '1' then -- halt the decode of the IE-unit is busy
......
```

5

6

| 8  | elsif instr_rvalid_ID = '0' then halt if there is no incoming valid instruction                             |
|----|-------------------------------------------------------------------------------------------------------------|
| 9  |                                                                                                             |
| 10 | else else decode the incoming instruction                                                                   |
| 11 | Decode OF INSTRUCTION (BEGIN)                                                                               |
| 12 |                                                                                                             |
| 13 | ie_instr_req <= '1'; enable the IE stage                                                                    |
| 14 | case OPCODE_wires is                                                                                        |
| 15 | when OP_IMM =>                                                                                              |
| 16 | if(rd(instr_word_ID_lat) $\neq 0$ ) then instructions referencing $rd=x0$ instructions are executed as NOPs |
| 17 | case FUNCT3_wires is                                                                                        |
| 18 | when ADDI => ADDI instruction                                                                               |
| 19 | decoded_instruction_IE <= ADDI_pattern; assign the correct one hot pattern to the ADDI instruction          |
| 20 | when SLTI => SLTI instruction                                                                               |
| 21 | decoded_instruction_IE <= SLTI_pattern; assign the correct one hot pattern to the SLTI instruction          |
| 22 |                                                                                                             |
| 23 |                                                                                                             |
| 24 | when LUI => LUI instruction                                                                                 |
| 25 | if (rd(instr_word_ID_lat) /= 0) then                                                                        |
| 26 | decoded_instruction_IE <= LUI_pattern; assign the correct one hot pattern to the LUI instruction            |
| 27 | else R0_INSTRUCTION                                                                                         |
| 28 | decoded_instruction_IE <= NOP_pattern; assign the NOP pattern to the LUI instruction                        |
| 29 | end if;                                                                                                     |
| 30 | when AUIPC => AUIPC Instruction                                                                             |
| 31 |                                                                                                             |
| 32 |                                                                                                             |
| 33 | when others => ILLEGAL_INSTRUCTION                                                                          |
| 34 | decoded_instruction_IE <= ILL_pattern; assign illegal pattern to instructions with unrecognized opcode      |
| 35 | end case; OPCODE_wires cases                                                                                |
| 36 |                                                                                                             |
| 37 | Decode OF INSTRUCTION (END)                                                                                 |
| 38 | end if; instr. conditions                                                                                   |
| 39 | end if; <i>clk</i>                                                                                          |
| 40 | end process;                                                                                                |

- **Registerfile:** The T03 has a 2Rd/1Wr operand registerfile with register 'x0' being statically bounded to 0. The registerfile can be configured to be 32x32 regfile following the RV32I instruction set, or it can be configured to be a 16x32 registerfile thus following the RV32E extension. While the instructions get decoded, it's operands are read in parallel by the registerfile.
- **Comparators**: are used to make branch decisions. Three comparators are needed to determine whether the operands satisfy a BEQ, BNE, BLT, BLTU, BGE, BGEU. The comparators will send a signal to the execute stage to indicate whether the branches will be taken or not. The separation of the comparators from the execute stage was in order to balance the decode and the execute stages.

```
-- COMPARATORS ------

if (signed(regfile(harc_ID)(rs1(instr_word_ID_lat))(31 downto 0)) =

signed(regfile(harc_ID)(rs2(instr_word_ID_lat))(31 downto 0))) then

pass_BEQ_ID <= '1';

else

pass_BNE_ID <= '1';

end if;

if (signed(regfile(harc_ID)(rs1(instr_word_ID_lat))(31 downto 0))) <

signed(regfile(harc_ID)(rs2(instr_word_ID_lat))(31 downto 0))) then

pass_BLT_ID <= '1';

else

pass_BGE_ID <= '1';
```

```
2
3
4
5
6
7
8
9
10
11
12
```

```
13
14
15
16
17
18
19
20
```

```
end if;
if (unsigned(regfile(harc_ID)(rs1(instr_word_ID_lat))(31 downto 0)) <
unsigned(regfile(harc_ID)(rs2(instr_word_ID_lat))(31 downto 0))) then
    pass_BLTU_ID <= '1';
else
    pass_BGEU_ID <= '1';
end if;
```

- Execute: The execute has a four state fsm machine:
  - **Reset State:** initial state before the core begins executing instructions.
  - Sleep State: Idle state in which the core waits for a *fetch\_en\_i* signal or an interrupt.
  - **Debug State:** Indicates that the core is currently in debug mode.
  - Data Valid Waiting State: Core is waiting for data to be loaded or stored into the mem.
  - CSR Instruction Wait State: Indicates that the core is handling CSR instructions.

The execute stage encapsulates all the functional units required to execute the RISC-V instructions. The functional units are shared by the instructions, and a mapper is included in order to correctly map the instruction operands to their corresponding FUs:

- ADDI, ADD, SUB, AUIPC, JAL, and JALR share the same adder.
- SLLI, and SLL instructions share the same left shifter.
- SRLI, SRAI, SRL, SRA share the right shifter.
- AND, ANDI, OR, ORI, XOR, XORI each share their corresponding logical units.
- MUL. MULH, MULHU, MULHSU share the same multiplier.
- DIV, DIVU, REM, REMU share the same divider.
- LOAD, STORE, instructions have their own adder for address creation.

Branch instructions update the program counter of the corresponding hart if the branch is taken. T03 implementations of the core do not need any flushing logic, since each hart is only one instruction in the pipeline at a time. The execute stage also handles CSR instructions, it puts the registerfile data on the CSR write bus and the CSR data on the read bus.

In addition, pending interrupts are served in the IE stage, more details about interrupt handling will be elaborated on later in this chapter.

```
fsm_IE_sync : process(clk_i, rst_ni)
-- pragma translate_off
variable row : line; -- local variable for instruction tracing, not synthesizable
-- pragma translate_on
begin
if rst_ni = '0' then
...
elsif rising_edge(clk_i) then
case state_IE is -- stage state
when normal =>
-- check if there is a valid instruction and the thread it belongs to is not in a delay slot:
if instr_rvalid_IE = '0' then
instr_rvalid_WB <= '0'; -- do nothing and wait for valid instruction and finished delay slot
elsif irq_pending(harc_IE) = '1' then
instr_rvalid_WB <= '0'; -- in the sync process we don't need to do anything here</pre>
```

| 19 | else process the instruction                                                                       |
|----|----------------------------------------------------------------------------------------------------|
| 20 | EXECUTE OF INSTRUCTION (BEGIN)                                                                     |
| 21 |                                                                                                    |
| 22 | if decoded instruction IE(ADDI bit position) = '1' or                                              |
| 23 | decoded_instruction_IE(ADD7_bit_position) = '1' or                                                 |
| 24 | decoded instruction IE(SUB7 bit position) = '1' or                                                 |
| 25 | decoded instruction $IE(AUIPC \text{ bit position}) = '1' \text{ or}$                              |
| 26 | decoded instruction IE(JAL bit position) = '1' or                                                  |
| 27 | decoded instruction $IE(JALR \text{ bit position}) = '1'$ then                                     |
| 28 | if $(rd(instr_word_IE) = 0)$ then condition for JAL and JALR ops which execute when "rd = x0"      |
| 29 | IE WB $EN <= 1'$ ;                                                                                 |
| 30 | end if;                                                                                            |
| 31 | IE WB <= std logic vector(signed(add op A)+signed(add op B)); ADDER                                |
| 32 | end if;                                                                                            |
| 33 |                                                                                                    |
| 34 | if decoded instruction IE(SLLI bit position) = '1' or                                              |
| 35 | decoded instruction IE(SLLL bit position) = '1' then                                               |
| 36 | WB EN <= '1';                                                                                      |
| 37 | WB <= to stdlogicvector(to bitvector(sl op A) sll to integer(unsigned(sl op B))); LEFT SHIFTER     |
| 38 | end if; $2 - 2 - 2 - 2 - 2 - 2 - 2 - 2 - 2 - 2 $                                                   |
| 39 |                                                                                                    |
| 40 |                                                                                                    |
| 41 | if decoded instruction IE(SW MIP bit position) = '1' then                                          |
| 42 | if sw mip = '1' and halt $IE = 0'$ then                                                            |
| 43 | core busy IE wires := '1'; halt the core since the instruction takes more than one cycle           |
| 44 | nextstate IE wires := csr instr wait state; software ints write to the MIP registers of the target |
| 45 | hart                                                                                               |
| 46 | end if;                                                                                            |
| 47 | end if;                                                                                            |
| 48 |                                                                                                    |
| 49 |                                                                                                    |
| 50 | EXECUTE OF INSTRUCTION (END)                                                                       |
| 51 | end if; instr rvalid IE values                                                                     |
| 52 | when csr instr wait state =>                                                                       |
| 53 |                                                                                                    |
| 54 | when others =>                                                                                     |
| 55 |                                                                                                    |
| 56 | end case; fsm_IE state cases                                                                       |
| 57 | end if; refers to reset signal                                                                     |
| 58 | end process:                                                                                       |
| 59 | ;end of IE stage                                                                                   |
| 60 |                                                                                                    |

- Writeback: The writeback writes the result from the IE stage back to the registerfile when it receives a "WB\_EN" from the IE stage. Since all the execution units are encapsulated in one entity, we will get up to one result per cycle only. Certainly, a hart can only write to its own regfile, so each registerfile needs only one write port since only one result will be ready at a time.
- **Program Counter:** A *pc\_updater* fsm updates the program counter of each hart, to fetch the next instruction by incrementing the current *pc* address. A program counter may be updated by events coming from various signals:
  - *set\_branch\_condition*: event happens in case of unconditional jumps or taken branches.
  - set\_except\_condition: event happens due to executing an illegal instruction, or misaligned memory access or due to executing an Environment Call ECALL instruction, and the program counter will be updated to jump to the exception handling routine.

- *irq\_pending*: event occurs due to incoming external or timer interrupts, or interthread software interrupts. the program counter will be updated to jump to the interrupt handling routine.
- *rst\_ni*: event occurs only once at the startup time of execution, and updates the program counter with the boot\_pc that contains the boot pointer.

The program counter has the hart interleaving unit also known as the hardware context counter (harc). The harc updates the program counters of each hart in an interleaved fashion.

• **CSR Unit:** The control and status register unit handle the execution of the CSR instructions, the automatic update of some registers due to certain events such as exceptions or interrupts (also maps the inter-thread software interrupts to the appropriate CSR unit), and handles the MRET instructions. A subset of the CSR registers is supported in Klessydra and they are listed in table 4,5 Each CSR unit has a unique identification number in the read only MHARTID register. More details about the implementation of the CSR registers can be found in appendix A.

| Table.4.4. Control and sta | atus registo | ers supported by Klessydra cores |
|----------------------------|--------------|----------------------------------|
| Name                       | R/W          | Description                      |
| MSTATUS                    | R/W          | status register                  |
| MEPC                       | R/W          | exception program counter        |
| MCAUSE                     | R/W          | trap cause                       |
| PCER                       | R/W          | performance counter enabler      |
| MESTATUS                   | R/W          | exception status register backup |
| MHPMCOUNTER                | R/W          | performance-monitoring counter   |
| MHPMEVENT                  | R/W          | performance-event selector       |
| MCPUID                     | R            | cpu description                  |
| MIMPID                     | R            | implementation description       |
| MHARTID                    | R            | hardware thread integer id       |
| MIP                        | R/W          | interrupt pending type           |
| MTVEC                      | R/W          | trap-handler base address        |
| MIRQ                       | R            | ext. interrupt request number    |
| MBADADDR                   | R/W          | misaligned address value         |

• **Debug Unit:** The core also augments a basic debug unit which can halt the execution through a debug request or an EBREAK instruction. In debug mode the core can be in two states *halt state* in which the cores halts execution after the last fetched instruction, and *single step mode* in which the core steps through every instruction in the core. In debug mode, the debug unit can read the registerfile contents of the hart in the execute stage, to read the contents of the other harts, the debug unit must single step through the instructions until the desired hart arrives to the execute stage.

### 4.7. Trap handling

#### 4.7.1. Trap Handling through hardware

When a trap occurs, the IE stage automatically sends a signal to the program counter so that it updates the *pc* value of the hart in the IE stage to jump to the machine trap vector address MTVEC. The CSR unit updates the corresponding CSR registers:

- MCAUSE is updated with the type of exception if the trap was due to an exception.
- MIP is updated with the type of interrupt if the trap cause was an interrupt request.
- MEPC is updated with the *pc* value of the executing instruction when the trap occurred.
- **MSTATUS** indicates trap handling in progress, and disables nested traps handling.
- **MESTATUS** is backed with the pre-trap MSTATUS value.

• MBADADDR holds the misaligned address if the trap was due to a misaligned access.

**Interrupts:** Klessydra cores support three types of interrupts, *external, timer*, and *software interrupts*. Hart 0 handles timer and external interrupts, and is done as shown from the code below.

| <br>synchronous assignment to MIP internal bits:                                             |
|----------------------------------------------------------------------------------------------|
| this is Pulpino-specific assignment, i.e. the timer-related IRQ vector value                 |
| – the h index refers to the hart, in this case hart 0 only enters the condition              |
| if $h = 0$ and unsigned(irq_id_i) >= 28 and irq_i = '1' then the irq is a timer interrupt    |
| $MIP\_internal(h)(7) \le '1';$                                                               |
| else                                                                                         |
| $MIP(h)(7) \le 0';$                                                                          |
| end if;                                                                                      |
| this detects the other IRQ vector values in Pulpino                                          |
| if $h = 0$ and unsigned(irq_id_i) < 28 and irq_i = '1' then the irq is an external interrupt |
| MIP(h)(11) <= '1';                                                                           |
| else                                                                                         |
| $MIP(h)(11) \le 0';$                                                                         |
| end if; the MIP(h)(3), software interrupt bit is handled by all the harts                    |

All the harts on the other hand can send and receive software interrupts through a store word instruction to a specific address in the memory map. The address tag (upper bits of the SW address) is checked in the ID stage, and if the tag maps to the software interrupt's address tag, the store instruction will instead act as a CSR instruction that writes to the MIP register of the other harts as shown in the VHDL code below.

| 1 | if decoded_instruction_IE(SW_MIP_bit_position) = '1' then – a store word that writes to the MIP of a hart    |
|---|--------------------------------------------------------------------------------------------------------------|
| 2 | if sw_mip = '1' then the upper bits of the address are decoded in the ID stage to know if the SW is a SW_MIP |
| 3 | csr_op_i <= CSRRW; set the type of CSR instruction                                                           |
| 4 | csr_instr_req <= '1'; enable the CSR unit                                                                    |
| 5 | ie_csr_wdata_i <= RS2_Data_IE; put the data on the CSR bus                                                   |
| 6 | csr_wdata_en <= '1'; enable csr write                                                                        |
| 7 | csr_addr_i <= MIP_ADDR; the csr address is the MIP register                                                  |
| 8 | the lower address bits of the SW instruction are decoded to know which hart receives the software interrupt  |
| 9 | for i in harc_range loop                                                                                     |
| 0 | if data_addr_internal_IE(3 downto 0) = std_logic_vector(to_unsigned((4*i),4)) then                           |
| 1 | harc_to_csr <= i; harc_to_csr enables the target CSR unit                                                    |
| 2 | end if;                                                                                                      |
| 3 | end loop;                                                                                                    |
| 4 | end if;                                                                                                      |
| 5 | end if;                                                                                                      |

When a hart receives an interrupt of any type, it will be directly serviced as soon as the hart arrives at the IE stage in the pipeline, and the instruction that is currently in the IE stage will not be executed. The hart will jump to the interrupt servicing routine, and will return at the end of the routine with an *MRET* instruction to the same address in order to execute the instruction that was discarded before. If the instruction discarded happened to be a *WFI*, this case will be registered when the trap occurs in the MSTATUS(h)(30) register of the hart indexed in *h*, and the return from the interrupt routine during the MRET execution will be to the "WFI\_ptr + 4" instead. This is essential in order to break the core from being stuck in an infinite loop. The following code briefly shows how the CSR units updates the CSR registers for each type of event Interrupt/exception and how the MSTATUS recovers after servicing the interrupt routine.

-- it is the MEIP bit, ext. irq

<sup>--</sup> Interrupt-cause CSR updating ------

<sup>--</sup> note: PC just udpdated, MIP\_internals can't have been cleared yet.

if served\_irq(h) = '1' and MIP\_internal(h)(11) = '1' then

| 5<br>6<br>7<br>8<br>9 | MCAUSE_internal(h) <= "1" & std_logic_vector(to_unsigned(11, 31)); ext. irq<br>MESTATUS(h)(2 downto 1) <= MSTATUS_internal(h); push the MSTATUS to back MESTATUS register<br>if WFI_Instr = '1' then Indicates to the MEPC that the return address contains a WFI instruction<br>MCAUSE internal(h)(30) <= '1'; |
|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0                     | else                                                                                                                                                                                                                                                                                                            |
| 10                    | MCAUSE internal(h)(30) $\leq 0$ ;                                                                                                                                                                                                                                                                               |
| 11                    | end if;                                                                                                                                                                                                                                                                                                         |
| 12                    | MSTATUS internal(h)(0) $\leq 0'$ ; interrupt handling temporarily disabled,                                                                                                                                                                                                                                     |
| 13                    | MSTATUS internal(h)(0) <= 0, = internal(h)(0); trap handling pending in progress                                                                                                                                                                                                                                |
| 14                    | elsif served irq(h) = '1' and MIP internal(h)(3) = '1' then                                                                                                                                                                                                                                                     |
| 15                    | it is the MSIP bit, sw interrupt req                                                                                                                                                                                                                                                                            |
| 16                    | MCAUSE internal(h) $\leq$ "1" & std logic vector(to unsigned(3, 31)); sw interrupt                                                                                                                                                                                                                              |
| 17                    | MIP internal(h)(3) $\leq 0'$ ; we reset the sw int. request just being served                                                                                                                                                                                                                                   |
| 18                    | similar assignments as the ext irq                                                                                                                                                                                                                                                                              |
| 19                    | Similar assignments as the ext hy                                                                                                                                                                                                                                                                               |
| 20                    | elsif served $irq(h) = '1'$ and MIP $internal(h)(7) = '1'$ then                                                                                                                                                                                                                                                 |
| $\overline{21}$       | it is the MTIP bit, timer interrupt req                                                                                                                                                                                                                                                                         |
| 22                    | MCAUSE internal(h) $\leq$ "1" & std logic vector(to unsigned(7, 31)); timer interrupt                                                                                                                                                                                                                           |
| $\overline{23}$       | similar assignments as the ext irq                                                                                                                                                                                                                                                                              |
| 24                    |                                                                                                                                                                                                                                                                                                                 |
| 25                    | Exception-cause CSR updating                                                                                                                                                                                                                                                                                    |
| 26                    | elsif served except condition(h) = '1' then                                                                                                                                                                                                                                                                     |
| 27                    | if served ie except condition(h) = '1' then                                                                                                                                                                                                                                                                     |
| 28                    | MCAUSE_internal(h) <= ie_except_data; exception cause passed from IE Stage                                                                                                                                                                                                                                      |
| 29                    | end if;                                                                                                                                                                                                                                                                                                         |
| 30                    | MESTATUS(h)(2 downto 1) <= MSTATUS_internal(h); push the MSTATUS to backup register MESTATUS                                                                                                                                                                                                                    |
| 31                    | MEPC_internal(h) <= pc_except_value_wire(h);                                                                                                                                                                                                                                                                    |
| 32                    | $MSTATUS_internal(h)(0) \le '0'; interrupt handling temporarily disabled,$                                                                                                                                                                                                                                      |
| 33                    | MSTATUS_internal(h)(1) <= '1'; trap handling pending in progress                                                                                                                                                                                                                                                |
| 34                    | if misaligned_err = '1' then                                                                                                                                                                                                                                                                                    |
| 35                    | MBADADDR(h) <= data_addr_internal; store the misaligned address that caused the trap                                                                                                                                                                                                                            |
| 36                    | end if;                                                                                                                                                                                                                                                                                                         |
| 37                    |                                                                                                                                                                                                                                                                                                                 |
| 38                    | MRET-cause CSR updating                                                                                                                                                                                                                                                                                         |
| 39                    | elsif served_mret_condition(h) = '1' then                                                                                                                                                                                                                                                                       |
| 40                    | MSTATUS_internal(h)(1) <= '1'; re-enable the trap handling                                                                                                                                                                                                                                                      |
| 41                    | $MSTATUS_internal(h)(0) \le MSTATUS_internal(h)(1); indicate the core is no longer handling traps$                                                                                                                                                                                                              |
| 42                    | end if;                                                                                                                                                                                                                                                                                                         |

### 4.7.2. Trap handling through software

In the startup code there is a an MTVEC label indicating the start of the routine to execute during a trap. The routine will simply compare the MCASUE value to the table of trap handlers to know which trap handling to execute, and once the MCAUSE matches the value in the trap table, it will jump to the trap handling routine defined by PULPino, and then returns back to the execution environment. Below is a partial assembly snippet of the trap handling routine from the klessydra startup.S file.

| mtvec | _routine |                                                                                     |
|-------|----------|-------------------------------------------------------------------------------------|
|       | addi     | sp,sp,-KLESSYDRA_EXC_STACK_SIZE; // decrement the stack pointer                     |
|       | SW       | t4,0x00(sp); // save the register to be modified on the stack                       |
|       | SW       | t5,0x04(sp);                                                                        |
|       | SW       | t6,0x08(sp);                                                                        |
|       | csrrs t  | 5, k meause, x0; // load the casue of the trap                                      |
|       | csrr t4  | k, k mirq; // load the the interrupt id                                             |
|       | li t6, 1 | EXT INTERRUPT CODE;                                                                 |
|       | bne t5   | b, t6, no ext interrupt; // Check whether the trap was due to an external interrupt |
|       |          | 0x04(sp);                                                                           |
|       | -        | 0x08(sp);                                                                           |

| <ul> <li>no_ext_interrupt:</li> <li>ii t6, SW_INTERRUPT_CODE_WFI; //In klessydra, if we have a WFI, we write a "1" to the bit mcause(30,</li> <li>to return to the instruction following the WFI</li> <li>beq t5, t6, sofware_insn_handler;</li> <li>ii t6, SW_INTERRUPT_CODE_NO_WFI; // Check whether the trap was due to a software interrupt</li> <li>beq t5, t6, sofware_insn_handler; // They jump to the same routine the since mepc is incremented in hardware</li> <li>when the mepc return value is a WFI instruction</li> <li>ii t6, TIMER_INTERRUPT_CODE; // Check whether the trap was due to a timer interrupt</li> <li>ben t5, t6, exception_trap;</li> <li>lw t5, 0x04(sp);</li> <li>lw t6, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul> |           |                                                                             |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-----------------------------------------------------------------------------|
| <ul> <li>li t6, SW_INTERRUPT_CODE_WFI; //In klessydra, if we have a WFI, we write a "1" to the bit mcause(30, to return to the instruction following the WFI</li> <li>beq t5, t6, sofware_insn_handler;</li> <li>li t6, SW_INTERRUPT_CODE_NO_WFI; // Check whether the trap was due to a software interrupt</li> <li>beq t5, t6, sofware_insn_handler; // They jump to the same routine the since mepc is incremented in hardware</li> <li>when the mepc return value is a WFI instruction</li> <li>li t6, TIMER_INTERRUPT_CODE; // Check whether the trap was due to a timer interrupt</li> <li>bne t5, t6, exception_trap;</li> <li>li t6, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, LLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a store error</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |           |                                                                             |
| <ul> <li>to return to the instruction following the WFI</li> <li>beq t5, t6, sofware_insn_handler;</li> <li>li t6, SW_INTERRUPT_CODE_NO_WFI; // Check whether the trap was due to a software interrupt</li> <li>beq t5, t6, sofware_insn_handler; // They jump to the same routine the since mepc is incremented in hardware</li> <li>when the mepc return value is a WFI instruction</li> <li>li t6, TIMER_INTERRUPT_CODE; // Check whether the trap was due to a timer interrupt</li> <li>ben t5, t6, exception_trap;</li> <li>lw t5, 0x04(sp);</li> <li>lw t5, 0x04(sp);</li> <li>lw t6, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instruction</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |           |                                                                             |
| <ul> <li>beq t5, t6, sofware_insn_handler;</li> <li>li t6, SW_INTERRUPT_CODE_NO_WFI; // Check whether the trap was due to a software interrupt</li> <li>beq t5, t6, sofware_insn_handler; // They jump to the same routine the since mepc is incrememnted in hardware</li> <li>when the mepc return value is a WFI instruction</li> <li>li t6, TIMER_INTERRUPT_CODE; // Check whether the trap was due to a timer interrupt</li> <li>bne t5, t6, exception_trap;</li> <li>lw t5, 0x04(sp);</li> <li>lw t6, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instruction</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a store error</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |           |                                                                             |
| <ul> <li>li tó, SW_INTERRUPT_CODE_NO_WFI; // Check whether the trap was due to a software interrupt</li> <li>beq t5, tó, sofware_insn_handler; // They jump to the same routine the since mepc is incremented in hardware</li> <li>when the mepc return value is a WFI instruction</li> <li>li tó, TIMER_INTERRUPT_CODE; // Check whether the trap was due to a timer interrupt</li> <li>bne t5, t6, exception_trap;</li> <li>lw t5, 0x04(sp);</li> <li>lw t6, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, LLLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instruction</li> <li>beq t5, t6, invalid_adir_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_adir_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a store error</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |           |                                                                             |
| <ul> <li>beq t5, t6, sofware_insn_handler; // They jump to the same routine the since mepc is incremented in hardware</li> <li>when the mepc return value is a WFI instruction</li> <li>li t6, TIMER_INTERRUPT_CODE; // Check whether the trap was due to a timer interrupt</li> <li>bne t5, t6, exception_trap;</li> <li>lw t5, 0x04(sp);</li> <li>lw t6, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ELLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instruction</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a store error</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |           |                                                                             |
| <ul> <li>when the mepc return value is a WFI instruction</li> <li>li t6, TIMER_INTERRUPT_CODE; // Check whether the trap was due to a timer interrupt</li> <li>bne t5, t6, exception_trap;</li> <li>lw t5, 0x04(sp);</li> <li>lw t5, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, eill_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a store error</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | RUPT_C    | NO_WFI; // Check whether the trap was due to a software interrupt           |
| <ul> <li>li t6, TIMER_INTERRUPT_CODE; // Check whether the trap was due to a timer interrupt</li> <li>bne t5, t6, exception_trap;</li> <li>lw t5, 0x04(sp);</li> <li>lw t6, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, eill_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |           |                                                                             |
| <ul> <li>bne t5, t6, exception_trap;</li> <li>lw t5, 0x04(sp);</li> <li>lw t6, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |           |                                                                             |
| <ul> <li>lw t5, 0x04(sp);</li> <li>lw t6, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |           | DE; // Check whether the trap was due to a timer interrupt                  |
| <ul> <li>lw t6, 0x08(sp);</li> <li>jr t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | on_trap;  |                                                                             |
| <ul> <li>jr t4;</li> <li>ir t4;</li> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |           |                                                                             |
| <ul> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6,LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |           |                                                                             |
| <ul> <li>exception_trap:</li> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |           |                                                                             |
| <ul> <li>li t6, ECALL_EXCEPT_CODE; // Check whether the trap was due to an ECALL instruction</li> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |           |                                                                             |
| <ul> <li>beq t5, t6, ecall_insn_handler;</li> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |           |                                                                             |
| <ul> <li>li t6, ILLEGAL_INSN_EXCEPT_CODE; // Check whether the trap was due to executing an illegal instructio</li> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | CEPT_C    | // Check whether the trap was due to an ECALL instruction                   |
| <ul> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | sn_hand   |                                                                             |
| <ul> <li>beq t5, t6, illegal_insn_handler;</li> <li>li t6, LOAD_ERROR_EXCEPT_CODE; // Check whether the trap was due to a load error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6,LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |           | CODE; // Check whether the trap was due to executing an illegal instruction |
| <ul> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6,LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |           |                                                                             |
| <ul> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6, STORE_ERROR_EXCEPT_CODE; // Check whether the trap was due to a store error</li> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6,LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | OR EXC    | CODE; // Check whether the trap was due to a load error                     |
| 35beq t5, t6, invalid_addr_handler;36li t6,LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |           |                                                                             |
| <ul> <li>beq t5, t6, invalid_addr_handler;</li> <li>li t6,LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | OR EX     | CODE; // Check whether the trap was due to a store error                    |
| 36 li t6,LOAD_MISALIGNED_EXCEPT_CODE; // Check whether the trap was due to a misaligned access                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |           |                                                                             |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |           |                                                                             |
| beq t5, t6, invalid addr handler;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |           |                                                                             |
| 38                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | · _       |                                                                             |
| 39 lw t4,0x00(sp); // recover the stack                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | recover i | лсk                                                                         |
| 40 $lw t5, 0x04(sp);$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |           |                                                                             |
| 41 $lw t6, 0x08(sp);$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |           |                                                                             |
| 42 addi sp,sp, KLESSYDRA EXC STACK SIZE; // recover the stack pointer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | ESSYDI    | XC STACK SIZE; // recover the stack pointer                                 |
| 43 mret; // return to the execution environment                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |           |                                                                             |

Klessydra specific C functions that have been integrated to the libraries inside Pulpino to be used to quickly send software interrupts. The following is the body of the C function that sends a software interrupt to a target hart. The function takes one argument which is the hart id. From the hart id it will generate the MIP address and send a store word to that MIP value.

| 2 | int send_sw_irq(int targethart){<br>int mip_data_send = 8;                                               |
|---|----------------------------------------------------------------------------------------------------------|
| 2 |                                                                                                          |
| 2 | int store_addr = 0xff00; // Base address of the software interrupt memory section                        |
| 4 | if(targethart >= THREAD_POOL_SIZE) return 0; // the thread in which the interrupt was sent doesn't exist |
| 5 | else { store addr = store addr + (4*targethart); // MIP address generation                               |
| 6 | store mem(mip data send, store addr); // Send a store word with address with the MIP address             |
| 7 | return 1;}}                                                                                              |
| 8 |                                                                                                          |
| 9 | void store_mem(int data_send, int store_addr) {                                                          |
| 0 | asm ("sw %0, (%1);"                                                                                      |
| 1 | :/*no output register*/                                                                                  |
| 2 | :"r"(data_send), "r"(store_addr)                                                                         |
| 3 | :/*no clobbered register*/);}                                                                            |

### 4.8. Thread synchronization.

### 4.7.3. Atomic Instruction Support:

The atomic extensions were augmented to the instruction set supported by Klessydra-T cores in order to support thread synchronization of the harts. However, only a minimal integration of the atomic extension was done such that the only atomic instruction implemented was the 'amoswap'. Implementing the *amoswap* instruction is sufficient enough in order to have thread synchronization, and implement region locks (acquire, and release) on a memory location. Briefly an amoswap instruction loads a key value from a memory and swaps the loaded value with a lock. In order for the amoswap to work correctly, the pointers of the instruction must be addressing the regions in the shared .data section of the data memory by assigning them as global variables, and not the dedicated .stack section, since each hart has its own dedicated stack region in the memory. The following are the body of the functions which do lock acquire, and lock release to memory regions. Both functions take an argument which is a pointer to the lock that is a global variable.

| void klessydra_lock_acquire(int *lock){                                       |
|-------------------------------------------------------------------------------|
| int temp $0 = 1$ ;                                                            |
| asm (                                                                         |
| "loop: "                                                                      |
| "amoswap.w.aq %1, %1, (%0);" // Set the lock by swapping the key '0' with '1' |
| "bnez %1,loop;" // loop until the lock is released.                           |
| ://no output register                                                         |
| :"r" (lock), "r" (temp0)                                                      |
| :/*no clobbered registers*/);}                                                |
| void klessydra_lock_release(int *lock)                                        |
|                                                                               |
| asm_(                                                                         |
| amoswap.w.rl x0, x0, (%0);" // Release lock by storing 0.                     |
| ://no output                                                                  |
| :"r" (lock)                                                                   |
| ://no clobbered register);}                                                   |

#### 4.7.4. Barrier Functions:

The previous functions can ensure the safe access to shared memory regions by blocking the access of all the other harts. However, in order to have thread synchronization, the Klessydra libraries include an additional set of sync barrier functions to synchronize the threads.

- sync barrier reset is used once at the beginning of the code and when the harts are in sync. • The function does a csrw to the MSTATUS register in order to enable the handling of interrupts (i.e. software interrupts in our case). And initializes all the variables to be read in the following functions.
- sync barrier thread registration is used when the harts are in sync, and it registers every • hart that calls this function. This registration process is essential to know the total number of harts interleaving in the IMT core.
- sync barrier function synchronizes the harts. The harts to be synchronized call the function • in chronological order, all the harts except the last one that enter the function register themselves in array to indicate they arrived at the barrier. A conditional structure will compare the number of harts registered versus the number of harts that arrived at the barrier function,

and if the number of harts arrived is less than the number of harts registered, then the hart that entered the barrier function will go to a WFI state. Once the last hart enters the functions and registers itself, the if condition will check that all the harts arrived, and this hart will go to 'else' state and starts sending software interrupts to every sleeping hart in the core. The harts will return from this function synchronized. One important note about the barrier function is that the routine for the barrier-arrival-registration of the harts, and the following code to check the number of the harts arrived is done atomically. Performing this routine without atomicity might in some cases confuse the hart reading the global variables, and will thus send all the harts to a WFI state.

The sync barrier function bodies are shown below.

```
1
       void sync barrier reset(){
 2
3
4
                int i;
                int key = 1;
                static int section = 0;
 5
                int* ptr section = & section;
 6
                asm volatile
 7
                          "csrrw zero, mstatus, 8;" // enable the interrupt handling
                (
 8
                         "amoswap.w.aq %[key], %[key], (%[ptr_section]);"
 9
                         :[key] "r" (key), [ptr section] "r" (ptr section):);
10
                if (section == 0)
11
                   for (i=0;i<THREAD POOL SIZE; i++) {
12
                         sync barrier register[i] = 0; \}\}
13
14
       void sync barrier thread registration(){
15
         int my hart;
16
         my hart = Klessydra get coreID();
17
         arrived at barrier[my hart] = 0;
18
         sync barrier register[my hart] = 1;}
19
20
21
22
23
24
25
26
27
28
       void sync barrier(){
                int my hart, i;
                int *ptr key = &key barr;
                my hart = Klessydra get coreID();
                if(syc_barrier_register[my_hart] == 1) { // checks if the hart entering was registered
                      klessydra lock_acquire(ptr_key); // the following routine must be done atomically
                       barrier completed[my hart] = 1; // set to 1 to indicate that all harts arrived, else it will be set to zero
                       arrived_at_barrier[my_hart] = 1; // notifies the core that the hart with the hart id in "my hart" has arrived
                       for (i=0;i<THREAD_POOL_SIZE; i++)
29
                         if (arrived at barrier[i] == 0 && sync barrier register[i] == 1) {
30
                             barrier completed[my hart] = 0;} // reset to zero, since not all the harts arrived at the barrier
31
                       if (barrier completed[my hart] == 0) { // send the waiting threads to a WFI state
32
                         klessydra lock release(ptr key); // release lock acquired previously
33
34
                            asm ("WFI;");} // put the hart to sleep with a WFI
                      else{
35
                         klessydra lock release(ptr key); // release lock acquired previously
36
                         for (i=0;i<THREAD POOL SIZE; i++){
37
                                  if (my hart != i && sync barrier register[i] == 1) {
38
                                     send sw irq(i);}
39
                                     sync barrier register[i]=0; } // unregister all of the registered harts
40
                                     barrier completed[my hart] = 0;}}}
```

### 4.9. Conclusion

Throughout this chapter we studied the IMT processors, and we made an an experimental and analytical assessment in order to determine the optimal pipeline organization to be adopted. Having chosen T03 as our optimal IMT implementation, we integrated the T03 inside Pulpino, and we adjusted the support of the exceptions, and interrupts in order to be compatible with the SoC. Also, we added a set of libraries to Klessydra that can be utilized to exploit the architecture, In the next chapter we will see how we can further improve the T03 IMT core.

# Chapter 5 Klessydra-T1 Architectures

## 5.1. Background

In the previous chapter we have shown how an IMT processor, can be easily exploited in two classes of applications. Decoupled applications, each of which runs on a dedicated hart, and balanced parallel applications that allocate equal or semi-equal workloads to every hart, and the nature of the workloads being balanced among the harts gives only a tiny overhead during thread synchronization. A good example of threads running dedicated applications is for instance when using the SoC in an environment in which each hart interfaces its own peripheral device for instance; I/O devices or sensors or wireless devices and etc. The previous study from chapter 4 showcased the performance of the T03 when executing some basic control, or integer arithmetic applications. However, this chapter shows that IMT cores can be utilized in broader areas, in which harts can work together to run specialized applications that are easily exploited with superscalar hardware accelerators coupled with dedicated low latency local energy efficient scratchpad memories [36][37]. In this chapter, our aim is to exploit IMT processors to perform well in a broader set of the computing application spectrum and that is through the augmentation of specialized hardware accelerators. The T03 version supporting specialized hardware acceleration is called the T13 core.

As mentioned in the previous chapter that T03 is a short hand for T033, and also in this chapter, the T13 is a short hand for T133. The T13 is part of the Klessydra open source project. [31][32][33][[34] and it expands the instruction set of T03 with two extensions; the first being the "M" (multiply/divide) extension which is handled in the IE block, and the second is the "K" custom instruction set extension, specifically designed to facilitate vector calculations, that is managed by the SPMU. So, the ISA supported by the T13 core is RV32IMAK. The T13 core was designed to allow superscalar execution, and yet still interleave only three harts in the core. The superscalar execution of the T13 is done without creating any highly multi-ported registerfiles as those available in Out-of-Order architectures. It parallelizes the execution in IMT processors while still maintaining the pipeline stages, and the thread pool baseline of the T03. It demonstrates how simple it is to augment a hardware accelerator, and shows how to design the accelerators in order exploit thread level parallelism. Different hardware accelerator, schemes have been implemented in order to see which approach yields the best performance, area, and energy efficiency.

This chapter starts by demonstrating the motivation for augmenting a hardware accelerator to the T03 architecture in section 5.2. Then it would describe the microarchitecture of the augmented hardware accelerator in 5.3. Section 5.4 shows how our accelerator can be built in different implementations Then in section 5.5 a set of different hardware accelerator schemes are provided in order to study the optimal choice to use for exploiting an IMT processor. Followed by a performance benchmark of the different hardware accelerator schemes from section 5.4. In section 5.6 the FPGA synthesis results are reported when synthesizing the T13 core with the different accelerator schemes shown in section 5.3. In section 5.7, supplementary tests are made to further test the T13 hardware accelerator

### 5.2. Motivation for augmenting the T03 core with a hardware accelerator

The IMT core presented in this chapter is called the *Klessydra-T13* (*T13* for short). The T13 block organization is shown in figure 5.1, it maintains the same hart count of its predecessor the T03. However, unlike the T03, the T13 introduces superscalar execution giving rise to the possibility of having instructions from different harts in the execute stage as seen below.



Figure.5.1. Klessydra T133 block organization, interleaves three harts and has three execution units working in parallel

A good practice to make a superscalar processor is to let each augmented execution unit write into its own memory. Take a look at figure 5.1 for example; The Load-Store Unit (LSU) only allows superscalar execution with the other units is when the instruction its handling is a store. Since stores write to the external memory, and not the registerfile. Following the same concept, we have created a hardware accelerator called the Special Purpose Mathematical Unit (SPMU) that has its own execution units and its own dedicated local Scratchpad Memories. The SPMU has its own custom instructions that can read from the SPMs or the registerfile, however, it only writes to the scratchpads and never to the registerfile. Working in this fashion, the SPMU can automatically be said to work in parallel with the other execution units, since it does not perform any concurrent writes to shared memories.

Following this practice, hardware accelerators can be easily augmented to IMT architectures, to increase their capabilities in targeting a large portion of the spectrum of computing applications.

### 5.3. Special Purpose Mathematical Unit Microarchitecture

The SPMU is the hardware accelerator. It was given the name "Special Purpose" because it performs a certain subset of mathematical operations specifically designed to accelerate the execution of Convolutional Neural Networking Applications (CNN). The SPMU is comprised of two main subsystems as seen in figure 5.2. The Special Purpose Engine (SPE) which maps, controls, and executes the SPMU instructions, and the Scratchpad Memory Interface (SPI) that manages the SPE and LSU access to the scratchpad memories (not to be confused with SPI "serial peripheral interface"). The SPMU can be compared to a vector processor rather than a packed SIMD [46][47] processor since it executes on sets of data of variable vector length, unlike SIMD instructions that have a fixed vector length. However, throughout the rest of this chapter and the next, the word "SIMD will be used to refer to the nature of the execution of the instructions and not the type of the instructions. The instructions are of type vector, and not SIMD.

In the T13, the length of the vector to execute in each instruction is set in a custom CSR called Machine Vector Size "MVSIZE". Also, similarly the SPMU compares to a vector processor by allowing the configuration of different data types, the data types supported in the SPMU are integer 8-bit, 16-bit, and 32-bit.



Figure.5.2. SPMU Block Diagram

### 5.3.1. Special Purpose Engine

The execution of the T13 custom K instruction set extensions is done in the SPE. The SPE is composed of many integral sub-systems, which handle the configuring, fetching, mapping, executing, and writing of the instruction. At any point in time the SPMU can be in any of the following two states:

- **SPE\_INIT:** The default state of the SPE, and also the initial state for every instruction, this state handles the configuring of the functional units, and the exception control checking, fetching of the first data elements, and buffer the signals coming from the Decode, and CSR units.
- **SPE\_Exec:** The SPE transfers to this state if there are no exceptions, and in the *SPE\_Exec* state, we handle the hardware-loops, mapping, fetching the next elements, executing operations, and writing the results. After the results have been written successfully the SPE returns back to the SPE\_INIT state.

Each of the SPMU's sub-systems will be described in the following paragraphs detailing their functions, and also showcasing VHDL snippets of how they were implemented.

**The exception handler** is a part of the initialization phase which checks for any current exceptions, and predicts for any future exceptions right at the very first cycle of the execution of a custom instruction from the "K" extension. All the exceptions are regarding the SPM access.

The main reason for controlling exceptions in the first cycle is that after the first cycle, the core enables the dispatch of the instructions of the other harts, and the state of the registerfile. So, in the

case of encountering an exception in the first cycle, the core will recover the state of the processor precisely to the time before the exception occurred without having the registerfile being modified. Detecting exceptions after the first cycle requires a history file to recover the processor's state precisely for when the program counter returns from the trap handling routine, which is an efficient procedure seeing that the nature of an exception happening is quite exceptional.

The following are a list of what might be exception triggers in the SPMU:

- 1. **Out of bound SPM access**; in this case, one of the pointers to a data element is pointing to an address not belonging to any of the SPM memories.
- 2. **Dual SPM read access**; a SPM has one read port, and when the two instruction operands point to the same SPM, we encounter an exception.
- 3. **Overflow data read and write**; this happens when the SPM pointer plus the vector size will overflow the address of the SPM being indexed. This overflow exception only traps when the operand being indexed is used as a vector, and not scalar.
- 4. **Misaligned access**; SPMs are 32-bit word aligned and any misaligned access will trigger this exception.

Below is the RTL description of the exception handler in the SPMU.

```
1
2
3
4
5
6
7
8
9
        ----- Exception handler of SPE Unit ----
        SPE Excpt Cntrl Unit comb : process(all)
        begin
        ...
         if spe instr req = '1' or busy SPE internal lat = '1' then
           case state SPE is
            when SPE init =>
             overflow rs1 spm <= std logic vector('0' & unsigned(RS1 Data IE(Addr Width -1 downto 0)) +
10
                                                        unsigned(MVSIZE(harc EXEC)) -1):
11
             overflow rs2 spm \leq std logic vector('0' & unsigned(RS2 Data IE(Addr Width -1 downto 0)) +
12
                                                        unsigned(MVSIZE(harc EXEC)) -1);
13
             overflow rd spm <= std logic vector('0' & unsigned(RD Data IE(Addr Width -1 downto 0)) +
14
                                                       unsigned(MVSIZE(harc EXEC)) -1);
15
             if MVSIZE(harc EXEC) = (0 to Addr Width => '0') then -- don't execute instructions with zero vector elements
16
              null:
17
             elsif MVSIZE(harc EXEC)(1 downto 0) /= "00" and MVTYPE(harc EXEC)(3 downto 2) = "10" then
18
              except condition wires := '1': -- Set exception if the number of bytes are not divisible by four
19
                                      <= ILLEGAL_VECTOR_SIZE_EXCEPT_CODE;
              except data wire
20
             elsif \overline{\text{MVSIZE}}(harc EXEC)(0) /= '0' and \overline{\text{MVTYPE}}(harc EXEC)(3 downto 2) = "01" then
\begin{array}{c} 21 \\ 22 \\ 23 \\ 24 \\ 25 \\ 26 \\ 27 \\ 28 \\ 29 \\ 30 \\ 31 \\ 32 \end{array}
              except condition wires := '1'; -- Set exception if the number of bytes are not divisible by two
                                      <= ILLEGAL VECTOR SIZE EXCEPT CODE;
              except data wire
             elsif (rs1 to spm = "100" and vec_read_rs1_ID = '1') or
                  (rs2 to spm = "100" and vec read rs2 ID = '1') o
                   rd to spm = "100" then
              except condition wires := '1'; -- Set exception for non-scratchpad access
              except data wire
                                      <= ILLEGAL ADDRESS EXCEPT CODE;
             elsif rs1_to_spm = rs2_to_spm and vec_read_rs1_ID = '1'
                                            and vec read rs2 ID = '1' then
              except condition wires := '1'; -- Set exception for same read access
              except_data_wire
                                      <= READ SAME SCARTCHPAD EXCEPT CODE;
             elsif (overflow_rs1_spm(Addr_Width) = '1' and vec_read_rs1_ID = '1') or
33
                  (overflow rs2 spm(Addr Width) = '1' and vec read rs2 ID = '1') then
34
              except condition_wires := '1'; -- Set exception if reading overflows the scratchpad's address
35
                                      <= SCRATCHPAD OVERFLOW EXCEPT CODE;
              except data wire
36
             elsif overflow rd spm(Addr Width) = '1' and vec write rd ID = '1' then
37
              except condition wires := '1'; -- Set exception if reading overflows the scratchpad's address
38
              except data wire <= SCRATCHPAD OVERFLOW EXCEPT CODE;
39
             else -- else we process the instruction
              if halt_hart = \overline{0} then
40
```

| 41 | <pre>nextstate_SPE &lt;= spe_exec;</pre> |
|----|------------------------------------------|
| 42 | else                                     |
| 43 | nextstate_SPE <= spe_halt_hart;          |
| 44 | end if;                                  |
| 45 | busy SPE internal wires := '1';          |
| 46 | end if;                                  |
| 47 | when others =>                           |
| 48 | null;                                    |
| 49 |                                          |

The initialization block configures the functional units correctly in order to execute the instructions in flight. An example of some configurations might be; Setting the FU controls to execute the data type to be computed on, such as; *chars*, *shorts* or *ints*. Other configurations might also be to transform the input operands into their two's complement or they might be to configure outputs to either become sign extended or zero extended.

| <br>FU Inittialiazion phase                                                                          |
|------------------------------------------------------------------------------------------------------|
| Set signals to enable correct virtual parallelism operation                                          |
| if (decoded instruction SPE(KADDV bit position) = '1' or                                             |
| decoded instruction SPE(KSVADDSC bit position) = '1') and                                            |
| MVTYPE(3  down to  2) = "10"  then                                                                   |
| carry_pass <= "111"; pass all carry_outs                                                             |
| elsif decoded instruction SPE(KSVADDRF bit position) = '1' and                                       |
| MVTYPE(3  downto  2) = "10"  then                                                                    |
| carry pass <= "111"; pass all carry outs                                                             |
| rf rs2 <= '1';                                                                                       |
| ····                                                                                                 |
| elsif decoded instruction SPE(KSUBV bit position) = '1' and                                          |
| MVTYPE(3  downto  2) = "10"  then                                                                    |
| carry pass <= "111"; pass all carry outs                                                             |
| twos complement $\leq "0001000100010001000100010001";$                                               |
| elsif decoded instruction SPE(KSUBV bit position) = '1' and                                          |
| MVTYPE(3  downto  2) = "01"  then                                                                    |
| carry pass <= "101"; pass carries 9, and 25                                                          |
| twos complement $\leq "01010101010101010101010101010101010101$                                       |
| elsif decoded instruction SPE(KSUBV bit position) = '1' and                                          |
| MVTYPE(3  downto  2) = "00"  then                                                                    |
| carry pass <= "000"; don't pass carry outs and keep addition 8-bit                                   |
| twos complement $\leq$ "111111111111111111111111111111111111                                         |
|                                                                                                      |
| elsif decoded instruction SPE(KDOTP bit position) = '1' and                                          |
| $MVTYPE(\overline{3} \text{ down to } 2) = "10" \text{ then}$                                        |
| FUNCT_SELECT_MASK <= (others => '1'); This enables 32-bit multiplication with the 16-bit multipliers |
| dotp <= '1';                                                                                         |
| elsif decoded instruction SPE(KDOTP bit position) = '1' and                                          |
| MVTYPE(3  downto  2) = "01"  then                                                                    |
| dotp <= '1';                                                                                         |
| MVTYPE(3  downto  2) = "00"  then                                                                    |
| dotpps <= '1';                                                                                       |
| elsif decoded_instruction_SPE(KSVMULRF_bit_position) = '1' and                                       |
| MVTYPE(3  downto  2) = "10"  then                                                                    |
| FUNCT_SELECT_MASK <= (others => '1');                                                                |
| rf_rs2 <= '1';                                                                                       |
| elsif (decoded_instruction_SPE(KVMUL_bit_position) = '1' or                                          |
| decoded_instruction_SPE(KSVMULSC_bit_position) = '1') and                                            |
| MVTYPE(3  downto  2) = "10" then                                                                     |
| FUNCT_SELECT_MASK <= (others => '1');                                                                |
| end if                                                                                               |

In the execute state of the SPE, the hardware-controlled loops or shortly **hardware loops** (hw-loops) eliminate the overhead required for looping operations. It increments the source operand pointers to fetch the next element of each operand only when the instruction operands are defined as vector sources and not scalar sources. The same applies for the writing of the results. The hw-loops also handles decrementing the vector length continuously. When the vector size becomes zero, the hw-loops stop, and the instruction is considered done. A masking vector is created depending on the number of elements left, such that if the number of elements is less than the number of bytes processed in one cycle, the mask will disable the upper bytes of the fetched elements. This is essential when elements fetched get accumulated. In this case, we need to avoid accumulating data not belonging to the instruction in order to get correct accumulation results.



| if halt spe = '0' then the hardware loops work only when there is no halt from the SPI               |
|------------------------------------------------------------------------------------------------------|
| Increment the write address when we have a result as a vector                                        |
| if vec write rd lat = '1' and wb ready = '1' then destination address increment                      |
| RD Data IE lat <= std logic vector(unsigned(RD Data IE lat) + SIMD RD BYTES);                        |
| end if;                                                                                              |
| if wb ready = '1' then decrement by SIMD BYTE Execution Capability                                   |
| if to integer(unsigned(MVSIZE WRITE)) >= SIMD RD BYTES then                                          |
| $\overline{MVSIZE}$ WRITE <= std logic vector(unsigned( $\overline{MVSIZE}$ WRITE) - SIMD RD BYTES); |
| else decrement the remaining bytes                                                                   |
| MVSIZE WRITE $\leq (others => 0');$                                                                  |
| end if;                                                                                              |
| end if;                                                                                              |
| Increment the read addresses                                                                         |
| if to_integer(unsigned(MVSIZE_READ)) >= SIMD_RD_BYTES and data_gnt_i = '1' then                      |
| if vec read rs1 lat = '1' then source 1 address increment                                            |
| RS1_Data_IE_lat <= std_logic_vector(unsigned(RS1_Data_IE_lat) + SIMD_RD_BYTES);                      |
| end if;                                                                                              |
| if vec_read_rs2_lat = '1' then source 2 address increment                                            |
| RS2_Data_IE_lat <= std_logic_vector(unsigned(RS2_Data_IE_lat) + SIMD_RD_BYTES);                      |
| end if;                                                                                              |
| end if;                                                                                              |
| Decrement the vector elements that have already been operated on                                     |
| if data_gnt_i = '1' then decrement by SIMD_BYTE Execution Capability                                 |
| if to_integer(unsigned(MVSIZE_READ)) >= SIMD_RD_BYTES then                                           |
| MVSIZE_READ <= std_logic_vector(unsigned(MVSIZE_READ) - SIMD_RD_BYTES);                              |
| else decrement the remaining bytes                                                                   |
| $MVSIZE\_READ \le (others \implies '0');$                                                            |
| end if;                                                                                              |
| end if;                                                                                              |
| <pre>spm_data_read_mask &lt;= (others =&gt; '0');</pre>                                              |
| if data_gnt_i_lat = 'l' then                                                                         |
| if to_integer(unsigned(MVSIZE_READ_MASK)) >= SIMD_RD_BYTES then                                      |
| <pre>spi_data_read_mask &lt;= (others =&gt; '1');</pre>                                              |
| MVSIZE_READ_MASK<=std_logic_vector(unsigned(MVSIZE_READ_MASK) -                                      |
| SIMD_RD_BYTES);                                                                                      |
| else                                                                                                 |
| MVSIZE READ MASK $\leq $ (others $\geq 0$ ):                                                         |

| 38 | spi_data_read_mask(to_integer(unsigned(MVSIZE_READ_MASK))*8-1 downto 0)<=(others => '1'); |
|----|-------------------------------------------------------------------------------------------|
| 39 | end if;                                                                                   |
| 40 | end if;                                                                                   |
| 41 | end if;                                                                                   |
|    |                                                                                           |

The fetched input operands go into the **mapping unit**, that maps the fetched input data to their corresponding functional units. Some instructions use multiple functional units and so the outputs of the first functional unit re-route to the next one. The operands can be either scalar or vector, and they can be fetched from the SPM or the registerfile. The final outputs of the functional units will connect again to the mapping unit, in which they will be written back to the SPMs. Below is a brief snippet from the RTL of the input operand mapper, as for the output mapping, the assignments would be similar but reversed.

|                     | OPERAND MAPPING                                                                                                                                                                                                                                                 |
|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| dec                 | ded_instruction_SPE_lat(KDOTP_bit_position) = '1' or dot product instruction<br>oded_instruction_SPE_lat(KDOTPPS_bit_position) = '1') and - dot product instruction with post scaling                                                                           |
|                     | $TYPE\_SPE = "01" \text{ or } MVTYPE\_SPE = "10") \text{ then}$                                                                                                                                                                                                 |
|                     | operands(0) <= spi_data_read(0) and spi_data_read_mask;                                                                                                                                                                                                         |
|                     | operands(1) <= spi_data_read(1) and spi_data_read_mask;                                                                                                                                                                                                         |
|                     | tp = '1' then                                                                                                                                                                                                                                                   |
|                     | um_operands <= out_mul_results;                                                                                                                                                                                                                                 |
|                     | dotpps = '1' then                                                                                                                                                                                                                                               |
|                     | t_amount <= MPSCLFAC_SPE;                                                                                                                                                                                                                                       |
|                     | ter_operand <= out_mul_results;<br>um operands <= out shifter results;                                                                                                                                                                                          |
| end i               |                                                                                                                                                                                                                                                                 |
| end if:             |                                                                                                                                                                                                                                                                 |
| chu n,              |                                                                                                                                                                                                                                                                 |
| adde                | <pre>oded_instruction_SPE_lat(KADDV_bit_position) = '1' then - vector-vector add instr<br/>r_operands(0) &lt;= spi_data_read(0);<br/>r_operands(1) &lt;= spi_data_read(1);</pre>                                                                                |
| MV<br>adde<br>for i | <pre>bded_instruction_SPE_lat(KSVADDSC_bit_position) = '1' and vector-scalar add instruction 'TYPE_SPE = "10" then r_operands(0) &lt;= spi_data_read(0); in 0 to SIMD-1 loop er_operands(1)(31+32*(i) downto 32*(i)) &lt;= spi_data_read(1)(31 downto 0);</pre> |
| end l<br>end if;    | oop;                                                                                                                                                                                                                                                            |
| deco                | <pre>oded_instruction_SPE_lat(KSRAV_bit_position) = '1' or - right arithmetic shift instruction<br/>oded_instruction_SPE_lat(KSRLV_bit_position) = '1' then right logic shift instruction<br/>er operand &lt;= spi data read(0);</pre>                          |
|                     | $=$ amount $=$ RS2_Data_IE_lat(4 downto 0); map the scalar value (shift amount)                                                                                                                                                                                 |
|                     | <pre>oded_instruction_SPE_lat(KRELU_bit_position) = '1' then - relu instruction operands &lt;= spi_data_read(0);</pre>                                                                                                                                          |

The **control unit** controls the requests to fetch the input operands and write the output results. It also halts the vector processor in case the source SPMs are being accessed by the load-store unit. When the SPE gets a halt signal, all the data in the pipes will maintain their state, and the hardware loops will stop counting until the SPM accessed becomes free. The Control for KADDV and KDOTP is shown below. Other instructions have a similar control.

if decoded\_instruction\_SPE\_lat(KADDV\_bit\_position) = '1' or -- control for KADDV and KSUBV instructions

| 2                                    | decoded instruction SPE lat(KSUBV bit position) = '1' then                                                                                                                                  |
|--------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 2<br>3<br>4<br>5<br>6<br>7<br>8<br>9 | if adder stage 3 $en = '1'$ then                                                                                                                                                            |
| 4                                    | wb ready $\leq 1'$ ; the results of the final stage are ready to be written back                                                                                                            |
| 5                                    | elsif recover state = '1' then                                                                                                                                                              |
| 6                                    | wb ready $\leq =$ '1'; latch the writeback ready signal for as soon as the write is granted                                                                                                 |
| 7                                    | end if;                                                                                                                                                                                     |
| 8                                    | if MVSIZE_READ > (0 to Addr_Width => '0') then – keep on reading until all the data has been fetched                                                                                        |
| 9                                    | spe_to_spm(to_integer(unsigned(rs1_to_spi_lat)))(0) <= '1'; assign vs1 to the first SPI read port                                                                                           |
| 10                                   | spe_to_spm(to_integer(unsigned(rs2_to_spi_lat)))(1) <= '1'; assign vs2 to the second SPI read port                                                                                          |
| 11                                   | <pre>spi_req(to_integer(unsigned(rs1_to_spi_lat))) &lt;= '1'; request vs1</pre>                                                                                                             |
| 12                                   | <pre>spi_req(to_integer(unsigned(rs2_to_spi_lat))) &lt;= '1'; request vs2</pre>                                                                                                             |
| 13                                   | $spi_read_addr(0) \le RS1_Data_IE_lat(Addr_Width - 1 downto 0); send the address of vs1$                                                                                                    |
| 14                                   | $spi_read_addr(1) \le RS2_Data_IE_lat(Addr_Width - 1 downto 0); send the operand of vs2$                                                                                                    |
| 15                                   | end if;                                                                                                                                                                                     |
| 16                                   | if $MVSIZE_WRITE > (0 \text{ to } Addr_Width => '0') \text{ then}$                                                                                                                          |
| 17                                   | nextstate_SPE <= spe_exec; latch the execute state of the SPE                                                                                                                               |
| 18                                   | busy_SPE_internal_wires := '1'; - the SPE is considered busy until all the outputs are written                                                                                              |
| 19<br>20                             | end if;                                                                                                                                                                                     |
| 20                                   | if wb_ready = '1' then - first batch of the vector results becomes ready<br>spi we(to integer(unsigned(rd to spi lat))) <= '1'; enable the writeback                                        |
| $\frac{21}{22}$                      | spi_we(to_integer(unsigned(id_to_spi_at))) ~~ 1, enable the writeback<br>spi write addr <= RD Data IE lat; send the write address which is incremented by the hw loops                      |
| $\frac{22}{23}$                      | end if;                                                                                                                                                                                     |
| $\frac{23}{24}$                      | end if;                                                                                                                                                                                     |
| 25                                   |                                                                                                                                                                                             |
| $\overline{26}$                      | if decoded_instruction_SPE_lat(KVRED_bit_position) = '1' or <i>Control of the accumulator using instructions</i>                                                                            |
| $\frac{1}{27}$                       | decoded instruction SPE lat(KDOTP bit position) = '1' or                                                                                                                                    |
| 28                                   | decoded instruction SPE lat(KDOTPPS bit position) = '1' then                                                                                                                                |
| 29                                   | if accum stage 3 en = '1' then                                                                                                                                                              |
| 30                                   | wb ready $\leq = 1$ ;                                                                                                                                                                       |
| 31                                   | elsif recover state = '1' then                                                                                                                                                              |
| 32                                   | wb_ready <= '1';                                                                                                                                                                            |
| 33                                   | end if;                                                                                                                                                                                     |
| 34                                   | if $MVSIZE_READ > (0 \text{ to } Addr_Width => '0') \text{ then} - keep on reading until all the data has been fetched}$                                                                    |
| 35                                   | if vec_read_rs2_SPE = '1' then                                                                                                                                                              |
| 36                                   | <pre>spi_req(to_integer(unsigned(rs2_to_spi_lat))) &lt;= '1'; request vs2</pre>                                                                                                             |
| 37                                   | spe_to_spi(to_integer(unsigned(rs2_to_spi_lat)))(1) <= '1'; assign vs2 to the second SPI read port                                                                                          |
| 38                                   | $spi_read_addr(1) \le RS2_Data_IE_lat(Addr_Width - 1 downto 0); send the address of vs2$                                                                                                    |
| 39                                   | end if;                                                                                                                                                                                     |
| 40                                   | $spi_req(to_integer(unsigned(rs1_to_spi_lat))) \le '1'; request vs1$                                                                                                                        |
| 41<br>42                             | spe_to_spm(to_integer(unsigned(rs1_to_spi_lat)))(0) <= 'l'; assign vs1 to the first SPI read port<br>are read addr(0) <= $PS1$ Data UE lat(Addr Width -1 downto 0); and the address of well |
| 42                                   | $spi_read_addr(0) \le RS1_Data_IE_lat(Addr_Width - 1 downto 0); send the address of vs1$                                                                                                    |
| 43<br>44                             | nextstate_SPE <= spe_exec;<br>busy SPE internal wires := '1';                                                                                                                               |
| 45                                   | elsif MVSIZE WRITE = $(0 \text{ to Addr Width} => '0')$ then                                                                                                                                |
| 46                                   | nextstate SPE <= spe init; return to the init state when the accumulation is done                                                                                                           |
| 47                                   | else                                                                                                                                                                                        |
| 48                                   | nextstate SPE <= spe exec; latch the execute state until all the elements have accumulated                                                                                                  |
| 49                                   | busy SPE internal wires := '1'; the SPE is considered busy until all the values have been accumulated                                                                                       |
| 50                                   | end if;                                                                                                                                                                                     |
| 51                                   | if wb_ready = '1' then final scalar result is ready                                                                                                                                         |
| 52                                   | spi_we(to_integer(unsigned(rd_to_spi_lat))) <= '1'; enable the writeback                                                                                                                    |
| 53                                   | spi_write_addr <= RD_Data_IE_lat; send the write address of the scalar value                                                                                                                |
| 54                                   | end if;                                                                                                                                                                                     |
| 55                                   | end if                                                                                                                                                                                      |
|                                      |                                                                                                                                                                                             |

> The SPE has five different functional units (FUs). All the units work with different data types (8-bit, 16-bits, 32-bit) both signed and unsigned. Three of the FUs work in partial mode; the adder, shifter, and the multiplier. The partial FUs increase the parallelism for smaller data width elements while maintaining a small area occupation. Table 1.1 shows how many operations we do in one cycle in every FU and for each data type when the SIMD parameter is configured to be 1. Bigger SIMD configurations will double the number of parallelisms on all the functional units.

| Instruction | FU Type | Data Type | Parallelism |
|-------------|---------|-----------|-------------|
| Adder       | Partial | 32        | 1*SIMD      |
|             |         | 16        | 2*SIMD      |
|             |         | 8         | 4*SIMD      |
| Shifter     | Partial | 32        | 1*SIMD      |
|             |         | 16        | 2*SIMD      |
|             |         | 8         | 4*SIMD      |
| Multiplier  | Partial | 32        | 1*SIMD      |
|             |         | 16        | 2*SIMD      |
|             |         | 8         | 2*SIMD      |
| Accumulator | Normal  | 32        | 1*SIMD      |
|             |         | 16        | 2*SIMD      |
|             |         | 8         | 2*SIMD      |
| ReLu        | Normal  | 32        | 1*SIMD      |
|             |         | 16        | 2*SIMD      |
|             |         | 8         | 4*SIMD      |

Table.5. 1 Type, and parallelism of the functional units in the SPE

We can see the **partial adder** from figure 5.3, there are a set of four 8-bit adders cascaded together. To produce 8-bit sums, the initialization block will configure the adders to block the carries propagated from the partial sums giving four 8-bit sums as outputs. For 16-bit additions, only the first and the third adders are allowed to propagate their carries, giving two 16-bit outputs. While for the 32-bit sums all the carries are allowed to be propagated giving one 32-bit output. The adders as seen from figure 5.3 are split into two pipe stages, the carry from the lower 16 bits, goes to the upper sixteen bits through a register and not a wire.



Figure.5.3. Partial Adder Circuit in SIMD=4

The RTL describing the behavior of the SIMD pipelined partial adders is shown below.

| 1 -                                  | for i in 0 to SIMD-1 loop                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 2                                    | if (adder stage 1 en = '1' or recover state wires = '1') then                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 2<br>3<br>4<br>5<br>6<br>7<br>8<br>9 | add 8 $\overline{0}$ wire(i) $\leq $ std logic vector('0' & unsigned(adder ops(0)(7+8*(4*i) downto 8*(4*i))) +                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| 4                                    | unsigned(adder_ops(1)(7+8*(4*i) downto $8*(4*i)$ )) +                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| 5                                    | twos complement( $0+(4*i)$ ));                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| 6                                    | add_16_8_wire(i) <= std_logic_vector('0' & unsigned(adder_ops(0)(15+8*(4*i) downto 8+8*(4*i))) +                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| 7                                    | $udd_10_0 whe(i) < sta_iogie_vector(0 & unsigned(adder_ops(0)(15+6 (4+i) downto 5+6 (4+i))) + unsigned(adder_ops(1)(15+8*(4*i) downto 8+8*(4*i))) + (15+8*(4*i) downto 8+8*(4*i))) + (15+8*(4*i) downto 8+8*(4*i)) + (15+8*(4*i) downto 8+8*(4*i))) + (15+8*(4*i) downto 8+8*(4*i)))) + (15+8*(4*i) downto 8+8*(4*i))))) + (15+8*(4*i) downto 8+8*(4*i)))))))))))))))))))))))))))))))))))$ |
| 2                                    | carry 8 wire(i) +                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| י<br>ר                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| )                                    | twos_complement(1+(4*i)));<br>Carries are either passed or blocked for the 9-th, 17-th, and 25-th bits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 1                                    | $\operatorname{carry}_8 \operatorname{wire}(i) \le \operatorname{add}_8 \operatorname{o}_{\operatorname{wire}(i)(8)}$ and $\operatorname{carry}_pass(0)$ ; <i>carry_pass is configured in the init stage</i>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| 2                                    | $carry_{16}_{wire(i)} \le add_{16}_{8}_{wire(i)(8)}$ and $carry_{pass}(1)$ ; <i>carry_pass is configured in the init stage</i>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| 3 _                                  | end if;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| -                                    | for i in 0 to SIMD-1 loop – index 'i' is for the SIMD depth of the SPMU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                                      | if (adder stage 2 en = '1' or recover state wires = '1') then                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                                      | add 24_16_wire(i) <= std_logic_vector('0' & unsigned(adder_ops_lat(0)(7+8*(2*i) downto 8*(2*i)))+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                                      | $unsigned(adder_ops_lat(0)(7+8*(2*i)) downto 8*(2*i))) +$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                      | carry $16(i)$ + twos complement(2+(4*i)));                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                                      | add 32_24_wire(i) <= std_logic_vector('0' & unsigned(adder_ops_lat(0)(15+8*(2*i) downto 8+8*(2*i))) +                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                                      | unsigned(adder_ops_lat(1)(15+8*(2*i) downto 8+8*(2*i))) + $(2+(4*i))$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                                      | $carry_24_wire(i) + twos_complement(3+(4*i)));$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                                      | All the 8-bit adders are lumped into one output write signal that will write to the scratchpads                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                                      | Carries are either passed or blocked for the 9-th, 17-th, and 25-th bits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|                                      | carry_24_wire(i) <= add_24_16_wire(i)(8) and carry_pass(2); carry_pass is configured in the init stage                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|                                      | end if;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| ; -                                  | end loop;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| -                                    | if add en = '1' and halt spe lat = '0' then                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                                      | carry_16 <= carry_16_wire; latch the wires                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|                                      | add $\overline{8}$ 0 <= add $8$ 0 wire;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                                      | add $16^{\circ} 8 \le add 16^{\circ} 8$ wire;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                                      | for i in 0 to SIMD-1 loop – index 'i' is for the SIMD depth of the SPMU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                                      | if (adder stage 2 en = '1' or recover state wires = '1') then                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                                      | All the 8-bit adders are lumped into one output signal that will write to the scratchpads                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                      | out_adder_results $(31+32^*(i) \text{ down to } 32^*(i)) \le \text{add}_{32}_{24} \text{ wire}(i)(7 \text{ down to } 0) \& form the output result}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                                      | $add 24 \ 16 \ \text{wire(i)}(7 \ \text{downto } 0) \&$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                                      | add 16 8(i)(7 downto 0) &                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                                      | $add_8_0(i)(7 \text{ downto } 0);$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|                                      | end if;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                                      | end loop;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                      | end if;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                                      | for i in 0 to SIMD-1 loop – index 'i' is for the SIMD depth of the SPMU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                                      | $adder_ops\_lat(j)(15 + 16*(i) \text{ downto } 16*(i)) \le adder_ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(31 + 32*(i) \text{ downto } 16 + 32*(i)); latch the ops(f)(j)(j)(j)(j)(j)(j)(j)(j)(j)(j)(j)(j)(j)$                                                                                                                                                                                                                                                                                                                                        |
|                                      | end loop;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                      | end loop;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| 5<br>6<br>7<br>8                     | for j in 0 to 1 loop index 'j' loops through the upper two 8-bit adders<br>adder_ops_lat(j)(15+16*(i) downto 16*(i)) <= adder_ops(f)(j)(31+32*(i) downto 16+32*(i)); latch the op                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |

For the 32-bit **multiplier** the partial multiplication structure is based on four 16-bit multipliers, according to the following implementation:

 $A_{31-0} * B_{31-0} = [(A_{31-16} << 16) + A_{15-0}] * [(B_{31-16} << 16) + B_{15-0}]$ 

This method can generate two 8-bit, or two 16-bit MULs per cycle, or one 32-bit MUL per cycle. The circuit describing the multiplier is shown in figure 5.4. If the data type is set to 8-bit, or 16-bit, then the middle multiplications (AL\*BH and AH\*BL) will be masked with zeros to block the accumulation of the partial multiplications into making a 32-bit output. The actual multiplier does not use right shifters to give this 16-bit offset of zeros, instead it just concatenates a 16-bit zero vector to the upper portions of the partial multiplications.

The reason this operation was not divided to use 8-bit multipliers instead, was because one DSP [45] slice is utilized in the FPGA whether an 8-bit or a 16-bit multiplication is done. So, for our current implementations of the multipliers, we will only get twice the speed-up for 8-bits of data and not four times as in the case of the partial adders. One note also, the multipliers upper 32-bit outputs are ignored so we do not emulate any 'MULH' operation, because they are not required in our applications.



Figure.5.4. Partial Multiplier Circuit in SIMD=4

| Synchronous Partial Multiplication Stage 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| if halt_spe_lat = '0' then $-$ index 'i' is for the SIMD depth of the SPMU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| if mul_en = '1' and (mul_stage_1_en = '1' or recover_state_wires = '1') then                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| for i in 0 to SIMD-1 loop                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| $mul_a(31+32^*(i) \text{ downto } 32^*(i)) \ll std_logic_vector(unsigned(mul_ops(0)(15+16^*(2^*i+1) \text{ downto } 16^*(2^*i+1))))) \approx 10^{-10} \text{ downto } 16^*(2^*i+1) \text{ downto } 16^*(2^*i+1)) \approx 10^{-10} \text{ downto } 16^*(2^*i+1) \text{ downto } 16^*(2^*i+1)) \approx 10^{-10} \text{ downto } 16^*(2^*i+1) \text{ downto } 16^*(2^*i+1)) \approx 10^{-10} \text{ downto } 16^*(2^*i+1) \text{ downto } 16^*(2^*i+1)) \approx 10^{-10} \text{ downto } 16^*(2^*i+1) \text{ downto } 16^*(2^*i+1) \text{ downto } 16^*(2^*i+1)) \approx 10^{-10} \text{ downto } 16^*(2^*i+1) \text{ downto } 16^*(2^*i+1)) \approx 10^{-10} \text{ downto } 16^*(2^*i+1) \text{ downto } 16^*(2^*i+1) \text{ downto } 16^*(2^*i+1)) \approx 10^{-10} \text{ downto } 16^*(2^*i+1) \text{ downto } 16^*(2^*i+1)) \approx 10^{-10} \text{ downto } 16^*(2^*i+1) \text{ downto } 16^$                         |
| unsigned(mul_ops(1)(15+16*(2*i+1) downto 6*(2*i+1))));                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| $mul_b(31+32^*(i) \text{ downto } 32^*(i)) \le std_logic_vector((unsigned(mul_ops(0)(16^*(2^*i+1) - 1 \text{ downto } 16^*(2^*i)))^*)) \le std_logic_vector((unsigned(mul_ops(0)(16^*(2^*i+1) - 1 \text{ downto } 16^*(2^*i)))^*)) \le std_logic_vector((unsigned(mul_ops(0)(16^*(2^*i+1) - 1 \text{ downto } 16^*(2^*i)))^*)) \le std_logic_vector((unsigned(mul_ops(0)(16^*(2^*i+1) - 1 \text{ downto } 16^*(2^*i))))) \le std_logic_vector((unsigned(mul_ops(0)(16^*(2^*i+1) - 1 \text{ downto } 16^*(2^*i))))))$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| unsigned(mul_ops(1)(15+16*(2*i+1) downto $16*(2*i+1))))$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| and unsigned(FUNCT_SELECT_MASK));                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| $mul_c(31+32^*(i) \text{ downto } 32^*(i)) \ll std_logic_vector((unsigned(mul_ops(0)(15+16^*(2^*i+1) \text{ downto } 16^*(2^*i+1))))) \approx 10^{-10} \text{ downto } 10^{-10}$ |
| unsigned(mul_ops(1)(16*(2*i+1) - 1 downto 16*(2*i))))                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| and unsigned(FUNCT_SELECT_MASK));                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| $mul_d(31+32^*(i) \text{ downto } 32^*(i)) \le std_logic_vector(unsigned(mul_ops(0)(16^*(2^*i+1) - 1 \text{ downto } 16^*(2^*i)))^*)$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| unsigned(mul_ops(1)( $16*(2*i+1) - 1$ downto $16*(2*i)$ ));                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| 57                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |

| Synchronous Partial Multiplication Stage 2                                                                                                                                                                                        |               |                            |                   |                     |                |                    |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|----------------------------|-------------------|---------------------|----------------|--------------------|
| for i in 0 to SIMD-1 loop<br>out_mul_results((Data_Width-1)+Data_Width*(i) downto Data_Width*(i)) <=<br>(std_logic_vector(unsigned(mul_tmp_a(i)) +                                                                                | -             | -                          | •                 |                     |                |                    |
| <pre>out_mul_results((Data_Width-1)+Data_Width*(i) downto Data_Width*(i)) &lt;=</pre>                                                                                                                                             |               |                            | or recover_state  | wires = 1) and      | halt_spe_lat = | $= 0^{\circ}$ then |
| (std_logic_vector(unsigned(mul_tmp_a(i)) +                                                                                                                                                                                        |               | 1                          | Width*(i) down    | nto Doto Width*     |                |                    |
| unsigned(mul_tmp_b(i)) +<br>unsigned(mul_tmp_c(i)) +<br>unsigned(mul_tmp_d(i))));<br>end loop;<br>nd if;<br>Combinational Partial Multiplication<br>Combinational Partial Multiplication                                          | out_mui_res   |                            | _ ()              |                     | (1)) <=        |                    |
| unsigned(mul_tmp_c(i)) +<br>unsigned(mul_tmp_d(i))));<br>end loop;<br>nd if;<br>Combinational Partial Multiplication<br>if mul_en = '1' and (mul_stage_2_en = '1' or recover_state_wires = '1') then<br>for i in 0 to SIMD-1 loop |               |                            |                   |                     |                |                    |
| unsigned(mul_tmp_d(i))));<br>end loop;<br>nd if;<br>Combinational Partial Multiplication                                                                                                                                          |               |                            |                   |                     |                |                    |
| end loop;<br>nd if;<br>Combinational Partial Multiplication                                                                                                                                                                       |               |                            |                   |                     |                |                    |
| nd if;<br>Combinational Partial Multiplication<br>if mul_en = '1' and (mul_stage_2_en = '1' or recover_state_wires = '1') then<br>for i in 0 to SIMD-1 loop                                                                       |               | un                         | signed(mui_tmp    | $D_{(1)}));$        |                |                    |
| Combinational Partial Multiplication<br>if mul_en = '1' and (mul_stage_2_en = '1' or recover_state_wires = '1') then<br>for i in 0 to SIMD-1 loop                                                                                 |               |                            |                   |                     |                |                    |
| if mul_en = '1' and (mul_stage_2_en = '1' or recover_state_wires = '1') then<br>for i in 0 to SIMD-1 loop                                                                                                                         | na 11;        |                            |                   |                     |                |                    |
| if mul_en = '1' and (mul_stage_2_en = '1' or recover_state_wires = '1') then<br>for i in 0 to SIMD-1 loop                                                                                                                         |               |                            |                   |                     |                |                    |
| if mul_en = '1' and (mul_stage_2_en = '1' or recover_state_wires = '1') then<br>for i in 0 to SIMD-1 loop                                                                                                                         |               |                            |                   |                     |                |                    |
| for i in 0 to SIMD-1 loop                                                                                                                                                                                                         |               |                            |                   |                     |                |                    |
| 1                                                                                                                                                                                                                                 | if mul_en = ' | and (mul_stage_2_en =      | '1' or recover_st | ate_wires = '1') th | nen            |                    |
| if MUTVDE SDE /- "10" then                                                                                                                                                                                                        | for i in 0 to | IMD-1 loop                 |                   |                     |                |                    |
| $11 \text{ WEV I I I E_SEE /- 10 UICH}$                                                                                                                                                                                           | if MVTYI      | $E_SPE /= "10"$ then       |                   |                     |                |                    |
|                                                                                                                                                                                                                                   |               | $a(i) \le (mul_a(15+16*(2$ | *i) downto 16*    | (2*;)) & *"0000"    | ١.             |                    |

 $mul_tmp_d(i) \le (x''0000'' \& mul_d(15+16*(2*i) downto 16*(2*i)));$ 

 $mul_tmp_b(i) \le (mul_b(15+16^*(2^*i) \text{ downto } 16^*(2^*i)) \& x"0000");$ 

 $mul_tmp_c(i) \le (mul_c(15+16^*(2^*i) \text{ downto } 16^*(2^*i)) \& x''0000'');$ 

 $mul_tmp_d(i) \le (mul_d(31+32^*(i) \text{ downto } 32^*(i)));$ 

elsif MVTYPE SPE = "10" then

7

8

9

10

11

12

13

14

15

16

17

end if;

end loop;

end if;

The **partial right shifter** in the SPE works in the opposite manner (Figure 5.5). One 32-bit right logic shifter slides the input operands and computes one 32-bit shifted output. If the data width was 16-bits, the init config will configure the data to mask the data sliding form one data value to the other. It will execute as follows: The two 16 bits data will go into the right shifter, the output of the shifter will be sent to the next stage where the lower bits of the **upper** 16-bit input that were slided into the upper bits of the **lower** 16-bit input will be masked with a bit a zero if the shift was logical, and sign extended if the shift was arithmetic. A similar approach is applied for 8-bit data types.

\_\_\_\_\_

\_\_\_\_\_

-- (Ah\*Bl)

--(Al\*Bh)

-- (Al\*Bl)

-- The upper 32-bit results of the multiplication are discarded in the SPMU (Ah\*Bh)

The SPMU does not include a left shifter, instead the partial multipliers can be used for left shifting. As for the implementation of the right shifter, it was implemented to be used for pre-scaling and post-scaling of the input and output data to be used in convolutions.



| 1      | Synchronous Partial Shifter Stage 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 2      | if shift_en = '1' and (shifter_stage_1_en = '1' or recover_state_wires = '1') and halt_spe_lat = '0' then                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 3      | for i in 0 to SIMD-1 loop                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 4      | $shifter_op(31+32*(i) \text{ downto } 32*(i)) \le to\_stdlogicvector(to\_bitvector(shifter_op(31+32*(i) \text{ downto } 32*(i))) \text{ srl}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| 5      | to_integer(unsigned(shift_amount))); shift as if it was a 32-bit value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 6      | end loop;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 7      | if MVTYPE_SPE = "00" then                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 8<br>9 | for i in 0 to 4*SIMD-1 loop latch the sign bits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|        | shifter_op_lat(7+8*i downto 8*i) <= (others => shifter_op(7+8*i)); latch 8-bit data sign bit for arith shifts                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| 10     | end loop;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 11     | elsif MVTYPE_SPE = "01" then                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| 12     | for i in 0 to 2*SIMD-1 loop latch the sign bits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| 13     | $shifter_op_lat(15+16*i \text{ downto } 16*i) \le (others => shifter_op(15+16*i)); latch 16-bit data sign bit for arith shifts bits a sign bit for arith shifts bits bits a sign bit for a s$ |
| 14     | end loop;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 15     | $elsif MVTYPE\_SPE = "10"$ then                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| 16     | for i in 0 to SIMD-1 loop latch the sign bits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| 17     | shifter_op_lat(31+32*i downto 32*i) <= (others => shifter_op(31+32*i)); latch 32-bit data sign bit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| 18     | end loop;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 19     | end if;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| 20     | end if;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| 21     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|        |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |

----- Synchronous Partial Shifter Stage 2 -----if shift\_en = '1' and (shifter\_stage\_2\_en = '1' or recover\_state\_wires = '1') and halt\_spe\_lat = '0' then
if MVTYPE\_SPE = "10" then
for i in 0 to SIMD-1 loop
out\_shifter\_results(31+32\*(i) downto 32\*(i)) <= shifter\_op\_lat\_wire(31+32\*(i) downto 32\*(i)) or
shifter\_op(31+32\*(i) downto 32\*(i));
end loop;
elsif MVTYPE\_SPE = "01" or (decoded\_instruction\_SPE\_lat(KDOTPPS\_bit\_position) = '1' and
MVTYPE\_SPE = "00") then</pre>

| 10<br>11 | KDOTPPS8 is added here because the element number loaded per cycle for mul ops is the sane for 8, and 16 types for i in 0 to 2*SIMD-1 loop |
|----------|--------------------------------------------------------------------------------------------------------------------------------------------|
| 12       | out shifter results $(15+16*(i) \text{ downto } 16*(i)) \le \text{ shifter op lat wire}(15+16*(i) \text{ downto } 16*(i)) \text{ or }$     |
| 13       | (shifter operand(15+16*(i) downto 16*(i))) is since $(p) = (15+16*(i) downto 16*(i))$ or $(shifter operand(15+16*(i) downto 16*(i)))$ and  |
| 14       | shift enabler(15 downto 0));                                                                                                               |
| 15       | end loop;                                                                                                                                  |
| 16       | $elsif MVTYPE_SPE = "00"$ then                                                                                                             |
| 17       | for i in 0 to 4*SIMD-1 loop                                                                                                                |
| 18       | out_shifter_results $(7+8*(i) \text{ down to } 8*(i)) \le \text{ shifter_operand_lat_wire}(7+8*(i) \text{ down to } 8*(i)) \text{ or}$     |
| 19       | (shifter_operand(7+8*(i) downto 8*(i)) and                                                                                                 |
| 20       | shift_enabler(7 downto 0));                                                                                                                |
| 21       | end loop;                                                                                                                                  |
| 22       | end if;                                                                                                                                    |
| 23       | end if;                                                                                                                                    |
| 24       |                                                                                                                                            |

| 1                                    | Combinational Partial Shifter                                                                                |
|--------------------------------------|--------------------------------------------------------------------------------------------------------------|
| 2                                    | if shift $en = 1'$ and halt spe $lat = 0'$ then                                                              |
| 3                                    | if $MVTYPE$ SPE = "01" then                                                                                  |
| 4                                    | shift_enabler(15 - to_integer(unsigned(shift_amount(3 downto 0))) downto 0) <= (others => '1');              |
| 5                                    | elsif $\overline{M}VTYPE$ SPE = "00" then                                                                    |
| 2<br>3<br>4<br>5<br>6<br>7<br>8<br>9 | shift_enabler( $\overline{7}$ - to_integer(unsigned(shift_amount(2 downto 0))) downto 0) <= (others => '1'); |
| 7                                    | end if;                                                                                                      |
| 8                                    | if (decoded instruction SPE lat(KSRAV bit position) = '1' or                                                 |
| 9                                    | decoded instruction SPE lat(KDOTPPS bit position) = '1') and                                                 |
| 10                                   | MVTYPE SPE = " $10$ " then 32-bit sign extension for srl in stage 1                                          |
| 11                                   | for i in 0 to SIMD-1 loop                                                                                    |
| 12                                   | shifter_op_lat_wire(31+32*(i) downto 31 - to_integer(unsigned(shift_amount(f)(4 downto 0)))+32*(i)) <=       |
| 13                                   | shifter_operand_lat(31+32*(i) downto 31 - to_integer(unsigned(shift_amount(f)(4 downto 0)))+32*(i));         |
| 14                                   | end loop;                                                                                                    |
| 15                                   | elsif (decoded_instruction_SPE_lat(KSRAV_bit_position) = '1' or                                              |
| 16                                   | decoded_instruction_SPE_lat(KDOTPPS_bit_position) = '1') and                                                 |
| 17                                   | MVTYPE_SPE = "01" then 16-bit sign extension for srl in stage 1                                              |
| 18                                   | for i in 0 to 2*SIMD-1 loop                                                                                  |
| 19                                   | shifter_operand_lat_wire(15+16*(i) downto 15 - to_integer(unsigned(shift_amount(3 downto 0)))+16*(i)) <=     |
| 20                                   | shifter_operand_lat( 15+16*(i) downto 15 - to_integer(unsigned(shift_amount(3 downto 0)))+16*(i));           |
| 21                                   | end loop;                                                                                                    |
| 22                                   | elsif (decoded_instruction_SPE_lat(KSRAV_bit_position) = '1' or                                              |
| 23                                   | decoded_instruction_SPE_lat(KDOTPPS_bit_position) = '1') and                                                 |
| 24                                   | MVTYPE_SPE = "00" then 8-bit sign extension for srl in stage 1                                               |
| 25                                   | for i in 0 to 4*SIMD-1 loop                                                                                  |
| 26                                   | shifter_operand_lat_wire(7+8*(i) downto 7 - to_integer(unsigned(shift_amount(2 downto 0)))+8*(i)) <=         |
| 27                                   | shifter_operand_lat( 7+8*(i) downto 7 - to_integer(unsigned(shift_amount(2 downto 0)))+8*(i));               |
| 28                                   | end loop;                                                                                                    |
| 29                                   | end if;                                                                                                      |
| 30                                   | end if;                                                                                                      |
| 31                                   |                                                                                                              |
|                                      |                                                                                                              |

The remaining two functional units are a **2-stage accumulator**, which accumulates an input vector source into a scalar output, and a **ReLu unit** that rectifies all negative vector elements to zero.

| Two Stage Accumulator SIMD 2                                                                   |
|------------------------------------------------------------------------------------------------|
| if (decoded_instruction_SPE_lat(KDOTP_bit_position) = '1' or                                   |
| decoded_instruction_SPE_lat(KDOTPPS_bit_position) = '1' or                                     |
| decoded_instruction_SPE_lat(KVRED_bit_position) = '1') and                                     |
| $MVTYPE\_SPE = "10"$ then                                                                      |
| if $(accum_stage_1_en = '1' \text{ or recover_state_wires} = '1')$ and halt_spe_lat = '0' then |
| accum_partial_results_stg_1(31 downto 0) <= std_logic_vector(unsigned(accum_op(31 downto 0)) + |
| unsigned(accum_op(63_downto 32)));                                                             |
|                                                                                                |

| 9  | end if;                                                                                                      |
|----|--------------------------------------------------------------------------------------------------------------|
| 10 | if $(accum_stage_2_en = '1' \text{ or recover}_state_wires = '1')$ and halt_spe_lat = '0' then               |
| 11 | out accum results(f) $\leq$ std logic vector(unsigned(accum partial results stg 1(31 downto 0)) +            |
| 12 | unsigned(out_accum_results));                                                                                |
| 13 | end if;                                                                                                      |
| 14 | elsif (decoded_instruction_SPE_lat(KDOTP_bit_position) = '1' or                                              |
| 15 | decoded_instruction_SPE_lat(KDOTPPS_bit_position) = '1' or                                                   |
| 16 | decoded_instruction_SPE_lat(KVRED_bit_position) = '1') and                                                   |
| 17 | $(MVTYPE\_SPE = "01" \text{ or } MVTYPE\_SPE = "00")$ then                                                   |
| 18 | if (accum_stage_1_en = '1' or recover_state_wires = '1') and halt_spe_lat = '0' then                         |
| 19 | accum_partial_results_stg_1(15 downto 0) <= std_logic_vector(unsigned(accum_op(15 downto 0)) +               |
| 20 | unsigned(accum_op(31 downto 16)));                                                                           |
| 21 | accum_partial_results_stg_1(31 downto 16) <= std_logic_vector(unsigned(accum_op(47 downto 32)) +             |
| 22 | unsigned(accum_op(63 downto 48)));                                                                           |
| 23 | end if;                                                                                                      |
| 24 | if (accum_stage_2_en = '1' or recover_state_wires = '1') and halt_spe_lat = '0' then                         |
| 25 | <pre>spe_out_accum_results &lt;= std_logic_vector(unsigned(accum_partial_results_stg_1(15 downto 0)) +</pre> |
| 26 | unsigned(accum_partial_results_stg_1(31 downto 16)) +                                                        |
| 27 | unsigned(out_accum_results));                                                                                |
| 28 | end if;                                                                                                      |
| 29 | end if;                                                                                                      |
| 30 |                                                                                                              |

| 1  | Synchronous Single Stage ReLu                                                                   |
|----|-------------------------------------------------------------------------------------------------|
| 2  | if (relu_stage_1_en = '1' or recover_state_wires = '1') and halt_spe_lat = '0' then             |
| 3  | if MVTYPE SPE = "10" then $- ReLu$ for 32-bit data type                                         |
| 4  | for i in 0 to SIMD-1 loop                                                                       |
| 5  | if spe_in_relu_operands $(31+32^*(i)) = '1'$ then                                               |
| 6  | spe_out_relu_results( $31+32^*(i)$ downto $32^*(i)$ ) <= (others => '0');                       |
| 7  | else                                                                                            |
| 8  | spe_out_relu_results(31+32*(i) downto 32*(i)) <= spe_in_relu_operands(31+32*(i) downto 32*(i)); |
| 9  | end if;                                                                                         |
| 10 | end loop;                                                                                       |
| 11 | elsif MVTYPE_SPE = "01" then – ReLu for 16-bit data type                                        |
| 12 |                                                                                                 |
| 13 | end if;                                                                                         |
| 14 | end if;                                                                                         |

#### 5.3.2. Scratchpad Memory Interface

The engine is interfaced with a set of SPMs through the Scratchpad Memory Interface. Each SPM in the SPI has a read and write port, and every SPM-line has a set of banks that hold a 32-bit word. The number of banks in an SPM is dependent on the SIMD configuration chosen. For example, a configuration with SIMD 4 has four banks. Each of the banks has a read and write port, and the total width of the ports in the SPM will be 128-bits (i.e. 32-bits\*4). When a fetch request is granted the data will be read on the next cycle. The RTL below illustrates the implementation of the SPMs in the T13.

<sup>-----</sup> Scratchpad Memory Generation --------- 3D array, of memory, the 1<sup>st</sup> dimension defines the size of each word, the 2<sup>nd</sup> is number of words in a bank, and the 3<sup>rd</sup>

is the number of banks.

signal mem : array\_3d(SIMD\*SPM\_NUM-1 downto 0)(2\*\*(Addr\_Width-(SIMD\_BITS+2))-1 downto 0)(Data\_Width-1 downto 0);

attribute ram\_style : string;

attribute ram\_style of mem : signal is "block";

spm\_banks : for h in 0 to SIMD\*SPM\_NUM -1 generate

| 10 | spm_logic: process(clk_i)                                                    |
|----|------------------------------------------------------------------------------|
| 11 | begin                                                                        |
| 12 | if(clk_i'event and clk_i='1') then                                           |
| 13 | sc_data_rd(h) <= mem(SIMD*SPM_NUM + h)(to_integer(unsigned(sc_addr_rd(h)))); |
| 14 | if $sc_we(h) = '1'$ thenwrite mode                                           |
| 15 | mem(SIMD*SPM_NUM + h)(to_integer(unsigned(sc_addr_wr(h)))) <= sc_data_wr(h); |
| 16 | end if; we                                                                   |
| 17 | end if; clk                                                                  |
| 18 | end process;                                                                 |
| 19 | end generate spm_banks;                                                      |
| 20 |                                                                              |
|    |                                                                              |

An SPM read or write access will fetch or write an entire line in one cycle. If the fetch pointer was not pointing to the beginning of the line, the data fetched will be from the line being indexed, and the next line as well, therefore maintaining the fetching of one complete line per cycle.

Misaligned fetches go into a read-rotator circuit to make it appear as if the fetching is from the beginning of the line. The rotator gives a one extra cycle of latency to execute the instruction. In this manner operand\_a[i] will always be aligned with operand\_b[i] and go the same functional unit. Without rotation, misaligned accesses might send operand\_a[i] and operand\_b[i+2] to go to the same functional unit, and that produces erroneous outputs. During the result write, the result will be rotated back with a write rotator to go to the correct bank indexed in the write address.

| 1      | Synchronous Write Rotator                                                                            |
|--------|------------------------------------------------------------------------------------------------------|
|        | for i in 0 to SIMD-1 loop index i loops the words inside each SPM                                    |
| 2<br>3 | if (to_integer(unsigned(spm_write_addr(SIMD_BITS+1 downto 0))) = $4*i$ ) and (i /= 0) then           |
| 4      | wr_offset(i-1 downto 0) <= (others => '1');                                                          |
| 5      | end if;                                                                                              |
| 6      | end loop;                                                                                            |
| 7      | for i in 0 to SIMD-1 loop index i loops the words inside each SPM                                    |
| 8      | if (to_integer(unsigned(spm_write_addr(SIMD_BITS+1 downto 0))) = $4*i$ ) then                        |
| 9      | for j in 0 to SIMD-1 loop                                                                            |
| 10     | if $j \le (SIMD-1)$ -i then                                                                          |
| 11     | $spm_data_write_int_wire(31+32*(j+i) downto 32*(j+i)) \le spm_data_write_wire(31+32*j downto 32*j);$ |
| 12     | elsif $j > (SIMD-1)$ -i then                                                                         |
| 13     | spm_data_write_int_wire(31+32*(j-(SIMD-1)+(i-1)) downto 32*(j-(SIMD-1)+(i-1))) <=                    |
| 14     | spn_data_write_wire(31+32*j downto 32*j);                                                            |
| 15     | end if;                                                                                              |
| 16     | end loop;                                                                                            |
| 17     | end if;                                                                                              |
| 18     | end loop;                                                                                            |
| 19     |                                                                                                      |
|        |                                                                                                      |
|        |                                                                                                      |
| 1      | Synchronous Read Rotator                                                                             |
| 2      | for k in 0 to 1 loop – index k loops between the two read data operands of the SPI                   |
| 2<br>3 | for i in 0 to SIMD-1 loop index i loops the words inside each SPM                                    |
| 4      | if (to_integer(unsigned(spm_read_addr(k)(SIMD_BITS+1 downto 0))) = $4*i$ ) and (i /= 0) then         |
| 5      | $rd_offset(k)(i-1 \text{ downto } 0) \le (others => '1');$                                           |
| 6      | end if;                                                                                              |
| 7      | end loop;                                                                                            |
| 8      | for i in 0 to SIMD-1 loop index i loops the words inside each SPM                                    |
| 9      | if $(to_integer(unsigned(spm_read_addr_lat(k))) = 4*i)$ then                                         |
| 10     | for j in 0 to SIMD-1 loop                                                                            |
| 11     | if $j \ge i$ then                                                                                    |
| 12     | spm_data_read_wire(k)(31+32*(j-i) downto 32*(j-i))                                                   |
| 13     | $<=$ spm_data_read_int_wire(k)(31+32*j downto 32*j);                                                 |
| 14     | elsif $j < i$ then                                                                                   |
|        | ·                                                                                                    |

| 15 | spm_data_ | read_wire(k)(31+32*((SIMD-1)-i+(j+1)) downto 32*((SIMD-1)-i+(j+1))) |
|----|-----------|---------------------------------------------------------------------|
| 16 |           | <= spm_data_read_int_wire(k)(31+32*j downto 32*j);                  |
| 17 | end if;   |                                                                     |
| 18 | end loop; |                                                                     |
| 19 | end if;   |                                                                     |
| 20 | end loop; |                                                                     |
| 21 | end loop; |                                                                     |
| 22 |           |                                                                     |
|    |           |                                                                     |

The SPI has a serialized access grant unit, in which the instruction that comes first in program order will either lock the read and write access of a certain scratchpad. And since T13 is an in-order processor, there will never be data hazards with the serialized access grant.

LSU accesses to SPI read or write one bank at a time instead of writing the entire SPM line at once. A bank interleaver will loop consecutively between each bank in the SPM, and once it reaches the last line of the bank it increments the read or write address, and loops back to bank 0 of the SPM. The RTL describing the implementation of the bank interleaver is shown below.

```
1
      --- Synchronous bank counter -----
2
3
      -- Increments the bank count inside each spm memory
        if data rvalid i = '1' then
 4
         if spm word count = SIMD-1 then
 5
          spm word count \leq 0;
 6
         else
7
          spm word count \leq spm word count + 1;
 8
         end if:
9
        end if:
10
1
      --- LSU read port -----
2
3
       if ls data gnt i(i) = '1' then -- LSU read port
         if harc LS wire = h then -- data reads the register from the bank counter to index
 4
          ls sc data read wire replicated(h) \leq sc data rd(h)((SIMD)*i + sc word count(h));
 5
         end if:
 6
        end if;
7
8
        if ls spi req(i) = '1' then
                                 -- LSU read port
9
         if harc LS wire = h then -- address reads the wire from the bank counter to index
10
          spm addr rd(h)(spm word count wire + (SIMD)*i) <= ls spm read addr;
11
         end if:
12
        end if;
13
```

### 5.4. SPMU Implementations

This section explores a set of hardware accelerator schemes whose architecture was described in section 5.3, and describes how each one can be used in exploiting the T13 core.

#### 5.4.1. Shared-SPMU (Shared-SPI, Shared-SPE):

The first approach used when augmenting a hardware accelerator to the IMT architecture was having a Shared SPMU being accessed by all the harts in the core. Figure 5.6 shows a block diagram of this scheme. The schematic is very identical to that one showed in figure 5.2. In order to access the Shared-

SPMU, a request signal is sent from the decode stage. If the Shared-SPMU is busy, the pipeline will be halted until it becomes free again.

In order to minimize the halts to the pipeline, functional units can be set to execute in SIMD. Increasing the SIMD multiplies the functional units in the core, and the number of banks in each SPM. The core could be configured to process data in parallel of up to 256-bits per cycle (SIMD 8 max). Smaller data types perform even faster when boosting the data level parallelism. Since most of the functional units work in partial mode, and can compute of up to four results per unit as seen from table 5.1. In the scheme in figure 5.6, all the harts share the same memory space, and the same execution units. The SPMU can work in superscalar with other non-SPMU instructions, however, when an SPMU instruction is decoded, and SPMU unit is busy, then the instruction pipeline will be halted in this scheme.

The RTL describing the implementation of the Shared-SPMU is the same code that was shown in section 5.3.



Figure.5.6. Diagram of the Shared-SPMU, all accesses to the SPMU are shared by all the harts

#### 5.4.2. Dedicated-SPI Shared-SPE

The second hardware accelerator scheme is called the Dedicated-SPI Shared-SPE. The diagram showing its implementation is shown in figure 5.7. In this hardware scheme, every hart in the T13 core has its own dedicated memory space, but they all share the same functional units. It can be compared to a multi-threaded hardware accelerator, in which the threads share the access to the logical elements [38]. In the Dedicated-SPI Shared-SPE, any contention to a functional unit is processed by a contention handler to determine which hart requested the access first. Since the hart instructions are issued in order, then there will never be simultaneous requests, and no race conditions. An SPMU

busy signal in this hardware scheme will only block SPMU instructions belonging to the same hart thus minimizing the pipeline halts in the SPMU a lot. Note that there is a buffer to hold the instruction data **for each hart**. This gives a great speed advantage over the Shared-SPMU approach as it exploits thread level parallelism, and still maintains minimal architectural complexity, as no instruction issue logic is needed to issue out of order.

Every hart can load data to its own SPI, and not to any other SPI. In this manner, the SPMs of each hart can have overlapping memory addresses. For example, hart 2 can perform burst loads 'kmemlds' from the main memory to the SPI(2) only, and hart 1 using the same pointers used in *kmemld* instruction from hart 2 can do the same. The decoding of the entire SPM address space becomes much easier to handle, and makes it also easier for the programmer that will be managing the SPMU address space. In a similar manner, all SPMU arithmetic instructions read and write from and to their own SPIs only.



Figure.5.7. Diagram of dedicated SPI shared SPE model. Each hart has a dedicated set of scratchpads, busy signals will only block the hart belonging to the same SPMU

If the user needs to broadcast some input data to all the SPIs, they can execute another type of load instruction called broadcast load "kbcastld". When using *kbcastld*, if the user wants to send some input data to SPM(i), then the *kbcastld* will broadcast the data to the SPM(i) of each SPI. This broadcasting operation relives the core from having to fill three memories sequentially.

Changes to the RTL required to handle this are minimal. First every SPI is replicated with a "for generate" structure is needed and a signal to distinguish the load is a broadcast as shown below,

SPI\_Unit\_comb : process(all))

1

SPM\_replicated : for h in accl\_range generate

<sup>--</sup> The index 'h' now refers to each SPI

```
4
       begin
 5
       . . .
 6
 7
           if data rvalid i = '1' then
                                        -- LS write port
 8
            if ls spi req(i) = '1' and ls spi we(i) = '1' and ls spi wr gnt = '1' then
 9
             if harc LSU wire = h or spm bcast = '1' then - spm bcast indicates we have kbcastld, and we always enter the 'if'
10
              spm we(h)((SIMD)*i + spm word count(h)) <= '1';
11
              spm data wr(h)(spm word count(h) + (SIMD)*i) \leq 1 lsu data write wire(31 downto 0);
12
              spm addr wr(h)(spm word count(h) + (SIMD)*i) \leq lsu write addr;
13
             end if:
14
            end if;
15
           end if:
16
       . . .
17
18
       end process;
19
20
       end generate SPM replicated;
```

In addition to the SPI, the SPE must have a new mapping unit, each hart must have its own hardware loops, and there must be a functional unit contention access handler.

The RTL below describes, a brief implementation of how the new mapping unit should be. One disadvantage from the implementation of the mapping unit below, is that all these input SPI operands mapping to these different functional units requires a huge set of multiplexers to map inputs and outputs appropriately.

| Input Mapping                                                                          |
|----------------------------------------------------------------------------------------|
| The index 'h' refers the dedicated SPI in the core, and maps them to the adder         |
| if decoded instruction SPE lat(h)(KADDV bit position) = '1' then                       |
| adder $ops(0) \le spi data read(h)(0);$                                                |
| adder $ops(1) \leq spi$ data $read(h)(1);$                                             |
| end if;                                                                                |
|                                                                                        |
| Output Mapping                                                                         |
| The output results of the adder are again mapped to the appropriate SPI indexed in 'h' |
| if decoded instruction SPE lat(h)(KADDV bit position) = '1' then                       |
| spe sc data write wire int(h) <= out adder results;                                    |
| end if;                                                                                |
| ·                                                                                      |

The FU contention handler on the other hand is a bit more complex to implement. The RTL below shows logic behind the functional unit grant handler. As seen in the RTL below, every functional unit has its own handler, and any reservation on a busy functional unit stores the ID of the hart requesting the access inside a buffer, the buffer write-pointer gets incremented as soon as the request becomes registered, and another hart can reserve access to the busy functional unit at the new write-pointer value. As soon as the functional unit becomes free. The buffer is read, the read-pointer is incremented, and the grant will be given to the hart ID stored in the buffer.

```
\begin{array}{c}
1 \\
2 \\
3 \\
4 \\
5 \\
6 \\
7 \\
8 \\
9 \\
10 \\
11 \\
12 \\
13 \\
\end{array}
```

1

```
11
               if fu gnt en(h)(i) = '1' then
12
                if unsigned(fu rd ptr(i)) = THREAD POOL SIZE - 2 then
13
                 fu rd ptr(i) \le (others \implies '0');
14
                else
15
                 fu rd ptr(i) \leq std logic vector(unsigned(fu rd ptr(i)) + 1); -- increment the read pointer
16
                end if:
17
               end if;
18
              end if:
19
            end loop;
20
           end loop;
21
```

| 1      | Combinational FU access handler                                                                                |
|--------|----------------------------------------------------------------------------------------------------------------|
| 2      | for h in accl range loop                                                                                       |
| 3      | $fu_gnt_wire(h) \le (others \implies '0');$                                                                    |
| 4      | $fu_gnt_en(h) \ll (others \Rightarrow '0');$                                                                   |
| 5      | if add_en_pending_wire(h) = '1' and busy_add_wire = '0' then                                                   |
| 6<br>7 | fu_gnt_en(h)(0) <= '1';                                                                                        |
|        | end if;                                                                                                        |
| 8      | if shift_en_pending_wire(h) = '1' and busy_shf_wire = '0' then                                                 |
| 9      | $fu_gnt_en(h)(1) \le '1';$                                                                                     |
| 10     | end if;                                                                                                        |
| 11     |                                                                                                                |
| 12     | for i in 0 to 4 loop – loops through the five functional units (add, shift. mul, acc, relu)                    |
| 13     | if $fu_gnt_en(h)(i) = '1'$ then                                                                                |
| 14     | give a grant to fu_gnt(h)(i), such that the 'h' index points to the thread in "fu_issue_buffer"                |
| 15     | $fu\_gnt\_wire(to\_integer(unsigned(fu\_issue\_buffer(i)(to\_integer(unsigned(fu\_rd\_ptr(i))))))(i) \le '1';$ |
| 16     | end if;                                                                                                        |
| 17     | end loop;                                                                                                      |
| 18     |                                                                                                                |
| 19     | end loop;                                                                                                      |
| 20     |                                                                                                                |

Note that the Dedicated-SPI Shared-SPE approach which already exploits thread level parallelism of the T13 core, can still be configured to exploit the data level parallelism of the T13 by configuring the SPMU to execute with larger SIMD settings.

#### 5.4.3. Dedicated-SPMU (Dedicated SPI, Dedicated-SPE)

The Dedicated-SPMU approach, as the name implies assigns a dedicated hardware accelerator to each hart. Just like the previous implementation was compared to a multi-threaded accelerator, this implementation can be compared to a multi-core accelerator. The term multicore can be compared to the CUDA cores in NVIDIA Tesla [39]. Each SPMU has its own SPI and SPE, there is no contention handler needed at all, since each hart will have its own set of functional units. Figure 5.8 shows the implementation of such an approach. The advantage to this approach over the Dedicated SPI Shared-SPE approach is that this approach further decreases the stalls to the instruction pipeline since there will never be contention over functional units. Also, the mapping unit of this approach is also much less complex since it does not need that huge crossbar to map the operands to the functional units, and its implementation will follow that Shared-SPMU.

Like the Dedicated SPI Shared-SPE approach, this unit has one instruction buffer for each hart. A pipeline stall will only happen when the decode stage has an SPMU instruction going to the same hart of a busy SPMU. Also, similarly the SPI implementation of the Dedicated-SPMU approach is exactly the same to that of the previous approach, and it still maintains the support for the broadcast load

instruction, However, a disadvantage for this approach is that this approach might utilize a big area since all the pipelined functional units are replicated.



Figure.5.8. Diagram of Dedicated-SPMU, each hart has a dedicated SPE and SPI, a busy signal will only block the hart belonging to the same SPMU

The brief RTL below illustrates how all the signals in the SPMU must be changed relative to the Shared-SPMU approach, in which all the signals now have a new dimension which is called  $\langle accl\_tange \rangle$ , that ranges through number of hardware accelerators in the core. If the SPMU is replicated as in this case, the *accl\\_range* is equal to the THREAD\_POOL\_SIZE. While if the replication was disabled, *accl\\_range* would become zero. Also as seen in the RTL that a "forgenerate" must be added to replicate the assignments in the SPE just like the SPI assignments were replicated in the previous approach. This way, each process assigns to its own dimension indexed in 'h'.

```
signal wb ready
                              : std logic vector(accl range);
signal SIMD RD BYTES
                              : array 2d int(accl range);
signal MVSIZE WRITE
                              : array 2d(accl range)(Addr Width downto 0);
 SPE replicated : for h in accl range generate -- The h index loops through the acc range above
        if wb ready(h) = '1' then
         if to integer(unsigned(MVSIZE WRITE(h))) >= SIMD RD BYTES(h) then
          MVSIZE WRITE(h) <= std logic vector(unsigned(MVSIZE WRITE(h)) - SIMD RD BYTES(h));
         else
          MVSIZE WRITE(h) <= (others => '0');
                                                        -- decrement the remaining bytes
         end if;
        end if;
 . . .
 end generate
```

1

23456789

10

11

12

13

14

### 5.5. Performance evaluation of the SPMU implementations.

In order to benchmark the performance of the T13 core when executing vector operations, various tests have been developed. The first batch is a basic series of instruction level testing. These tests benchmark the performance contribution of different approaches provided in the SPMU, that helped boost the execution of arithmetic-vector operations. The second batch of tests, is a set of matrix convolution being executed with the SPMU, in order to show the how the hardware schemes introduced in section 5.4 performed. Lastly, we show results of running entire layers of DCNN on the SPMU, and we compare its performance to the T03, and Riscy cores from Pulpino. Details about the implementation of the tests are laid out in the chapter 6.

#### 5.5.1. Instruction Level Testing:

In order to benchmark some implementations in the SPMU, a set of basic arithmetic tests were performed to see which implementations provided the largest performance boost. Figure 5.9 shows the number of clock cycles took to perform an arithmetic operation in the T13 without using any hardware accelerator, but still utilizing all the harts in the core.



Figure.5.9. Number of cycles taken to perform an arithmetic vector operation without the SPMU

In figure 5.10, the same vector-arithmetic operations were performed with the SPMU with the different data types (8,16,32). However, they were performed using software loops instead of zero-overhead loops (hardware loops). The convolutions were run on the Shared-SPMU scheme



Figure.5.10. Cycle time using the SPMU with SIMD=1 and hardware loops disabled

configured with no data level parallelism (SIMD=1). Figure 5.10, shows the advantage of using the low latency local scratchpad memories.

Comparing figures 5.9 and 5.10 for small vector sizes, the boost was not very evident. However, as the vector sizes grew, the tests that were running on the SPMU using sw-loops and SIMD equal to zero, showed that the cycle time grew with a smaller slope then that of the non-accelerated test. This test clearly outlines the advantage of using low latency scratchpad memories to using the registerfiles. Such that the total number of cycles dropped by more than 40% for vectors of sixty elements.

The reason for the speedup is obvious, since the non-accelerated operations writing to the registerfile will have to push the old data to the stack memory to make way for the new computed results, and then load back the data from the stack when it needs to be read. While when using the SPMU will load the input data once from the main memory with a burst load instruction. Then, stores the final results at the end of the operation with a burst store back to the main memory.

Smaller data width such as 16, and 8-bit performed even better since they are more parallel than the 32-bit operations even though the SIMD of the SPMU is set to one. The nature of SPMU using partial functional units and replicating the non-partial ones will show this very good performance with the tiny slope relative to the 32-bit operations.

Figure 5.11 shows the advantage of using the zero-overhead loops or hardware loops in the SPMU. The hardware loops relieve the core from augmenting the following overhead of instructions:

- Incrementing the address of source operand 1.
- Incrementing the address of source operand 2.
- Decrementing the number of elements left to execute.
- Branching to the beginning of the loop if the number elements is not zero.



Figure.5.11. Cycle time using the SPMU with SIMD=1 and hardware loops enabled

Enabling the hardware loops in the SPMU, boosted the performance for all vector sizes, such that the speed boost was over 170% for large vectors, and almost 100% for small vectors comparing to the



sw-loop approach. While comparing to the non-accelerated from figure 5.9 approach we can see the speed boost to go over 350% for large vectors.

Figure.5.12. Cycle time using the SPMU with SIMD=4 and hardware loops enabled

Finally, figure 5.12 reports the cycle time when executing the same test, however with increasing the data level parallelism by setting the SIMD equal to four.

Boosting the data level parallelism was the least contributor out of all the implementations to the performance boosts. Such that the speed boost was barely visible for small vectors, and for large vectors, the speed boost was about 15% over the previous approach. Not only that, but the area increases from replicating the functional units, and the registers that hold the data in the pipelines of functional units, and the read and write SPM rotators size increase can be regarded as considerably large for such small performance contributions.

More reports regarding the area utilization will be discussed in the section 5.6.

# 5.5.2. Routine Level Testing

Libraries have been made using the SPMU instructions in order to perform matrix convolutions. Details about the implementation of the convolutions are included in chapter 6. The matrix convolutions included different square matrix sizes, typically 4x4, 8x8, 16x16, and 32x32. The data types used were only 32-bit integers. That is because the neural network test used, uses these data types as well. The convolution tests have been run on the hardware schemes introduced in section 5.4. Each hardware scheme was configured with different SIMD configurations (1, 2, 4 and 8) to show the contribution of the data level parallelism in each. Table 5.2 reports the cycle time for each matrix convolution on each SPMU hardware scheme as well as the non-accelerated versions of the T13 and the native PULPino Riscy cores.

Now as we delve in the evaluation of the different hardware schemes from section 5.4. I will be using some terminology to refer to the schemes in order to be brief:

• *DLP approach:* means increasing the data level parallelism in the Shared-SPMU such that we go from SIMD-1 to SIMD-8.

- *TLP approach:* means that we go from the Shared-SPMU SIMD-1 scheme to the Dedicated-SPMU SIMD-1 or Dedicated-SPI\_Shared-SPE SIMD-1 schemes that exploit thread level parallelism.
- *Hybrid Approach:* means that we go from the Shared-SPMU SIMD-1 scheme to the Dedicated-SPMU SIMD-8 or Dedicated-SPI\_Shared-SPE SIMD-8 schemes that exploit both data level parallelism and thread level parallelism.

The evaluation begins as follows starting from small matrix convolutions. Looking at table 5.2, small matrix convolutions (4x4) performed by the different SPMU configurations gave approximately 2-3 times the speed-up relative to performing the convolutions on the non-accelerated T13 core (No\_ACCL\_RV32IM), and more than 2 times the speed-up when being compared to the Riscy core itself and 4-7 times comparing it to the Zeroriscy core. Riscy achieves a low cycle count as it exploits the hardware loops and custom DSP extensions, thus there instruction count decreases as much of the software overhead is performed in hardware.

|               | Core                      |    |      | Cycle | Count  |        |
|---------------|---------------------------|----|------|-------|--------|--------|
|               |                           |    |      | 8x8   | 16x16  | 32x32  |
|               |                           | 1  | 1105 | 3060  | 9727   | 34201  |
|               | Shared SDMU               | 2  | 895  | 2245  | 6261   | 20374  |
|               | Shared SPMU               | 4  | 824  | 1768  | 4607   | 13444  |
|               |                           | 8  | 824  | 1613  | 3692   | 10069  |
|               |                           | 1  | 626  | 1493  | 3887   | 13536  |
|               | Dadiasted SDMU            | 2  | 629  | 1190  | 3123   | 8681   |
|               | Dedicated SPMU            | 4  | 560  | 1190  | 2543   | 7148   |
| Klassydna T12 |                           | 8  | 560  | 1152  | 2543   | 6006   |
| Klessydra T13 |                           | 1  | 663  | 1521  | 4153   | 13565  |
|               | Dedicated SPI Shared SPE  | 2  | 638  | 1274  | 3280   | 9167   |
|               |                           | 4  | 573  | 1213  | 2688   | 7473   |
|               |                           | 8  | 573  | 1079  | 2580   | 6285   |
|               | NO_ACCL (RV32IM)          | NA | 1819 | 5737  | 20714  | 79230  |
|               | NO_ACCL_(RV32EM)          | NA | 2355 | 7821  | 28927  | 111891 |
|               | NO_ACCL (RV32I)           | NA | 4883 | 17877 | 69087  | 272394 |
|               | NO_ACCL_(RV32E)           | NA | 5568 | 20707 | 80478  | 318084 |
|               | <b>RISCY</b><br>ZeroRiscy |    |      | 4247  | 15088  | 57020  |
|               |                           |    |      | 8111  | 29583  | 113793 |
| ZeroRi        | scy (no RV32M)            | NA | 6406 | 23601 | 91233  | 360081 |
| Ν             | <b>AicroRiscy</b>         | NA | 7380 | 27385 | 106271 | 419618 |

 
 Table.5.2. Cycle number to execute a set of convolutions for different SPMU configurations

Comparing the SPMU schemes to the T13 cores that did not use the accelerator, acceleration became more evident with bigger convolutions such that 32x32 convolutions achieved up to 5-7 times the speed-up using the DLP or TLP approach alone. Hybrid approaches exploiting both DLP and TLP gained up to 16 times the speed-up. While comparing the results to the PULPino Riscy cores we have even a larger speed-up on bigger convolutions such that hybrid SPMU approaches had up to 12 times the speed-up relative to the Riscy core, and 10 times the speedup comparing to Zeroriscy.

Moving on to comparing the SPMU schemes with themselves in the bigger matrix convolutions, using the DLP approach alone we saw more than 3.4 times the speed-up, while using the TLP approach alone gave approximately 2.5 times the speed-up. Exploiting both DLP and TLP we saw 5.7 times the speed-boost. In bigger matrix convolutions not only did the TLP and DLP approaches

gave higher speed-ups than the smaller matrix convolutions, however the rate of the improvement of the DLP was faster than the rate of the improvement in the TLP such that in bigger matrix it appeared better to use the DLP approach of the TLP approach.

Many other important notes can also be taken from table 2. First, the Dedicated-SPI\_Shared-SPE approach when being compared to the Dedicated-SPMU approach has achieved from a minimum of 94% to a maximum of 99% the speed boost when compared to the Dedicated-SPMU. This showed that in fact sharing the resources impacts the speed only a tiny bit as far as matrix convolutions are concerned.

Second, the speed-up in both approaches exploiting TLP (*Dedicated-SPMU*, and *Dedicated-SPI\_Shared-SPE*) can show how much pipeline stalls had an effect on the speed when comparing to the Shared-SPMU.

Third, the embedded approaches (RV32E implementations) that were aimed at decreasing the registerfile footprint in the IMT architectures had somewhat discouraging performance results. such that comparing the NO\_ACCL\_RV32EM to NO\_ACCL\_RV32IM showed a speed degradation of 30% in small matrix convolutions and the degradation went up to 41% in the large convolutions, this nonlinear degradation obtained from bigger convolutions is mostly due to the increase in the memory transfers to the stack section of the data memory since the registerfile in the RV32E extension has very little space allocated for saved registers as opposed to the normal registerfile in the RV32I.

Figure 5.13 shows the contribution of the boost from exploiting the DLP, TLP, and the Hybrid approach were both DLP and TLP are exploited. Obviously, the Hybrid had the biggest boost in the cycle time, however, comparing the DLP and TLP alone. We saw that for small vectors TLP was better at giving higher performances and the matrices grew larger (i.e. beyond 16x16) we saw that TLP boost remained the same, and the DLP boost then became better than the boost from the TLP.



Figure.5.13. Speed boost from exploiting the DLP, TLP, and both together (Hybrid)

The reason behind not seeing much speed-ups due to DLP in small vectors is that:

• The nature of the SPMU being already superscalar with the other non-SPMU execution units does well in hiding the latencies of its instructions.

• The size of the vectors is small such that doubling the functional units can save only a few cycles and not much more.

Table 5.3 shows the top frequency of the T13, and the PULPino Riscy cores after a post-synthesis implementation. The timing constraint used in the synthesis was 1ns, which is a tight constraint that compels Vivado to synthesize the fastest layouts possible.

|               | Core                     | SIMD | Top Frequency<br>(MHz) |
|---------------|--------------------------|------|------------------------|
|               |                          | 1    | 165.29                 |
|               | Shawad SDMU              | 2    | 151.17                 |
|               | Shared SPMU              | 4    | 141.16                 |
|               |                          | 8    | 129.99                 |
|               |                          | 1    | 156.35                 |
|               | Dedicated SPMU           | 2    | 130.58                 |
|               | Dedicated SFWIU          | 4    | 111.51                 |
| Klessydra T13 |                          | 8    | 108.35                 |
| Klessyura 115 |                          | 1    | 140.06                 |
|               | Dedicated SPI Shared SPE | 2    | 131.04                 |
|               |                          | 4    | 116.80                 |
|               |                          | 8    | 102.31                 |
|               | NO_ACCL (RV32IM)         | NA   | 206.31                 |
|               | NO_ACCL_(RV32EM)         | NA   | 209.60                 |
|               | NO_ACCL (RV32I)          | NA   | 185.53                 |
|               | NO_ACCL_(RV32E)          | NA   | 216.64                 |
|               | RISCY                    | NA   | 91.36                  |
|               | ZeroRiscy                | NA   | 117.23                 |
| Zerol         | Riscy (no RV32M)         | NA   | 133.08                 |
|               | MicroRiscy               | NA   | 146.11                 |

Table.5.14. Top frequency for each T13 configuration and Riscy Cores

Vivado was able to generate fast layouts for all the hardware schemes for SIMD configurations 1 and 2. However, the top speed witnessed a sharper drop as the DLP grew larger (SIMD 4 and SIMD 8) especially for the hybrid schemes exploiting both TLP and DLP. For the dedicated SPMU approach, the area overhead became large enough so that the FPGA slices were being placed farther away from each other, thus increasing the net delay between the FPGA slices themselves.

While the Dedicated-SPI-Shared-SPE approach witnessed even a larger drop in the top frequency for large SIMD configurations. Looking at the timing report from Vivado, we saw that the crossbar that maps the Dedicated-SPI input data buses to the shared SPE functional units became the critical path in the SPMU for both SIMD 4 and 8 implementations. One approach to make this scheme faster is to pipeline the crossbar, and divide the critical path. However, we will see in the next why this is not a very favorable approach.

Figure 5.14 shows the execution time it takes to run the convolutions on all the schemes from table 5.3 when operating at the maximum frequency. The figure was separated into two margins left side being the SPMU hardware schemes while the right side being the non-accelerated implementations of T13 and Riscy cores. The reason they were separated was so that very high cycle count on the right side does not saturate the improvements of the TLP and DLP in the SPMU schemes on the left side.

Beginning with our evaluations, increasing the DLP in bigger convolutions such as 16x16 and 32x32 did actually provide a decrease in the cycle time for all the SPMU schemes. Smaller convolutions actually got slower when increasing the DLP, that is because of the sharp drop in the top frequency seen from table 5.3 when increasing the DLP was bigger than the boost in the cycle time.

One conclusion can be made here, that although increasing the DLP does multiply the processor's ability to process data in parallel and thus decrease the cycle count, however, your processor might in turn perform slightly worse especially when the vectors being worked on are smaller (figure 5.14 convolution 4x4). Comparing the T13 non accelerated schemes to the Riscy cores.



Figure.5.15. Total execution time to perform convolutions when running at the maximum attainable frequency for accelerated and non-accelerated implementations

The T13 cores highly outperformed the Riscy cores since not only do they have a good cycle count, but also attain a very high top frequency in comparison with the other cores.

- The higher cycle count comes as a result of T13 cores having zero data dependency pipeline stalls, and zero pipeline flushing, and low latency multiplication instruction.
- The high frequency is attained from pipelining and hardware simplicity.

Showing how the non-accelerated implementations of T13 outperformed the PULPino Riscy cores makes us certain that as far as CNN accelerators are concerned, it is better to use an IMT architecture over and in-order execution processor.

One final note is that also again, implementations using the embedded extension RV32E had somewhat discouraging results, which did not convince us that migrating towards an IMT architecture with a smaller set of registerfiles is better than using the normal registerfile size as defined in the RV32I ISA.

# 5.5.3. VGG16 Deep Convolutional Neural Networking Application

In order to further evaluate our SPMU accelerator when executing neural networking applications, we had to make the SPMU execute an entire CNN. For that, we have chosen the famous VGG16 DCNN [40]. The VGG16 test is a very successful DCNN that can achieve accuracies of up to 92.7%. It is used in many classifications [41][42][43]. The layers of the VGG16 test are showed in figure 5.15. In order to fully support the convolution layers of the VGG16, the matrix convolutions from the previous sections were combined with other libraries that performed: pre-scaling, post-scaling, addbias, and ReLu, as well as a set of libraries for the fully-connected layers. The remaining parts of the network did not undergo acceleration (e.g. *softmax, maxpool*). After having built a unique VGG16 test to run for the various implementations of the SPMU. We have run a particular set of tests to evaluate the performance of the T13 IMT architecture. The layers in the network are shown in the image below.

Two tests are shown in figures 5.16 and 5.17. The first shows the difference in performance when running the VGG16 using one hart only, and when dividing the workload over all the harts in the core. The other compares the IMT full active harts Dedicated-SPMU versus an in-order architecture "Zeroriscy".

The difference between the single-thread test (1 hart active), and the multi-thread test (all harts active) outlines one very important aspects in IMT architectures. First of all, both implementations interleave three harts in the core. However, the single-thread implementation shows how poorly an IMT core performs when the other harts are Idle. When all the harts become active, and the workload becomes divided among the harts, we will see a large drop in the cycle count that is evident in figure 5.16.

From the results back in the previous sub-section we chose the Dedicated-SPMU SIMD-2 as a very fast and yet most balanced option to be compared with an in-order architecture such as Zeroriscy. A few layers were developed to execute on that version of the SPMU, and they were compared with the Zeroriscy cores as show in figure 5.17.



From figure 5.16 we can still affirm that when running real life applications as the VGG16 the SPMU accelerator indeed maintains it's fast trend results that were displayed back in figure 5.13.



Figure.5.17. KlessydraT13 Shared-SPMU, Single Thread Vs Multithread cycle count per layer for VGG16



Figure.5.18. KlessydraT13 Dedicated-SPMU SIMD-2, vs Zeroriscy cycle count per layer for VGG16 execution

As a conclusion for the performance evaluation we saw the difference between an IMT core and an in-order processor. An IMT processor certainly performed better when the applications were decoupled. Synthesis results showed that IMT processors had very high top frequencies. Attaching the different SPMU schemes showed the contribution of each SPMU to the performance, and showed how DLP and TLP differently exploit the processor with small and big vector computations. Not to mention a layer of the VGG neural network were run, and they showed how the SPMU accelerator faired in real life applications.

# 5.6. Area, Power, and Energy Reports

# 5.6.1. Area Utilization

Table 5.4 reports the area utilization on the FPGA when synthesizing on the Genesys2 board [29]. We can see clearly that the area increase due to the DLP was really impacting especially in the Hybrid approaches exploiting both DLP and TLP. One small conclusion can be made here, that the speedboost from the DLP showed in the previous section was on average smaller than the TLP speed boost, and yet the DLP exploiting schemes (Shared-SPMU SIMD-8) consumed a higher area than the TLP exploiting schemes (Dedicated-SPMU SIMD-1 and Dedicated-SPI Shared-SPE).

An additional important note to take from these results as well is that the crossbar in the Dedicated-SPI-Shared-SPE version is large enough, such that the number of LUT utilization is very similar to that in the Dedicated-SPMU version, and that the reduction in element utilization was only in the FFs and the DSP slice count. Pipelining the crossbar to get a higher top frequency is possible, however it will increase the FF utilization in the Dedicated-SPI-Shared-SPE, and hence the FF count saved from sharing FUs will be utilized in pipelining the crossbar rendering this approach to be somewhat useless. relative to the Dedicated-SPMU approach. But still this approach can be considered as seen from the results, we save a huge number in the DSP slice count when sharing the functional units in the Dedicated-SPI-Shared-SPE approach.

| Table.5.3. T13 Area Utilization on FPGA for all SPMU Configurations |                                   |      |       |         |                    |     |  |  |
|---------------------------------------------------------------------|-----------------------------------|------|-------|---------|--------------------|-----|--|--|
|                                                                     | Core                              | SIMD | El    | ement U | <b>tilizatio</b> r | ı   |  |  |
|                                                                     | Core                              |      |       | LUT     | BRAM               | DSP |  |  |
|                                                                     |                                   | 1    | 6552  | 10655   | 6                  | 8   |  |  |
|                                                                     | Shared SPMU                       | 2    | 6907  | 12835   | 6                  | 12  |  |  |
|                                                                     | Shared SPWIU                      | 4    | 7587  | 15807   | 6                  | 20  |  |  |
|                                                                     |                                   | 8    | 9064  | 21423   | 12                 | 36  |  |  |
|                                                                     |                                   | 1    | 7782  | 14344   | 18                 | 16  |  |  |
|                                                                     | Dedicated SPMU                    | 2    | 8875  | 13017   | 18                 | 28  |  |  |
|                                                                     |                                   | 4    | 10903 | 28309   | 18                 | 52  |  |  |
| Klessydra T13                                                       |                                   | 8    | 15223 | 46861   | 36                 | 100 |  |  |
| Klessyura 115                                                       |                                   | 1    | 7234  | 14229   | 18                 | 9   |  |  |
|                                                                     | Dedicated SPI Shared SPE          | 2    | 8009  | 18803   | 18                 | 12  |  |  |
|                                                                     | Dedicated SFT Shared SFE          | 4    | 9167  | 27150   | 18                 | 20  |  |  |
|                                                                     |                                   | 8    | 11460 | 48081   | 36                 | 36  |  |  |
|                                                                     | NO_ACCL (RV32IM)                  | NA   | 5639  | 7975    | 0                  | 4   |  |  |
|                                                                     | NO_ACCL_(RV32EM)                  | NA   | 4165  | 8120    | 0                  | 4   |  |  |
|                                                                     | NO_ACCL (RV32I)                   | NA   | 5424  | 7674    | 0                  | 0   |  |  |
|                                                                     | NO_ACCL_ (RV32E)                  | NA   | 3890  | 7414    | 0                  | 0   |  |  |
|                                                                     | NA                                | 2527 | 7674  | 0       | 6                  |     |  |  |
|                                                                     | ZeroRiscy<br>ZeroRiscy (no RV32M) |      |       | 3275    | 0                  | 1   |  |  |
| Zerol                                                               |                                   |      |       | 2832    | 0                  | 0   |  |  |
|                                                                     | MicroRiscy                        | NA   | 1279  | 2434    | 0                  | 0   |  |  |

TT4-11- - 41-

Making the Comparison between Riscy, Zeroriscy cores and the T13 non accelerated cores. We definitely see a larger area occupation in the T13 non accelerated cores. That is for the obvious reason that in order for the T13 core to be an IMT architecture, we had to replicate the registerfile, the CSR unit and the program counter. One thing to consider in order to decrease overhead that IMT architectures have, is by disabling the performance counters in the CSR unit. Doing that saved us approximately 1200 LUTs from the LUT count listed above. The other thing is to use the embedded extension RV32E which halves the size of the registerfile. However, we saw how that terribly affected the performance, and thus the tradeoff of the registerfile area with performance is a favorable step in this case.

# 5.6.2. Dynamic Power Consumption and Energy Efficiency

The average dynamic power consumption is reported in figure 5.18 for running the convolutions on each hardware scheme. Obviously, the power consumption increases as the area gets bigger, but the curve rises up very sharply for the SIMD 8 configurations. Deep SIMD configurations proved to be less power efficient in this manner (especially in FPGA synthesis) as they consume a lot of power particularly in the hardware schemes exploiting the TLP. SIMD 2 configurations for all hardware schemes showed only a slight increase in dynamic power consumption in one hand, and a greater increase in performance on the other hand, making it desirable to be considered as a balanced approach.

Other than the small area footprint of the Riscy cores, they also all consumed less dynamic power than the T13 non accelerated cores. The RV32E extensions seemed to have larger drops in the dynamic power consumption as well.

The static power was not mentioned, since for FPGAs the static power does not change based on the area utilization of the FPGA, but rather it depends on the technology of the FPGA itself.



Figure.5.19. Dynamic Power Consumption of the T13 core running 32x32 convolutions

Figure 5.19 shows the total energy consumption for running the different convolutions. They were again divided into two sides. The left sides for the accelerators, and the right side for the non-accelerators. They were separated in since the non-accelerated had very high energy consumption compared to the accelerated counterparts, and thus if placed together, the non-accelerated energy results would have saturated the improvements between the different schemes in the accelerated results.



Figure.5.20. Energy Consumption for running each implementation at the top frequency on the different convolution sizes

Many conclusions can be made from these results. First, we show that not only using the SPMU accelerator generates high speed results, but it is also more energy efficient, than not using the SPMU accelerator.

Second, compare the SPMU accelerators, we can see that the Shared-SPMU has the worst results, and that both the TLP exploiting approaches gave much better results than the Shared-SPMU.

Third, the results comparing the Dedicated-SPMU to the Dedicated-SPI\_Shared-SPE approach showed almost an overlap in the energy consumption just like the overlap they in the performance. This is very good since we showed that very little trade-off in the performance and energy consumption can be substituted with a large chunk of area and that is by sharing the SIMD functional units.

Finally comparing the non-accelerated implementations together, we see that the T13 slightly less energy efficient then both Riscy and Zeroriscy. Zeroriscy has a very low dynamic power count, while Riscy has a low cycle count, both contributed heavily to the energy efficiency.

# 5.7. Further Evaluations (memory test, GCC optimizations)

A few additional tests were performed to see the consistency of the performance using GCC optimization flag "-O2". Figure 5.23 shows the cycle count to perform vector addition when compiling the C tests without enabling any GCC optimizations. While figure 5.20 shows the same results but with GCC optimizations enabled.



Figure.5.21. Vector addition C test performed with GCC optimizations disabled



Figure.5.22. Vector addition C test performed with GCC optimizations enabled

From the results above it shows that disabling the GCC optimizations affected performance in both operations. However, for the operations using the accelerator, we have a cycle count increase that is a constant offset, while in the non-accelerated vector addition operation, the cycle count increment was a variable offset such that when the vector size, grows, the offset grows linearly as well.

Another evaluation was made to show the memory impact of doing two equal operations (table 5.5). The first operation does not use the SPMU accelerator. The second performs the same operation, but using the SPMU. In the operations using the SPMU, there are two memory tests that were made, the first one does all the SPMU operations in a single function call, while the other one does the same operations in a multi-function call.

|        | Size (Bytes)        |                        |                  |                             |                  |                        |                  |                             |                  |                          |                  |                             |  |
|--------|---------------------|------------------------|------------------|-----------------------------|------------------|------------------------|------------------|-----------------------------|------------------|--------------------------|------------------|-----------------------------|--|
|        | Ν                   | ormal Ad               | dition Test      |                             | SPM              | U Single I             | Funct Call To    | est                         | SPM              | U Multi H                | Funct Call To    | est                         |  |
| Vector | With G<br>Optimiz   |                        |                  | Without GCC<br>Optimization |                  |                        |                  | Without GCC<br>Optimization |                  | With GCC<br>Optimization |                  | Without GCC<br>Optimization |  |
| Size   | Program<br>mem size | Program<br>mem<br>size | Data mem<br>size | Program<br>mem<br>size      | Data mem<br>size | Program<br>mem<br>size | Data mem<br>size | Program<br>mem<br>size      | Data mem<br>size | Program<br>mem<br>size   | Data mem<br>size | Program<br>mem<br>size      |  |
| 1      | 1326                | 3059                   | 1300             | 3533                        | 1378             | 3477                   | 1352             | 3705                        | 1378             | 3230                     | 1352             | 3591                        |  |
| 10     | 2730                | 3211                   | 2704             | 3572                        | 2782             | 3477                   | 2756             | 3705                        | 2782             | 3230                     | 2756             | 3591                        |  |
| 20     | 4290                | 3211                   | 4264             | 3572                        | 4342             | 3477                   | 4316             | 3705                        | 4342             | 3230                     | 4316             | 3591                        |  |

Table.5.4. Size in Bytes of the program memory and data memory for different tests

The results from the memory tests, shows that also using the SPMU does not impact the memory size, the results are similar to the non-SPMU test. For the data memory, the only impact on the memory size was from increasing the vector size, but regardless whether we use the SPMU or not.

# Chapter 6 C Language Software Suite

This chapter shows the implementation of the software suite used in benchmarking the T13 microprocessor. All the tests were written in C and compiled by a patched RISCV-GCC compiler. The first section shows the instruction level testing of the custom SPMU instructions. The second section shows how the custom instructions were used to make convolutions. The third section mentions the additional libraries needed in order to accelerate the convolution and fully-connected layers of the VGG16 DCNN application.

# 6.1. Instruction level testing:

For every custom instruction in the SPMU, a C test has been made to detect whether the SPMU executes its instructions correctly. All the tests check whether the SPMU outputs match the non-SPMU, and benchmark the performance of the SPMU for all data types (8, 16, and 32).

The example test shown in the code below takes the number of elements inside each vector, and the time variable, and tries to randomize the data with the *rand* function. The test sets the MVTYPE and then calls a C function that uses all the harts in the core to load the vectors and compute the results. The cycle count to perform the arithmetic operation is counted, and saved. The output results are checked to be correct, and then performance is compared to the non-accelerated tests.

The code below shows how vector addition instruction KADDV is tested for 32-bit data types. Other data types and instructions are not shown because of the repetitiveness of the code sequence. There implementation can be inferred just by looking at this one.

```
/* ------ KADDV Test -----*/
 1
 2345678
     #define NumOfThreads 3
     #define NumOfElements 50
     #define TIME 10
     int32 t vect32 1[NumOfElements], vect32 2[NumOfElements];
     int32 t testres32[NumOfElements];
     int32 t *res32;
 9
     int32<sup>t</sup> result32[NumOfElements];
10
     int size32=NumOfElements*sizeof(int);
11
     int testperf, perf32[NumOfThreads];
12
13
     int main() {
14
           srand(TIME):
15
           for (int i=0; i<NumOfElements; i++) {
16
                 vect32_1[i] = rand() \% (0x8000000 - 0x1) +1;
17
                 vect32 2[i] = rand() \% (0x8000000 - 0x1) +1;
18
19
           int add pass = 0;
20
           int perf = 0;
21
22
           int* ptr perf = &perf;
23
           /* 32-bit KADDV here */
24
           VECT ADD 32:
25
           sync barrier();
26
           // ENABLE COUNTING ------
27
            asm ("csrrw zero, 0x7A0, 0x0000001");
28
           //-----
           // SET MVTYPE -----
29
30
            asm ("csrrw zero, mvtype, 0x0000002"); // set the data type to 32-bits
31
           //-----
```

```
32
33
              // TEST KADDV(32)------
34
              /* call the function that perfroms the KADDV operation
              res32=kless_vector_addition_mth((void*) result32, (void*) vect32 1, (void*) vect32 2, size32);
35
36
              //_____
37
              // DISABLE COUNTING AND SAVE MCYCLE OF EACH THREAD ------
38
               asm ("csrrw zero, 0x7A0, 0x0000000;"
39
                              "csrrw %[perf], mcycle, zero;"
40
                              "sw %[perf], 0(%[ptr perf]);"
41
42
                              :[perf] "r" (perf), [ptr perf] "r" (ptr perf)
43
                              ):
44
              if (Klessydra get coreID()==0) perf32[0]=perf; // store the cycle count of thread 2
45
              if (Klessydra get coreID()==1) perf32[1]=perf; // store the cycle count of thread 1
46
              if (Klessydra get coreID()==2) perf32[2]=perf; // store the cycle count of thread 0
47
              //------
48
49
              // Test 32-bit addition result -----
50
              if (Klessydra get coreID()==1){
51
                        _asm__( "csrrw zero, 0x7A0, 0x00000001;"); // enable counting
52
                      for (int i=0; i<NumOfElements; i++){
53
                              testres32[i] = vect32 1[i]+vect32 2[i]; // perform the addition without acceleration
54
                      }
55
                        asm ("csrrw zero, 0x7A0, 0x00000000;" // disable counting and save the cycle count
56
                               "csrrw %[perf], mcycle, zero;"
57
                              "sw %[perf], 0(%[ptr perf]);"
58
59
                              :[perf] "r" (perf), [ptr perf] "r" (ptr perf)
60
                              );
61
                      testperf = perf;
                      for (int i=0; i<NumOfElements; i++){
62
63
                              if (res32[i]==testres32[i]) // check every element{
64
                                      add pass++;
65
66
                              else {
67
                                      goto FAIL VECT ADD 32; // if an error is encountered goto the error label
68
69
70
                      if (add pass==NumOfElements){
71
                              printf("\nPASSED KADDV32 32-bit vector addition"); // all outputs are correct print pass
72
                       ł
73
74
              if (Klessydra get coreID()==1){
75
                      printf("\n\nNumber of Elements:%d\n",NumOfElements);
76
                      for(int i=0; i<3; i++){
77
                              printf("Th%d KADDV32 Speed: %d Cycles\n",i, perf32[i]); // print cycle count of SPMU
78
                      }
79
                      printf("ADDV32 Speed: %d Cycles\n", testperf); // print the cycle count and end the program
80
                      return 0;
81
              }
82
                asm ("csrrw zero, mstatus, 8;" "wfi;"); // stall the harts that finish
83
              // ----- Fail Section ------
                                                 _____
84
              FAIL VECT ADD 32: // error label
85
              printf("\nFAILED KADDV32 32-bit vector addition\n"); // print fail
86
              return 1;
```

The function "*kless\_vector\_addition\_mth*" performs the KADDV using all the harts in the T13 core, as seen in the code below. The first thread that enters does a vector load *vs1* atomically, and then exits the routine. The second hart atomically loads the second vector *vs2* to the SPMs and exits

the function. The third hart performs the vector addition, stores the result back in main mem, then exits the function.

void\* kless vector addition mth(void \*result, void\* src1, void\* src2, int size){ int SPMADDRA = spmaddrA; // base address of spmA int SPMADDRB = spmaddrB; // base address of spmB int SPMADDRC = spmaddrC; // base address of spmC int key = 1; // the key locks some routines from being executed static int section 1 = 0; static int section 2 = 0: int\* psection1 = &section1; int\* psection2 = & section2; asm volatile( "amoswap.w.aq %[key], %[key], (%[psection1]);" "bnez %[key], SCP copyin vect 2;" "SCP copyin vect 1:" " kmemld %[SPMADDRA], %[srcA], %[sz];" // load vector vs1 " i END;" "SCP copyin vect 2:" amoswap.w.aq %[key], %[key], (%[psection2]);" .. bnez %[key], END;" " kmemld %[SPMADDRB], %[srcB], %[sz];" // load vector vs2 .. csrw 0xBF0, %[sz]; " // set the vector size .. kaddv %[SPMADDRC], %[SPMADDRA], %[SPMADDRB];" // KADDV operation kmemstr %[result], %[SPMADDRC], %[sz];" // store back the result in memory "END:" :[key] "r" (key),[psection1] "r" (psection1), [psection2] "r" (psection2), [sz] "r" (size), [SPMADDRA] "r" (SPMADDRA), [srcA] "r" (src1), [SPMADDRB] "r" (SPMADDRB), [srcB] "r" (src2), [SPMADDRC] "r" (SPMADDRC), [result] "r" (result) ); return result;

Another function that does the above routine with a single thread only is shown below.

```
1
 1
2
3
4
5
      void* kless vector addition sth(void *result, void* src1, void* src2, int size){
              int SPMADDRA = spmaddrA; // base address of spmA
              int SPMADDRB = spmaddrB; // base address of spmB
              int SPMADDRC = spmaddrC; // base address of spmC
              asm volatile(
 6
                               kmemld %[SPMADDRA], %[srcA], %[sz];" // load vector vs1
 7
                       "
                               kmemld %[SPMADDRB], %[srcB], %[sz];" // load vector vs2
 8
                       "
                               csrw 0xBF0, %[sz]; " // set the vector size
 9
                       "
                               kaddv %[SPMADDRC], %[SPMADDRA], %[SPMADDRB];" // KADDV operation
10
                       "
                               kmemstr %[result], %[SPMADDRC], %[sz];" // store back the result in memory
                       "END:"
11
12
13
                       :[key] "r" (key), [sz] "r" (size),
14
                       [SPMADDRA] "r" (SPMADDRA), [srcA] "r" (src1),
                       [SPMADDRB] "r" (SPMADDRB), [srcB] "r" (src2),
15
16
                       [SPMADDRC] "r" (SPMADDRC), [result] "r" (result)
17
              );
18
              return result;
19
```

An additional function in the SPMU libraries was created to benchmark the speed of the hardware loops, and that is by executing the SPMU instructions continuously inside a sw-loop (*for loop*), then the output is compared. The body of that function is shown below.

```
void* kless vector addition sth sw loop(void *result, void* src1, void* src2, int size, int SIMD BYTES){
 12345678
              int SPMADDRA = spmaddrA; // base address of spmA
              int SPMADDRB = spmaddrB; // base address of spmB
              int SPMADDRC = spmaddrC; // base address of spmC
              int size temp = size;
              asm volatile(
                              kmemld %[SPMADDRA], %[srcA], %[size temp];" // load vector vs1
                      "
                              kmemld %[SPMADDRB], %[srcB], %[size temp];" // load vector vs2
                      ..
 9
                              csrw 0xBF0, %[SIMD BYTES];" // set the vector size
10
                      :[size temp] "r" (size temp), [SIMD BYTES] "r" (SIMD BYTES),
11
                       [SPMADDRA] "r" (SPMADDRA), [srcA] "r" (src1),
                       [SPMADDRB] "r" (SPMADDRB), [srcB] "r" (src2)
12
13
              );
14
              for (int i=0; i<size; i=i+SIMD BYTES){ // loop through the vector elements
15
                      if (size-i \ge SIMD BYTES)
                              size = size-i; // decrement the vector size
16
17
                              asm volatile(
18
                                      kaddv %[SPMADDRC], %[SPMADDRA], %[SPMADDRB];"// KADDV operation
19
                                 : [SPMADDRA] "r" (SPMADDRA),
20
                                  [SPMADDRB] "r" (SPMADDRB),
21
22
23
                                  [SPMADDRC] "r" (SPMADDRC)
                              ):
                              SPMADDRA+=SIMD BYTES; // increment source A pointer
24
                              SPMADDRB+=SIMD BYTES; // increment source B pointer
25
26
27
28
29
30
                              SPMADDRC+=SIMD BYTES; // increment the destination pointer
                      }
                      else {
                      /* if there is no need to loop anymore, then re-write the vector size and execute the last SPM line */
                              size = i;
                              asm volatile(
31
                                              csrw 0xBF0, %[size];"
                                              kaddv %[SPMADDRC], %[SPMADDRA], %[SPMADDRB];"
32
                                      ••
33
                                      : [SPMADDRA] "r" (SPMADDRA),
34
                                        [SPMADDRB] "r" (SPMADDRB),
                                        [SPMADDRC] "r" (SPMADDRC),
35
36
                                       [size] "r" (size)
37
                              );
38
                      }
39
40
              SPMADDRC=spmaddrC;
41
              asm volatile(
                              kmemstr %[result], %[SPMADDRC], %[size temp];"
42
43
                      :[size temp] "r" (size temp), [SIMD BYTES] "r" (SIMD BYTES),
44
                      [SPMADDRC] "r" (SPMADDRC), [result] "r" (result)
45
              );
46
              return result;
47
```

# 6.2. Convolution tests:

The convolution test comes with a set of functions called convolution2D (conv2D for short). In order to fully explain the algorithm of the conv2D functions we will demonstrate how a convolution is performed the conventional way, and how the algorithm was transformed to fit on the SPMs.

#### 6.2.1. Convolutions (traditional method)

The convolutions in neural networks are performed by sliding the kernel map from its central point over the entire pixels of the feature map otherwise known as the input matrix. The kernel maps in our convolutions have their dimensions set to 3x3 (i.e. like VGG16 filters). Consider the convolution of this kernel with a 4x4 feature map as shown in figure 6.1.

| 1 | 2 | 1 | 1 |   |     |     |     |
|---|---|---|---|---|-----|-----|-----|
| 2 | 2 | 1 | 2 |   | [0] | [1] | [2] |
| - |   |   | - | * | [3] | [4] | [5] |
| 2 | 1 | 2 | 1 |   | [6] | [7] | [8] |
| 1 | 1 | 1 | 2 |   |     |     |     |

Figure.6.1. Convolution of feature map on the left and kernel map on the right

When the kernel map starts sliding over the feature map starting from the top left corner. There will be elements of the kernel map not overlapping any elements of the feature map. In order to overcome this, feature map is padded with zeros around its entire parameter such that when the kernel map slides, its elements will either be overlapping the feature map or the padded-zeros as seen in figure 6.2.

| [0] | [1]              | [2]               |   |   |             | [0]<br><b>0</b>  | [1]<br>0         | [2]<br>0              | 0 | 0 | 0 |
|-----|------------------|-------------------|---|---|-------------|------------------|------------------|-----------------------|---|---|---|
| [3] | [4]<br>1         | [ <b>5</b> ]<br>2 | 1 | 1 |             | <sup>[3]</sup> 0 | [4]<br>1         | [5]<br>2              | 1 | 1 | 0 |
| [6] | [7] <sub>2</sub> | [8]<br>2          | 1 | 2 | Zeropadding | <sup>[6]</sup> 0 | [7] <sub>2</sub> | <mark>[8]</mark><br>2 | 1 | 2 | 0 |
|     | 2                | 1                 | 2 | 1 |             | 0                | 2                | 1                     | 2 | 1 | 0 |
|     | 1                | 1                 | 1 | 2 |             | 0                | 1                | 1                     | 1 | 2 | 0 |
|     |                  |                   |   |   |             | 0                | 0                | 0                     | 0 | 0 | 0 |

Figure.6.2. Convolution of feature map on the left and kernel map on the right

One convolution gives one output pixel result for the output map. When the kernel has passed over the entire feature map and produced all the output pixels, the convolution2D would be considered at this point done.

### 6.2.2. Convolutions (sub-kernel method):

One drawback of the traditional method of performing a matrix convolution was the augmentation of zero-paddings to the whole parameter of the feature maps. It presented a few challenges for doing that method, such as:

• High memory consumption, for example a 32x32 matrix of integers that will be zeropadded cannot fit on a 4KB scratchpad memory, it needs an extra 528 Bytes of memory space to fit, which is about 12.5% the size of the original matrix. • We also have slower memory for ASIC implementations, since FPGAs have fixed size BRAMs [44] so this might only affect the FF or LUT based memories. However, for ASIC zero-padding will require an 8KB memory for a 4KB feature map, and a 4KB memory for 2KB feature map and etc. Bigger memories are usually slower than smaller memories, or have higher latencies.

There was the need to re-write the conv2D function in order to avoid zero-padding. The key idea was to divide the conv2D function into separate functions each would perform a set of convolutions with sub-kernels on the different regions of the feature map as shown in figure 6.3. We will demonstrate how the convolution with sub-kernel F was performed. Other sub-kernel implementations follow a similar pattern and thus will not be elaborated.



Figure.6.3. Division of the sub-kernels. On the left shows the overlap with sub-kernel F

The sub-kernels include only the overlapping parts between the kernels and the feature maps. In the above figure with the 4x4 matrix, we can see some regions. Each region has a different part of the kernel map overlapping. Thus, performing a convolution would require calling nine functions each performing the routine with a different sub-kernel.

The functions are divided into four groups. The first group is when the kernel centroid lands on the edges, we perform the A-C-G-I routines. Sliding the centroid in between the corners on the first and last row uses the B-H group. Likewise, sliding in between columns we use D-F groups. When the sub-kernel is fully overlapping the feature map, the operations will belong to group E, and the convolution will be the default case.

Considering the convolution with sub-kernel F, the output pixel is calculated as follows:

The presence of the "+= "sign is because the convolutions always accumulate the output pixel. In addition, since our convolutions are performed with a fixed-point implementation, the outputs need to be post scaled. Hence the equation would actually look like this.

$$output \ pixel += ([0] * 1) \gg ps + ([1] * 1) \gg ps + ([3] * 1) \gg ps + ([4] * 2) \\ \gg ps + ([6] * 2) \gg ps + ([7] * 1) \gg ps$$

The snippet of the code in figure 6.4 shows how to perform the convolution with sub-kernel F using the SPMU instructions.

```
SUD Kernel F
CSR MVSIZE (2*SIZE OF INT);
kern offset = 0;
fm offset= (size-1-1);
for(int i=1; i< size-1;i++) {</pre>
  dest in C = (void*)spmaddrCoff + SIZE OF INT*(size*i)+ SIZE OF INT*(1)*(size-1);
  dest in D = (void*) spmaddrDoff + SIZE OF INT*(size*i) + SIZE OF INT*(1)*(size-1);
  kdotpps32 (dest in D,
            (void*) ((int*) spmaddrAoff + (i-1)*size
                                                          + fm offset ),
                                                                                  Previous
            (void*)((int*)spmaddrBoff + (0)*jump kr row + kern offset ));
                                                                                    row
  kaddv32(dest in C, dest in C, dest in D);
  kdotpps32 (dest in D,
             (void*) ((int*) spmaddrAoff + (i)*size
                                                          + fm offset ),
                                                                                   Current
            (void*)((int*)spmaddrBoff + (1)*jump kr row + kern offset ));
                                                                                    row
  kaddv32(dest in C, dest in C, dest in D);
  kdotpps32(dest in D,
            (void*) ((int*) spmaddrAoff + (i+1)*size
                                                          + fm offset ),
                                                                                    Next
             (void*)((int*)spmaddrBoff + (2)*jump kr row + kern offset ));
                                                                                    row
  kaddv32(dest in C, dest in C, dest in D);
```

Figure.6.4. Sub-Kernel F executed in the SPMU

As figure 6.3 suggests, when the centroids overlap the element (1,3), three different rows of two integers are highlighted, hence the vector length of 2.

### 6.2.3. Choosing the best convolutions algorithm:

Although the sub-kernel method had the memory advantage over the zero padded method. However, it suffered in the cycle time as it did not actually exploit the SIMD nature of the SPMU very well. While the zero-padded implementation while still consuming bigger memory, but nonetheless exploited very well the SIMD implementation in the SPMU, but it suffered with the memory loads as it was loading a bunch of zeros.

So instead of doing one burst load for the entire matrix with a "*kmemld*" instruction, we found that the optimal solution was to use the zero-padded method with a set of burst loads that loads the discrete data lines in the matrix without the padded zeros. This in turn will relieve the overhead of doing unnecessary memory transfers of zeros. The data lines will be separated by the offset of zeros that separate them. Figure 6.5 shows how it is done.

```
// loop the discrete kmemlds
for (int i=0; i<A_ORDER; i++) {
    kmemld(
        (void*)((int*)spmaddrA + ((i+1)*Z_ORDER + 1)),
        (void*)((int*)matA0+ (i*A_ORDER)),
        SIZE_OF_INT*(A_ORDER)
    );
}</pre>
```



Figure 6.6 shows how the zero-padded convolutions are done using the SPMU instructions.

```
for(int i=1; i< size-1;i++)</pre>
ł
    k element=0;
    for (int rw_pt=-1; rw_pt<2; rw_pt++)</pre>
    //rw_pt is an index i use to point to the correct row, regarding this loop that is executed three time:
    //instead of making 9 different ksvmulrf
    ksymulsc((void*)( (int*) (spmaddrDoff) ) + SIZE OF INT*(size*i)+1*SIZE OF INT,
                      ( (int*) spmaddrAoff + (i+rw_pt)*size
( (int*) spmaddrBoff+k_element++) );
              (void*)
                                                                 +0),
              (void*)
    ksrav((void*)( (int*) (spmaddrDoff) ) + SIZE_OF_INT*(size*i)+1*SIZE_OF_INT,
(void*)( (int*) (spmaddrDoff) ) + SIZE_OF_INT*(size*i)+1*SIZE_OF_INT,
          (int*)conv2D out scal);
    ksvmulsc((void*)( (int*) (spmaddrDoff) ) + SIZE_OF_INT*(size*i)+1*SIZE_OF_INT,
                        ( (int*) spmaddrAoff + (i+rw pt)*size
             (void*)
                                                                 +1 ),
                      ( (int*) spmaddrBoff+k_element++) );
             (void*)
    ksrav((void*)( (int*) (spmaddrDoff) ) + SIZE_OF_INT*(size*i)+1*SIZE_OF_INT,
(void*)( (int*) (spmaddrDoff) ) + SIZE_OF_INT*(size*i)+1*SIZE_OF_INT,
          (int*)conv2D out scal);
    ksvmulsc((void*)( (int*) (spmaddrDoff) ) + SIZE_OF_
(void*) ( (int*) spmaddrAoff + (i+rw_pt)*size
                                                ) + SIZE OF INT*(size*i)+1*SIZE OF INT,
                                                                  +2 ).
                     ( (int*)spmaddrBoff+k_element++) );
             (void*)
    ksrav((void*)( (int*) (spmaddrDoff) ) + SIZE_OF_INT*(size*i)+1*SIZE_OF_INT,
(void*)( (int*) (spmaddrDoff) ) + SIZE_OF_INT*(size*i)+1*SIZE_OF_INT,
          (int*)conv2D out scal);
    // CSR MVSIZE(size*size*SIZE_OF_INT);
ksvmulrf((void*)spmaddrDoff,(void*)spmaddrDoff,(void*)zero);
```

Figure.6.6. Zero-Padded Convolution method using the SPMU instructions

# 6.3. Supplementary VGG16 libraries

Having built libraries capable of doing the matrix convolutions, there was still the need to supplement the VGG16 libraries with a few more functions in order to have it ready to accelerate the network.

First, AddBias and ReLu operations are functions were made Adding the bias to the output matrix is done with the following function:

"ksvaddsc\_v2 (dest, source1, source2, size);"

The function above sets the MVSIZE CSR to be equal to *size*, and calls the *ksvaddsc* SPMU instruction which adds the vector *source1* with the scalar *source2*, and stores the result in *dest*.

The operation is followed by calling a ReLu function that rectifies all the negative values.

"krelu((void\*)dest, (void\*)source);"

The function above rectifies the source vector in *source*, and places the output vector in *dest*.

What remains after this is the fully connected layer which can be simply implemented by one instruction called '*kdotpps*'.

Operations in VGG16 not handled by the SPMU are the following:

- Maxpool layer haves the sizes of its input matrices by pooling the maximum value in 2x2 filter that slides vertically and horizontally across the input matrix.
- As for the last part, layer\_22 is implemented using the *softmax()* function, which implements the *non-linear* function *softmax* for producing the probability distribution of all the possible outcomes.

With this, the libraries have become complete and can be used to accelerate the VGG16. The performance results were already reported in chapter 5.

# Conclusions

In this thesis we introduce the Klessydra-T branch of the Klessydra family of microprocessors. The Klessydra cores fully support RISC-V instruction set in 32-bit. The Klessydra-T cores support the base integer instructions "I", the atomic extensions "A", the multiplication/division extension "M". The T1 sub-branch of the Klessydra-T further appends to the native RISC-V ISA a set custom specialized instruction for accelerating convolutional neural networking applications. The motivation behind forming the Klessydra-T branch was to exploit IoT embedded systems in order to obtain higher energy efficiency and performance, and the motivation behind adding a hardware accelerator in the T1 was in order to allow an easy migration of CNN towards embedded systems.

Our study started by determining the optimal pipeline organization in interleaved multithreaded processors by performing and experimental assessment, and we showcased that pipelining the core has consistently improved the performance, while interleaved multithreading maintained the core in having zero delay slots, thus improving both the overall performance, and the energy efficiency required to execute a single instruction.

We further described in an analytical assessment that deeper pipelines between registerfile read and write ports are unfavorable (e.g. T04, T05, etc.), since the critical path would improve only slightly in soft core implementations due to the growth of the net delay between FPGA elements. While the area would still continue to grow linearly with every new hart. Also, we mentioned that the cycle count would become worse when executing practical applications in these deeply pipelined IMT architectures, such that overall performance will degrade in the sequential single hart applications, or in parallel tightly coupled applications that require constant thread synchronizations.

Also, in another analytical assessment, we saw that introducing pipelines before the registerfile read ports does not increase the performance, but rather degrades it, since it will require that the IMT core implements instruction flushing logic in which it was not needed previously. Thus, re-introducing the branch delay slots.

The spectrum of target applications covered in our earlier assessments, showed that the number of applications that can be exploited by the IMT approach were only a small portion of the entire spectrum. So, we attempted to develop an IMT processor coupled with a hardware accelerator that can exploit more target applications. And since neural networks were becoming a hot topic in embedded systems. This in hand drove us to develop a neural network accelerator called the SPMU.

In our basic evaluations of the SPMU, we saw the significance of the performance contributions in cycle count of both the low latency scratchpad memories, and the hardware loops (zero overhead loops) to the overall performance of the SPMU when executing different vector sizes. Further evaluations continued to test the cycle count improvement in increasing the data level parallelism for small and large vectors. We determined that data level parallelism can improve the cycle count greatly in large vectors and only slightly in small vector because of the T13 core's ability to hide the latency of the SPMU instructions almost completely when the vectors are small, and only moderately if the vectors were large.

Two more complex SPMU hardware schemes were employed. These two schemes exploited the instruction level parallelism through increasing the thread level parallelism. The first scheme sets dedicated memories subsystems (SPI) for every hart, and dedicated functional units (SPE) as well. While the other scheme employs dedicated memories (SPI) for every hart, but a shared set of functional units (SPE) to be used by all the harts. Both approaches decreased the cycle count even

further then the basic Shared-SPMU approach. The Dedicated-SPMU scheme got slower because of the large area overhead and the increase in the net delay, while the scheme containing shared functional units suffered in the top operating frequency because the crossbar connecting the SPI memories to the shared SPE functional units was very large. The speed drop becomes highly more obvious in higher level SIMD configurations. However, the Dedicated-SPI\_Shared-SPE approach showed an overlap in the overall performance with the Dedicated-SPMU which was a good sign that a tiny performance trade-off was made with a large chunk of area.

The Dedicated-SPMU were further evaluated with a more practical test, and that is by executing the layers of the VGG16 deep convolutional neural network algorithm. The first test showcased the performance of the T13 IMT architecture when having one active hart only, and when having all the harts active. We further evaluated the performance of the Dedicated-SPMU versus the Zero-riscy cores showing the performance in executing the layers of the VGG16 test with both large vectors, and small vectors, the Dedicated-SPMU continued to show performance superiority even in these real-life applications.

Area evaluations were made and we showed how much the DLP impacts the area, versus the TLP, Also, we saw how big the cross-bar was in the Dedicated-SPI\_Shared-SPE scheme. Finally, we saw how much overhead does the T13 IMT core have over the in-order Riscy and Zeroriscy cores.

Finally, the dynamic power consumption and the energy consumption were shown for all the SPMU configurations. We saw that the dynamic power increased largely especially in SIMD 8 configurations. Also, that the SPMU schemes had a high power consumption. But when the time came to showcase the energy consumption, we saw the Hybrid approach was the most energy efficient such that Dedicated-SPMU SIMD-2 or the Dedicated-SPI\_Shared-SPE SIMD-2 had the lowest energy consumption among all the hardware schemes.

Our study of the T13 showed how to easily make a high performance and energy efficient hardware accelerator for a very balanced IMT architecture, that interleaves a moderate number of harts. By simply adding a hardware accelerator that writes to its own dedicated memory, we can allow superscalar execution. This in hand will allow superscalar execution between the instructions that write to different memories without having stalls due to data dependencies, while still maintaining the same thread pool baseline, and not needing to interleave any additional harts to fence between the memory accesses. The study can be generalized to any hardware accelerator for IMT architecture, and not only convolution engines.

# Appendix A

# Klessydra Technical Manual

# Chapter 1 Architecture overview

# 1.1 Features

The Klessydra processing core family is a set of processors featuring full compliance with the RISC-V instruction set and intended to be placed within the Pulpino microprocessor platform. To date, the Klessydra family includes

- a minimal gate count single-thread core, **Klessydra S0**. The S0 core is not maintained as open-source;
- a class of multi-threaded cores, **Klessydra T0**, available in different implementations called Klessydra T0ab;
- a class of extended versions of the T0 cores, named **Klessydra T1** cores, featuring an SPMU hardware accelerator.
- A class of fault tolerant versions of the T0 cores, featuring fault-tolerant mechanisms for harsh environment applications, named Klessydra F0x.

The Klessydra core family features:

- Full compliance with the RISC-V architecture specification (instruction set, control and status registers, interrupt handling mechanism and call-ing convention);
- Compliance with the standard RISC-V compilation toolchain;
- Interleaved multi-threaded execution of RISC-V harts (hardware threads);
- Easy and standardized multi-threading programming interface;
- Core synthesis on FPGA (presently, Xilinx Series 7 implementations have been tested);
- Hardware compliance with the Pulpino microprocessor platform, as pinto-pin compatible alternative of the Pulpino RI5CY core;
- Software compliance with the Pulpino microprocessor platform, as compatible I/O memory map, interrupt handler memory map, program/data memory map;
- Extends the software test suite of Pulpino with custom tests designed specifically for the klessydra cores.

### 1.2 Naming convention

The different cores available in the Klessydra family follow the naming convention depicted in Fig. 1.1.





#### **1.3 Supported Instruction Set**

To date, all the Klessydra cores implement the 32-bit integer RISC-V machine mode instruction set, namely user-level RV32I base integer instruction set version 2.1 and M-mode privileged instruction set version 1.1. T0 and T1 cores support the RV32IME set.

The T0 and T1 cores support the atomic instruction AMOSWAP.W from the RVA atomic instruction extension.

The T1 core extends the instruction set with non-native **custom** vector instructions for memory to scratchpad transfers and vector arithmetic operations. Vector instructions come in three different variants supporting different data width "8-bit, 16-bit, 32-bits" e.g.

Only M-mode operation is supported, so that no operating system support is implemented. Yet, the Klessydra family comes with a baseline runtime system software layer that implements part of the interrupt handling features and part of the multi-threaded programming model.

#### 1.4 Multi-threading model

**Klessydra S0** core supports single thread execution (RISC-V hart) only, with the following features:

- The hart can be interrupted by a trap such as an external interrupt or instruction exception. Software interrupts are supported, although their use is expected to be impractical in a single-thread execution environment. When the trap handling routine ends the core resumes the original execution thread (see Chapter "Exception and Interrupts" for details);
- The core can enter an idle state by means of the WFI instruction; when an external interrupt arrives at the core, the core starts the execution of the interrupt handling routine as the new hart of execution.

• The hart can be halted and resumed by means of the *Fetch\_en* core interface signal.

**Klessydra T0**x, **T1**x and **F0**x cores implement interleaved multi-threading. At each clock cycle, a new instruction is fetched from a different hart (Fig. 1.2).



Fig. 1.2. Conceptual view of hardware thread (hart) interleaved execution

The execution has the following features:

- Each hart in the hardware thread pool can be either active or idle.
- An idle hart can be activated by an interrupt request directed to the hart. The core executes the interrupt handling routine within the hart. When the interrupt handling routine ends, the hart becomes idle again (see Chapter "Exception and Interrupts" for details).
- An active hart can be interrupted by instruction exceptions or interrupt requests. When the interrupt/exception handling routine ends and the signal fetch\_enable\_i is high, the core resumes the interrupted execution hart. (see Chapter "Exception and Interrupts" for details);
- An active hart can become idle by executing the WFI instruction;
- The maximum number of active harts is an architecture characteristic parameter called Thread Pool Size.
- Each hart is identified by an integer number ranging from 0 up to Thread Pool Size 1.
- There is also a minimum number of active harts, needed to avoid data hazards between threads during the pipelined execution, called Thread Pool Baseline. The Thread Pool Baseline value is an architecture characteristic parameter related to the instruction pipeline organization implemented in the hardware microarchitecture of the core.
- When the number of active threads is less than the Thread Pool Baseline, one or more idle hart runs in the pipeline as NOP instructions.

As a general note, a higher Thread Pool Baseline value corresponds to a higher sustainable clock frequency and generally indicates a higher performance when running at full thread pool. For example, a T03 core will significantly outperform a T02 core when executing 4 harts.

### 1.5 Core Interfaces

The core interface is signal-to-signal compatible with the Pulpino microprocessor platform, and as such it is the same as Pulpino RI5CY cores. The detailed description follows.

| Name       | Direction | Width | Notes                         |
|------------|-----------|-------|-------------------------------|
| clk_i      | In        | 1     | Core clock signal             |
| clock_en_i | In        | 1     | Core clock enable             |
| rst_ni     | In        | 1     | Core reset signal, active low |
| test_en_i  | In        | 1     | Core test enable (unused)     |

#### Table.1.1 Clock, reset active low, test enable

| Table.1.2 Initialization signals |           |       |                    |  |  |  |  |  |  |  |
|----------------------------------|-----------|-------|--------------------|--|--|--|--|--|--|--|
| Name                             | Direction | Width | Notes              |  |  |  |  |  |  |  |
| boot_addr_i                      | In        | 32    | Boot address value |  |  |  |  |  |  |  |
| core_id_i                        | In        | 4     | Core id number     |  |  |  |  |  |  |  |
| cluster_id_i                     | In        | 6     | Cluster id number  |  |  |  |  |  |  |  |

|                |           | Table 1.3 | Program memory interface                             |
|----------------|-----------|-----------|------------------------------------------------------|
| Name           | Direction | Width     | Notes                                                |
| instr_req_o    | Out       | 1         | Request signal, must stay high until accepted        |
| instr_gnt_i    | In        | 1         | Request accepted, address may change in the next     |
|                |           |           | cycle                                                |
| instr_rvalid_i | In        | 1         | Instruction valid, stays high for exactly one cycle. |
| instr_addr_o   | Out       | 32        | Address                                              |
| instr_rdata_i  | In        | 32        | Instruction read from memory                         |

|               |           | Table 1 | .4 Data Memory interface                               |
|---------------|-----------|---------|--------------------------------------------------------|
| Name          | Direction | Width   | Notes                                                  |
| data_req_o    | Out       | 1       | Request signal, must stay high until accepted          |
| data_gnt_i    | In        | 1       | Request accepted, address may change in the next cycle |
| data_rvalid_i | In        | 1       | Data valid, stays high for exactly one cycle           |
| data_we_o     | Out       | 1       | Write enable, high = write, low = read                 |
| data_be_o     | Out       | 4       | Byte selection                                         |
| data_addr_o   | Out       | 32      | Address                                                |
| data_wdata_o  | Out       | 32      | Data to be written to memory                           |
| data_rdata_i  | In        | 32      | Data read from memory                                  |
| data_err_i    | In        | 1       | Memory error signal                                    |

| Table 1.5 Interrupt request / acknowledge |           |       |                                             |  |  |  |  |  |  |
|-------------------------------------------|-----------|-------|---------------------------------------------|--|--|--|--|--|--|
| Name                                      | Direction | Width | Notes                                       |  |  |  |  |  |  |
| irq_i                                     | In        | 1     | Interrupt request signal                    |  |  |  |  |  |  |
| irq_id_i                                  | in        | 5     | Interrupt request vector value              |  |  |  |  |  |  |
| irq_ack_o                                 | out       | 1     | Interrupt acknowledge signal                |  |  |  |  |  |  |
| irq_id_o                                  | in        | 5     | Interrupt acknowledge vector value (unused) |  |  |  |  |  |  |

#### Table 1.6 Debug interface

| Name           | Direction | Width | Notes                            |
|----------------|-----------|-------|----------------------------------|
| debug_req_i    | In        | 1     | Debug request                    |
| debug_gnt_o    | Out       | 1     | Debug request granted            |
| debug_rvalid_o | Out       | 1     | Debug data valid                 |
| debug_addr_i   | In        | 15    | Debug location address           |
| debug_we_i     | In        | 1     | Debug write enable               |
| debug_wdata_i  | In        | 32    | Debug data to be written to core |
| debug_rdata_o  | Out       | 32    | Debug data read from core        |
| debug_halted_o | Out       | 1     | Debug halt acknowledge           |
| debug_halt_i   | In        | 1     | Debug halt request               |
| debug_resume_i | in        | 1     | Debug resume signal              |

| Name                | Direction | Width | Notes                                        |
|---------------------|-----------|-------|----------------------------------------------|
| fetch_enable_i      | In        | 1     | Fetch enable, stops the core                 |
| core_busy_o         | Out       | 1     | Core busy signal                             |
| ext_perf_counters_i | In        | 1     | External performance counter signal (unused) |

# Chapter 2 Memory model and protocol

#### 2.1 Instruction Fetch

The instruction fetch stage of the core is called FSM\_IF and is able to supply one instruction to the instruction decode stage per cycle, if the program memory is able to serve one instruction per cycle. Instructions are word aligned, meaning that the two least significant bits in the PC are always set to 0, and the PC value is incremented by 4 units at each new fetch when no branch occurs. Compressed instruction format is not supported. No prefetch logic is present.

#### 2.2 Memory Access Protocol

The program and data memory access protocol is pin-to-pin compatible with the Pulpino microprocessor platform, and as such it is the same as RI5CY / Zeroriscy cores'. The protocol that used to access the data memory works as follows. The program memory follows the same protocol except for the absence of write operation support.

The core provides a valid address in *data\_addr\_o* and sets *data\_req\_o* high. The memory then answers with *data\_gnt\_i* set high as soon as it is ready to serve the request. This may happen in the same cycle as the request is sent or any number of cycles later. After a grant is received, the address may be changed in the next cycle by the core. In addition, the *data\_wdata\_o*, *data\_we\_o* and *data\_be\_o* signals may be changed. After receiving a grant, the memory answers with *data\_rvalid\_i* set high if *data\_rdata\_i* is valid. This may happen one or more cycles after the grant has been received. The signal *data\_rvalid\_i* must also be set when a write operation is performed, although the *data\_rdata\_i* has no meaning in this case. Figure 2.1, Figure 2.2 and Figure 2.3 shows examples of the protocol timing.



Figure 2.1 Basic Memory Transaction (reprinted from RI5CY manual, rel. Jan 2017)



Figure 2.2 Back-to-Back Memory Transaction (reprinted from RI5CY manual, rel. Jan 2017)



Figure 2.3 Slow Response Memory Transaction (reprinted from RI5CY manual, rel. Jan 2017)

#### 2.3 Misaligned Accesses

The core hardware does not perform misaligned accesses natively (i.e. accesses that are not aligned on natural word boundaries). If a misaligned memory access is requested by an instruction, the core produces an exception. There is no necessary hardware to realize the misaligned access by multiple aligned access. In compliance with RISC-V specification, misaligned accesses are therefore not guaranteed to be atomic.

#### 2.4 Memory Address Map

Harts (i.e. hardware threads) running on a Klessydra core share the memory map illustrated in Fig. 2.4, which is compliant with the Pulpino SoC platform specification. The MIP CSR, one for each hart, are memory mapped starting at address 0x0000ff00 and allow for inter-thread interrupts, in compliance with the RISC-V specification. (Other CSRs are not memory mapped).

Each hart has its own stack, and the stack size and starting address are customizable at software level in the runtime system startup routine. The remaining memory space is available for inter-thread data communication.

For information about the addresses from 0x00 to 0x90, see the vector table in chapter 5. Address 0x94 is reserved to MTVEC.



Fig. 2.4 Klessydra Memory Map (assuming 4 Threads, 2 KB stack per thread)

# Chapter 3 Architecture Registers

#### 3.1 Register File

Klessydra has 32x32-bit wide registers which form the registers x0 to x31. Register x0 is statically bound to 0 and can only be read. Write on register x0 has no side effect. They can be modified to 16x32 registers if the RV32E embedded extension was enabled.

#### 3.2 Control and Status Registers

Klessydra cores implement a subset of the control and status registers specified in the RISC-V privileged specification, limited to the registers needed for M-mode operation and to the functionalities implemented in the core. Klessydra cores also implement some additional CSRs specifically needed for the core operations and/or for

compliance with the Pulpino microprocessor platform. This extended CSR sub-set is composed of the MIRQ, PCER, PCMR registers. The whole set of CSRs implemented in the Klessydra cores is as follows:

| Table 3.1 CSR Registers |                                      |                 |     |                                                   |
|-------------------------|--------------------------------------|-----------------|-----|---------------------------------------------------|
| Name                    | CSR Address                          | Reset Value     | R/W | Description                                       |
| MSTATUS                 | 0x300                                | 0x0000_18<br>08 | R/W | Machine Status                                    |
| MEPC                    | 0x341                                | 0x0000_00<br>00 | R/W | Machine Exception<br>Program Counter              |
| MCAUSE                  | 0x342                                | 0x0000_00<br>00 | R/W | Machine Trap Cause                                |
| PCER                    | 0x7A0                                | 0xFFFF_F<br>FFF | R/W | Performance Counter<br>Enable                     |
| MHPMCOUNTE<br>R         | 0xB00,0xB02<br>0xB03,<br>0xB06-0xB0A | 0x0000_00<br>00 | R/W | Machine Performance-<br>Monitoring Counter        |
| MHPMEVENT               | 0x323,<br>0x326-0x32A                | 0x0000_00<br>00 | R/W | Machine Performance-<br>Monitoring Event Selector |
| MCPUID                  | 0xF00                                | 0x0000_01<br>00 | R   | CPU Description                                   |
| MIMPID                  | 0xF01                                | 0x0000_80<br>00 | R   | Implementation ID                                 |
| MHARTID                 | 0xF10                                | -               | R   | Hardware Thread ID                                |
| MIP                     | 0x344                                | -               | R/W | Interrupt Pending                                 |
| MTVEC                   | 0x305                                | 0x0000_00<br>94 | R/W | Trap-Handler Base Address                         |
| MBADADDR                | 0x343                                | 0x0000_00<br>00 | R/W | Misaligned Address<br>Container                   |
| MIRQ                    | 0xFC0                                | -               | R   | Interrupt Request                                 |
| MVSIZE                  | 0xBF0                                | 0x0000_00<br>01 | R/W | Set Vector Size unit (T1)                         |
| MVTYPE                  | 0xBF8                                | 0x0000_00<br>02 | R/W | Set the data type (T1)                            |
| MPSCLFAC                | 0xBE0                                | 0x0000_00<br>00 | R/W | Set the post scaling factor<br>(T1)               |

# MSTATUS Register bit map

|       | _   | Table.3.1.1 MSTATUS bits                                                                                                                                                                                                                                                                                                                                                                                     |
|-------|-----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Bit # | R/W | Description                                                                                                                                                                                                                                                                                                                                                                                                  |
| 3     | R/W | <b>Interrupt Enable:</b> When an exception is encountered, Interrupt Enable will be set to 1'b0, and it's state will be stored in bit '7'. When the <i>mret</i> instruction is executed, the original value of Interrupt Enable will be restored from the 7 <sup>th</sup> bit. If you want to enable interrupt handling in your exception handler, set the Interrupt Enable to '1' inside your handler code. |

7 R/W Interrupt Previous Enable: Takes the state of the 3<sup>rd</sup> bit when serving an interrupt, and when an *mret* is served it stays latched to 1. And returns the 3<sup>rd</sup> bit back to it's original value.

### • MEPC Register

When an exception is encountered, the current program counter is saved in MEPC, and the core jumps to the MTVEC address. When an MRET instruction is executed, the value from MEPC replaces the current program counter, unless the return value was a WFI instruction, in this case we return to the instruction in the address after the WFI.

# MCAUSE Register bit map

|       |     | Table.3.1.2 MCAUSE bits                                                                                  |
|-------|-----|----------------------------------------------------------------------------------------------------------|
| Bit # | R/W | Description                                                                                              |
| 31    | R   | <b>Interrupt:</b> This bit is set when the exception was triggered by an interrupt.                      |
| 30    | R   | <b>WFI:</b> This bit indicates that the last instruction before entering the subroutine was a <i>WFI</i> |
| 4:0   | R   | <b>Trap Cause:</b> "0011" for SW IRQ, "0111" for Timer IRQ, "1011" for External IRQ.                     |

# • PCER Register bit map

Each bit in the PCER register controls one performance counter. If the bit is 1, the counter is enabled and starts counting events. If it is 0, the counter is disabled and its value won't change.

|       | Table.3.1.3 PCER bits                  |
|-------|----------------------------------------|
| Bit # | Description                            |
| 9     | Branch Taken Counter Enable            |
| 8     | Branch Counter Enable                  |
| 7     | Jump Counter Enable                    |
| 6     | Store Counter Enable                   |
| 5     | Load Access Counter Enable             |
| 4     | Instruction Miss Counter Enable        |
|       | (currently not implemented)            |
| 3     | Jump Access Stall Counter Enable       |
|       | (currently not implemented)            |
| 2     | Load/Store Access Stall Counter Enable |
| 1     | Instruction Counter Enable             |
| 0     | Cycle Counter Enable                   |
|       |                                        |

# MHPMCOUNTER Registers

Klessydra Core includes a MCYCLE counter, a MINSTRET counter and others 6 additional event counters, MHPMCOUNTER3, MHPMCOUNTER6-MHPMCOUNTER10 of which only the first eight are used. The names of the registers are compliant to RISC-V but the counters are not divided into 32 lower bits and 32 higher bits. Only MCYCLE and MINSTRET are extended to 64 bits by the registers CYCLEH and MINSTRETH. The counter value is 32 bits unsigned integer. Table.3.1.4 MHPMCOUNTER bits

| Register     | Description                                          |
|--------------|------------------------------------------------------|
| MCYCLE       | Counts the number of cycles the core was active (not |
|              | sleeping)                                            |
| MINSTRET     | Counts the number of instructions executed           |
| MHPMCOUNTER3 | Number of load/store data hazards                    |
| MHPMCOUNTER4 | currently not used                                   |
| MHPMCOUNTER5 | currently not used                                   |
| MHPMCOUNTER6 | Number of data memory loads executed                 |
| MHPMCOUNTER7 | Number of data memory stores executed                |
| MHPMCOUNTER8 | Number of unconditional jumps                        |
| MHPMCOUNTER9 | Number of branches. Counts taken and not taken       |
|              | branches                                             |
| MHPMCUNTER10 | Number of taken branches                             |
|              |                                                      |

### MHPMEVENT Registers

In each MHPMEVENT register all the bits are statically bound to 0 except for the bit related to the counter that must be enabled. If that bit is 1, the counter is active and starts counting events. For instance, if the user wants to enable MHPMCOUNTER3 he will set the bit #2 (the 3<sup>th</sup> bit) of MHPMEVENT3 to 1. This procedure is equivalent to set PCER (3) to 1. The core includes 6 registers, MHPMEVENT3, MHPMEVENT6-MHPMEVENT10.

| Register                        | Not Bound Bit # |
|---------------------------------|-----------------|
| MHPMEVENT3                      | 2               |
| MHPMEVENT4 (currently not used) | -               |
| MHPMEVENT5 (currently not used) | -               |
| MHPMEVENT6                      | 5               |
| MHPMEVENT7                      | 6               |
| MHPMEVENT8                      | 7               |
| MHPMEVENT9                      | 8               |
| MHPMEVENT10                     | 9               |

Table.3.1.5 MHPMEVENT bits

#### MCPUID Register

The value of this register is fixed to 256 and cannot be changed. By using the CPUID opcode, software can determinate processor type and the presence of features.

### MIMPID Register

The value of this register is fixed to 32768 and cannot be changed. MIMPID provides a unique encoding of the version of the processor implementation.

#### • MHARTID Register

This register contains the integer ID of the hardware thread running the code. His value depends on Cluster and Core external signals and can only be read.

| Table.3.1.6 WHARTID DIts |                                   |  |  |
|--------------------------|-----------------------------------|--|--|
| Bit #                    | Description                       |  |  |
| 9:4                      | ID of the Cluster                 |  |  |
| 3:0                      | ID of the core within the cluster |  |  |

# • MIP Register

The MIP register contains information about the type of pending interrupts. Bits #11 and #7 are enabled according to the external interrupt bits while bit #3 is settled to 1 to activate the SW interrupt routine.

| 1                   |   |
|---------------------|---|
| Table.3.1.7 MIP bit | S |

| Bit # | R/W | Interrupt Type     |
|-------|-----|--------------------|
| 11    | R   | External Interrupt |
| 7     | R   | Time Interrupt     |
| 3     | R/W | Software Interrupt |

### • MTVEC Register

When an exception or an interrupt occurs, PC is loaded with the value of this register. MTVEC is the standard RISC-V base trap vector.

# • MIRQ Register

This register saves which interrupt has been called. The value of this register is four times the number of the interrupt's bit enabled. For instance, if  $irq_i(3)$  is set, MIRQ will be loaded with 12. If no interrupt is set, MIRQ value is 65535, that is just an arbitrary number.

# • BADADDR Register

When an instruction-fetch, load or store address-misaligned or access exception occurs, MBADADDR is written with the faulting address.

# • MVSIZE Register

Setting this register will set the vector size to be used by the mathematical unit. The biggest size should not exceed the SPM size, since overflow bits will be ignored.

# MPSCLFAC Register

Contains the post scaling factor that determines the shift amount in KDOTPPS custom Klessydra instruction.

# Chapter 4 **Pipeline Organization**

#### 4.1 General concepts

Klessydra cores implement pipelined instruction processing. The number of pipeline stages differs among the cores as reported below. In the following, **F** indicates instruction fetch, **D** indicates operand read from register file and instruction decoding, **E** indicates operation execution, **W** indicates result writeback to the register file. In all cores, the **F** stage latency is equal to the latency of program memory access, and variable latency program memory is supported (as for the case of instruction cache memory). The **F** stage latency is 1 in case of single-cycle-access program memory. For other pipeline stages, the latency may be fixed or depend on external events (e.g. data memory latency, contention on CSR updating in case of interrupt requests). When a stage latency takes more than 1 cycle, the hardware stalls the preceding stage by local handshake signals. Similarly, each stage locally signals the succeeding stage when a new item is ready.

The generic microarchitecture for T0 cores is depicted in Fig. 4.1.

Each-thread is identified by a positive integer number *harc* (hardware context). The *harc* counter changes the *harc* value at each new instruction fetch, and the *harc* value associated to an instruction is passed through the pipeline stages. Most of the logic in the pipeline control section is replicated on a per-thread basis, and the *harc* value is used to properly index the logic units. Conversely, all the logic in the processing pipeline is not per-thread replicated with the only exception of the data register file. In the S0 core, per-thread replication and the *harc*-related logic are natively absent.



Fig. 4.1 – Generic pipeline microarchitecture scheme implemented in Klessydra T0 cores.

The specialize microarchitecture of T1 cores is represented in Fig 4.2.



Fig. 4.2 – Datapath sketch of T1 cores.

T1 cores feature an execution stage that is split into a mathematical acceleration unit, scratchpad memory unit and a regular execution unit.

#### 4.2 S0 core pipeline

The Klessydra S0 core implements a 2-stage pipeline according to the model  $\mathbf{F} / \mathbf{DEW}$ . The latency scheme is as follows:

|                             | F        | DEW      |
|-----------------------------|----------|----------|
| Load and store instructions | $\geq 1$ | $\geq 2$ |
| CSR instructions            | $\geq 1$ | $\geq 2$ |
| All other instructions      | $\geq 1$ | 1        |

Branch instructions are predicted as not-taken and are executed with a delay slot of 1 cycle; in case of taken branch the hardware flushes any wrongly fetched instruction from the pipeline.

Data hazards never occur.

#### 4.3 T02x core pipeline

The Klessydra T0x cores implement a 3-stage pipeline according to the model F / D / EW. The latency scheme is as follows:

|                             | F        | D | EW       |
|-----------------------------|----------|---|----------|
| Load and store instructions | $\geq 1$ | 1 | $\geq 2$ |
| CSR instructions            | $\geq 1$ | 1 | $\geq 2$ |
| Atomic memory operations    | $\geq 1$ | 1 | $\geq 4$ |
| All other instructions      | > 1      | 1 | 1        |

Branch instructions are predicted as not-taken and are executed with a delay slot of 2 cycles; in case of taken branch the hardware flushes any wrongly fetched instruction, belonging to the branching thread, from the pipeline. No pipeline flush occurs if at least 3 threads are interleaved in the pipeline.

Data hazards never occur, provided that at least 2 threads (Thread Pool Baseline) are interleaved in the pipeline.

#### 4.4 T03x / T13x / Fxxx core pipeline

The Klessydra T03x/T13x cores implement a 4-stage pipeline according to the model F / D / E / W. The latency scheme is as follows:

|                             | F        | D | E        | W |
|-----------------------------|----------|---|----------|---|
| Load and store instructions | $\geq 1$ | 1 | $\geq 2$ | 0 |
| CSR instructions            | $\geq 1$ | 1 | $\geq 2$ | 0 |
| Atomic memory operations    | $\geq 1$ | 1 | $\geq 4$ | 0 |
| All other instructions      | $\geq 1$ | 1 | 1        | 1 |
| Specialized vector          | $\geq 1$ | 1 | $\geq 2$ | 0 |
| instructions                |          |   |          |   |

Branch instructions are predicted as not-taken and are executed with a delay slot of 3 cycles; in case of taken branch the hardware flushes any wrongly fetched instruction, belonging to the branching thread, from the pipeline. No pipeline flush occurs if at least 3 threads are interleaved in the pipeline.

Data hazards never occur, provided that at least 2 threads (Thread Pool Baseline) are interleaved in the pipeline.

# Chapter 5 Exceptions and Interrupts

Klessydra cores implement exceptions on illegal instructions, on load and store instructions to invalid addresses, on misaligned memory accesses, and on ECALL instruction execution.

Klessydra cores implement vectorized interrupts, specifically supporting 32 separate interrupt service routines. There are three types of interrupt:

- Software Interrupt
- External Interrupt
- Timer Interrupt

The interrupt/exception vector table supported by Klessydra cores is compliant with the Pulpino platform interrupt vector table, as follows:

| Table.5.1 Interrupt Handler address map |  |  |
|-----------------------------------------|--|--|
|                                         |  |  |
|                                         |  |  |
|                                         |  |  |
| ed                                      |  |  |
| cess)                                   |  |  |
|                                         |  |  |
|                                         |  |  |

| -           |                                 |
|-------------|---------------------------------|
| Except Code | Exception                       |
| 0x0000_0002 | ILLEGAL_INSN_EXCEPT_CODE        |
| 0x0000_0005 | LOAD_ERROR_EXCEPT_CODE          |
| 0x0000_0007 | STORE_ERROR_EXCEPT_CODE         |
| 0x0000_000B | ECALL ECALL_EXCEPT_CODE         |
| 0x0000_0004 | LOAD_MISALIGNED_EXCEPT_CODE     |
| 0x0000_0006 | STORE_MISALIGNED_EXCEPT_CODE    |
| 0x0000_0100 | ILLEGAL_VECTOR_SIZE_EXCEPT_CODE |
| 0x0000_0101 | ILLEGAL_ADDRESS_EXCEPT_CODE     |
| 0x0000_0102 | SCRATCHPAD_OVERFLOW_EXCEPT_CODE |
|             |                                 |

Interrupt handling is accomplished in the core hardware by jumping to the address contained in MTVEC, in compliance with RISC-V specification; the pre-compiled startup software routine located at MTVEC address implements the interrupt vector table as it is shown above, jumping to the right handler routine address. The interrupt handler are to be written by the final user according to the target application.

Interrupts can be enabled/disabled on a global basis through the MSTATUS register; they cannot be individually enabled/disabled. Exceptions cannot be disabled.

When entering an interrupt routine, the core saves the current value of MIE (3<sup>rd</sup>-bit) to the MPIE (7<sup>th</sup>-bit) in the MSTATUS register; the state of MIE will be restored after returning from interrupt service routine.

If multiple interrupt requests arrive at the same cycle, the order of service is external interrupt first, then software interrupt, timer interrupt and exceptions (compliance to RISC-V specification).

In T0 cores, external interrupts are always re-directed to hart number 0. Software interrupts can be directed from any active hart, to any active or idle hart. Software interrupts allow inter-hart service requests.

In T0 cores and in T1 cores, as all status registers are replicated on a per-thread basis, the interrupt/exception handling mechanism is implemented referring to the status registers of the interrupted thread.

T1 cores introduce five more exceptions regarding the scratchpad handling. Exceptions will be raised if the Math Accelerator unit operands are from nonscratchpad addresses, or if writing or reading will result in a request from an overflown scratchpad address, or if we have dual writes or dual reads from the same scratchpad such as in the case of the LSU and Math Accelerator unit working simultaneously

### Chapter 6 Scratchpads and mathematical unit (T1 version only)

#### 6.1 Scratchpad memory subsystem

Klessydra T1 cores include scratchpad memories, with configurable number of scratchpads, banks, and scratchpad size and address mapping. The configurations can be modified in the PKG file of the synthesizable Klessydra suite. Each scratchpad memory (SPM) is composed of a set of memory banks; the number of banks available in each SPM is defined by on the "SIMD" parameter value set in the PKG file.

Each bank address holds a 32-bit word. An SPM data line is composed of as many words as the "SIMD" value. As addresses remain byte-aligned, the address distance between SPM data lines is + SIMD\*4. Each word on the line has its own address and can be independently accessed (with 4-byte aligned address).

Any SPM bank can be read or written to. For read access, any bank that is not bank0 will cause the data read in SIMD fashion to be rotated as if it was coming from bank0 by a read rotator. While for write access, any write to a bank different from bank0 will cause the data to be rotated to its correct destination bank by a write rotator. The rotators were made to align the two input source operands

Each scratchpad has one read port, and one write port. Each port has size SIMD\*32bits (e.g. for SIMD=4, we have 4\*32 = 128 bits).

The SPMs can be accessed by the SPMU or the LSU. When a dual read (or dual write) access is requested to the same SPM on the same port by two different units (SPMU and LSU), priority will be given to the unit that requested the access first and the other unit will be halted until the operation is finished. Due to the in-order single-issue pipeline of the Klessydra cores it is not possible that the two units request access to the SPM in the same cycle).

All transfers to/from the scratchpads go through an interface called **SPI**. Both SPM read and writes happen through this SPI wrapper. The LSU and SPMU are the only units that interact with the scratchpad memories, always through the SCI.

#### 6.2 Mathematical accelerator unit

The Mathematical unit was designed to execute custom Klessydra instructions targeting vector, DSP-like and CNN-inference-like operations.

The Mathematical unit interfaces to the SPI unit by means of two read ports for the operands coming from the scratchpad memory interface, and one write port to the scratchpad memory interface. The read and write port width are dependent on the SIMD parameter value set in the PKG file.

The custom instruction set executed in the Mathematical unit are listed in table 7.1. It executes different instructions many of which have different variants. Table.7.1 also shows the SIMD capability of the Mathematical unit. A composition of partial functional units has been adopted to enhance the SIMD execution and to optimize the area consumption mathematical unit. Addition instructions use a combination of 8-bit adders to make 8-bit, 16-bit, and 32-bit additions. Multiplication instructions use a combination of 16-bit multipliers to perform 8-bit, 16-bit and 32-bit multiplications<sup>1</sup>.

<sup>&</sup>lt;sup>1</sup> 16-bit multipliers were chosen over 8-bit multipliers since doing 32-bit multiplication using 8-bit multipliers would be inefficient, and also 16-bit are the optimal choice needed for utilizing DSP blocks on presently available FPGAs

There are no 32-bit arithmetic units in the mathematical except for the 32-bit shifters, that can be configured to do 8,16,32-bit right arithmetic or logical shifts, accumulators (32-bit adders, and 16-bit adders for both 8, and 16-bit), and Rectify Linear Unit (RELU).

When working on vectors, the Mathematical unit exploits built-in hardware loop (zero overhead loops), executing the following steps in hardware:

- a. Increment the source and destination vector pointers to fetch the next element chunk;
- b. Decrement the remaining number of elements to process;
- c. Evaluate a conditional branch to check whether the number of remaining elements reached zero.

The Mathematical unit can operate in parallel with respect to the other execution units. Since the custom Klessydra instructions never have dependencies with the standard RISCV instructions, the IE unit and LSU can work in parallel with the Mathematical unit.

The Mathematical unit recovers its state when a halt occurs due to dual read/write access.

# Chapter 7 Fault Tolerance Support (F0x versions only)

Klessydra core versions F0x support several mechanisms of fault tolerance targeting aerospace and safety critical application. Most of the mechanisms implemented address tolerance to single event upset (SEU) in memory elements (registers and memories).

#### 7.1 Basic mechanisms

Supported standard FT mechanisms are Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR), both based on repetition on functional modules and comparison of outputs through a voting system.

DMR uses two replicas of combinational or sequential logic, it can only detect errors and has a low area occupation and power consumption but high time of implementation.

Basic-TMR is a triple repetition of combinational or sequential logic and a majority voter; it has the same time of implementation of DMR and also area and power consumption increase.

Full-TMR adds a triple redundancy to both logic and registers at cost of area and power consumption and it uses cross voters to guarantee high error correction capability.

Global-TMR is based on full-TMR but it can be automated through synthesis tools.

#### 7.2 F03a: Fully TMR – Partial TMR Design

The protection of control and state registers is a priority because they contain vital information about core operation and they are written only once at first run, so a TMR must be used.

Counter registers are less critical because they are constantly refreshed. Each core has a dedicated counter with many 32-bit registers, so it's suggested to use alternative techniques:

- MSB-TMR: triple redundancy only of N most significant bits, reducing area impact.
- DMR: detection of an error trigger a trap identified by a code and it's managed by the software.
- Software protection: no hardware protection, the software periodically reads and compares counters.

Pipeline robustness is fundamental in TMR because redundancy does not protect registers from loading wrong values that irredeemably corrupt code execution. So redundancy has to be applied to all registers between pipeline stages and state registers of state machines. Registers file are dedicated for each thread so a TMR has the highest impact in terms of area.

The voting system also lengthens critical path that lowers the maximum clock frequency. Program counter unit has a dedicated 32-bit register for each thread and some flip-flop to store events and conditions that must be resolved from the unit. Flip-Flop corruption doesn't lead to loss of control because PC update is defined by signals coming from pipeline.

An error on exception service Flip-Flop is more critical because it requires a response from CSR.

Due to the small area occupation on few registers and Flip-Flop it's suggested to protect with TMR the PC unit.

#### 7.3 F03b: Double Pipeline Design with Check&Restore

In this version, a protection technique is used that does not allow error correction, but the area occupied by the TMR version is reduced by a third, without losing reliability. This design implies a change of the internal architecture of the core. The structure is based on a new Processing Unit composed of two Pipelines, the CRU and a new CSR unit derived from the TMR version. In this architecture, a checkpoint is created before critical portions of code or periodically. Checkpoint control is managed by the Check Restore Unit (CRU) and the CSR.

The two pipelines are the same as the Klessydra T03 version. The input and output signals of the pipelines are controlled by the CRU, which can drive them exclusively to start checkpoint or restore procedures, not natively implemented in the Pipeline. To allow CRU functioning and error check, there is an internal register, called CRSTATUS. This critical register is protected by TMR redundancy and can be read (but not written via) CSR instructions. The management of the check and restore system requires instructions for management and control. These instructions are:

- Chepoint start instruction
- Instruction to activate thread dependency
- Instruction to restart the restore manually
- Instruction to deactivate protected mode

The double pipeline structure adds two operating modes to the system:

- the normal mode it allows to deactivate the clock of the not used pipeline in order to reduce the dynamic consumption of the core;
- the "single pipeline" mode allows you to increase core's life in critical environments. This mode is integrated and supported by the hardware but requires that the code is written ad properly.

A portion of the software is used to check the correct functioning of the hardware: if a pipeline is damaged, it can be disabled. The core is then used with a single unprotected pipeline. The robustness of the processing must be granted to the software, which will be executed in a redundant manner, sacrificing the processing speed.

The CRU is the heart of the DoublePipe architecture protection system. The system is based on the comparison of the pipeline outputs. In case of output's discrepancy, the CRU activates a flag that indicates the presence of an error. In the next execution phase of the thread with an active flag, the CRU takes control of the outputs, simulating an illegal instruction with a specific cause code. At this point, a software routine takes care of recovering the values of the register files previously saved in memory. At the end of the illegal instruction routine, the PC unit loads the program counter with the value saved during checkpoint creation. The return to a checkpoint does not deactivate the protected mode or eliminate the checkpoint. The unit manages part of the dedicated instructions of this architecture and the internal control register. The control register contains information on the configuration of the CRU, on the execution status of the core and on any hanging errors. This register can be read (but not written) by the user.

The DuoblePipe CSR, unlike the original version, includes a specific register which is used to back up the PC. The register is not addressable, and its writing is managed by the CSR during the execution of the pseudo instructions developed for this architecture. Other registers are instead extended in use and functionality compared to the RISC-V standard: the writing with particular values of some registers will be interpreted as an instruction. The CSR, together with the CRU, takes care of serving the instructions for starting a checkpoint and restoring it in the event of an error.

The DoublePipe program counter unit is substantially identical to the original version in terms of functionality. Since the activation system of a checkpoint is based on the start of a particular software interrupt, it is necessary to add a condition in the PC unit that allows the service of this type of interrupt despite the interrupts being disabled.

#### 7.4 F03c: Shadow Thread Double Pipe

The F03c architecture is based on the possibility to correct errors, by using a double pipeline and a redundant execution of the thread instructions. To obtain corrections of the errors, 3 copies of the same result are necessary:

- two contemporary copies obtained through pipeline redundancy;
- a third copy obtained through a temporally out of phase processing.

A Shadow control unit (SCU) handle the pipelines and the architecture synchronization. During the shadow processing, the SCU handles two different instructions. These are executed in the pipelines. To solve latency problem between instructions, the SCU can put the pipeline on hold in order to complete the execute phase in a synchronized way. The register file management is left to SRU unit, that constantly communicates with the SCU. The SCU provides information about processing in the pipeline, indicating to the SRU whether the instructions require access to the registers. In case of error, the SRU performs a memory access and retrieves the value of the register. If the error is detected in conjunction with an instruction that requires reading the same file, the SRU sends a signal to the SCU which blocks the pipelines. Once received the data from the memory, a triple comparison is made, and the correct data is sent to the pipelines. When this phase is completed, the SCU unlocks the pipelines. The writing in memory is started in conjunction with the WB phase. The copy of the regfile has priority over a possible access instruction in memory, which is put on hold, locking the pipelines.

The CSR ST does not contain dedicated registers but is equipped with additional input signals. The operation of the unit depends on:

- main processing: the CSR executes the commands received from the pipeline processing, previously controlled by the SCU. It also reacts to any hardware routines to serve exceptions and interrupts or following a return instruction.
- Shadow processing: the CSR simulates the execution of the access instructions by supplying the values contained in the registers. The values contained in the registers are not modified unless explicitly commanded by the SCU. No interrupt or exception affects the CSR at this step. An eventual interrupt event is served at the next main processing.

Writing to the internal registers can be disabled at any time by the SCU, which keeps a constant check on the CSR. This architectural difference allows the value sent to the registers to be blocked at any time. Loading incorrect values (in the TMR registers) is always prevented. The triple redundancy technique completely loses its effectiveness in the event of an error in the logic that sends the data.

The architecture of the PC ST unit differs from the original version of the core as it must guarantee the functioning of the shadow structure that requires up to two PCs simultaneously. The portion that manages the PC during the fetching phase of the shadow processing is located inside the SCU. The PC ST unit, in addition to supplying the correct PC value to the SCU, must manage the correct updating of the internal PC registers. This is done by receiving information on the location of the shadow thread within the Pipeline. Thanks to this information the unit can execute or block the updating of the PC registers. In the event of interrupts, exceptions or jumps during the execution phase of the Shadow processing, the unit locks the PC update system waiting for the main processing phase. If the conditions that triggered an interrupt, an exception or any request to change the program flow remain, the PC will update and start the procedure. This allows the SCU to stop updating the PC at any time in case of error. In this case indeed, the TMR protection of the inside registers is not able to correct the loading of an incorrect value.

### Chapter 8 Debug Support

Klessydra core supports common baseline debug features: halting the program flow, reading data register file, reading the PC value and enabling a single step execution. Software breakpoints are implemented by the RISC-V instruction EBREAK.

The debug operations are intended at core level and not per-thread. When entering debug mode, the whole core (i.e. with all its threads) enters debug mode. The internal debug unit accesses information related to the thread whose instruction is in the execution stage of the core pipeline in the current clock cycle.

The debug hardware interface is the same as the memory interface, but on separate buses. Every access to debug facilities is done by an access to debug registers.

To halt the core, external debug unit has to set DBG\_CTRL[0] bit. If DBG\_CTRL[0] is set, the core is in single step mode, so clearing the DGB\_HIT[0] bit enable execution of a single instruction.

Debug registers are always accessible. Program counter and register file are accessible only when the core is halted. Which register of register file external debug unit requires is specified in [6:2] bit of the address.

| Address         | Name            | Table.8.1 Debug Registers Description |
|-----------------|-----------------|---------------------------------------|
| 0x00            | DBG_CTRL        | Debug Control                         |
| 0x04            | DBG_HIT         | Debug Hit                             |
| 0x2000          | DBG_PPC         | Next PC                               |
| 0x2004          | DBG_NPC         | Previous PC                           |
| 0x400-<br>0x47C | GPR(x0-<br>x31) | General Purpose Registers             |

| Table.8.2 Debug | Control register bit map |
|-----------------|--------------------------|
| 1 41            |                          |

| DIL # | <b>FK</b> / <b>VV</b> | Description                                                                                          |
|-------|-----------------------|------------------------------------------------------------------------------------------------------|
| 16    | R/W                   | HALT bit: When set to '1', the core enters debug mode, when reset to '0', the core exits debug mode. |
| 0     | R/W                   | SSTE bit: Single-step enable bit.                                                                    |
|       |                       |                                                                                                      |

|       |     | Table.8.3 Debug Hit register bit map |
|-------|-----|--------------------------------------|
| Bit # | R/W | Description                          |

D:4 #

0 R/W SSTH: Single-step hit, sticky bit that must be cleared by external debugger in order to execute next instruction.

| Table.8.4 Debug Next Program Counter register bit map |  |
|-------------------------------------------------------|--|
|-------------------------------------------------------|--|

| :0 R/W NPC: Next PC to be executed | NPC: Next PC to be executed |                                | <u> </u> |     |
|------------------------------------|-----------------------------|--------------------------------|----------|-----|
|                                    | NI C. NEXT C ID DE EXECUTE  | R/W NPC: Next PC to be execute | :0 R     | 31: |

|       | Table.8.5 | Debug Previous Program Counter register bit map |
|-------|-----------|-------------------------------------------------|
| Bit # | R/W       | Description                                     |
| 31:0  | R/W       | PPC: Previous PC, already executed              |

# Chapter 9 Instruction Set

#### 9.1 Integer Register-Immediate operations

| Table.9.1 Regis                | ter-Immediate operations |                     |
|--------------------------------|--------------------------|---------------------|
| Name                           | Binary format type       | Assembly syntax     |
| ADDI – add immediate           | I                        | ADDI rd, rs1, imm   |
| SLTI - set if less immediate   | I                        | SLTI rd, rs1, imm   |
| SLTIU - set if less imm. uns.  | I                        | SLTIU rd, rs1, imm  |
| ANDI - and immediate           | I                        | ANDI rd, rs1, imm   |
| ORI - or immediate             | I                        | ORI rd, rs1, imm    |
| XORI – excl. or immediate      |                          | XORI rd, rs1, imm   |
| SLLI – shift left logical imm. | I                        | SLLI rd, rs1, shamt |
| SRLI- shift right logical imm. | I                        | SRLI rd, rs1, shamt |
| SRAI – shift right arithm.     | I                        | SRAI rd, rs1, shamt |
| imm.                           |                          |                     |
| LUI - load upper immediate     | U                        | LUI rd, imm         |
| AUIPC - add upper imm. to      | I                        | AUIPC rd, imm       |
| рс                             |                          |                     |
|                                |                          |                     |

- ADDI adds the sign-extended 12-bit immediate to register rs1. Arithmetic overflow is ignored and the result is simply the low 32 bits of the result. ADDI *rd*, *rs1*, 0 can be used to implement a register move operation.
- SLTI places the value 1 in register *rd* if register rs1 is less than the sign-extended immediate when both are treated as signed numbers, else 0 is written to rd. SLTIU is similar but compares the values as unsigned numbers.
- ANDI, ORI, XORI are logical operations that perform bitwise AND, OR, and XOR on register *rs1* and the sign-extended 12-bit immediate and place the result in *rd*. Notably, XORI *rd*, *rs1*, *-1* performs a bitwise logical inversion of register *rs1*.
- SLLI is a logical left shift (zeros are shifted into the lower bits); SRLI is a logical right shift (zeros are shifted into the upper bits); and SRAI is an arithmetic right shift (the original sign bit is copied into the vacated upper bits). The operand to be shifted is in *rs1*, and the shift amount is encoded in the lower 5 bits of the I-immediate field.
- LUI is used to build 32-bit constants. LUI places the U-immediate value in the top 20 bits of the destination register *rd*, filling in the lowest 12 bits with zeros.
- AUIPC is used to build PC-relative addresses. AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the PC, then places the result in register *rd*.

| Table.9.2 Re                | gister-Register Operations |                   |
|-----------------------------|----------------------------|-------------------|
| Name                        | Binary format type         | Assembly syntax   |
| ADD - add                   | R                          | ADD rd, rs1, rs2  |
| SLT - set if less           | R                          | SLT rd, rs1, rs2  |
| SLTU – set if less unsigned | R                          | SLTU rd, rs1, rs2 |
| AND - and                   | R                          | AND rd, rs1, rs2  |
| OR - or                     | R                          | OR rd, rs1, rs2   |
| XOR - exclusive or          | R                          | XOR rd, rs1, rs2  |

#### 9.2 Integer Register-Register Operations

| SLL – shift left logical     | R | SLL rd, rs1, rs2 |
|------------------------------|---|------------------|
| SRL – shift right logical    | R | SRL rd, rs1, rs2 |
| SUB – subtract               | R | SUB rd, rs1, rs2 |
| SRA - shift right arithmetic | R | SRA rd, rs1, rs2 |

- ADD and SUB perform addition and subtraction respectively. Overflows are ignored and the low 32 bits of results are written to the destination.
- SLT and SLTU perform signed and unsigned compares respectively, writing 1 to rd if *rs1* < *rs2*, 0 otherwise. Note, SLTU *rd*, *x0*, *rs2* sets *rd* to 1 if *rs2* is not equal to zero, otherwise sets *rd* to zero (assembler pseudo-op SNEZ *rd*, *rs*).
- AND, OR, and XOR perform bitwise logical operations.
- SLL, SRL, and SRA perform logical left, logical right, and arithmetic right shifts on the value in register *rs1* by the shift amount held in the lower 5 bits of register *rs2*.

#### 9.3 Unconditional Jumps

| Tabl                   | e.9.3 Unconditional Jumps |                   |
|------------------------|---------------------------|-------------------|
| Name                   | Binary format type        | Assembly syntax   |
| JAL - jump and link    | UJ                        | JAL rd, imm       |
| JALR – jump to reg and | UJ                        | JALR rd, rs1, imm |
| link                   |                           |                   |

- The jump and link (JAL) instruction uses the J-immediate to encode a signed offset in multiples of 2 bytes. The offset is sign-extended and added to the pc to form the jump target address. Jumps can therefore target a  $\pm 1$  MiB range. JAL stores the address of the instruction following the jump (PC+4) into register *rd*. Plain unconditional jumps are encoded as a JAL with rd = x0.
- The indirect jump instruction JALR (jump and link register) obtains the target address by adding the 12-bit signed I-immediate to the register *rs1*, then setting the least-significant bit of the result to zero. The address of the instruction following the jump (PC+4) is written to register rd. Register *x0* can be used as the destination if the result is not required.
- The JAL and JALR instructions will generate a misaligned instruction fetch exception if the target address is not aligned to a four-byte boundary.

#### 9.4 Conditional Branches

| T                        | able.9.4 Branches  |                   |
|--------------------------|--------------------|-------------------|
| Name                     | Binary format type | Assembly syntax   |
| BEQ – branch if equal    | SB                 | BEQ rs1, rs2,imm  |
| BNE - branch if not eq.  | SB                 | BNE rs1, rs2,imm  |
| BLT – branch if less     | SB                 | BLT rs1, rs2,imm  |
| BGE– branch if greater   | SB                 | BGE rs1, rs2,imm  |
| BLTU – branch if less    | SB                 | BLTU rs1, rs2,imm |
| BGEU – branch if greater | SB                 | BGEU rs1, rs2,imm |

- BEQ and BNE take the branch if registers *rs1* and *rs2* are equal or unequal respectively.
- BLT and BLTU take the branch if *rs1* is less than *rs2*, using signed and unsigned comparison respectively.
- BGE and BGEU take the branch if *rs1* is greater than or equal to *rs2*, using signed and unsigned comparison respectively.

• All branch instructions use the 12-bit B-immediate to encode signed offsets in multiples of 2, and add the offset to the current PC to give the target address. The conditional branch range is ±4 KiB.

#### 9.5 Memory access Instructions

| Table.9.4                   | 5 Load-Store Instructions |                  |
|-----------------------------|---------------------------|------------------|
| Name                        | Binary format type        | Assembly syntax  |
| LB - load byte              | 1                         | LB rd, rs1, imm  |
| LH - load half word         |                           | LH rd, rs1, imm  |
| LW - load word              |                           | LW rd, rs1, imm  |
| LBU - load byte unsigned    | 1                         | LBU rd, rs1, imm |
| LHU - load half word unsig. | 1                         | LHU rd, rs1, imm |
| SB - store byte             |                           | SB rs1,rs2,imm   |
| SH - store half word        |                           | SH rs1,rs2,imm   |
| SW - store word             |                           | SW rs1,rs2,imm   |

- Load and store instructions transfer a value between the registers and memory. Loads are encoded in the I-type format and stores are S-type. The effective byte address is obtained by adding register rs1 to the sign-extended 12-bit offset. Loads copy a value from memory to register rd. Stores copy the value in register rs2 to memory.
- The LW instruction loads a 32-bit value from memory into rd. LH loads a 16-bit value from memory, then sign-extends to 32-bits before storing in rd. LHU loads a 16-bit value from memory but then zero extends to 32-bits before storing in rd. LB and LBU are defined analogously for 8-bit values. The SW, SH, and SB instructions store 32-bit, 16-bit, and 8-bit values from the low bits of register rs2 to memory

#### 9.6 CSR Instructions (Read-Set-Clear)

| Name                      | Binary format type | Assembly syntax     |
|---------------------------|--------------------|---------------------|
| CSRRW - csr read/write    |                    | CSRRW rd, csr, rs1  |
| CSRRS - csr read & set    |                    | CSRRS rd, csr, rs1  |
| CRSSC - csr read & clear  |                    | CSRRC rd, csr, rs1  |
| CSRRWI - csr rd/wr. Imm.  |                    | CSRRWI rd, csr, imm |
| CSRRSI - csr rd & set imm |                    | CSRRSI rd, csr, imm |
| CSRRCI - csr rd & clr imm |                    | CSRRCI rd, csr, imm |

- The CSRRW instruction atomically swaps values in the CSRs and integer registers. CSRRW reads the old value of the CSR, zero-extends the value to 32 bits, then writes it to integer register rd. The initial value in rs1 is written to the CSR. If rd=x0, then the instruction shall not read the CSR and shall not cause any of the side-effects that might occur on a CSR read.
- The CSRRS instruction reads the value of the CSR, zero-extends the value to 32 bits, and writes it to integer register rd. The initial value in integer register rs1 is treated as a bit mask that specifies bit positions to be set in the CSR. Any bit that is high in rs1 will cause the corresponding bit to be set in the CSR, if that CSR bit is writable. Other bits in the CSR are unaffected (though CSRs might have side effects when written).
- The CSRRC instruction reads the value of the CSR, zero-extends the value to 32 bits, and writes it to integer register rd. The initial value in integer register rs1 is treated as a bit mask that specifies bit positions to be cleared in the CSR. Any bit

that is high in rs1 will cause the corresponding bit to be cleared in the CSR, if that CSR bit is writable. Other bits in the CSR are unaffected.

- For both CSRRS and CSRRC, if *rs1*=x0, then the instruction will not write to the CSR at all, and so shall not cause any of the side effects that might otherwise occur on a CSR write, such as raising illegal instruction exceptions on accesses to read-only CSRs. Note that if rs1 specifies a register holding a zero value other than x0, the instruction will still attempt to write the unmodified value back to the CSR and will cause any attendant side effects.
- The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS, and CSRRC respectively, except they update the CSR using an 32-bit value obtained by zero-extending a 5-bit unsigned immediate (uimm[4:0]) field encoded in the rs1 field instead of a value from an integer register. For CSRRSI and CSRRCI, if the uimm[4:0] field is zero, then these instructions will not write to the CSR, and shall not cause any of the side effects that might otherwise occur on a CSR write. For CSRRWI, if rd=x0, then the instruction shall not read the CSR and shall not cause any of the side-effects that might occur on a CSR read.

#### 9.7 CSR Privileged Instructions

| Name                     | Binary format type | Assembly syntax |
|--------------------------|--------------------|-----------------|
| ECALL – environment call |                    | ECALL           |
| EBREAK – break to envir. |                    | EBREAK          |
| WFI – wait for IRQ       |                    | WFI             |
| MRET – machine return    |                    | MRET            |

- The ECALL instruction is used to make a request to the supporting execution environment, which is usually an operating system. The ABI for the system will define how parameters for the environment request are passed, but usually these will be in defined locations in the integer register file.
- The EBREAK instruction is presently implemented in the S0 core only (future update in T0 cores and T1 cores).
- The WFI is a wait for interrupt instruction, that latches the thread in an idle state until an interrupt arrives.
- The MRET updates the program counter with the address of the instruction being executed before entering the trap handling routine. Unless the instruction was a WFI, we return to the address after it.

#### **9.8 Atomic Instructions**

|              | <b>Table.9.8 Atomic Instructions</b> |                         |
|--------------|--------------------------------------|-------------------------|
| Name         | Binary format type                   | Assembly syntax         |
| AMOSWAP.W.AQ | R                                    | AMOSWAP.W.AQ rd,rs1,rs2 |
| AMOSWAP.W.RL | R                                    | AMOSWAP.W.RL rd,rs1,rs2 |

• The atomic memory operations AMOSWAP.W atomically load a data value from the address in *rs1*, place the value into register *rd*, apply a swap between the loaded value and the original value in *rs2*, then store the swapped value to the address in *rs1*.

The implementation follows "release consistency". The AMOSWAP.W.AQ instruction implements a read-modify-write operation suited to lock acquiring, while the AMOSWAP.W.AQ instruction implements a read-modify-write operation suited to lock releasing.

The S0 core does not support Atomic Instructions.

|          | Table.9.9 Klessydra custom | extensions            |
|----------|----------------------------|-----------------------|
| Name     | Binary format              | type Assembly syntax  |
| KMEMLD   | R                          | kmemld rd, rs1, rs2   |
| KMEMSTR  | R                          | kmemstr rd, rs1, rs2  |
| KADDV    | R                          | kaddv rd, rs1, rs2    |
| KSUBV    | R                          | ksubv rd, rs1, rs2    |
| KVMUL    | R                          | kvmul rd, rs1, rs2    |
| KVRED    | R                          | kvred rd, rs1, rs2    |
| KDOTP    | R                          | kdotp rd, rs1, rs2    |
| KSVADDSC | R                          | ksvaddsc rd, rs1, rs2 |
| KSVADDRF | R                          | ksvaddrf rd, rs1, rs2 |
| KSVMULSC | R                          | ksvmulsc rd, rs1, rs2 |
| KSVMULRF | R                          | ksvmulrf rd, rs1, rs2 |
| KDOTP    | R                          | kdotp rd, rs1, rs2    |
| KDOTPPS  | R                          | kdotpps rd, rs1, rs2  |
| KSRLV    | R                          | ksrlv rd, rs1, rs2    |
| KSRAV    | R                          | ksrav rd, rs1, rs2    |
| KRELU    | R                          | krelu rd, rs1, rs2    |
| KBCAST   | R                          | kbcast rd, rs1        |
| KVCP     | R                          | kvcp rd, rs1          |
|          |                            |                       |

### 9.9 Klessydra Custom Extensions (T1 version only)

- KMEMLD: loads the number of bytes specified by 'rs2' in the scratchpad memory at address 'rd', from the address 'rs1' in the main memory.
- KMEMSTR: loads the number of bytes specified by 'rs2' in the main memory at address 'rs1', from the address 'rd' in the scratchpad memory.
- KADDV: adds the operands in the scratchpad at addresses in 'rs1' and in 'rs2' and stores the result as a vector at the address 'rd' in the scratchpad memory.
- KSUBV: subtracts the operands in the scratchpad at addresses in 'rs1' and in 'rs2' and stores the result as a vector at the address 'rd' in the scratchpad memory.
- KVMUL: multiplies the vector elements of rs1 and rs2 and stores the result in rd.
- KVRED: performs vector reduction between the elements at addresses 'rs1' and 'rs2', and stores the scalar in 'rd'.
- KDOTP: multiplies the operands at addresses in 'rs1' and in 'rs2', the multiply intermediate results are accumulated, and the final results are stored as a scalar in the address in 'rd'.
- KDOTPPS: performs post scaling dot product on the elements at addresses in 'rs1' and 'rs2' and puts the result in 'rd'. The multiplication result is shifted by the value set the CSR register 'MPSCLFAC'.
- KSVADDSC/RF: adds the scalar operand in the register file or scratchpad address in 'rs1' with a scalar value that is in 'rs2'. The result is stored as a vector at address in 'rd'. (A faster alternative to using KBCAST).
- KSVMULRF/SC multiplies the scalar operand in the register file / scratchpad in 'rs1' with a scalar value that is in 'rs2'. The result is stored as a vector in the address in 'rd'. (A faster alternative to using KBCAST).
- KSRLV/KSRAV: does right logical/arithmetic shifts on the vector at the address in 'rs1' by the shift amount in 'rs2' and stores the vector results at the address in 'rd'.
- KRELU: does linear rectification on the negative values of the vector at the address in 'rs1' and puts the rectified vector at the address in 'rd'.

- KBCAST: does a vector broadcast of the scalar value contained in scalar register 'rs1' to the vector at the address in 'rd'.
- KVCP: copies the vectors starting at the address in 'rs1' to the address in 'rd'. Both addresses are in scratchpad memory space.

All logical-arithmetic vector instructions should all be used in conjunction with the CSR register 'MVSIZE' in order to specify the size of the vector to be processed by the operation.

### Appendix B

### T13 VHDL Code

This appendix includes some of the main RTL files of the Klessydra T13, not all files have been included in order to keep this thesis more compact. The language is VHDL\_2008

Also, one important note, the term DSP refers to the SPMU. Earlier implementations of the unit were designed to make a DSP, however, the term was later changed to SPMU

Another note: SC is the earlier abbreviation of scratchpad memory, which is now known as SPM.

The sources included are the package file, the SPE, SPI, and SPM entities, all sources can be found at Github [31][32][33].

### 1. Package file Parameters

```
library ieee;
use ieee.math_real.all;
use ieee.std logic 1164.all;
```

package thread\_parameters\_klessydra is

```
type array_2d is array (integer range⇔) of std_logic_vector;
type array_3d is array (integer range⇔) of array_2d;
type array_2d_int is array (integer range⇔) of integer;
```

constant THREAD\_ID\_SIZE : integer := 4;

constant THREAD\_POOL\_SIZE : integer := 3; -- Changing the TPS to less than "number of pipeline stages-1" is not allowed. And making it bigger than "pipeline stages-1" is okay but not recommended

constant NOP\_POOL\_SIZE : integer := 2; -- should be static and not touched, unless the number of pipeline stages changes; presently unused

constant BRANCHING\_DELAY\_SLOT : integer := 3; -- should be static and not touched, unless the number of pipeline stages change

constant HARC\_SIZE : integer := THREAD\_POOL\_SIZE; -- for the moment we do not implement "nop" threads subtype harc\_range is integer range THREAD\_POOL\_SIZE - 1 downto 0; -- will be used replicated units in the core

| <br>       |     |    |     |   |     |         |       |     |       |        |  |
|------------|-----|----|-----|---|-----|---------|-------|-----|-------|--------|--|
| <br>###### | ### | ## | ### | ¥ | ##  | ####### | ##### | ŧ # | ##### | ###### |  |
| <br>##     | #   | #  | ##  | # | ##  | ##      | #     | ##  |       | ##     |  |
| <br>##     | #   | #  | ##  | # | ##  | #####   | #     | ##  | ####  | #####  |  |
| <br>##     | #   | #  | ##  | # | ##  | ##      | #     | ##  | ##    | ##     |  |
| <br>###### | ### | ## | ##  | # | ### | ##      | ##### | ##  |       | ###### |  |
| <br>       |     |    |     |   |     |         |       |     |       |        |  |

constant RF SIZE : natural := 32; -- Regfile size, Can be set to 32 for RV32I or 16 for RV32E constant RV32M : natural := 0; -- Enable the M-extension of the risc-v instruction set : natural := 0; -- Enable the generation of the special purpose accelerator constant accl en constant replicate accl en : natural := 0; -- Set to 1 to replicate the accelerator for every thread constant multithreaded\_accl\_en : natural := 0; -- Set to 1 to let the replicated accelerator share the functional units (note: replicate\_accl\_en must be set to '1') constant SPM NUM : natural := 4; -- The number of scratchpads available "Minimum allowed is two" constant Addr Width : natural := 14; -- This address is for scratchpads. Setting this will make the size of the spm to be: "2^Addr Width -1" constant SPM STRT ADDR : std logic vector(31 downto 0) := x"1000\_0000"; -- This is starting address of the spms, it shouldn't be bigger than 2<sup>32</sup>, and shouldn't overlap any sections in the memory map constant SIMD : natural := 1; -- Changing the SIMD, would change the number of the functional units in the dsp, and the number of banks in the spms (can be power of 2 only e.g. 1,2,4,8) constant MCYCLE\_EN : natural := 0; -- Can be set to 1 or 0 only. Setting to zero will disable MCYCLE and MCYCLEH constant MINSTRET EN : natural := 0; -- Can be set to 1 or 0 only. Setting to zero will disable MINSTRET and MINSTRETH

constant MHPMCOUNTER EN : natural := 0; -- Can be set to 1 or 0 only. Setting to zero will disable all program counters except "MCYCLE/H" and "MINSTRET/H" \_\_\_\_\_

| constant RF_CEIL       : natural := integer(ceil(log2(real(RF_SIZE))));         constant TPS_CEIL       : natural := integer(ceil(log2(real(THREAD_POOL_SIZE))));         constant TPS_BUF_CEIL       : natural := integer(ceil(log2(real(THREAD_POOL_SIZE-1))));         constant SPM_ADDR_WID       : natural := integer(ceil(log2(real(SPM_NUM+1))));         constant SIMD_BITS       : natural := integer(ceil(log2(real(SIMD))));         constant SIMD_Width       : natural := 32;         constant SIMD_Width       : natural := SIMD*Data_Width;        constant XLEN       : natural := 32; aaa use this instead of Data_Width, the name is shorter and more convenient |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| constant ACCL_NUM : natural := (THREAD_POOL_SIZE - (THREAD_POOL_SIZE-1)*(1-replicate_accl_en));<br>constant FU_NUM : natural := (ACCL_NUM - (ACCL_NUM-1)*(multithreaded_accl_en));                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| subtype accl_range is integer range ACCL_NUM - 1 downto 0; will be used replicated accelerators in the core subtype fu_range is integer range FU_NUM - 1 downto 0; will be used replicated accelerators in the core                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| type fsm_IE_states is (sleep, reset, normal, csr_instr_wait_state, debug);<br>type mul_states is (mult, accum);<br>type div_states is (init, divide);<br>type fsm_LS_states is (normal, data_valid_waiting);                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| constant dsp init : std logic vector(1 downto 0) := "00":                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |

```
constant dsp init
                      : std logic vector(1 downto 0) := "00"
constant dsp halt_hart : std_logic_vector(1 downto 0) := "01";
                      : std_logic_vector(1 downto 0) := "10";
constant dsp_exec
```

### 2. SPE Unit

```
-- ieee packages ------
library ieee;
use ieee.std logic 1164.all;
use ieee.std_logic_misc.all;
use ieee.numeric_std.all;
use std.textio.all;
-- local packages -----
use work.riscv_klessydra.all;
use work.thread parameters klessydra.all;
-- DSP pinout -----
entity DSP Unit is
 port (
          -- Core Signals
  clk_i, rst_ni
                         : in std_logic;
  -- Processing Pipeline Signals
                           in std logic vector(SPM ADDR WID-1 downto 0);
  rs1_to_sc
                           : in std_logic_vector(SPM_ADDR_WID-1 downto 0);
  rs2_to_sc
                           : in std_logic_vector(SPM_ADDR_WID-1 downto 0);
  rd_to_sc
          -- CSR Signals
                           : in array_2d(harc_range)(Addr_Width downto 0);
  MVSIZE
  MVTYPE
                           : in array_2d(harc_range)(3 downto 0);
  MPSCLFAC
                           : in array 2d(harc range)(4 downto 0);
                           : out array 2d(accl range)(31 downto 0);
  dsp except data
   -- Program Counter Signals
   dsp_taken_branch
                            : out std_logic_vector(accl_range);
   dsp except condition
                            : out std_logic_vector(accl_range);
  -- ID_Stage Signals
                            : in std_logic_vector(DSP_UNIT_INSTR_SET_SIZE-1 downto 0);
  decoded_instruction_DSP
  harc EXEC
                             : in harc range;
  pc IE
                             : in std_logic_vector(31 downto 0);
  RS1_Data_IE
                             : in std_logic_vector(31 downto 0);
  RS2_Data_IE
                             : in std_logic_vector(31 downto 0);
  RD Data IE
                             : in std logic vector(Addr Width -1 downto 0);
  dsp_instr_req
                             : in std_logic_vector(accl_range);
  spm_rs1
                             : in std_logic;
  spm_rs2
                             : in std_logic;
  vec read rs1 ID
                            : in std_logic;
  vec_read_rs2_ID
                            : in std_logic;
  vec_write_rd_ID
                            : in std_logic;
  busy_dsp
                            : out std_logic_vector(accl_range);
  -- Scratchpad Interface Signals
```

```
dsp data gnt i
                             : in std logic vector(accl range);
  dsp sci wr gnt
                            : in std logic vector(accl range);
                            : in array_3d(accl_range)(1 downto 0)(SIMD Width-1 downto 0);
  dsp_sc_data_read
  dsp_we_word
                            : out array_2d(accl_range)(SIMD-1 downto 0);
                            : out array 3d(accl range)(1 downto 0)(Addr Width-1 downto 0);
  dsp sc read addr
                            : out array_3d(accl_range)(SPM_NUM-1 downto 0)(1 downto 0);
  dsp_to_sc
                           : out array_2d(accl_range)(SIMD_Width-1 downto 0);
  dsp sc data write wire
  dsp_sc_write_addr
                            : out array_2d(accl_range)(Addr_Width-1 downto 0);
                            : out array 2d(accl range)(SPM NUM-1 downto 0);
  dsp sci we
                        : out array_2d(accl_range)(SPM_NUM-1 downto 0);
  dsp_sci_req
  -- tracer signals
  state DSP
                       : out array 2d(accl range)(1 downto 0)
          );
end entity; --
architecture DSP of DSP Unit is
signal nextstate_DSP : array_2d(accl_range)(1 downto 0);
 -- Virtual Parallelism Signals
signal relu en
                              : std logic vector(accl range); -- enables the use of the shifters
signal shift en
                              : std_logic_vector(accl_range); -- enables the use of the shifters
 signal add en
                              : std logic vector(accl range); -- enables the use of the adders
signal mul en
                               : std logic vector(accl range); -- enables the use of the multipliers
                                : std_logic_vector(accl_range); -- enables the use of the accumulator
 signal accum en
 signal relu_en_wire
                                : std_logic_vector(accl_range); -- enables the use of the shifters
 signal shift en wire
                                : std_logic_vector(accl_range); -- enables the use of the shifters
                                 : std logic vector(accl range); -- enables the use of the adders
signal add en wire
signal mul en wire
                                 : std_logic_vector(accl_range); -- enables the use of the multipliers
                                   : std_logic_vector(accl_range); -- enables the use of the accumulatorss
 signal accum en wire
 signal add en pending wire
                                     : std logic vector(accl range); -- signal to preserve the request to access the adder "multhithreaded mode" only
signal shift en pending wire
                                     : std_logic_vector(accl_range); -- signal to preserve the request to access the shifter "multhithreaded mode"
only
signal mul en pending wire
                                     : std logic vector(accl range); -- signal to preserve the request to access the multiplier "multhithreaded mode"
only
signal accum en pending wire
                                       : std logic vector(accl range); -- signal to preserve the request to access the accumulator "multhithreaded
mode" only
signal relu_en_pending_wire
                                     : std logic vector(accl range); -- signal to preserve the request to access the ReLU "multhithreaded mode"
only
 signal add en pending
                                   : std logic vector(accl range); -- signal to preserve the request to access the adder "multhithreaded mode" only
 signal shift en pending
                                  : std logic vector(accl range); -- signal to preserve the request to access the shifter "multhithreaded mode" only
                                   : std logic vector(accl range); -- signal to preserve the request to access the multiplier "multhithreaded mode"
signal mul en pending
only
signal accum_en_pending
                                    : std_logic_vector(accl_range); -- signal to preserve the request to access the accumulator "multhithreaded
mode" only
signal relu en pending
                                  : std logic vector(accl range); -- signal to preserve the request to access the ReLU "multhithreaded mode" only
                               : std logic; -- busy signal active only when the FU is shared and currently in use
signal busy add
 signal busy_mul
                                : std logic; -- busy signal active only when the FU is shared and currently in use
 signal busy shf
                               : std logic; -- busy signal active only when the FU is shared and currently in use
                               : std logic; -- busy signal active only when the FU is shared and currently in use
signal busy acc
                               : std_logic; -- busy signal active only when the FU is shared and currently in use
signal busy_rel
 signal busy_add_wire
                                  : std logic; -- busy signal active only when the FU is shared and currently in use
                                  : std logic; -- busy signal active only when the FU is shared and currently in use
signal busy mul wire
                                 : std_logic; -- busy signal active only when the FU is shared and currently in use
: std_logic; -- busy signal active only when the FU is shared and currently in use
signal busy_shf_wire
 signal busy acc wire
 signal busy rel wire
                                 : std logic; -- busy signal active only when the FU is shared and currently in use
signal halt hart
                              : std logic vector(accl range); -- halts the thread when the requested functional unit is in use
                              : array 2D(accl_range)(4 downto 0); -- Each threa has request bits equal to the total number of FUs
 signal fu req
 signal fu_gnt
                              : array_2D(accl_range)(4 downto 0); -- Each threa has grant bits equal to the total number of FUs
 signal fu gnt wire
                                 : array 2D(accl range)(4 downto 0); -- Each threa has grant bits equal to the total number of FUs
                               : array 2D(accl range)(4 downto 0); -- Enable the giving of the grant to the thread pointed at by the issue buffer
signal fu gnt en
 signal fu_rd_ptr
                               : array 2D(4 downto 0)(TPS_BUF_CEIL-1 downto 0); -- five rd pointers each has a number of bits equal to
ceil(log2(THREAD_POOL_SIZE-1))
                               : array 2D(4 downto 0)(TPS BUF CEIL-1 downto 0); -- five rd pointers each has a number of bits equal to
signal fu wr ptr
ceil(log2(THREAD_POOL_SIZE-1))
  - five buffers for each FU times the "TPS-1" and not "TPS" since there is always one thread active, and not needing a buffer. Each buffer hold the
thread ID "TPS CEIL"
signal fu issue buffer
                                 : array 3D(4 downto 0)(THREAD POOL SIZE-2 downto 0)(TPS CEIL-1 downto 0);
 -- Functional Unit Ports ---
                                   : array 2d(accl range)(4*SIMD-1 downto 0);
 --signal dsp in sign bits
                                                                                          -- vivado unsynthesizable, but more efficient alternative
                                    : array_2d(fu_range)(SIMD_Width -1 downto 0);
 signal dsp_in_shifter_operand
                                      : array_2d(fu_range)(SIMD_Width -1 downto 0);
 signal dsp in shifter operand lat
                                                                                              -- 15 bits because i only want to latch the signed bits
 signal dsp in shifter operand lat wire : array 2d(fu range)(SIMD Width -1 downto 0);
 signal dsp int shifter operand
                                    : array 2d(fu range)(SIMD Width -1 downto 0);
 signal dsp_out_shifter_results
                                    : array_2d(fu_range)(SIMD_Width -1 downto 0);
```

103 104 105  $106 \\ 107 \\ 108 \\ 109 \\ 110 \\ 111 \\ 112 \\ 113 \\ 114 \\ 115 \\ 116 \\ 117 \\ 118 \\ 119 \\ 119 \\ 119 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100$ 

signal dsp in relu operands : array 2d(fu range)(SIMD Width-1 downto 0); : array 3d(fu range)(1 downto 0)(SIMD Width-1 downto 0); signal dsp in mul operands : array\_2d(fu\_range)(SIMD\_Width-1 downto 0); signal dsp\_out\_mul\_results : array\_2d(fu\_range)(SIMD\_Width-1 downto 0); signal dsp\_out\_relu\_results : array 2d(fu range)(SIMD Width-1 downto 0); signal dsp in accum operands : array\_2d(fu\_range)(31 downto 0): signal dsp\_out\_accum\_results : array\_3d(fu\_range)(1 downto 0)(SIMD Width-1 downto 0); signal dsp in adder operands signal dsp in adder operands lat : array\_3d(fu\_range)(1 downto 0)(SIMD\_Width/2 -1 downto 0); -- data\_Width devided by the number of pipeline stages : array 2d(fu range)(SIMD Width-1 downto 0); signal dsp out adder results : array 2d(fu range)(SIMD-1 downto 0); -- carry-out bit of the "dsp add 8 0" signal signal carry 8 wire : array 2d(fu range)(SIMD-1 downto 0); -- carry-out bit of the "dsp add 16 8" signal signal carry\_16\_wire signal carry\_16 : array 2d(fu range)(SIMD-1 downto 0); -- carry-out bit of the "dsp add 16 8" signal signal carry\_24\_wire : array\_2d(fu\_range)(SIMD-1 downto 0); -- carry-out bit of the "dsp\_add\_24\_16" signal signal dsp\_add\_8\_0 : array 3d(fu range)(SIMD-1 downto 0)(8 downto 0); -- 9-bits, contains the results of 8-bit adders signal dsp\_add\_16\_8 : array 3d(fu range)(SIMD-1 downto 0)(8 downto 0); -- 9-bits contains the results of 8-bit adders signal dsp\_add\_8\_0\_wire : array 3d(fu range)(SIMD-1 downto 0)(8 downto 0); -- 9-bits, contains the results of 8-bit adders signal dsp add 16 8 wire : array 3d(fu range)(SIMD-1 downto 0)(8 downto 0); -- 9-bits contains the results of 8-bit adders signal dsp\_add\_10\_0\_wire signal dsp\_add\_24\_16\_wire signal dsp\_add\_32\_24\_wire : array\_3d(fu\_range)(SIMD-1 downto 0)(8 downto 0); -- 9-bits contains the results of 8-bit adders : array 3d(fu range)(SIMD-1 downto 0)(8 downto 0); -- 9-bits, this should be 8 if we choose to discard the overflow of the addition of the upper byte signal mul tmp a array 3d(fu range)(SIMD-1 downto 0)(Data Width-1 downto 0); signal mul tmp b : array 3d(fu range)(SIMD-1 downto 0)(Data Width-1 downto 0); : array 3d(fu range)(SIMD-1 downto 0)(Data Width-1 downto 0); signal mul tmp c signal mul tmp d : array 3d(fu range)(SIMD-1 downto 0)(Data Width-1 downto 0); : array 2d(fu range)(SIMD Width -1 downto 0); -- Contains the results of the 16-bit multipliers signal dsp\_mul\_a signal dsp\_mul\_b : array\_2d(fu\_range)(SIMD\_Width -1 downto 0); -- Contains the results of the 16-bit multipliers signal dsp\_mul\_c : array\_2d(fu\_range)(SIMD\_Width -1 downto 0); -- Contains the results of the 16-bit multipliers signal dsp mul d : array 2d(fu range)(SIMD Width -1 downto 0); -- Contains the results of the 16-bit multipliers : array 2d(accl range)(2 downto 0); -- carry enable signal, depending on it's configuration, we can do KADDV8, signal carry pass KADDV16, KADDV32 signal FUNCT SELECT MASK : array 2d(accl range)(31 downto 0); -- when the mask is set to "FFFFFFFF" we enable KDOTP32 execution using the 16-bit muls signal twos complement : array\_2d(accl\_range)(31 downto 0); signal dsp shift enabler : array 2d(accl range)(15 downto 0); signal dsp in shift amount : array 2d(accl range)(4 downto 0); signal dsp\_sc\_data\_write\_wire\_int : array\_2d(accl\_range)(SIMD\_Width-1 downto 0); : array 2d(accl range)(SIMD Width-1 downto 0); signal dsp sc data write int signal MVTYPE DSP : array 2d(accl range)(1 downto 0); signal vec write rd DSP : std logic vector(accl range); -- Indicates whether the result being written is a vector or a scalar : std\_logic\_vector(accl\_range); -- Indicates whether the operand being read is a vector or a scalar : std\_logic\_vector(accl\_range); -- Indicates whether the operand being read is a vector or a scalar signal vec read rs1 DSP signal vec\_read\_rs2\_DSP signal dotp : std\_logic\_vector(accl\_range); -- indicator used in the pipeline handler to switch functional units signal dotpps : std logic vector(accl range); -- indicator used in the pipeline handler to switch functional units signal wb ready : std logic vector(accl range); signal halt dsp : std logic vector(accl range); signal halt dsp lat : std logic vector(accl range); signal recover state : std logic vector(accl range); signal recover\_state\_wires : std logic\_vector(accl\_range); signal dsp\_data\_gnt\_i\_lat : std\_logic\_vector(accl\_range); signal shifter stage 1 en : std logic vector(accl range); signal shifter\_stage\_2\_en : std logic vector(accl range); signal shifter stage 3 en : std logic vector(accl range); signal adder stage 1 en : std logic vector(accl range); signal adder\_stage\_2\_en : std logic vector(accl range); signal adder\_stage\_3\_en : std logic vector(accl range); signal mul\_stage\_1\_en : std\_logic\_vector(accl\_range); signal mul stage 2 en : std logic vector(accl range); signal mul stage 3 en : std logic vector(accl range); : std\_logic\_vector(accl\_range); signal relu\_stage\_1\_en signal relu\_stage\_2\_en : std\_logic\_vector(accl\_range); signal accum stage 1 en : std logic vector(accl range); : std\_logic\_vector(accl\_range); signal accum\_stage\_2\_en signal accum\_stage\_3\_en : std\_logic\_vector(accl\_range); signal dsp except data wire : array 2d(accl range)(31 downto 0); : array\_2d(accl\_range)(DSP\_UNIT\_INSTR\_SET\_SIZE -1 downto 0); signal decoded instruction DSP\_lat signal overflow\_rs1\_sc : array\_2d(accl\_range)(Addr\_Width downto 0); signal overflow rs2 sc : array\_2d(accl\_range)(Addr\_Width downto 0); : array\_2d(accl\_range)(Addr\_Width downto 0); signal overflow rd sc : array\_2d(accl\_range)(SPM\_ADDR\_WID-1 downto 0); signal dsp\_rs1\_to\_sc signal dsp rs2 to sc : array 2d(accl range)(SPM ADDR WID-1 downto 0); : array 2d(accl range)(SPM ADDR WID-1 downto 0); signal dsp rd to sc : array\_2d(accl\_range)(SIMD\_Width-1 downto 0); signal dsp\_sc\_data\_read\_mask

signal RS1 Data IE lat : array 2d(accl range)(31 downto 0); signal RS2 Data IE lat : array 2d(accl range)(31 downto 0); signal RD\_Data\_IE\_lat : array\_2d(accl\_range)(Addr\_Width -1 downto 0); signal MVSIZE READ : array\_2d(accl\_range)(Addr\_Width downto 0); -- Bytes remaining to read signal MVSIZE READ MASK : array 2d(accl range)(Addr Width downto 0); -- Bytes remaining to read : array\_2d(accl\_range)(Addr\_Width downto 0); -- Bytes remaining to write signal MVSIZE WRITE signal MPSCLFAC DSP : array 2d(accl range)(4 downto 0); signal busy\_dsp\_internal : std\_logic\_vector(accl\_range); signal busy\_DSP\_internal\_lat : std\_logic\_vector(accl\_range); : std\_logic\_vector(accl\_range); signal SIMD RD BYTES wire : array\_2d\_int(accl\_range); signal SIMD RD BYTES : array 2d int(accl range); component ACCUMULATOR : in std logic; : in std logic; MVTYPE DSP : in array 2d(accl range)(1 downto 0); : in std logic vector(accl range); accum stage 1 en accum\_stage\_2\_en : in std logic vector(accl range); recover\_state\_wires : in std\_logic\_vector(accl\_range); : in std\_logic\_vector(accl\_range); : in array 2d(accl range)(1 downto 0); : in array\_2d(accl\_range)(DSP\_UNIT\_INSTR\_SET\_SIZE -1 downto 0); decoded\_instruction\_DSP\_lat : in array\_2d(fu\_range)(SIMD\_Width-1 downto 0); dsp in accum operands dsp\_out\_accum\_results : out array 2d(fu range)(31 downto 0) ----- DSP BEGIN -----<= busy\_dsp\_internal; DSP replicated : for h in accl range generate ----- Sequential Stage of DSP Unit -----DSP\_Exec\_Unit : process(clk\_i, rst\_ni) -- single cycle unit, fully synchronous if rst ni = 0' then  $rf rs2(h) \ll 0';$ dotpps(h) <= '0'; dotp(h)  $\leq 0';$ recover state(h)  $\leq 0'$ ; elsif rising edge(clk i) then if dsp instr req(h) = '1' or busy DSP internal lat(h) = '1' then case state\_DSP(h) is when dsp\_init => ## # ## # ## ## # ## ## # --## # ## # ## ## # ##### ####### --## ### ## # ## # ## ## ---## ## ---FUNCT SELECT MASK(h) <= (others => '0'); twos\_complement(h)  $\leq (others => '0');$  $rf_rs2(h) <= '0';$ dotpps(h) <= '0'; <= '0': Set signals to enable correct virtual parallelism operation if (decoded instruction DSP(KADDV bit position) = '1' or decoded instruction DSP(KSVADDSC bit position) = '1') and MVTYPE(h)(3 downto 2) = "10" thencarry\_pass(h) <= "111"; -- pass all carry\_outs elsif decoded instruction DSP(KSVADDRF bit position) = '1' andMVTYPE(h)(3 downto 2) = "10" thencarry\_pass(h) <= "111"; -- pass all carry\_outs rf\_rs2(h) <= '1'; elsif (decoded instruction DSP(KADDV bit position) ='1' or decoded\_instruction\_ $DSP(KSVADDSC_bit_position) = '1')$  and

MVTYPE(h)(3 downto 2) = "01" thencarry  $pass(h) \le "101"$ ; -- pass carries 9, and 25 elsif decoded instruction DSP(KSVADDRF bit position) = '1' and MVTYPE(h)(3 downto 2) = "01" thencarry  $pass(h) \leq "101"$ ; -- pass carries 9, and 25  $rf rs2(h) \le '1'$ : elsif (decoded instruction DSP(KADDV bit position) = '1' or decoded  $\overline{\text{instruction}}$   $\overline{\text{DSP}}(\overline{\text{KSVADDSC}}, \overline{\text{bit}}, \text{position}) = '1')$  and MVTYPE(h)(3 downto 2) = "00" then $carry_pass(h) \stackrel{\scriptstyle\frown}{<=} "000"; -- don't pass carry_outs and keep addition 8-bit$ elsif decoded\_instruction\_DSP(KSVADDRF\_bit\_position) = '1' and MVTYPE(h)(3 downto 2) = "00" thencarry pass(h) <= "000"; -- don't pass carry outs and keep addition 8-bit rf rs2(h) <= '1': elsif decoded\_instruction\_DSP(KSUBV\_bit\_position) = '1' and MVTYPE(h)(3 downto 2) = "10" thencarry\_pass(h) <= "111"; -- pass all carry\_outs twos\_complement(h) <= "0001000100010001000100010001001"; elsif decoded instruction DSP(KSUBV bit position) = '1' and MVTYPE(h)(3 downto 2) = "01" thenelsif decoded instruction DSP(KSUBV bit position) = '1' and MVTYPE(h)(3 downto 2) = "00" thencarry\_pass(h) <= "000"; -- don't pass carry\_outs and keep addition 8-bit elsif decoded instruction DSP(KDOTP bit position) = 'l' and MVTYPE(h)(3 downto 2) = "10" thenKDOTP32 does not use the adders of KADDV instructions but rather adds the mul acc results using it's own adders FUNCT SELECT MASK(h) <= (others => '1'); -- This enables 32-bit multiplication with the 16-bit multipliers  $dotp(h) \le '1'$ : elsif decoded instruction DSP(KDOTP bit position) = '1' and MVTYPE(h)(3 downto 2) = "01" then $dotp(h) \le '1';$ elsif decoded instruction DSP(KDOTP bit position) = '1' and MVTYPE(h)(3 downto 2) = "00" then $dotp(h) \le '1';$ elsif decoded instruction DSP(KDOTPPS bit position) = '1' and MVTYPE(h)(3 downto 2) = "10" thenFUNCT SELECT MASK(h) <= (others => '1'); -- This enables 32-bit multiplication with the 16-bit multipliers  $dotpps(h) \le '1';$ elsif decoded instruction DSP(KDOTPPS bit position) = '1' and MVTYPE(h)(3 downto 2) = "01" then $dotpps(h) \le '1';$ elsif decoded instruction DSP(KDOTPPS bit position) = '1' and MVTYPE(h)(3 downto 2) = "00" then $dotpps(h) \le '1';$ elsif decoded instruction DSP(KSVMULRF bit position) = '1' and MVTYPE(h)(3 downto 2) = "10" thenFUNCT SELECT MASK(h) <= (others => '1'); rf rs2(h) <= '1': elsif decoded instruction\_DSP(KSVMULRF\_bit\_position) = '1' and MVTYPE(h)(3 downto 2) = "01" thenrf rs2(h) <= '1'; elsif decoded instruction DSP(KSVMULRF bit position) = '1' and MVTYPE(h)(3 downto 2) = "00" then $rf rs2(h) \le 1'$ : elsif (decoded instruction DSP(KVMUL bit position) = '1' or decoded\_instruction\_DSP(KSVMULSC\_bit\_position) = '1') and MVTYPE(h)(3 downto 2) = "10" thenFUNCT\_SELECT\_MASK(h) <= (others => '1'); end if: -- We backup data from decode stage since they will get updated MVSIZE READ MASK(h) <= MVSIZE(harc EXEC); MVSIZE WRITE(h) <= MVSIZE(harc\_EXEC); MPSCLFAC\_DSP(h) <= MPSCLFAC(harc\_EXEC);

MVSIZE\_WRITE(h) <= MVSIZE(harc\_EXEC); MPSCLFAC\_DSP(h) <= MPSCLFAC(harc\_EXEC); MVTYPE\_DSP(h) <= MVTYPE(harc\_EXEC)(3 downto 2); decoded\_instruction\_DSP\_lat(h) <= decoded\_instruction\_DSP; vec\_write\_rd\_DSP(h) <= vec\_write\_rd\_ID; vec\_read\_rs1\_DSP(h) <= vec\_read\_rs1\_ID; vec\_read\_rs2\_DSP(h) <= vec\_read\_rs2\_ID; dsp\_rs1\_to\_sc(h) <= rs1\_to\_sc; dsp\_rs2\_to\_sc(h) <= rs2\_to\_sc; dsp\_rd\_to\_sc(h) <= rd\_to\_sc; RD\_Data\_IE\_lat(h) <= RD\_Data\_IE; -- Increment the read addresses

if dsp data gnt i(h) = '1' then if vec read rs1 ID = '1' then RS1 Data IE lat(h) <= std logic vector(unsigned(RS1 Data IE) + SIMD RD BYTES wire(h)); -- source 1 address increment else RS1 Data IE  $lat(h) \leq RS1$  Data IE; end if. if vec read rs2 ID = '1' then RS2 Data IE lat(h) <= std\_logic\_vector(unsigned(RS2\_Data\_IE) + SIMD\_RD\_BYTES\_wire(h)); -- source 2 address increment else RS2 Data IE  $lat(h) \leq RS2$  Data IE; end if; - Decrement the vector elements that have already been operated on if unsigned(MVSIZE(harc EXEC)) >= SIMD RD BYTES wire(h) then MVSIZE\_READ(h) <= std\_logic\_vector(unsigned(MVSIZE(harc\_EXEC)) - SIMD\_RD\_BYTES\_wire(h)); -- decrement by SIMD\_BYTE Execution Capability else MVSIZE READ(h)  $\leq$  (others = '0'); -- decrement the remaining bytes end if; else RS1 Data IE  $lat(h) \le RS1$  Data IE; RS2 Data IE lat(h) <= RS2 Data IE; MVSIZE\_READ(h) <= MVSIZE(harc\_EXEC); end if; when dsp exec => recover state(h) <= recover state wires(h); if halt\_dsp(h) = '1' and halt\_ $dsp_lat(h) = '0'$  then dsp\_sc\_data\_write\_int(h) <= dsp\_sc\_data\_write\_wire\_int(h); end if; if halt dsp(h) = '0' then -- Increment the write address when we have a result as a vector if vec write rd DSP(h) = '1' and wb ready(h) = '1' then RD\_Data\_IE\_lat(h) <= std\_logic\_vector(unsigned(RD\_Data\_IE\_lat(h)) + SIMD\_RD\_BYTES(h)); -- destination address increment end if; if wb ready(h) = '1' then if to integer(unsigned(MVSIZE WRITE(h))) >= SIMD RD BYTES(h) then MVSIZE\_WRITE(h) <= std\_logic\_vector(unsigned(MVSIZE\_WRITE(h)) - SIMD\_RD\_BYTES(h)); -- decrement by SIMD BYTE Execution Capability else MVSIZE WRITE(h)  $\leq$  (others  $\geq$  '0'); -- decrement the remaining bytes end if; end if; -- Increment the read addresses  $if to integer(unsigned(MVSIZE_READ(h))) >= SIMD_RD_BYTES(h) and dsp_data_gnt_i(h) = '1' then -- Increment the addresses untill all other states are straightforward and the states are straightforward and the states are straightforward and the st$ the vector elements are operated fetched if vec read rs1 DSP(h) = '1' then RS1\_Data\_IE\_lat(h) <= std\_logic\_vector(unsigned(RS1\_Data\_IE\_lat(h)) + SIMD\_RD\_BYTES(h)); -- source 1 address increment end if: if vec read rs2 DSP(h) = '1' then RS2 Data IE lat(h) <= std logic vector(unsigned(RS2 Data IE lat(h)) + SIMD RD BYTES(h)); -- source 2 address increment end if: end if; -- Decrement the vector elements that have already been operated on if dsp\_data\_gnt\_i(h) = '1' then if to integer(unsigned(MVSIZE\_READ(h))) >= SIMD\_RD\_BYTES(h) then  $MVSIZE READ(h) \le std logic vector(unsigned(MVSIZE READ(h)) - SIMD RD BYTES(h));$ -- decrement by SIMD BYTE Execution Capability else MVSIZE READ(h)  $\leq$  (others = '0'); -- decrement the remaining bytes end if; end if: dsp\_sc\_data\_read\_mask(h) <= (others => '0'); if dsp\_data\_gnt\_i\_lat(h) = '1' then if to integer(unsigned(MVSIZE READ MASK(h))) >= SIMD RD BYTES(h) then dsp\_sc\_data\_read\_mask(h) <= (others => '1'); MVSIZE READ MASK(h) <= std logic vector(unsigned(MVSIZE READ MASK(h)) - SIMD RD BYTES(h)); -- decrement by SIMD BYTE Execution Capability else MVSIZE READ MASK(h)  $\leq$  (others = '0'); dsp\_sc\_data\_read\_mask(h)(to\_integer(unsigned(MVSIZE\_READ\_MASK(h)))\*8 - 1 downto 0) <= (others => '1'); end if; end if: end if; when others => null:

end case; end if; end if: end process; ---- Combinational Stage of DSP Unit -----DSP Excpt\_Cntrl\_Unit\_comb : process(all) variable busy DSP internal wires : std logic; variable dsp except condition wires : replicated bit; variable dsp\_taken\_branch\_wires : replicated\_bit; begin busy\_DSP\_internal\_wires := '0'; dsp\_except\_condition\_wires(h) := '0'; dsp\_taken\_branch\_wires(h) := '0'; wb\_readv(h) <= '0'; wb ready(h) halt dsp(h) <= '0'; nextstate DSP(h) <= dsp\_init; recover\_state\_wires(h) <= recover state(h); dsp\_except\_data\_wire(h) <= dsp\_except\_data(h); overflow rs1 sc(h) <= (others => '0'); $\leq$  (others  $\geq$  '0'): overflow rs2 sc(h) overflow\_rd\_sc(h) <= (others => '0'); dsp we word(h)  $\leq$  (others  $\geq$  '0'); dsp\_sci\_req(h)  $\leq$  (others  $\Rightarrow$  '0'); dsp\_sci\_we(h) <= (others => '0'); dsp\_sc\_write\_addr(h)  $\leq$  (others  $\Rightarrow$  '0');  $\langle = (others => (others => '0'));$ dsp sc read addr(h)  $\langle \langle \text{others} \rangle \rangle \langle \text{others} \rangle \rangle \rangle$ dsp\_to\_sc(h) if dsp\_instr\_req(h) = '1' or busy\_DSP\_internal\_lat(h) = '1' then case state\_DSP(h) is when dsp init => overflow rs1 sc(h) <= std logic vector('0' & unsigned(RS1 Data IE(Addr Width -1 downto 0)) + unsigned(MVSIZE(harc EXEC)) -1); overflow rs2 sc(h) <= std logic vector('0' & unsigned(RS2 Data IE(Addr Width -1 downto 0)) + unsigned(MVSIZE(harc EXEC)) -1);  $overflow_rd\_sc(h) \le std\_logic\_vector('0' & unsigned(RD\_Data\_IE(Addr\_Width -1 downto 0)) + unsigned(MVSIZE(harc\_EXEC)) -1);$ if MVSIZE(harc EXEC) = (0 to Addr Width => '0') then null: elsif MVSIZE(harc EXEC)(1 downto 0) /= "00" and MVTYPE(harc EXEC)(3 downto 2) = "10" then -- Set exception if the number of bytes are not divisible by four dsp except condition wires(h) := '1'; dsp\_taken\_branch\_wires(h) := '1'; dsp\_except\_data\_wire(h) <= ILLEGAL\_VECTOR\_SIZE\_EXCEPT\_CODE; elsif MVSIZE(harc EXEC)(0) /= '0' and MVTYPE(harc EXEC)(3 downto 2) = "01" then -- Set exception if the number of bytes are not divisible by two dsp except condition wires(h) := '1'; dsp taken branch wires(h) := '1'; dsp except data wire(h) <= ILLEGAL VECTOR SIZE EXCEPT CODE; elsif (rs1\_to\_sc = "100" and vec\_read\_rs1\_ID = '1') or (rs2\_to\_sc = "100" and vec\_read\_rs2\_ID = '1') or rd\_to\_sc = "100" then -- Set exception for non scratchpad access dsp\_except\_condition\_wires(h) := '1'; dsp taken branch wires(h) ·= '1'· dsp\_except\_data\_wire(h) <= ILLEGAL\_ADDRESS\_EXCEPT\_CODE; elsif rs1 to sc = rs2 to sc and vec read rs1 ID = '1' and vec read rs2 ID = '1' then -- Set exception for same read access dsp\_except\_condition\_wires(h) := '1'; dsp\_taken\_branch\_wires(h) := '1'; dsp except data wire(h) <= READ SAME SCARTCHPAD EXCEPT CODE; elsif (overflow rs1 sc(h)(Addr Width) = '1' and vec read rs1 ID = '1') or (overflow rs2 sc(h)(Addr Width) = '1' and vec read rs2 ID = '1') then -- Set exception if reading overflows the scratchpad's address dsp\_except\_condition\_wires(h) := '1'; dsp\_taken\_branch\_wires(h) := '1'; dsp\_except\_data\_wire(h) <= SCRATCHPAD\_OVERFLOW\_EXCEPT\_CODE; elsif overflow\_rd\_sc(h)(Addr\_Width) = '1' and vec\_write\_rd\_ID = '1' then -- Set exception if reading overflows the scratchpad's address, scalar writes are excluded dsp\_except\_condition\_wires(h) := '1'; dsp\_taken\_branch\_wires(h) := '1': dsp\_except\_data\_wire(h) <= SCRATCHPAD\_OVERFLOW\_EXCEPT\_CODE; else if halt hart(h) = 0' then nextstate\_DSP(h) <= dsp\_exec; else nextstate DSP(h) <= dsp halt hart; end if;

```
busy DSP internal wires := '1';
     end if:
     if rs1_to_sc /= "100" and spm_rs1 = '1' and halt_hart(h) = '0' then
       dsp sci req(h)(to integer(unsigned(rs1 to sc))) <= '1';
       dsp_to_sc(h)(to_integer(unsigned(rs1_to_sc)))(0) <= '1';
       dsp sc read addr(h)(0) \leq RS1 Data IE(Addr Width-1 downto 0);
      end if;
     if rs2 to sc /= "100" and spm_rs2 = '1' and rs1_to_Sc /= rs2_to_sc and halt_hart(h) = '0' then -- Do not send a read request if the second
operand accesses the same spm as the first,
       dsp_sci_req(h)(to_integer(unsigned(rs2_to_sc))) <= '1';
       dsp to sc(h)(to integer(unsigned(rs2 to sc)))(1) <= '1';
       dsp sc read addr(h)(1) <= RS2 Data IE(Addr Width-1 downto 0);
     end if;
     when dsp halt hart =>
      if halt hart(h) = '0' then
       nextstate DSP(h) \le dsp exec;
      else
       nextstate_DSP(h) <= dsp_halt_hart;
      end if:
      busy DSP internal wires := '1';
     when dsp exec =>
      if (dsp\_sci\_wr\_gnt(h) = '0' and wb\_ready(h) = '1') then
       halt dsp(h) \leq 1':
       recover_state_wires(h) <= '1';
      elsif unsigned(MVSIZE WRITE(h)) <= SIMD RD BYTES(h) then
       recover state wires(h) \leq 0';
      end if:
      if vec_write_rd_DSP(h) = '1' and dsp_sci_we(h)(to_integer(unsigned(dsp_rd_to_sc(h)))) = '1' then
       if unsigned(MVSIZE WRITE(h)) >= (SIMD)*4+1 then --
        dsp_we_word(h) <= (others => '1');
       elsif unsigned(MVSIZE WRITE(h)) \geq 1 then
        for i in 0 to SIMD-1 loop
          if i \le to_integer(unsigned(MVSIZE_WRITE(h))-1)/4 then -- Four because of the number of bytes per word
           if to integer (unsigned (dsp_sc_write_addr(h) (SIMD_BITS+1 downto 0))/(4 + i) < SIMD then
            dsp we word(h)(to integer(unsigned(dsp sc write addr(h)(SIMD BITS+1 downto 0))/4 + i)) <= '1';
           elsif to integer(unsigned(dsp sc write addr(h)(SIMD BITS+1 downto 0))/4 + i) >= SIMD then
            dsp we word(h)(to integer(unsigned(dsp sc write addr(h)(SIMD BITS+1 downto 0))/4 + i - SIMD)) \leq '1';
           end if;
         end if:
        end loop;
       end if;
      elsif vec write rd DSP(h) = 0' and dsp sci we(h)(to integer(unsigned(dsp rd to sc(h)))) = 1' then
       dsp\_we\_word(h)(to\_integer(unsigned(dsp\_sc\_write\_addr(h)(SIMD\_BITS+1 downto 0))/4)) <= '1';
      end if:
      if decoded_instruction_DSP_lat(h)(KBCAST_bit_position) = '1' then
        - KBCAST signals are handeled here
       if MVSIZE WRITE(h) > (0 to Addr Width => '0') then
        nextstate DSP(h) \le dsp exec;
        busy_DSP_internal_wires := '1';
       end if:
       wb ready(h) \leq 1';
       dsp_sci_we(h)(to_integer(unsigned(dsp_rd_to_sc(h)))) <= '1';
       dsp_sc_write_addr(h) <= RD_Data_IE_lat(h);
      end if:
      if decoded instruction DSP lat(h)(KVCP bit position) = '1' then
        - KVCP signals are handeled here
       if adder_stage_3_en(h) = '1' then
        wb ready(h) \leq = '1';
       elsif recover_state(h) = '1' then
        wb_ready(h) \leq 1';
       end if:
       if MVSIZE READ(h) > (0 to Addr Width => '0') then
        dsp_to_sc(h)(to_integer(unsigned(dsp_rs1_to_sc(h))))(0) <= '1';
        dsp_sci_req(h)(to_integer(unsigned(dsp_rs1_to_sc(h)))) <= '1';
        dsp sc read addr(h)(0) <= RS1 Data IE lat(h)(Addr Width - 1 downto 0);
       end if:
       if MVSIZE_WRITE(h) > (0 to Addr_Width => '0') then
        nextstate DSP(h) \le dsp exec;
        busy DSP internal_wires := '1';
       end if:
```

```
dsp sci we(h)(to integer(unsigned(dsp rd to sc(h)))) <= '1';
  dsp_sc_write_addr(h) <= RD Data IE lat(h);
 end if:
end if;
if decoded_instruction_DSP_lat(h)(KRELU_bit_position) = '1' then
 -- KRELU signals are handeled here
 if relu_stage 2_{en}(h) = '1' then
  wb_ready(h) \leq 11:
 elsif recover_state(h) = '1' then
  wb ready(\overline{h}) <= '1';
 end if:
 if MVSIZE READ(h) > (0 to Addr Width => '0') then
  dsp_to_sc(h)(to_integer(unsigned(dsp_rs1_to_sc(h))))(0) <= '1';
  dsp_sci_req(h)(to_integer(unsigned(dsp_rs1_to_sc(h)))) <= '1';
  dsp_sc_read_addr(h)(0) <= RS1_Data_IE_lat(h)(Addr_Width - 1 downto 0);
 end if;
 if MVSIZE WRITE(h) > (0 to Addr Width => '0') then
  nextstate DSP(h) <= dsp_exec;
  busy DSP_internal_wires := '1';
 end if;
 if wb ready(h) = '1' then
  dsp_sci_we(h)(to_integer(unsigned(dsp_rd_to_sc(h)))) <= '1';</pre>
  dsp sc write addr(h) \le RD Data IE lat(h);
end if;
end if;
if decoded_instruction_DSP_lat(h)(KSRAV_bit_position) = '1' or
 decoded instruction DSP lat(h)(KSRLV bit position) = '1' then
 -- KSRAV signals are handeled here
 if shifter_stage_3_en(h) = '1' then
  wb_ready(h) \leq = '1';
 elsif recover state(h) = '1' then
  wb ready(\overline{h}) <= '1';
 end if;
 if MVSIZE READ(h) > (0 to Addr Width => '0') then
  dsp_to_sc(\overline{h})(to_integer(unsigned(dsp_rs1_to_sc(h))))(0) <= '1';
  dsp sci req(h)(to integer(unsigned(dsp rs1 to sc(h)))) <= '1';
  dsp\_sc\_read\_addr(h)(0) \le RS1\_Data\_IE\_lat(h)(Addr\_Width - 1 downto 0);
 end if;
 if MVSIZE WRITE(h) > (0 \text{ to } Addr_Width => '0') then
  nextstate_DSP(h) <= dsp_exec;
  busy DSP internal wires := '1';
 end if;
 if wb_ready(h) = '1' then
  dsp_sci_we(h)(to_integer(unsigned(dsp_rd_to_sc(h)))) <= '1';
dsp_sc_write_addr(h) <= RD_Data_IE_lat(h);
end if;
end if;
if decoded_instruction_DSP_lat(h)(KADDV_bit_position) = '1' or
 decoded_instruction_DSP_lat(h)(KSUBV_bit_position) = '1' then
 -- KADDV and KSUBV signals are handeled here
if adder_stage_3_en(h) = '1' then
wb_ready(h) <= '1';
 elsif recover state(h) = '1' then
  wb ready(\overline{h}) <= '1';
 end if;
 if MVSIZE READ(h) > (0 \text{ to } Addr_Width => '0') then
  dsp_to_sc(h)(to_integer(unsigned(dsp_rs1_to_sc(h))))(0) <= '1';
  dsp to sc(h)(to integer(unsigned(dsp rs2 to sc(h))))(1) <= '1';
  dsp_sci_req(h)(to_integer(unsigned(dsp_rs1_to_sc(h)))) <= '1';
  dsp_sci_req(h)(to_integer(unsigned(dsp_rs2_to_sc(h)))) <= 'l';</pre>
  dsp_sc_read_addr(h)(0) <= RS1_Data_IE_lat(h)(Addr_Width - 1 downto 0);
  dsp sc read addr(h)(1) <= RS2 Data IE lat(h)(Addr Width - 1 downto 0);
 end if:
 if MVSIZE_WRITE(h) > (0 to Addr_Width => '0') then
  nextstate DSP(h) \le dsp exec;
  busy DSP internal wires := '1';
 end if;
 if wb ready(h) = '1' then
  dsp\_sci\_we(h)(to\_integer(unsigned(dsp\_rd\_to\_sc(h)))) \quad <= 'l';
  dsp_sc_write_addr(h) <= RD_Data_IE_lat(h);
 end if:
end if:
```

if wb ready(h) = '1' then

```
if decoded_instruction_DSP_lat(h)(KVRED_bit_position) = '1' or
```

decoded instruction DSP lat(h)(KDOTP bit position) = '1' or decoded instruction DSP lat(h)(KDOTPPS bit position) = '1' then -- KDOTP signals are handeled here if accum\_stage\_3\_en(h) = '1' then wb ready(h)  $\leq = 1$ ; elsif recover state(h) = '1' then wb ready(h)  $\leq 11'$ ; end if; if MVSIZE READ(h) > (0 to Addr Width => '0') then if vec\_read\_rs2\_DSP(h) = '1' then dsp\_sci\_req(h)(to\_integer(unsigned(dsp\_rs2\_to\_sc(h)))) <= '1'; dsp\_to\_sc(h)(to\_integer(unsigned(dsp\_rs2\_to\_sc(h))))(1) <= '1'; dsp sc read addr(h)(1)  $\leq$  RS2 Data IE lat(h)(Addr Width - 1 downto 0); end if: dsp\_sci\_req(h)(to\_integer(unsigned(dsp\_rs1\_to\_sc(h)))) <= '1'; dsp\_to\_sc(h)(to\_integer(unsigned(dsp\_rs1\_to\_sc(h))))(0) <= '1'; dsp\_sc\_read\_addr(h)(0) <= RS1\_Data\_IE\_lat(h)(Addr\_Width - 1 downto 0); nextstate DSP(h) <= dsp exec; busy DSP internal wires := '1'; elsif  $\overline{MVSIZE}$  WRITE(h) = (0 to Addr Width => '0') then nextstate\_ $DSP(h) \le dsp_init;$ else nextstate DSP(h) <= dsp exec; busy DSP internal wires := '1'; end if; if wb ready(h) = '1' then dsp\_sci\_we(h)(to\_integer(unsigned(dsp\_rd\_to\_sc(h)))) <= '1';</pre> dsp\_sc\_write\_addr(h) <= RD\_Data\_IE\_lat(h); end if: end if; if decoded instruction DSP lat(h)(KVMUL bit position) = '1' or decoded\_instruction\_DSP\_lat(h)(KSVMULSC\_bit\_position) = '1' or decoded\_instruction\_DSP\_lat(h)(KSVMULRF\_bit\_position) = '1' or decoded\_instruction\_DSP\_lat(h)(KSVADDSC\_bit\_position) = '1' or decoded instruction DSP lat(h)(KSVADDRF bit position) = '1' then - KMUL signals are handeled here if mul\_stage 3 en(h) = '1' or adder\_stage 3 en(h) = '1' then wb\_ready(h) <= '1'; elsif recover\_state(h) = '1' then wb ready(h)  $\leq 1'$ ; end if: if MVSIZE READ(h) > (0 to Addr Width => '0') then dsp sci req(h)(to integer(unsigned(dsp rs1 to sc(h))))  $\leq 12$ ; if rf rs2(h) = 0' then -- if the scalar does not come from the regfile dsp\_sci\_req(h)(to\_integer(unsigned(dsp\_rs2\_to\_sc(h)))) <= '1'; dsp\_to\_sc(h)(to\_integer(unsigned(dsp\_rs2\_to\_sc(h))))(1) <= '1'; dsp\_sc\_read\_addr(h)(1) <= RS2\_Data\_IE\_lat(h)(Addr\_Width - 1 downto 0); end if: dsp\_to\_sc(h)(to\_integer(unsigned(dsp\_rs1\_to\_sc(h))))(0) <= '1'; dsp sc read addr(h)(0) <= RS1 Data IE lat(h)(Addr Width - 1 downto 0); nextstate\_DSP(h) <= dsp\_exec; busy\_DSP\_internal\_wires := '1'; elsif MVSIZE\_WRITE(h) = (0 to Addr\_Width => '0') then nextstate  $DSP(h) \le dsp$  init; else nextstate DSP(h) <= dsp exec; busy DSP internal wires := '1'; end if: if wb\_ready(h) = '1' then dsp\_sci\_we(h)(to\_integer(unsigned(dsp\_rd\_to\_sc(h)))) <= '1'; dsp sc write  $addr(h) \le RD$  Data IE lat(h); end if: end if; when others => null: end case: end if; busy\_DSP\_internal(h) <= busy\_DSP\_internal\_wires;</pre> dsp\_except\_condition(h) <= dsp\_except\_condition\_wires(h);</pre> dsp taken branch(h) <= dsp\_taken\_branch\_wires(h);

end process;

fsm\_DSP\_pipeline\_controller : process(clk\_i, rst\_ni) begin

```
if rst ni = 0' then
  dsp_data_gnt_i_lat(h) \leq '0';
  adder_stage_1_en(h)
                          <= '0'
  adder_stage_2_en(h)
                          <= '0':
  adder stage 3 en(h)
                          <= '0';
  shifter_stage_1_en(h)
                          <= '0'
  shifter_stage_2_en(h)
                          <= '0':
  mul_stage_1_en(h)
                          <= '0';
  mul_stage_2_en(h)
                          <= '0';
                          <= '0'
  mul_stage_3_en(h)
  accum_stage_1_en(h)
                           <= '0'
                           <= '0':
  accum stage 2 en(h)
  accum stage 3 en(h)
                           <= '0':
                         <= '0';
  relu_stage_1_en(h)
  relu_stage_2_en(h)
                         <= '0';
          busy_DSP_internal_lat(h) <= '0';
  state DSP(h)
                       <= dsp_init;
 elsif rising edge(clk i) then
  dsp data gnt i lat(h) \le dsp data gnt i(h);
  adder_stage_1_en(h) <= dsp_data_gnt_i_lat(h) and add_en(h);
adder_stage_2_en(h) <= adder_stage_1_en(h);
  adder_stage_3_en(h) <= adder_stage_2_en(h);
  mul stage 1 en(h)
                        \leq dsp data gnt i lat(h) and mul en(h);
                        <= mul_stage_1_en(h);
  mul_stage_2_en(h)
  mul_stage_3_en(h)
                        <= mul_stage_2_en(h);
  relu stage 1 en(h)
                       <= dsp data_gnt_i_lat(h) and relu_en(h);
  relu stage 2 en(h) <= relu stage 1 en(h);
  accum_stage_2_en(h) <= accum_stage_1_en(h);
  accum_stage_3_en(h) <= accum_stage_2_en(h);
if dotpps(h) = '1' then
   shifter_stage_1_en(h) <= mul_stage_2_en(h);</pre>
   shifter stage 2 en(h) \le shifter stage 1 en(h);
   accum_stage_1_en(h) <= shifter_stage_2_en(h);
  elsif dotp(h) = 1 then
   accum_stage_1_en(h) <= mul_stage_2_en(h);</pre>
  else
   shifter stage 1 en(h) \le dsp data gnt i lat(h) and shift en(h);
   shifter_stage_2_en(h) <= shifter_stage_1_en(h);
   shifter_stage_3_en(h) <= shifter_stage_2_en(h);</pre>
   accum_stage_1_en(h) <= dsp_data_gnt_i_lat(h) and accum_en(h);
  end if;
  halt_dsp_lat(h)
                       \leq halt dsp(h):
  state DSP(h)
                       <= nextstate DSP(h);
  busy_DSP_internal_lat(h) <= busy_DSP_internal(h);</pre>
  SIMD RD_BYTES(h)
                              <= SIMD RD BYTES wire(h);
                         <= dsp_except_data_wire(h);
  dsp_except_data(h)
 end if;
end process;
DSP FU ENABLER SYNC : process(clk i, rst ni)
begin
 if rst ni = 0' then
                  <= '0':
  shift_en(h)
  add en(h)
                   <= '0':
                   <= '0';
  relu en(h)
                    <= '0':
  accum en(h)
                   <= '0';
  mul en(h)
  add en pending(h) <= '0';
  shift_en_pending(h) <= '0';</pre>
  mul_en_pending(h) <= '0';
  accum_en_pending(h) <= '0';
  relu en pending(h) \leq 0';
 elsif rising_edge(clk_i) then
  shift_en(h)
                  <= shift_en_wire(h);
  add_en(h)
                   <= add_en_wire(h);
  relu en(h)
                  <= relu en wire(h);
  accum_en(h)
                    <= accum_en_wire(h);
  mul_en(h)
                   <= mul_en_wire(h);
  add en pending(h) <= add en pending wire(h);
  shift_en_pending(h) <= shift_en_pending_wire(h);</pre>
  mul_en_pending(h) <= mul_en_pending_wire(h);
  accum_en_pending(h) <= accum_en_pending_wire(h);</pre>
  relu_en_pending(h) <= relu_en_pending_wire(h);</pre>
end if;
```

```
end process;
```

end generate DSP\_replicated;

FU HANDLER MC : if multithreaded accl en = 0 generate DSP FU ENABLER comb : process(all) begin for h in accl\_range loop shift en wire(h)  $\leq$  shift en(h); add\_en\_wire(h) <= add\_en(h); relu en wire(h) <= relu en(h); accum\_en\_wire(h) <= accum\_en(h);</pre> mul\_en\_wire(h) <= mul\_en(h);</pre> <= '0': halt hart(h) if add en(h) = '1' and busy DSP internal(h) = '0' then add en wire(h)  $\leq 0'$ ; end if: if mul\_en(h) = '1' and busy\_DSP\_internal(h) = '0' then mul en wire(h)  $\leq 0'$ ; end if: if shift en(h) = '1' and busy DSP internal(h) = '0' then shift en wire(h)  $\leq 0'$ ; end if: if accum\_en(h) = '1' and busy\_DSP\_internal(h) = '0' then accum\_en\_wire(h) <= '0'; end if; if relu en(h) = '1' and busy DSP internal(h) = '0' then relu en wire(h)  $\leq 0'$ ; end if: if dsp\_instr\_req(h) = '1' or busy\_DSP\_internal\_lat(h) = '1' then case state DSP(h) is when dsp init => -- Set signals to enable correct virtual parallelism operation if decoded\_instruction\_DSP(KADDV\_bit\_position) = '1' or decoded\_instruction\_DSP(KSVADDSC\_bit\_position) = '1' or decoded instruction DSP(KSVADDRF bit position) = '1' or decoded instruction DSP(KSUBV\_bit\_position) ='1' or decoded instruction DSP(KVCP bit position) = '1' then add\_en\_wire(h) <= ' $\overline{1'}$ ; elsif decoded instruction DSP(KDOTP bit position) = '1' then  $mul_en_wire(h) \leq '1';$ accum en wire(h)  $\leq 1'$ ; elsif decoded instruction DSP(KDOTPPS bit position) = '1' then mul en wire(h)  $\leq 11';$ shift\_en\_wire(h) <= '1';</pre> accum\_en\_wire(h) <= '1'; elsif decoded instruction DSP(KVRED bit position) = '1' then accum\_en  $wire(h) \le 1'$ ; elsif decoded instruction DSP(KSVMULRF bit position) = '1' or decoded instruction DSP(KSVMULSC bit position) = '1' or decoded\_instruction\_DSP(KVMUL\_bit\_position) ='1' then  $mul_en_wire(h) \le '1';$ elsif decoded\_instruction\_DSP(KSRAV\_bit\_position) = '1' or decoded\_instruction\_DSP(KSRLV\_bit\_position) = '1' then shift en wire(h)  $\leq 1'$ ; elsif decoded instruction DSP(KRELU bit position) = '1' then relu en wire(h)  $\leq 1'$ ; end if; when others => null; end case; end if: end loop; end process; end generate FU HANDLER MC; FU\_HANDLER\_MT : if multithreaded\_accl\_en = 1 generate DSP FU ENABLER comb : process(all) begin for h in accl\_range loop <= shift en(h); shift\_en\_wire(h) add\_en\_wire(h) <= add\_en(h); relu en wire(h) <= relu en(h); accum en wire(h) <= accum en(h); mul\_en\_wire(h) <= mul\_en(h);

901

```
902
905
906
907
```

```
add en pending wire(h)
                                   <= add en pending(h);
   shift en pending wire(h)
                                  <= shift en_pending(h);
   mul_en_pending_wire(h)
                                   <= mul_en_pending(h);
   accum_en_pending_wire(h)
                                   <= accum_en_pending(h);
                                  <= relu en_pending(h);
   relu en pending wire(h)
                           \leq  (others = > \overline{0});
   fu reg(h)
   halt hart(h)
                            <= '0':
   if add en(h) = '1' and busy DSP internal(h) = '0' then
    add_en_wire(h) <= '0';
    end if;
   if mul en(h) = '1' and busy DSP internal(h) = '0' then
    mul en wire(h) \leq 0';
    end if:
   if shift en(h) = '1' and busy DSP internal(h) = '0' then
    shift en wire(h) \leq 0';
   end if:
   if accum en(h) = '1' and busy DSP internal(h) = '0' then
    accum en wire(h) \leq 0';
   end if:
   if relu_en(h) = '1' and busy_DSP_internal(h) = '0' then
    relu_en_wire(h) \leq 0';
   end if:
   if dsp instr req(h) = '1' or busy DSP internal lat(h) = '1' then
    case state_DSP(h) is
      when dsp init =>
       -- Set signals to enable correct virtual parallelism operation
       if decoded_instruction_DSP(KADDV_bit_position) = '1' or
         decoded_instruction_DSP(KSVADDSC_bit_position) = '1' or
decoded_instruction_DSP(KSVADDRF_bit_position) = '1' or
         decoded_instruction_DSP(KSUBV_bit_position) = '1' or
         decoded_instruction_DSP(KVCP_bit_position)
                                                            = '1' then
         if busy add = 0' and add en pending = (accl range => 0') then
         add en wire(h) \leq 1';
         else
         add_en_pending_wire(h) <= '1';
          halt_hart(h) \le 11;
          fu_req(h)(0) <= '1';
         end if;
       elsif decoded instruction DSP(KDOTP bit position) = '1' then
         if busy_mul = '0' and busy_acc = '0' and mul_en_pending = (accl_range => '0') and accum_en_pending = (accl_range => '0') then
          mul en wire(h) \leq 1';
         accum en wire(h) \leq 1';
         else
          mul_en_pending_wire(h) <= '1';</pre>
          accum en pending wire(h) <= '1';
         halt hart(h) \leq 11;
          fu_req(h)(2) \le '1';
          fu_req(h)(3) <= '1';
         end if;
       elsif decoded instruction DSP(KDOTPPS bit position) = '1' then
         if busy mul = '0' and busy acc = '0' and busy shf = '0' and mul en pending = (accl range => '0') and accum en pending = (accl range =>
'0') and shift_en_pending = (accl_range => '0') then
         mul_en_wire(h) \leq 1';
          shift_en_wire(h) <= '1';
          accum_en_wire(h) <= '1';
         else
          mul_en_pending_wire(h) <= '1';</pre>
          shift_en_pending_wire(h) <= '1';</pre>
          accum_en_pending_wire(h) <= '1';
          halt hart(h) \le 1';
          fu_req(h)(2) <= '1';
          fu_req(h)(1) <= '1';
          fu req(h)(3) <= '1';
         end if:
       elsif decoded_instruction_DSP(KVRED_bit_position) = '1' then
         if busy_acc = '0' and accum_en_pending = (accl_range => '0') then
         accum en wire(h) \leq 1';
         else
          accum_en_pending_wire(h) <= '1';</pre>
          halt hart(h) \leq 1';
          fu_{req}(h)(3) \le '1';
         end if:
```

```
elsif decoded instruction DSP(KSVMULRF bit position) = '1' or
          decoded instruction DSP(KSVMULSC bit position) = '1' or
          decoded_instruction_DSP(KVMUL_bit_position) = '1' then
        if busy_mul = '0' and mul_en_pending = (accl_range => '0') then
        mul en wire(h) \leq 1';
        else
         mul_en_pending_wire(h) <= '1';
         halt_hart(h) \le 1;
         fu_{req}(h)(2) \le '1';
        end if;
      elsif decoded_instruction_DSP(KSRAV_bit_position) = '1' or
          decoded instruction DSP(KSRLV bit position) = '1' then
        if busy_shf = '0' and shift_en_pending = (accl_range => '0') then
         shift en wire(h) \leq 1';
        else
         shift_en_pending_wire(h) <= '1';</pre>
        halt_hart(h) \le 1';
         fu_req(h)(1) <= '1';
        end if;
      elsif decoded instruction DSP(KRELU bit position) = '1' then
        if busy_rel = '0' and relu_en_pending = (accl_range \Rightarrow '0') then
        relu_en_wire(h) <= '1';
        else
        relu\_en\_pending\_wire(h) <= '1';
         halt hart(h) \leq 11;
         fu_{req}(h)(4) \le '1';
        end if;
      end if;
     when dsp_halt_hart =>
      if fu gnt(h)(0) = '1' then
        add_en_wire(h) <= '1';
        add_en_pending_wire(h) <= '0';
      elsif add_en_pending(h) = '1' and fu_gnt(h)(0) = '0' then
       halt hart(h) \leq 1';
      end if;
      if fu_gnt(h)(1) = '1' then
        shift_en_wire(h) <= '1';
       shift_en_pending_wire(h) <= '0';</pre>
      elsif shift_en_pending(h) = '1' and fu_gnt(h)(1) = '0' then
       halt_hart(h) <= '1';
      end if;
      if fu_gnt(h)(2) = '1' then
        mul_en_wire(h) \le '1';
       mul_en_pending_wire(h) <= '0';</pre>
      elsif mul_en_pending(h) = '1' and fu_gnt(h)(2) = '0' then
       halt hart(h) \leq 1';
      end if;
      if fu_gnt(h)(3) = '1' then
        accum_en_wire(h) <= '1';
       accum_en_pending_wire(h) <= '0';</pre>
      elsif accum_en_pending(h) = 'l' and fu_gnt(h)(3) = '0' then
       halt hart(h) \leq 1';
      end if;
      if fu_gnt(h)(4) = '1' then
        relu_en_wire(h) <= '1';
        relu en pending wire(h) <= '0';
      elsif relu_en_pending(h) = '1' and fu_gnt(h)(4) = '0' then
       halt_hart(h) \le '1';
      end if;
     when others =>
      null:
   end case;
  end if:
 end loop;
end process;
FU\_Issue\_Buffer\_sync:process(clk\_i,rst\_ni)
begin
 if rst ni = 0' then
  fu rd ptr <= (others => (others => '0'));
  fu_wr_ptr \le (others \Longrightarrow (others \Longrightarrow '0'));
```

```
elsif rising edge(clk i) then
   060
                            fu_gnt <= fu_gnt_wire;
  061
                             for h in accl_range loop
  062
                               for i in 0 to 4 loop -- Loop index 'i' is for the total number of different functional units (regardless what SIMD config is set)
   063
                                 if fu req(h)(i) = '1' then - if a reservation was made, to use a functional unit
  064
                                    -to_integer(unsigned(fu_issue_buffer(i)(to_integer(unsigned(fu_wr_ptr(i))))) \le h; --store the thread_ID in its corresponding buffer at the thread_ID in it
   065
                      fu_wr_ptr position
   066
                                   --fu issue buffer(to integer(unsigned(fu wr ptr(i))))(i) <= std logic vector(unsigned(h)); -- store the thread ID in its corresponding buffer
1067 \\ 1068 \\ 1069
                      at the fu_wr_ptr position
                                   fu issue buffer(i)(to integer(unsigned(fu wr ptr(i)))) <= std logic vector(to unsigned(h,TPS CEIL));
                                   if unsigned(fu wr ptr(i)) = THREAD POOL SIZE - 2 then -- increment the pointer wr logic
                                    fu_wr_ptr(i) \le (others => '0'):
  070
  071 \\ 072
                                   else
                                     fu_wr_ptr(i) <= std_logic_vector(unsigned(fu_wr_ptr(i)) + 1);</pre>
  07\bar{3}
                                   end if;
  074
075
                                 end if:
                                 case state DSP(h) is
  .076
                                   when dsp halt hart =>
  07
                                     if fu gnt en(h)(i) = '1' then
 Ī Ŏ 78
                                       if unsigned(fu_rd_ptr(i)) = THREAD_POOL_SIZE - 2 then -- increment the read pointer
  079
080
                                          fu_rd_ptr(i) \le (others \implies '0');
                                       else
 1081
1082
1083
                                         fu rd ptr(i) \le std logic vector(unsigned(fu rd <math>ptr(i)) + 1);
                                       end if:
                                     end if;
  084
                                   when others =>
1085
1086
1087
1088
1088
                                   null;
                                 end case;
                              end loop;
                            end loop;
                          end if:
  090
091
                        end process;
  092
                        FU_Issue_Buffer_comb : process(all)
 1093
1093
1094
1095
                        begin
                          for h in accl range loop
                             fu gnt wire(h) \leq  (others =  '0');
  096
                             fu gnt en(h) \leq (others = '0');
   Ň97
                            if add_en_pending_wire(h) = '1' and busy_add_wire = '0' then
                              fu_{gnt}_{en}(h)(0) \le 1';
    098
   099
                            end if:
  100
101
102
                            if shift_en_pending_wire(h) = '1' and busy_shf_wire = '0' then
                              fu_gnt_en(h)(1) \le '1';
                            end if;
   103
                            if mul_en_pending_wire(h) = '1' and busy_mul_wire = '0' then
   104
                               fu_gnt_en(h)(2) \le '1';
     .05
                            end if;
   106
                            if accum_en_pending_wire(h) = '1' and busy_acc_wire = '0' then
   107
108
                               fu gnt en(h)(3) \le 1';
                             end if;
                            if relu_en_pending_wire(h) = '1' and busy_rel_wire = '0' then
   110
                              fu_gnt_en(h)(4) <= '1';
  111
112
113
114
115
116
117
118
119
                            end if;
                            case state DSP(h) is
                               when dsp_halt_hart =>
                                 for i in 0 to 4 loop
                                   if fu gnt en(h)(i) = '1' then
                                     fu gnt wire(to integer(unsigned(fu issue buffer(i)(to integer(unsigned(fu rd ptr(i))))))(i) <= '1'; -- give a grant to fu gnt(h)(i), such that
                      the 'h' index points to the thread in "fu_issue_buffer"
                                   end if;
                                 end loop;
      20
                               when others =>
   120
121
122
123
124
125
                                null:
                            end case;
                          end loop;
                        end process;
  125
126
127
128
                        DSP BUSY FU SYNC : process(clk i, rst ni)
                        begin
                          if rst ni = '0' then
                          elsif rising_edge(clk_i) then
                            busy_add <= busy_add_wire;
                            busy_mul <= busy_mul_wire;
                            busy shf <= busy shf wire;
                            busy acc <= busy acc wire;
                            busy_rel <= busy_rel_wire;
```

fu gnt  $\langle = (others => (others => '0'));$ 

```
40
  145
  146
147
  149
  150
151
152
153
154
155
156
157
158
  160
    61
  162
  163
  165
   166
  167
   168
     85
86
  187
188
188
189
     90
     <u>91</u>
    9<u>3</u>
    94
  195
  196
197
  198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1208
1209
```

1213

end if:

end if; end process;

end generate FU HANDLER MT;

busy add wire  $\leq 1'$  when multithreaded accl en = 1 and add en wire = (accl range => 0') else 0';

```
busy mul wire \leq 1' when multithreaded accl en = 1 and mul en wire = (accl range => 0') else '0';
busy_shf_wire <= '1' when multithreaded_accl_en = 1 and shift_en_wire /= (accl_range => '0') else '0';
busy_acc_wire <= '1' when multithreaded_accl_en = 1 and accum_en_wire /= (accl_range => '0') else '0';
busy_rel_wire <= '1' when multithreaded_accl_en = 1 and relu_en_wire /= (accl_range => '0') else '0';
MULTICORE OUT MAPPER : if multithreaded accl en = 0 generate
MAPPER replicated: for h in fu range generate
 MAPPING_OUT_UNIT_comb : process(all)
 begin
   dsp\_sc\_data\_write\_wire\_int(h) <= (others => '0');
   dsp sc data write wire(h)
                                   <= dsp_sc_data_write_wire_int(h);
   SIMD RD BYTES wire(h)
                                       <= SIMD*(Data Width/8);
   if dsp_instr_req(h) = '1' or busy_DSP_internal_lat(h) = '1' then
     case state_DSP(h) is
      when dsp_init =>
       -- Set signals to enable correct virtual parallelism operation
       if (decoded_instruction_DSP(KDOTP_bit_position) = '1' or
decoded_instruction_DSP(KDOTPPS_bit_position) = '1' or
          decoded_instruction_DSP(KVRED_bit_position) = '1' or
          decoded instruction DSP(KSVMULRF bit position) = '1' or
          decoded instruction DSP(KVMUL bit position) = '1' or
          decoded instruction DSP(KSVMULSC bit position) = '1') and
          MVTYPE(h)(3 \text{ downto } 2) = "00" \text{ then}
         SIMD_RD_BYTES_wire(h) <= SIMD*(Data_Width/8)/2;
       end if;
      when dsp exec =>
       -- Set signals to enable correct virtual parallelism operation
       if (decoded instruction DSP lat(h)(KDOTP bit position)
                                                                      = '1' or
          decoded_instruction_DSP_lat(h)(KDOTPPS_bit_position) = '1' or
          decoded instruction DSP lat(h)(KVRED bit position) = '1' or
          decoded_instruction_DSP_lat(h)(KSVMULRF_bit_position) = '1' or
decoded_instruction_DSP_lat(h)(KVMUL_bit_position) = '1' or
          decoded instruction DSP lat(h)(KSVMULSC bit position) = '1') and
          (MVTYPE DSP(h) = "00") then
         SIMD_RD_BYTES_wire(h) <= SIMD*(Data_Width/8)/2;
        end if;
       if decoded instruction DSP lat(h)(KDOTP bit position) = '1' or
         decoded_instruction_DSP_lat(h)(KDOTPPS_bit_position) = '1' or
         decoded instruction DSP lat(h)(KDOTP bit position) = '1' or
         decoded_instruction_DSP_lat(h)(KVRED_bit_position) = '1' then
         dsp_sc_data_write_wire_int(h)(31 downto 0) <= dsp_out_accum_results(h); -- AAA add a mask in order to store the lower half word when
16-bit or entire word when 32-bit
       end if;
       if (decoded instruction DSP lat(h)(KVMUL bit position) = '1' or
          decoded instruction DSP lat(h)(KSVMULRF bit position) = '1' or
          decoded_instruction_DSP_lat(h)(KSVMULSC_bit_position) = '1') and
         MVTYPE DSP(h) = "00" then
         for i in 0 to 2*SIMD-1 loop
          dsp sc data write wire int(h)(7+8*(i) \text{ downto } 8*(i)) \le dsp out mul results(h)(7+8*(2*i) downto 8*(2*i));
        end loop;
       end if;
       if (decoded instruction DSP lat(h)(KVMUL bit position) = '1' or
          decoded_instruction_DSP_lat(h)(KSVMULRF_bit_position) = '1' or
decoded_instruction_DSP_lat(h)(KSVMULSC_bit_position) = '1') and
         (MVTYPE DSP(h) = "01" or MVTYPE DSP(h) = "10") then
        dsp sc data write wire int(h) \le dsp out mul results(h);
       end if:
       if decoded instruction DSP lat(h)(KSRAV bit position) = '1' or
         decoded instruction DSP lat(h)(KSRLV bit position) = '1' then
         dsp_sc_data_write_wire_int(h) <= dsp_out_shifter_results(h);
```

if decoded instruction DSP\_lat(h)(KSVADDSC\_bit\_position) = '1' or

```
decoded instruction DSP lat(h)(KSVADDRF bit position) = '1' or
        decoded instruction DSP lat(h)(KADDV bit position)
                                                                  = '1' \text{ or }
        decoded_instruction_DSP_lat(h)(KSUBV_bit_position)
                                                                  = '1' or
        decoded_instruction_DSP_lat(h)(KVCP_bit_position)
                                                                  = '1' then
        dsp sc data write wire int(h) \le dsp out adder results(h);
       end if:
       if decoded instruction DSP lat(h)(KRELU bit position) = '1' then
        dsp_sc_data_write_wire_int(h) <= dsp_out_relu_results(h);
       end if:
       if decoded instruction DSP lat(h)(KBCAST bit position) = 'l' and MVTYPE DSP(h) = "10" then
        for i in 0 to SIMD-1 loop
         dsp_sc_data_write_wire_int(h)(31+32*(i) downto 32*(i)) <= RS1_Data_IE_lat(h);
        end loop;
       elsif decoded instruction DSP lat(h)(KBCAST bit position) = '1' and MVTYPE DSP(h) = "01" then
        for i in 0 to 2*SIMD-1 loop
         dsp sc data write wire int(h)(15+16*(i) \text{ downto } 16*(i)) \leq RS1 Data IE lat(h)(15 \text{ downto } 0);
        end loop;
       elsif decoded_instruction_DSP_lat(h)(KBCAST_bit_position) = '1' and MVTYPE_DSP(h) = "00" then
        for i in 0 to 4*SIMD-1 loop
         dsp sc data write wire int(h)(7+8*(i) \text{ downto } 8*(i)) \leq RS1 Data IE lat(h)(7 \text{ downto } 0);
        end loop:
       end if;
       if halt dsp(h) = '0' and halt dsp lat(h) = '1' then
        dsp_sc_data_write_wire(h) <= dsp_sc_data_write_int(h);
       end if:
      when others =>
       null:
    end case:
   end if:
 end process;
end generate;
end generate;
MULTITHREAD OUT MAPPER : if multithreaded accl en = 1 generate
 MAPPING_OUT_UNIT_comb : process(all)
 begin
  for h in 0 to (ACCL_NUM - FU_NUM) loop
   dsp_sc_data_write_wire_int(h) <= (others => '0');
   dsp sc data write wire(h) <= dsp sc data write wire int(h);
   SIMD RD BYTES wire(h) \leq SIMD*(Data Width/8);
   if dsp_instr_req(h) = '1' or busy_DSP_internal_lat(h) = '1' then
    case state DSP(h) is
     when dsp_init =>
       -- Set signals to enable correct virtual parallelism operation
       if (decoded instruction DSP(KDOTP bit position) = '1' or
         decoded_instruction_DSP(KDOTPPS_bit_position) = '1' or
         decoded_instruction_DSP(KVRED_bit_position) = '1' or
         decoded instruction DSP(KSVMULRF bit position) = '1' or
         decoded_instruction_DSP(KVMUL_bit_position) ='1' or
         decoded instruction DSP(KSVMULSC bit position) = '1') and
         MVTYPE(h)(3 \text{ downto } 2) = "00" \text{ then}
        SIMD_RD_BYTES_wire(h) <= SIMD*(Data Width/8)/2;
       end if:
      when dsp exec =>
      -- Set signals to enable correct virtual parallelism operation
       if (decoded_instruction_DSP_lat(h)(KDOTP_bit_position)
                                                                   = '1' or
         decoded instruction DSP lat(h)(KDOTPPS bit position) = '1' or
         decoded_instruction_DSP_lat(h)(KVRED_bit_position) = '1' or
decoded_instruction_DSP_lat(h)(KSVMULRF_bit_position) = '1' or
         decoded instruction DSP lat(h)(KVMUL bit position) ='1' or
         decoded_instruction_DSP_lat(h)(KSVMULSC_bit_position) = '1') and MVTYPE_DSP(h) = "00" then
        SIMD_RD_BYTES_wire(h) <= SIMD*(Data_Width/8)/2;
       end if;
       if decoded_instruction_DSP_lat(h)(KDOTP_bit_position) = '1' or
```

```
dsp sc data write wire int(h)(31 downto 0) <= dsp out accum results(0); -- AAA add a mask in order to store the lower half word when
16-bit or entire word when 32-bit
       end if:
       if (decoded instruction DSP lat(h)(KVMUL bit position) = '1' or
         decoded_instruction_DSP_lat(h)(KSVMULRF_bit_position) = '1' or
decoded instruction_DSP_lat(h)(KSVMULRC_bit_position) = '1') and
          MVTYPE DSP(h) = "00" then
         for i in 0 to 2*SIMD-1 loop
         dsp\_sc\_data\_write\_wire\_int(h)(7+8*(i) downto 8*(i)) \le dsp\_out\_mul results(0)(7+8*(2*i) downto 8*(2*i));
         end loop;
       end if;
       if (decoded instruction DSP lat(h)(KVMUL bit position) = '1' or
         decoded_instruction_DSP_lat(h)(KSVMULRF_bit_position) = '1' or
decoded_instruction_DSP_lat(h)(KSVMULSC_bit_position) = '1') and
         (MVTYPE DSP(h) = "01" or MVTYPE DSP(h) = "10") then
         dsp sc data write wire int(h) \le dsp out mul results(0);
       end if;
       if decoded instruction DSP lat(h)(KSRAV bit position) = '1' or
         decoded_instruction_DSP_lat(h)(KSRLV_bit_position) = '1' then
         dsp_sc_data_write_wire_int(h) <= dsp_out_shifter_results(0);
       end if:
       if decoded instruction DSP lat(h)(KSVADDSC bit position) = '1' or
         decoded instruction DSP lat(h)(KSVADDRF bit position) = '1' or
         decoded_instruction_DSP_lat(h)(KADDV_bit_position)
                                                                      = '1' or
         decoded_instruction_DSP_lat(h)(KSUBV_bit_position)
decoded_instruction_DSP_lat(h)(KVCP_bit_position)
                                                                      = '1' or
                                                                     = '1' then
         dsp sc data write wire int(h) \le dsp out adder results(0);
        end if:
       if decoded instruction DSP lat(h)(KRELU bit position) = '1' then
        dsp_sc_data_write_wire_int(h) <= dsp_out_relu_results(0);
       end if:
       if decoded instruction DSP lat(h)(KBCAST bit position) = '1' and MVTYPE DSP(h) = "10" then
         for i in 0 to SIMD-1 loop
          dsp sc data write wire int(h)(31+32*(i) downto 32*(i)) <= RS1 Data IE lat(h);
         end loop:
       elsif decoded instruction DSP lat(h)(KBCAST bit position) = '1' and MVTYPE DSP(h) = "01" then
         for i in 0 to 2*SIMD-1 loop
         dsp sc data write wire int(h)(15+16*(i) downto 16*(i)) <= RS1 Data IE lat(h)(15 downto 0);
         end loop;
       elsif decoded_instruction_DSP_lat(h)(KBCAST_bit_position) = '1' and MVTYPE_DSP(h) = "00" then
         for i in 0 to 4*SIMD-1 loop
         dsp_sc_data_write_wire_int(h)(7+8*(i) downto 8*(i)) <= RS1_Data_IE_lat(h)(7 downto 0);
         end loop;
       end if;
       if halt_dsp(h) = '0' and halt_dsp_lat(h) = '1' then
        dsp_sc_data_write_wire(h) <= dsp_sc_data_write_int(h);
       end if;
      when others =>
       null:
     end case;
   end if;
  end loop:
 end process;
end generate;
--FU IN_MAPPER_replicated : for f in accl_range generate
--FU_IN_MAPPER : if (multithreaded_accl_en = 0 or (multithreaded_accl_en = 1 and f = 0)) generate
FU replicated : for f in fu range generate
 DSP_MAPPING_IN_UNIT_comb : process(all)
 variable h : integer;
 begin
  dsp_in_mul_operands(f)
                                 <= (others => (others => '0'));
  dsp in adder operands(f)
                                 \langle = (others => (others => '0'));
  dsp in shift amount(f)
                                \leq (others \Rightarrow '0'):
  dsp_in_shifter_operand(f)
                                <= (others => '0');
  dsp in relu operands(f)
                                 \leq (others \geq '0');
  dsp in accum operands(f)
                                  \langle = (others => '0');
```

365

1368

end loop; end if; else

```
for g in 0 to (ACCL NUM - FU NUM) loop
   if multithreaded accl en = 1 then
    h := g; -- set the spm rd/wr ports equal to the "for-loop"
   elsif multithreaded accl en = 0 then
    h := f; -- set the spm rd/wr ports equal to the "for-generate"
   end if:
   if dsp instr req(h) = '1' or busy DSP internal lat(h) = '1' then
     case state DSP(h) is
      when dsp exec =>
       if (decoded instruction DSP lat(h)(KDOTP bit position) = '1' or
          decoded_instruction_DSP_lat(h)(KDOTPPS_bit_position) = '1') and
          MVTYPE DSP(h) = "00" then
        for i in 0 to 2*SIMD-1 loop
           dsp in mul operands (\hat{f})(0)(15+16*(i) \text{ downto } 16*(i)) \le (x"00" \& (dsp sc data read(h)(0)(7+8*(i) \text{ downto } 8*(i)) and
dsp sc data read mask(h)(7+8*(i) \text{ downto } 8*(i)));
           dsp in mul operands(f)(1)(15+16*(i) downto 16^{(i)} = (x^{00})^{(i)} \& (dsp sc data read(h)(1)(7+8^{(i)}) downto <math>8^{(i)}) and
dsp_sc_data_read_mask(h)(7+8*(i) downto 8*(i)));
          if dotp(h) = '1' then
           dsp in accum operands(f) \leq dsp out mul results(f);
          elsif dotpps(h) = '1' then
           dsp_in_shift_amount(f) <= MPSCLFAC DSP(h);</pre>
           dsp in shifter operand(f) <= dsp out mul results(f);
           dsp in accum operands(f) <= dsp out shifter results(f);
          end if:
        end loop;
       end if;
       if (decoded instruction DSP lat(h)(KDOTP bit position) = '1' or
          decoded_instruction_DSP_lat(h)(KDOTPPS_bit_position) = '1') and
        (MVTYPE_DSP(h) = "01" or MVTYPE_DSP(h) = "10") then
dsp_in_mul_operands(f)(0) <= dsp_sc_data_read(h)(0) and dsp_sc_data_read_mask(h);
        dsp in mul operands(f)(1) \leq dsp sc data read(h)(1) and dsp sc data read mask(h);
        if dotp(h) = 1 then
         dsp in accum operands(f) \leq dsp out mul results(f);
        elsif dotpps(h) = '1' then
          dsp in shift amount(f) <= MPSCLFAC DSP(h);
          dsp in shifter operand(f) <= dsp out mul results(f);
         dsp_in_accum_operands(f) <= dsp_out_shifter_results(f);
        end if;
       end if:
       if (decoded_instruction_DSP_lat(h)(KVMUL_bit_position) = '1' or
          decoded_instruction_DSP_lat(h)(KSVMULRF_bit_position) = '1' or
         decoded_instruction_DSP_lat(h)(KSVMULSC_bit_position) = '1') and MVTYPE_DSP(h) = "00" then
         for i in 0 to 2*SIMD-1 loop
          if vec read rs2 DSP(h) = 0' then
           if rf rs2(h) = '1' then
            dsp in mul operands(f)(1)(15+16*(i) downto 16*(i)) <= x"00" & RS2 Data IE lat(h)(7 downto 0); -- map the scalar value
           elsif rf rs2(h) = '0' then
            dsp in mul operands(f)(1)(15+16*(i) downto 16^{(i)}) \leq x"00" & dsp sc data read(h)(1)(7 downto 0); -- map the scalar value
           end if:
          else
           dsp in mul operands(f)(1)(15+16*(i) downto 16*(i)) <= x"00" & dsp sc data read(h)(1)(7+8*(i) downto 8*(i));
          end if:
         dsp\_in\_mul\_operands(f)(0)(15+16*(i) \text{ downto } 16*(i)) <= x"00" \& dsp\_sc\_data\_read(h)(0)(7+8*(i) \text{ downto } 8*(i));
        end loop;
        end if:
       if (decoded_instruction_DSP_lat(h)(KVMUL_bit_position) = '1' or
          decoded_instruction_DSP_lat(h)(KSVMULRF_bit_position) = '1' or
          decoded instruction DSP lat(h)(KSVMULSC bit position) = '1') and
          MVTYPE DSP(h) = "01" then
         if vec_read_rs2_DSP(h) = '0' then
          if rf_{rs2(h)} = '1' then
           for i in 0 to 2*SIMD-1 loop
            dsp in mul_operands(f)((1)(15+16*(i) \text{ downto } 16*(i)) \le RS2 Data IE lat(h)(15 downto 0); -- map the scalar value
           end loop;
          elsif rf rs2(h) = '0' then
           for i in 0 to 2*SIMD-1 loop
            dsp in mul operands(f)(1)(15+16*(i) downto 16^{*}(i)) \le dsp sc data read(h)(1)(15 downto 0); -- map the scalar value
```

dsp in mul operands(f)(1)  $\leq$  dsp sc data read(h)(1); end if: dsp\_in\_mul\_operands(f)(0) <= dsp\_sc\_data\_read(h)(0); end if; if (decoded instruction DSP lat(h)(KVMUL bit position) = '1' or decoded instruction DSP\_lat(h)(KSVMULRF\_bit position) = '1' or decoded\_instruction\_DSP\_lat(h)(KSVMULSC\_bit\_position) = '1') and MVTYPE DSP(h) = "10" then if vec\_read\_rs2 DSP(h) = '0' then if rf rs2(h) = '1' then for i in 0 to SIMD-1 loop dsp in mul operands(f)(1)(31+32\*(i) downto 32\*(i)) <= RS2 Data IE lat(h)(31 downto 0); -- map the scalar value end loop; elsif rf\_rs2(h) = '0' then for i in 0 to SIMD-1 loop dsp in mul operands $\overline{f}(1)(31+32^{*}(i) \text{ down to } 32^{*}(i)) \le dsp$  sc data read(h)(1)(31 down to 0); -- map the scalar value end loop; end if; else dsp\_in\_mul\_operands(f)(1) <= dsp\_sc\_data\_read(h)(1); end if: dsp in mul operands(f)(0)  $\leq$  dsp sc data read(h)(0); end if: if decoded instruction DSP lat(h)(KADDV bit position) = '1' then dsp in adder operands $(f)(\overline{0}) \le dsp\_sc\_data\_read(h)(0);$ dsp\_in\_adder\_operands(f)(1) <= dsp\_sc\_data\_read(h)(1); end if; if decoded \_instruction\_DSP\_lat(h)(KSRAV\_bit\_position) = '1' or decoded\_instruction\_DSP\_lat(h)(KSRLV\_bit\_position) = '1' then dsp\_in\_shifter\_operand(f) <= dsp\_sc\_data\_read(h)(0); <= RS2 Data\_IE\_lat(h)(4 downto 0); -- map the scalar value (shift amount) dsp\_in\_shift\_amount(f) end if: if decoded instruction DSP lat(h)(KSVADDSC bit position) = '1' and MVTYPE DSP(h) = "10" then dsp in\_adder\_operands(f)( $\overline{0}$ )  $\leq$  dsp\_sc\_data\_read( $\overline{h}$ )(0); for i in 0 to SIMD-1 loop  $dsp_in_adder_operands(f)(1)(31+32*(i) downto 32*(i)) \le dsp_sc_data_read(h)(1)(31 downto 0);$ end loop; end if; if decoded instruction DSP lat(h)(KSVADDSC bit position) = 'l' and MVTYPE DSP(h) = "01" then dsp in adder operands $(f)(\bar{0}) \le dsp$  sc data read(h)(0); for i in 0 to 2\*SIMD-1 loop  $dsp_i_adder_operands(\hat{f})(1)(15+16^*(i) downto 16^*(i)) \le dsp_sc_data_read(h)(1)(15 downto 0);$ end loop; end if: if decoded instruction DSP lat(h)(KSVADDSC bit position) = '1' and MVTYPE DSP(h) = "00" then dsp in adder operands(f)( $\overline{0}$ ) <= dsp\_sc\_data\_read( $\overline{h}$ )(0); for i in 0 to 4\*SIMD-1 loop  $dsp_in_adder_operands(f)(1)(7+8*(i) downto 8*(i)) \le dsp_sc_data_read(h)(1)(7 downto 0);$ end loop; end if: if decoded instruction DSP lat(h)(KSVADDRF bit position) = '1' and MVTYPE DSP(h) = "10" then  $dsp_in_adder_operands(f)(0) \ll dsp_sc_data_read(h)(0);$ for i in 0 to SIMD-1 loop dsp in adder operands(f)(1)(31+32\*(i) downto 32\*(i)) <= RS2 Data IE lat(h)(31 downto 0); end loop; end if: if decoded\_instruction\_DSP\_lat(h)(KSVADDRF\_bit\_position) = '1' and MVTYPE\_DSP(h) = "01" then  $dsp_in_adder_operands(f)(\overline{0}) \ll dsp_sc_data_read(\overline{h})(0);$ for i in 0 to 2\*SIMD-1 loop  $dsp\_in\_adder\_operands(\hat{f})(1)(15+16*(i) \text{ downto } 16*(i)) \\ <= RS2\_Data\_IE\_lat(h)(15 \text{ downto } 0);$ end loop; end if: if decoded\_instruction\_DSP\_lat(h)(KSVADDRF\_bit\_position) = '1' and MVTYPE\_DSP(h) = "00" then  $dsp_in_adder_operands(f)(\overline{0}) \ll dsp_sc_data_read(h)(0);$ for i in 0 to 4\*SIMD-1 loop dsp\_in\_adder\_operands(f)(1)(7+8\*(i) downto  $8*(i)) \leq RS2_Data_IE_lat(h)(7 downto 0);$ end loop; end if;

1460 1461 1462

463

464

465

466

467

468

470

471

473

474 475 476

477 478 479

 $481 \\ 482 \\ 483 \\ 484$ 

485 486

491

493 494

495

496 497

498

1524

```
600
601
1602
```

```
if decoded instruction DSP lat(h)(KSUBV bit position) = '1' then
        dsp in adder operands(f)(\overline{0}) <= dsp sc data read(h)(0);
        dsp_in_adder_operands(f)(1) <= (not dsp_sc_data_read(h)(1));
       end if:
       if decoded instruction DSP lat(h)(KVRED bit position) = '1' and MVTYPE DSP(h) = "00" then
        for i in 0 to 2*SIMD-1 loop
         dsp in accum operands(f)(15+16*(i) downto 16*(i)) \leq x"00" & (dsp sc data read(h)(0)(7+8*(i) downto 8*(i)) and
dsp sc data read mask(h)(7+8*(i) downto 8*(i)));
        end loop:
       end if;
       if decoded instruction DSP lat(h)(KVRED bit position) = '1' and (MVTYPE DSP(h) = "01" or MVTYPE DSP(h) = "10") then
        dsp in accum operands(f) <= dsp_sc_data_read(h)(0) and dsp_sc_data_read_mask(h);
       end if;
       if decoded instruction DSP lat(h)(KRELU bit position) = '1' then
        dsp_in_relu_operands(f) <= dsp_sc_data_read(h)(0);
       end if:
       if decoded instruction DSP lat(h)(KVCP bit position) = '1' then
        dsp_in_adder_operands(f)(0) \le dsp_sc_data_read(h)(0);
       end if;
      when others =>
       null:
     end case;
   end if:
  end loop;
 end process;
--end generate;
--end generate;
--FU IN MAPPER : if (multithreaded accl en = 0 or (multithreaded accl en = 1 and f = 0) generate
 fsm DSP adder stage 1 : process(all)
 variable h : integer:
 begin
  dsp add 8 0 wire(f) \leq dsp add 8 0(f);
  dsp add 16 \ 8 \ wire(f) \le dsp add 16 \ 8(f);
  for g in 0 to (ACCL_NUM - FU_NUM) loop
   if multithreaded accl en = 1 then
    h := g; -- set the spm rd/wr ports equal to the "for-loop"
   elsif multithreaded accl_en = 0 then
    h := f; -- set the spm rd/wr ports equal to the "for-generate"
   end if:
    -- Addition in SIMD Virtual Parallelism is executed here, if the carries are blocked, we will have a chain of 8-bit or 16-bit adders, else we have
32-bit adders
   for i in 0 to SIMD-1 loop
     if (adder stage 1 en(h) = 1 or recover state wires(h) = 1) then
      -- Unwinding the loop:
      -- (1) the term "8*(4*i)" is used to jump between the 32-bit words, inside the 128-bit values read by the DSP
      -- (2) Each addition results in an 8-bit value, and the 9th bit being the carry, depending on the instruction (KADDV32, KADDV16, KADDV8)
we either pass the or block the carries.
      -- (3) CARRIES:
      -- (a) If we pass all the carries in the 32-bit word, we will have executed KADDV32 (4*32-bit parallel additions)
      -- (b) If we pass the 9th and 25th carries we would have executed KADDV16 (8*16-bit parallel additions)
      -- (c) If we pass none of the carries then we would have executed KADDV8 (16*8-bit parallel additions)
      dsp add 8 0 wire(f)(i) \leq  std logic vector('0' & unsigned(dsp in adder operands(f)(0)(7+8*(4*i) downto 8*(4*i))) +
unsigned(dsp_in_adder_operands(f)(1)(7+8*(4*i) downto 8*(4*i))) + twos_complement(h)(0+(4*i)));
      dsp add \overline{16} 8 wire(f)(i) <= std logic vector('0' & unsigned(dsp in adder operands(f)(0)(15+8*(4*i) downto 8+8*(4*i))) +
unsigned(dsp_in_adder_operands(f)(1)(15+8*(4*i) downto 8+8*(4*i))) + carry 8 wire(f)(i) + twos complement(h)(1+(4*i)));
      -- All the 8-bit adders are lumped into one output write signal that will write to the scratchpads
      -- Carries are either passed or blocked for the 9-th, 17-th, and 25-th bits
      carry 8 wire(f)(i) \leq dsp add 8 0 wire(f)(i)(8) and carry pass(h)(0);
     carry_16_wire(f)(i) \le dsp_add_16_8_wire(f)(i)(8) and carry_pass(h)(1);
     end if:
   end loop;
  end loop;
 end process;
 fsm DSP adder stage 2 : process(all)
 variable h : integer;
 begin
  carry 24 wire(f)
                            <= (others => '0');
  dsp add 24 16 wire(f)
                                \langle = (others => (others => '0'));
  dsp_add_32_24_wire(f)
                                <= (others => (others => '0'));
```

1603 for g in 0 to (ACCL NUM - FU NUM) loop 604 if multithreaded accl en = 1 then 605 h := g; -- set the spm rd/wr ports equal to the "for-loop" 606 elsif multithreaded accl en = 0 then 607 h := f; -- set the spm rd/wr ports equal to the "for-generate" 608 end if .<u>6</u>09 -- Addition is here if halt  $dsp_lat(h) = '0'$  then 610 611 -- Addition in SIMD Virtual Parallelism is executed here, if the carries are blocked, we will have a chain of 8-bit or 16-bit adders, else we have 1612 1613 1614 1615 1615 1616 32-bit adders for i in 0 to SIMD-1 loop if (adder stage 2 en( $\hat{h}$ ) = '1' or recover state wires(h) = '1') then dsp add 24  $\overline{16}$  wire(f)(i) <= std logic vector('0' & unsigned(dsp in adder operands lat(f)(0)(7+8\*(2\*i) downto 8\*(2\*i))) + unsigned(dsp in adder operands  $lat(\overline{f})(1)(7+8*(2*i) \text{ downto } 8*(2*i))) +$  $carry_1\overline{16}(f)(i) + twos_complement(h)(2+(4*i)));$ 618 dsp add 32 24 wire(f)(i) <= std logic vector( $0^{\circ}$  & unsigned(dsp in adder operands lat(f)(0)(15+8\*(2\*i) downto 8+8\*(2\*i))) + unsigned(dsp\_in\_adder\_operands\_lat(f)(1)(15+8\*(2\*i) downto 8+8\*(2\*i))) + 620 carry  $2\overline{4}$  wire(f)( $\overline{i}$ ) + twos complement(h)(3+(4\*i))); 621 -- All the 8-bit adders are lumped into one output write signal that will write to the scratchpads 1622 -- Carries are either passed or blocked for the 9-th, 17-th, and 25-th bits  $carry_24_wire(f)(i) \le dsp_add_24_16_wire(f)(i)(8) and carry_pass(h)(2);$ 1623 1624 1625 1626 1627 1628 1629 end if; end loop; end if. end loop; end process; 1629 1630 1631 1632 1633 1634 1635 1636 fsm\_DSP\_adder : process(clk\_i, rst\_ni) variable h : integer; begin  $i\bar{f}$  rst ni = 0' then elsif rising edge(clk i) then for g in 0 to (ACCL\_NUM - FU\_NUM) loop if multithreaded accl en = 1 then 1637 1638 h := g; -- set the spm rd/wr ports equal to the "for-loop" elsif multithreaded\_accl\_en = 0 then 639 h := f; -- set the spm rd/wr ports equal to the "for-generate" 640 end if: 1641 642 -- Addition is here if add\_en(h) = '1' and halt\_dsp\_lat(h) = '0' then 643 carry  $16(f) \le carry 16$  wire(f); 644 dsp\_add\_8\_0(f) <= dsp\_add\_8\_0\_wire(f); 1645 164<u>6</u> dsp add 16 8(f) <= dsp add 16 8 wire(f); -- Addition in SIMD Virtual Parallelism is executed here, if the carries are blocked, we will have a chain of 8-bit or 16-bit adders, else we have 647 normal 32-bit adders 648 649 for i in 0 to SIMD-1 loop if (adder stage 2 en( $\hat{h}$ ) = '1' or recover state wires(h) = '1') then 650 651 652 653 -- All the 8-bit adders are lumped into one output signal that will write to the scratchpads dsp out adder results(f)(31+32\*(i) downto 32\*(i))  $\leq$  dsp add 32 24 wire(f)(i)(7 downto 0) & dsp add 24 16 0 & dsp add 24 16 0 & dsp\_add\_16\_8(f)(i)(7 downto 0) & dsp\_add\_8\_0(f)(i)(7 downto 0); end if; 1655 1654 1655 1656 1657 1658 end loop; end if: for i in 0 to SIMD-1 loop for j in 0 to 1 loop dsp in adder operands  $lat(f)(j)(15+16*(i) \text{ downto } 16*(i)) \le dsp$  in adder operands(f)(j)(31+32\*(i) downto 16+32\*(i));end loop: 660 end loop;  $1661 \\ 1662$ end loop; end if: 663 664 end process; 1665 fsm DSP shifter stg 1 : process(clk i, rst ni) 1666 variable h : integer; 667 begin .668 if rst ni = 0' then .669 elsif rising\_edge(clk\_i) then 670 for g in 0 to (ACCL\_NUM - FU\_NUM) loop .671 if multithreaded accl en = 1 then h := g; -- set the spm rd/wr ports equal to the "for-loop' .672 .6<u>7</u>3 elsif multithreaded accl en = 0 then 674 h := f; -- set the spm rd/wr ports equal to the "for-generate" end if: 676  $if shift\_en(h) = '1' and (shifter\_stage\_1\_en(h) = '1' or recover\_state\_wires(h) = '1') and halt\_dsp\_lat(h) = '0' then all the state\_baselines and th$ for i in 0 to SIMD-1 loop 6/ dsp int shifter operand(f)(31+32\*(i) downto 32\*(i)) <= to stdlogicvector(to bitvector(dsp in shifter operand(f)(31+32\*(i) downto 32\*(i))) srl to integer(unsigned(dsp in shift amount(f)))); 1680end loop:

61

681 682 683 684 685 1686 1687 1688 1688  $1690 \\ 1691 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 1692 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 100 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\$  $1692 \\ 1693 \\ 1694 \\ 1695 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1696 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\ 1000 \\$ 697 698 1690 1700 1700 1700 1700 1700 1700 1700 1700 1700 1700 1700 1700 1700 1700 1700 1700 1700 1700 1710 1710 1710 1770 1770 1770 1770 1770 1770 1770 1770 1770 1772 1772 1772 1733 1733 1733 1733 1774 1774 1774 1775 1775 1775 1775 1775 1775 1775 1775 1775 1775 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 175 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755 1755

758

-- for i in 0 to 4\*SIMD-1 loop -- latch the sign bits --dsp in sign bits(f)(i)  $\leq$  dsp in shifter operand(f)(7+8\*(i)); --end loop: if MVTYPE DSP(h) = "00" then for i in 0 to 4\*SIMD-1 loop -- latch the sign bits dsp in shifter operand  $lat(f)(7+8*i \text{ downto } 8*i) \le (\text{others} \Longrightarrow \text{dsp} \text{ in shifter operand}(f)(7+8*i));$ end loop; elsif MVTYPE DSP(h) = "01" then for i in 0 to 2\*SIMD-1 loop -- latch the sign bits  $dsp_{in\_shifter\_operand\_lat(f)(15+16*i downto 16*i) \le (others \Longrightarrow dsp_{in\_shifter\_operand(f)(15+16*i));$ end loop; elsif MVTYPE DSP(h) = "10" then for i in 0 to SIMD-1 loop -- latch the sign bits dsp in shifter operand  $lat(f)(31+32*i \text{ downto } 32*i) \le (others \Longrightarrow dsp in shifter operand(f)(31+32*i));$ end loop: end if; end if: end loop; end if; end process; fsm\_DSP\_shifter\_stg\_2 : process(clk\_i, rst\_ni) variable h : integer: begin if rst\_ni = '0' then elsif rising edge(clk i) then for g in 0 to (ACCL NUM - FU NUM) loop if multithreaded\_accl\_en = 1 then h := g; -- set the spm rd/wr ports equal to the "for-loop" elsif multithreaded accl en = 0 then h := f; -- set the spm rd/wr ports equal to the "for-generate" end if: if shift en(h) = 1' and (shifter stage 2 en(h) = 1' or recover state wires(h) = 1') and halt dsp lat(h) = 0' then if  $\overline{MVTYPE}$  DSP(h) = "10" then for i in 0 to SIMD-1 loop dsp out shifter results(f)(31+32\*(i) downto 32\*(i)) <= dsp in shifter operand lat wire(f)(31+32\*(i) downto 32\*(i)) or dsp int shifter operand(f)(31+32\*(i) downto 32\*(i)); end loop: elsif MVTYPE DSP(h) = "01" or (decoded instruction DSP lat(h)(KDOTPPS bit position) = '1' and MVTYPE DSP(h) = "00") then --KDOTPPS8 has been added here because the number of elements loaded for mul operations is equal for 8-bit and 16-bits instr for i in 0 to 2\*SIMD-1 loop dsp out shifter\_results(f)(15+16\*(i) downto  $16^{(i)} \le dsp_{in_shifter_operand_lat_wire(f)(15+16^{(i)} downto 16^{(i)}) or (i) or (i)$ (dsp int shifter operand(f)(15+16\*(i) downto 16\*(i)) and dsp shift enabler(h)(15 downto 0)); end loop; elsif MVTYPE DSP(h) = "00" then for i in 0 to 4\*SIMD-1 loop  $dsp\_out\_shifter\_results(f)(7+8*(i) \ down to \ 8*(i)) <= \ dsp\_in\_shifter\_operand\_lat\_wire(f)(7+8*(i) \ down to \ 8*(i)) \ or \ and \ and \ baselines \ baseline$ (dsp int shifter operand(f)(7+8\*(i) downto 8\*(i)) and dsp shift enabler(h)(7 downto 0)); end loop; end if; end if; end loop; end if: end process; fsm DSP shifter comb : process(all) variable h : integer; begin dsp in shifter operand lat wire(f)  $\leq$  (others = > '0'); for g in 0 to (ACCL\_NUM - FU\_NUM) loop if multithreaded accl en = 1 then h := g; -- set the spm rd/wr ports equal to the "for-loop" elsif multithreaded accl en = 0 then h := f; -- set the spm rd/wr ports equal to the "for-generate" end if; dsp shift enabler(h)  $\leq$  (others = '0'); if  $shift_{en}(h) = '1'$  and  $halt_dsp_lat(h) = '0'$  then if MVTYPE DSP(h) = "01" then dsp shift enabler(h)(15 - to integer(unsigned(dsp in shift amount(h)(3 downto 0))) downto 0) <= (others => '1'); elsif MVTYPE DSP(h) = "00" then dsp shift enabler(h)(7 - to integer(unsigned(dsp in shift amount(h)(2 downto 0)))) downto 0) <= (others => '1'); end if; if (decoded\_instruction\_DSP\_lat(h)(KSRAV\_bit\_position) = '1' or decoded\_instruction\_DSP\_lat(h)(KDOTPPS\_bit\_position) = '1') and  $MVTYPE_DSP(h) = "10"$  then -- 32-bit sign extension for for srl in stage 1 for i in 0 to SIMD-1 loop --dsp in shifter operand lat(f)(31+32\*(i) downto 31 - to integer(unsigned(dsp in shift amount(h)(4 downto 0)))+32\*(i)) <= (others => dsp in sign bits(h)( $3+\overline{4}*(i)$ ));

dsp\_in\_shifter\_operand\_lat\_wire(f)(31+32\*(i) downto 31 - to\_integer(unsigned(dsp\_in\_shift\_amount(f)(4 downto 0)))+32\*(i)) <=

759 dsp in shifter operand lat(f)( 31+32\*(i) downto 31 - to integer(unsigned(dsp in shift amount(f)(4 downto 0)))+32\*(i)); 760 end loop; 761 762 elsif (decoded instruction DSP lat(h)(KSRAV bit position) = '1' or decoded instruction DSP lat(h)(KDOTPPS bit position) = '1') and MVTYPE\_DSP(h) = "01" then -- 16-bit sign extension for for srl in stage 1 763 for i in 0 to  $2 \times SIMD-1$  loop 764 --dsp in shifter operand lat(f)(15+16\*(i) downto 15 - to integer(unsigned(dsp in shift amount(h)(3 downto 0)))+16\*(i)) <= (others => 1766 1766 17768 17768 17772 17773 17773 17773 17773 17773 17773 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17778 17788 17788 17788 17788 17788 17778 17788 17788 17788 17788 17788 17788 17788 17788 17788 17788 17799 17799 17799 17992 17992 17994 18001 18802 18806 18806 18806 18806 18812 18814 18814 18812 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 18822 182dsp in sign bits(h)(1+2\*(i))); dsp in shifter operand lat wire(f)(15+16\*(i) downto 15 - to integer(unsigned(dsp in shift amount(f)(3 downto 0)))+16\*(i)) <= dsp in shifter operand lat(f)( 15+16\*(i) downto 15 - to integer(unsigned(dsp in shift amount(f)(3 downto 0)))+16\*(i)); end loop: elsif (decoded instruction DSP lat(h)(KSRAV bit position) = '1' or decoded instruction DSP lat(h)(KDOTPPS bit position) = '1') and MVTYPE DSP(h) = "00" then -- 8-bit sign extension for for srl in stage 1 for i in 0 to 4\*SIMD-1 loop --dsp in shifter operand lat(f)(7+8\*(i) downto 7 - to integer(unsigned(dsp in shift amount(h)(2 downto 0)))+8\*(i)) <= (others => dsp in sign bits(h)(i)); dsp in shifter operand lat wire(f)(7+8\*(i) downto 7 - to integer(unsigned(dsp in shift amount(f)(2 downto 0)))+8\*(i)) <=  $dsp_in\_shifter\_operand\_lat(f)(7+8*(i) down to 7 - to\_integer(unsigned(dsp_in\_shift\_amount(f)(2 down to 0)))+8*(i));$ end loop: end if: end if; end loop; end process; -- STAGE 1 fsm MUL STAGE\_1 : process(clk\_i,rst\_ni) variable h : integer; begin if rst ni = 0' then elsif rising\_edge(clk\_i) then for g in 0 to (ACCL\_NUM - FU\_NUM) loop if multithreaded accl en = 1 then h := g; -- set the spm rd/wr ports equal to the "for-loop" elsif multithreaded accl en = 0 then h := f; -- set the spm rd/wr ports equal to the "for-generate" end if; if halt\_dsp\_lat(h) = '0' then if mul en(h) = 1' and (mul stage 1 en(h) = 1' or recover state wires(h) = 1') then for i in 0 to SIMD-1 loop -- Unwinding the loop: -- (1) The impelemtation in the loop does multiplication for KDOTP32, and KDOTP16 using only 16-bit multipliers. "A\*B" =  $[Ahigh*(2^{16}) + Alow]*[Bhigh*(2^{16}) + Blow]$ -- (2) Expanding this equation "[Ahigh\*(2^16) + Alow]\*[Bhigh\*(2^16) + Blow]" gives: "Ahigh\*Bhigh\*(2^32) + Ahigh\*Blow\*(2^16) + Alow\*Bhigh\*(2^16) + Alow\*Blow" which are the terms being stored in dsp\_out\_mul\_results -- (3) Partial Multiplication -- (a) "dsp mul a" <= Ahigh\*Bhigh -- (b) "dsp\_mul b" <= Ahigh\*Blow -- (c) "dsp\_mul\_c" <= Alow\*Bhigh -- (d) "dsp mul d" <= Alow\*Blow -- (4) "dsp mul a" is shifted by 32 bits to the left, "dsp mul b" and "dsp mul c" are shifted by 16-bits to the left, "dsp mul d" is not shifted -- (5) For 16-bit and 8-bit muls, the FUNCT SELECT MASK is set to x"00000000" blocking the terms in "dsp\_mul\_b" and "dsp\_mul\_c". For executing 32-bit muls , we set the mask to x"FFFFFFFF dsp mul  $a(f)(31+32*(i) \text{ downto } 32*(i)) \le$  std logic vector(unsigned(dsp in mul operands(f)(0)(15+16\*(2\*i+1)) downto 16\*(2\*i+1)))\* unsigned(dsp in mul operands(f)(1)(15+16\*(2\*i+1) downto 16\*(2\*i+1))); dsp mul b(f)(31+32\*(i) downto 32\*(i)) <= std logic vector((unsigned(dsp in mul operands(f)(0)(16\*(2\*i+1) - 1 downto 16\*(2\*i))) \* unsigned(dsp in mul operands(f)(1)(15+16\*(2\*i+1) downto 16\*(2\*i+1)))) and unsigned(FUNCT SELECT MASK(h))); dsp mul  $c(f)(31+32*(i) \text{ downto } 32*(i)) \le \text{std logic vector}((unsigned(dsp in mul operands(f)(0)(15+16*(2*i+1) \text{ downto } 16*(2*i+1)))) \le 10^{-10}$ unsigned(dsp in mul operands(f)(1)(16\*(2\*i+1) - 1 downto 16\*(2\*i)))) and unsigned(FUNCT SELECT MASK(h))); dsp mul d(f)(31+32\*(i) downto 32\*(i))  $\leq$  std logic vector(unsigned(dsp in mul operands(f)(0)(16\*(2\*i+1) - 1 downto 16\*(2\*i))) \* unsigned(dsp in mul operands(f)(1)(16\*(2\*i+1) - 1 downto  $1\overline{6}*(2*i)))$ ; end loop; end if: end if; end loop; end if: end process; fsm MUL STAGE 1 COMB : process(all) variable h : integer; begin mul tmp  $a(f) \le (others \implies (others \implies '0'));$ mul tmp  $b(f) \le (others \implies (others \implies '0'));$  $mu[tmp_c(f) \le (others \Longrightarrow '0'));$  $mu[tmp_d(f) \le (others \implies '0'));$ for g in 0 to (ACCL NUM - FU NUM) loop if multithreaded accl en = 1 then h := g; -- set the spm rd/wr ports equal to the "for-loop" elsif multithreaded accl en = 0 then h := f; -- set the spm rd/wr ports equal to the "for-generate" end if:

-- KDOTP and KSVMUL instructions are handeled here -- this part right here shifts the intermidiate results appropriately, and then accumulates them in order to get the final mul result if mul\_en(h) = '1' and (mul\_stage\_2\_en(h) = '1' or recover\_state\_wires(h) = '1') then for i in 0 to SIMD-1 loop if MVTYPE DSP(h) /= "10" then mul tmp  $a(f)(i) \le (dsp mul a(f)(15+16*(2*i) downto 16*(2*i)) \& x"0000");$  $mu[tmp_d(f)(i) \le (x"0000" \& dsp_mul_d(f)(15+16*(2*i) downto 16*(2*i)));$ elsif MVTYPE DSP(h) = "10" then -- mul tmp  $a(f)(i) \le (dsp mul a(f)(31+32*(2*i) downto 31*(2*i)) \& x"0000");$  -- The upper 32-bit results of the multiplication are discarded (Ah\*Bh)  $mul_tmp_b(f)(i) \le (dsp_mul_b(f)(15+16^*(2^*i) \text{ downto } 16^*(2^*i)) \& x"0000");$ -- Modified to only add the partail result to the lower 32bits (Ah\*BI)  $mul_tmp_c(f)(i) \le (dsp_mul_c(f)(15+16*(2*i) downto 16*(2*i)) \& x"0000");$ -- Modified to only add the partail result to the lower 32bits (Al\*Bh)  $mul_tmp_d(f)(i) \le (dsp_mul_d(f)(31+32*(i) downto 32*(i)));$ -- This is the lower 32-bit result of the partial mmultiplication (Al\*Bl) end if; end loop; end if: end loop; end process; -- STAGE 2 -fsm MUL STAGE 2 : process(clk i, rst ni) variable h : integer; begin if rst\_ni = '0' then elsif rising edge(clk i) then for g in 0 to (ACCL\_NUM - FU NUM) loop if multithreaded\_accl\_en = 1 then h := g; -- set the spm rd/wr ports equal to the "for-loop" elsif multithreaded accl en = 0 then h := f; -- set the spm rd/wr ports equal to the "for-generate" end if: -- Accumulate the partial multiplications to make up bigger multiplications if mul en(h) = 1' and (mul stage 2 en(h) = 1' or recover state wires(h) = 1') and halt dsp lat(h) = 0' then for i in 0 to SIMD-1 loop dsp out mul results(f)((Data Width-1)+Data Width\*(i) downto Data Width\*(i)) <= (std logic vector(unsigned(mul tmp a(f)(i))+ unsigned(mul\_tmp\_b(f)(i)) + unsigned(mul\_tmp\_c(f)(i)) + unsigned(mul\_tmp\_d(f)(i)))); end loop; end if; end loop; end if: end process; fsm RELU: process(clk i, rst ni) variable h : integer; begin if rst ni = 0' then elsif rising\_edge(clk\_i) then for g in  $\tilde{0}$  to (ACCL\_NUM - FU\_NUM) loop if multithreaded accl en = 1 then h := g; -- set the spm rd/wr ports equal to the "for-loop" elsif multithreaded  $accl_en = 0$  then h := f; -- set the spm rd/wr ports equal to the "for-generate" end if; if relu\_en(h) = '1' then if (relu\_stage 1\_en(h) = '1' or recover\_state\_wires(h) = '1') and halt\_dsp\_lat(h) = '0' then if MVTYPE DSP(h) = "10" then for i in 0 to SIMD-1 loop if dsp\_in\_relu\_operands(f)( $31+32^*(i)$ ) = '1' then dsp\_out\_relu\_results(f)(31+32\*(i) downto 32\*(i)) <= (others => '0'); else  $dsp\_out\_relu\_results(f)(31+32*(i) \ down to \ 32*(i)) <= dsp\_in\_relu\_operands(f)(31+32*(i) \ down to \ 32*(i));$ end if: end loop; elsif MVTYPE DSP(h) = "01" then for i in 0 to 2\*SIMD-1 loop if dsp\_in\_relu\_operands( $\hat{f}$ )(15+16\*(i)) = '1' then dsp\_out\_relu\_results(f)(15+16\*(i) downto  $16^{(i)} \le (others => 0');$ else  $dsp\_out\_relu\_results(f)(15+16*(i) \ downto \ 16*(i)) <= dsp\_in\_relu\_operands(f)(15+16*(i) \ downto \ 16*(i));$ end if; end loop; elsif MVTYPE DSP(h) = "00" then for i in 0 to 4\*SIMD-1 loop

844 845 846

860 861 862

<u>863</u>

864

865

866

867 868

70

875 876

880 881

85

888 889 890

891 892

894

900

901

90<sup>′</sup>

903

904

905

906

907

908

<u>909</u>

```
if dsp in relu operands(f)(7+8*(i)) = '1' then
          dsp out relu results(f)(7+8*(i) downto 8*(i)) <= (others => '0');
         else
          dsp_out\_relu\_results(f)(7+8*(i) downto 8*(i)) \le dsp\_in\_relu\_operands(f)(7+8*(i) downto 8*(i));
         end if;
       end loop;
      end if;
     end if;
    end if;
   end loop;
  end if;
 end process;
end generate FU replicated;
ACCUM STG: ACCUMULATOR
          port map(
   clk i
                        => clk i,
   rst ni
                        => rst ni,
   MVTYPE DSP
                               => MVTYPE DSP,
   accum_stage_1_en
                              => accum_stage_1_en,
   accum_stage_2_en
                              => accum_stage_2_en,
   recover state wires
                              => recover state wires,
   halt dsp lat
                           => halt dsp lat,
   state DSP
                          => state DSP,
   decoded instruction DSP lat
                                 => decoded instruction DSP lat,
                                 => dsp_in_accum_operands,
   dsp in accum operands
                                => dsp_out_accum_results
   dsp_out_accum_results
          );
end DSP;
-- END of DSP architecture -----
```

## 3. Scratchpad Memory Interface (SPI)

```
-- SCI pinout -----
entity Scratchpad memory interface is
 port (
                     : in std_logic;
  clk_i, rst_ni
  data rvalid i
                      : in std logic;
                     : in fsm LS states;
  state LS
  sc_word_count_wire
                           : in integer;
  spm bcast
                      : in std logic;
  harc LS wire
                       : in accl range;
                        : in array_2d(accl_range)(SIMD-1 downto 0);
  dsp_we_word
  ls_sc_data_write_wire : in std_logic_vector(Data_Width-1 downto 0);
  dsp sc data write wire : in array 2d(accl range)(SIMD Width-1 downto 0);
                        : in std_logic_vector(Addr_Width-(SIMD_BITS+3) downto 0);
: in std_logic_vector(Addr_Width-(SIMD_BITS+3) downto 0);
  ls sc read addr
  ls sc write addr
  dsp_sc_write_addr
                         : in array_2d(accl_range)(Addr_Width-1 downto 0);
  ls sci req
                     : in std_logic_vector(SPM_NUM-1 downto 0);
                      : in std_logic_vector(SPM_NUM-1 downto 0);
  ls_sci_we
  dsp_sci_req
                      : in array_2d(accl_range)(SPM_NUM-1 downto 0);
                       : in array 2d(accl range)(SPM NUM-1 downto 0);
  dsp sci we
  kmemld inflight
                         : in std logic vector(SPM NUM-1 downto 0);
                         : in std_logic_vector(SPM_NUM-1 downto 0);
  kmemstr_inflight
  dsp_to_sc
                      : in array_3d(accl_range)(SPM_NUM-1 downto 0)(1 downto 0);
  dsp sc read addr
                         : in array_3d(accl_range)(1 downto 0)(Addr_Width-1 downto 0);
                         : out array_3d(accl_range)(1 downto 0)(SIMD_Width-1 downto 0);
  dsp_sc_data_read
                          : out std_logic_vector(Data_Width-1 downto 0);
  ls_sc_data_read_wire
                       : out std_logic;
  ls sci wr gnt
  dsp sci wr gnt
                        : out std logic vector(accl range);
                       : out std_logic_vector(SPM_NUM-1 downto 0);
  ls_data_gnt_i
  dsp_data_gnt_i
                       : out std_logic_vector(accl_range)
          );
end entity; --
```

architecture SCI of Scratchpad\_memory\_interface is

signal dsp\_sc\_data\_write\_int\_wire : array\_2d(accl\_range)(SIMD\_Width-1 downto 0); signal ls\_sc\_data\_read\_int\_wire : array\_2d(accl\_range)(Data\_Width-1 downto 0); signal rd offset : array 3d(accl range)(1 downto 0)(SIMD-1 downto 0);

150

```
signal wr offset
                           : array 2d(accl range)(SIMD-1 downto 0);
signal dsp sc data read int wire : array 3d(accl range)(1 downto 0)(SIMD Width-1 downto 0);
                                : array 3d(accl range)(1 downto 0)(SIMD BITS+1 downto 0); -- Only need the lower part to check for the word
signal dsp_sc_read_addr_lat
access
                              : array 2d(accl range)(SPM NUM-1 downto 0);
signal dsp sci req lat
signal dsp_to_sc_lat
                             : array_3d(accl_range)(SPM_NUM-1 downto 0)(1 downto 0);
signal dsp_sc_data_read_wire
                                 : array 3d(accl range)(1 downto 0)(SIMD Width-1 downto 0);
signal ls sc data read replicated
                                 : array_2d(accl_range)(Data_Width-1 downto 0);
signal ls sc data read wire replicated : array 2d(accl range)(Data Width-1 downto 0);
signal dsp_sci_wr_gnt_lat
                               : std logic vector(accl range);
signal ls_sci_wr_gnt_replicated
                                 : std_logic_vector(accl_range);
signal ls sci wr gnt lat replicated : std logic vector(accl range);
signal halt dsp
                           : std logic vector(accl range);
                              : array 2d int(accl range);
signal sc word count
signal sc_we
                           : array_2d(accl_range)(SIMD*SPM_NUM-1 downto 0);
signal sc addr wr
                            : array 3d(accl range)(SIMD*SPM NUM-1 downto 0)(Addr Width-(SIMD BITS+3) downto 0);
signal sc_addr_rd
                            : array_3d(accl_range)(SIMD*SPM_NUM-1 downto 0)(Addr_Width-(SIMD_BITS+3) downto 0);
                            : array 3d(accl range)(SIMD*SPM NUM-1 downto 0)(Data Width-1 downto 0);
signal sc data wr
signal sc data rd
                            : array 3d(accl range)(SIMD*SPM NUM-1 downto 0)(Data Width-1 downto 0);
component Scratchpad memory
 port(
   clk i
                      : in std logic;
                       : in array_2d(accl_range)(SIMD*SPM NUM-1 downto 0);
    sc_we
                         : in array_3d(accl_range)(SIMD*SPM_NUM-1 downto 0)(Addr_Width-(SIMD BITS+3) downto 0);
    sc addr wr
    sc addr rd
                         : in array 3d(accl range)(SIMD*SPM NUM-1 downto 0)(Addr Width-(SIMD BITS+3) downto 0);
                         : in array 3d(accl range)(SIMD*SPM NUM-1 downto 0)(Data Width-1 downto 0);
   sc data wr
                         : out array_3d(accl_range)(SIMD*SPM_NUM-1 downto 0)(Data_Width-1 downto 0)
    sc_data_rd
  ):
end component;
 ----- SCI BEGIN ------
begin
 SC: Scratchpad memory
  port map(
    sc we
                => sc we
    clk i
               => clk i,
    sc_addr_rd
                 => sc addr rd,
    sc addr wr
                  => sc addr wr,
    sc data wr
                  => sc_data_wr,
    sc data rd
                 => sc data rd
   ):
 SPM_replicated : for h in accl_range generate
 SCI Exec Unit : process(clk i, rst ni) -- single cycle unit, fully synchronous
 begin
  if rst ni = 0' then
   dsp sc read addr lat(h) \le (others \implies 0'));
   dsp_to_sc_lat(h)
                      \langle = (others => (others => '0'));
   ls_data_gnt_i
                      <= (others => '0');
   dsp_sci_req_lat(h) <= (others => '0');
   sc word count(h)
                       <= 0:
   elsif rising_edge(clk_i) then
   halt dsp(h)
                          <= '0';
   dsp_sci_wr_gnt_lat(h)
                              <= dsp sci wr gnt(h);
   ls sci_wr_gnt_lat_replicated(h) <= ls_sci_wr_gnt_replicated(h);
   dsp_sci_req_lat(h)
                           <= dsp_sci_req(h);
   dsp_to_sc_lat(h)
                           <= dsp_to_sc(h);
   if harc LS wire = h or spm bcast = '1' then
    sc word count(h)
                            <= sc word count wire;
   end if:
   if unsigned(ls_data_gnt_i) /= 0 then
    ls sc data read replicated(h) <= ls sc data read wire replicated(h);
   end if:
   if (dsp_sci_wr_gnt(h) = '0' and dsp_sci_we(h) /= (0 to SPM_NUM-1 => '0')) then
    halt dsp(h) \leq 1';
   end if:
   if halt_dsp(h) = '0' then
    dsp_sc_data_read(h) <= dsp_sc_data_read_wire(h);
   end if:
   for i in 0 to SPM NUM-1 loop
    if ls sci req(i) = 'l' then -- AAA most probably useless
     ls data gnt i(i) \le 1';
    elsif ls_sci_req(i) = '0' then
```

```
ls data gnt i(i) \le 0';
    end if;
             if dsp\_sci\_req(h)(i) = '1' then
      for k in 0 to 1 loop
      dsp sc read addr lat(h)(k) \le dsp sc read addr(h)(k)(SIMD BITS+1 downto 0);
      end loop;
     end if:
   end loop;
  end if;
 end process;
 ls sc data read wire <= ls sc data read wire replicated(harc LS wire);
                   <= ls_sci_wr_gnt_replicated(harc_LS_wire);
 ls sci wr gnt
 SCI_Exec_Unit_comb : process(all)
 begin
  dsp data gnt i(h)
                            <= '0'
  for 1 in 0 to (SIMD*SPM NUM)-1 loop
   sc we(h)(l)
                 <= '0';
   sc_addr_rd(h)(l) <= (others => '0');
   sc_addr_wr(h)(1) \le (others \implies '0');
   sc data wr(h)(1) \le (others \implies '0');
  end loop;
  rd_offset(h)
                            \langle = (others => (others => '0'));
  dsp sc data read int wire(h)
                                  \langle = (others => (others => '0'));
  wr offset(h)
                            \langle = (others => '0');
           ls_sci_wr_gnt_replicated(h)
                                           <= ls_sci_wr_gnt_lat_replicated(h);
                                        <= dsp_sci_wr_gnt_lat(h);
           dsp_sci_wr_gnt(h)
  ls sc data read wire replicated(h) <= ls sc data read replicated(h);
           dsp_sc_data_write_int_wire(h)
                                            <= (others => '0');
  dsp sc data read wire(h)
                                   <= dsp_sc_data_read(h);
  for i in 0 to SPM_NUM-1 loop
                                            -- Loop through scratchpads A,B,C,D
   if data rvalid i = '1' then
                                 -- LS write port
     if ls\_sci\_req(i) = '1' and ls\_sci\_we(i) = '1' and ls\_sci\_wr\_gnt = '1' then
                       if harc LS wire = h or spm bcast = '1' then
       sc we(h)((SIMD)*i + sc word count(h)) \leq 12 '1':
       sc data wr(h)(sc word count(h) + (SIMD)*i) <= ls sc data write wire(31 downto 0);
       sc_addr_wr(h)(sc_word_count(h) + (SIMD)*i) <= ls_sc_write_addr;
      end if;
    end if:
   end if:
   if ls_data_gnt_i(i) = '1' then
                     if harc LS wire = h then
      ls_sc_data_read_wire_replicated(h) <= sc_data_rd(h)((SIMD)*i + sc_word_count(h)); -- sc_word_count because data being read is delayed
one cycle after the request
    end if:
   end if:
   if ls\_sci\_req(i) = '1' then
                                 -- LS read port
                     if harc LS wire = h then
      sc_addr_rd(h)(sc_word_count_wire + (SIMD)*i) <= ls_sc_read_addr;
    end if:
   end if:
   if dsp sci we(h)(i) = '1' and dsp sci wr gnt(h) = '1' then
                                                                 -- DSP write port;
     for j in 0 to SIMD-1 loop
                                -- Loop through the sub-scratchpads
      sc we(h)((SIMD)*i+j) <= dsp_we_word(h)(j);
      sc_addr_wr(h)((SIMD)*i+j) <= std_logic_vector(unsigned(dsp_sc_write_addr(h)(Addr_Width - 1 downto SIMD_BITS+2)) + wr_offset(h)(j));
      sc data wr(h)((SIMD)^{*}i+j) \le dsp sc data write int wire(h)(31+32*j downto 32*j);
    end loop;
   end if:
   if dsp_sci_req(h)(i) = '1' and dsp_to_sc(h)(i)(0) = '1' and dsp_data_gnt_i(h) = '1' then
                                                                                            -- DSP read port 1
     for j in 0 to SIMD-1 loop
                                -- Loop through the sub-scratchpads
      sc addr rd(h)((SIMD)*i+j) <= std logic vector(unsigned(dsp sc read addr(h)(0)(Addr Width - 1 downto SIMD BITS+2)) +
rd offset(h)(0)(j));
    end loop;
   end if:
   for j in 0 to SIMD-1 loop
                                 -- Loop through the sub-scratchpads
     if dsp sci req lat(h)(i) = 'l' and dsp to sc lat(h)(i)(0) = 'l' then
                                                                          -- DSP read port 1
      dsp_sc_data_read_int_wire(h)(0)(31+32*j downto 32*j) <= sc_data_rd(h)((SIMD)*i+j);
    end if:
   end loop:
```

```
if dsp\_sci\_req(h)(i) = '1' and dsp\_to\_sc(h)(i)(1) = '1' and dsp\_data\_gnt\_i(h) = '1' then -- DSP read port 2
```

195 196 197 198 199 200 for j in 0 to SIMD-1 loop sc\_addr\_rd(h)((SIMD)\*i+j) <= std\_logic\_vector(unsigned(dsp\_sc\_read\_addr(h)(1)(Addr\_Width - 1 downto SIMD\_BITS+2)) +
rd\_offset(h)(1)(j));</pre> end loop; er fc  $\tilde{2}\check{0}\check{1}$ er --if --el --el er if --el -el --el er if DSP el el eı enc ----------------for if er enc for if 32\*j) eı ene for fc

| end if;<br>for j in 0 to SIMD-1 loop Loop through the sub-scratchpads                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| if dsp_sci_req_lat(h)(i) = '1' and dsp_to_sc_lat(h)(i)(1) = '1' then DSP read port 2<br>dsp_sc_data_read_int_wire(h)(1)(31+32*j downto 32*j) <= sc_data_rd(h)((SIMD)*i+j);                                                                                                                                                                                                                                                                                                                                                                 |
| end if;<br>end loop;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Allow a DSP read only if the SPM(i) being loaded belongs to another thread and the instruction is not a broadcast load (data hazard) if kmemld_inflight(i) = '1' and $h/= harc_LS_wire$ and $spm_bcast = '0'$ then dsp_data_gnt_i(h) <= '1';                                                                                                                                                                                                                                                                                               |
| Allow a dsp read only when it is not currently being read by a kmemstr becuase we only have one read port (structural hazard)<br>elsif kmemstr_inflight(i) = '1' and dsp_sci_req(h)(i) = '1' and h /= harc_LS_wire then<br>dsp_data_gnt_i(h) <= '1';                                                                                                                                                                                                                                                                                       |
| Allow a DSP read if there are no current LSU accesses to SPM(i)<br>elsif kmemld_inflight(i) = '0' and kmemstr_inflight(i) = '0' and dsp_sci_req(h)(i) = '1' then<br>dsp_data_gnt_i(h) <= '1';<br>end if;                                                                                                                                                                                                                                                                                                                                   |
| if dsp_sci_we(h) = (0 to SPM_NUM-1 => '0') then<br>dsp_sci_wr_gnt(h) <= '0';<br>Allow the DSP to write only if the kmemld is filling the SPM(i) of another thread<br>elsif kmemld_inflight(i) = '1' and dsp_sci_we(h)(i) = '1' and h /= harc_LS_wire and spm_bcast = '0' then                                                                                                                                                                                                                                                              |
| $dsp\_sci\_wr\_gnt(h) \le '1';$<br>Allow the DSP to write only when the kmemstr is reading SPM(i) of another thread<br>elsif kmemstr\_inflight(i) = '1' and $dsp\_sci\_we(h)(i) = '1'$ and $h /= harc\_LS\_wire$ then<br>$dsp\_sci\_wr\_gnt(h) \le '1';$                                                                                                                                                                                                                                                                                   |
| Allow the DSP to write if there are no current LSU accesses to SPM(i)<br>elsif kmemld_inflight(i) = '0' and kmemstr_inflight(i) = '0' and dsp_sci_we(h)(i) = '1' then<br>dsp_sci_wr_gnt(h) <= '1';<br>end if;                                                                                                                                                                                                                                                                                                                              |
| if kmemld_inflight(i) = '1' and dsp_sci_we(h)(i) = '0' then One LSU write enable request will put the ls_sci_wr_gnt to '1' if there are no ongoing P writes to the same scratchpad ls_sci_wr_gnt_replicated(h) <= '1';<br>elsif kmemld_inflight(i) = '1' and dsp_sci_we(h)(i) = '1' and (h /= harc_LS_wire) and spm_bcast = '0' then ls_sci_wr_gnt_replicated(h) <= '1';<br>elsif kmemld_inflight) = 0 then All the ls_sci_we must be zero in-order to switch the ls_sci_wr_gnt back to '0' ls_sci_wr_gnt_replicated(h) <= '0';<br>end if; |
| nd loop;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| ## # # # ## ## ## ## ## ### ### ###                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| $\begin{array}{cccccccccccccccccccccccccccccccccccc$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| $ \begin{array}{cccccccccccccccccccccccccccccccccccc$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| $ \begin{array}{c} \# & \# & \# & \# & \# & \# & \# & \# & \# & $                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| $ \begin{array}{cccccccccccccccccccccccccccccccccccc$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| $ \begin{array}{llllllllllllllllllllllllllllllllllll$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| $ \begin{array}{llllllllllllllllllllllllllllllllllll$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |

-- Loop through the sub-scratchpads

```
if (to integer(unsigned(dsp sc read addr lat(h)(k))) = 4*i) then
                       for j in 0 to SIMD-1 loop
                           if j \ge i then
                                 dsp sc data read wire(h)(k)(31+32*(j-i) downto 32*(j-i)) \leq= dsp sc data read int wire(h)(k)(31+32*j downto 32*j);
                                                                                                                             elsif_{i} < i then
                                 dsp\_sc\_data\_read\_wire(h)(k)(31+32*((SIMD-1)-i+(j+1))) downto 32*(((SIMD-1)-i+(j+1))) <= dsp\_sc\_data\_read\_int\_wire(h)(k)(31+32*j) downto 32*(((SIMD-1)-i+(j+1))) downto 32*(((SIMD-1)-
downto 32*j);
                           end if;
                      end loop;
                  end if.
             end loop;
        end loop;
   end process;
   end generate SPM replicated;
end SCI:
-- END of SCI architecture -----
      _____
```

## 4. Scratchpad Memories

```
_____
entity Scratchpad_memory is
port(
   clk i
                    : in std logic;
                     : in array_2d(accl_range)(SIMD*SPM_NUM-1 downto 0);
   sc we
                       : in array_3d(accl_range)(SIMD*SPM_NUM-1 downto 0)(Addr_Width-(SIMD_BITS+3) downto 0);
   sc_addr_wr
   sc_addr_rd
                       : in array_3d(accl_range)(SIMD*SPM_NUM-1 downto 0)(Addr_Width-(SIMD_BITS+3) downto 0);
   sc data wr
                       : in array 3d(accl range)(SIMD*SPM NUM-1 downto 0)(Data Width-1 downto 0);
                       : out array_3d(accl_range)(SIMD*SPM_NUM-1 downto 0)(Data_Width-1 downto 0)
   sc_data_rd
   ):
end Scratchpad memory;
                                  _____
architecture SC of Scratchpad_memory is
signal mem : array 3d(ACCL NUM*SIMD*SPM NUM-1 downto 0)(2**(Addr Width-(SIMD BITS+2))-1 downto 0)(Data Width-1 downto 0);
signal h : std_logic_vector(ACCL_NUM*SIMD*SPM_NUM downto 0);
attribute ram style : string;
attribute ram style of mem : signal is "block";
begin
----- replicate logic three times -----
spm_replicas : for g in accl_range generate
 spm_banks : for h in 0 to SIMD*SPM_NUM -1 generate
  write_logic: process(clk_i) --
  begin
   if(clk i'event and clk i='1') then
    sc_data_rd(g)(h) \le mem(g*SIMD*SPM_NUM + h)(to_integer(unsigned(sc_addr_rd(g)(h))));
    if sc_we(g)(h) = '1' then
                            --write mode
     mem(g*SIMD*SPM_NUM + h)(to\_integer(unsigned(sc\_addr\_wr(g)(h)))) \le sc\_data\_wr(g)(h);
    end if; -- we
   end if; -- clk
  end process;
end generate spm banks;
end generate spm_replicas;
 -- end of replicated logic -----
end SC;
```

## Glossary

**ANN:** Artificial Neural Networks **CNN:** Convolutional Neural Networks **CSR:** Control and Status Registers **DCNN:** Deep Convolutional Neural Networks **DLP**: Data Level Parallelism FPGA: Field Programmable Gate Array FU: Functional unit (general name for any arithmetic or logic unit) F0x: Fault tolerant version of the T0 cores designed to make the Klessydra cores reliable in space environments prone to faults Harc: (hardware context) a positive integer number identifying a hardware thread in the processing core. Hart: hardware thread **IMT:** Interleaved Multithreading. IoT: Internet of Things. **ILP**: Instruction Level Parallelism. **IPC**: Instructions per Cycle. **IRQ:** interrupt request. **ISA:** Instruction Set Architecture. Klessydra: the name of the family of processing cores reported in this manual. MIPS: Millions of Instructions Per Second. NT: Number of Active harts in the core Modelsim: RTL Simulator. **OOO**: Out-of-order architecture. PULP: an open-source multi-core processor architecture. PULPino: an open-source System-on-Chip single-core microcontroller architecture. ReLu: Rectified Linear Unit, it rectifies negative values to zero. **RI5CY:** Generic four-stage pipeline Riscy core from Pulpino **RISC:** Reduced Instruction Set Computing **RISC-V**: Open RISC instruction set architecture.

**S0:** a core belonging to the Klessydra family featuring single-thread execution at minimum hardware cost

SIMD: Single Instruction Multiple Data

SPE: Special Purpose Engine, the engine the executes the SPMU instruction

**SPI:** Scratchpad Memory Interface, that is the interface that manages the communications between the SPE, LSU, and SPMs.

SPM: Scratchpad memory, which is a local memory accesses by the LSU and SPE

**SPMU:** Special Purpose Mathematical Unit, this is the hardware accelerator of the T13, that has two integrated entities. The SPE and SPI.

T0: an IMT implementation in the Klessydra family, supporting interleaved multiple thread executionT1: upgraded version of the T0 core designed to widen the target applications of Klessydra through hardware acceleration

TLP: Thread Level Parallelism

**TPS**: Thread Pool Size, is the number of hardware threads in the core

**TPB**: Thread Pool Baseline, is the minimum baseline required to not have any pipeline stalls

Vivado: Software Suite for Synthesizing RTL on XILINX FPGAs

VGG16: A deep fully connected convolutional neural networking algorithm, used for image recognition

Zero-Riscy: Generic two-stage pipeline Riscy core from Pulpino

## Bibliography

[1] Shilov, Anton. <u>"Samsung Completes Development of 5nm EUV Process Technology"</u>. <u>www.anandtech.com</u>.

[2] Shilov, Anton. <u>"TSMC: First 7nm EUV Chips Taped Out, 5nm Risk Production in Q2 2019"</u>

[3] Moore, Gordon E. (1965-04-19). <u>"Cramming more components onto integrated circuits"</u>. *Electronics*.

[4] Omura, Yasuhisa, Abhijit Mallik, and Naoto Matsuo. *MOS Devices for Low-voltage and Low-energy Applications*. John Wiley & Sons, 2017.

[5] Ge, Fen, Ning Wu, Hao Xiao, Yuanyuan Zhang, and Fang Zhou. "<u>Compact Convolutional</u> <u>Neural Network Accelerator for IoT Endpoint SoC</u>." *Electronics* 8, no. 5 (2019): 497.

[6] Samie, F.; Bauer, L.; Henkel, J. "<u>From Cloud Down to Things: An Overview of Machine Learn-ing in Internet of Things</u>". IEEE Internet Things J. **2019**, 4662, 1.

[7]. A. Waterman, K. Asanovic, Ed., The RISC-V Instruction Set Manual - Volume I: User-Level ISA - Document Ver-sion 2.2, May 2017. [Online] https://riscv.org/specifications/

[8]. A. Waterman, K. Asanovic, Ed., The RISC-V Instruction Set Manual - Volume II: Privileged ISA - Document Ver-sion 1.10, May 2017. [Online] <u>https://riscv.org/specifications/</u>

[9] <u>RISC-V Cores and SoC Overview"</u>. RISC-V. 25 September 2019. Retrieved 5 October 2019.

[10] Rossi, Davide, Francesco Conti, Andrea Marongiu, Antonio Pullini, Igor Loi, Michael Gautschi, Giuseppe Tagliavini, Alessandro Capotondi, Philippe Flatresse, and Luca Benini. "PULP: A parallel ultra low power platform for next generation IoT applications." In *2015 IEEE Hot Chips 27 Symposium (HCS)*, pp. 1-39. IEEE, 2015.

[11] Cheikh, A., Cerutti, G., Mastrandrea, A., Menichelli, F., Olivieri, M., "The microarchitecture of a multi-threaded RISC-V compliant processing core family for IoT end-nodes", Proc. of AP-PLEPIES 2017, *Lecture Notes in Electrical Engineering*, 2018, Springer.

[12] Abbas, Z.; Mastrandrea, A.; Olivieri, M., A Voltage-Based Leakage Current Calculation Scheme and its Application to Nanoscale MOSFET and FinFET Standard-Cell Designs, *IEEE Trans. on Very Large Scale Integration (VLSI) Systems*, 22(12), pp. 2549-2560, Dec. 2014.

[13] M. Makni, M. Baklouti, S. Niar, M. W. Jmal and M. Abid, "A comparison and performance evaluation of FPGA soft-cores for embedded multi-core systems," *11th Int. Design & Test Symposium (IDT)*, Hammamet, 2016, pp. 154-159.

[14] Trevor Martin, Ed., *Designer's Guide to the Cortex-M Processor Family*;2nd Edition; 2016. Elsevier.

[15] Olivieri, M., Cheikh, A., Cerutti, G., Mastrandrea, A., & Menichelli, F.,Investigation on the optimal pipeline organi-zation in RISC-V multi-threaded soft processor cores. In Proc. of 2017 New Generation of CAS (NGCAS),(pp. 45-48). IEEE. [16] Bechara, C., Berhault, A., Ventroux, N., Chevobbe, S., Lhuillier, Y., David, R. and Etiemble, D., 2011, December. A small footprint interleaved multithreaded processor for embedded systems. In *2011 18th IEEE International Confer-ence on Electronics, Circuits, and Systems*(pp. 685-690). IEEE.

[17] Traber, A., Zaruba, F., Stucki, S., Pullini, A., Haugou, G., Flamand, E., Gurkaynak, F.K. and Benini, L., 2016, Janu-ary. PULPino: A small single-core RISC-V SoC. In *3rd RISCV Workshop*.

[18] Conti, F. "An open-source microcontroller system based on RISC-V Pulpino free open source <u>GitHub repository</u>"

[19] Pulpino custom RI5CY toolchain "<u>ri5cy\_gnu\_toolchain on GitHub featuring patches for</u> zero\_riscy and riscy cores"

[20] Blasi, L., Vigli, F., Cheikh, A., Mastrandrea, A., Menichelli, F., Olivieri, M., A RISC-V Fault-Tolerant Microcon-troller Core Architecture Based on a Hardware Thread Full-Weak protection and a Thread-Controlled Watch-Dog Timer, In: *Applications in Electronics Pervading Industry, Environment and Society. ApplePies.* 2019.

[21] S. Gupta, N. Gala, G.S.Madhusudan e V.Kamakoti, «SHAKTI-F: A Fault Tolerant Microprocessor Architecture,» in 2015 IEEE 24th Asian Test Symposium, 2015.

[22] F. Menichelli and M. Olivieri, "Static Minimization of Total Energy Consumption in Memory Subsystem for Scratchpad-Based Systems-on-Chips," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 2, pp. 161-171, Feb. 2009

[23] Olivieri, M., Menichelli, F., Mastrandrea, A., Optimal pipeline stage balancing in the presence of large isolated interconnect delay (2017) Electronics Letters, 53 (4), pp. 229-231.

[24]. Malavenda, C.S., Menichelli, F., Olivieri, M., "Delay-tolerant, low-power protocols for large security-critical wireless sensor networks", (2012) Journal of Computer Networks and Communications.

[25] Malavenda, C.S., Menichelli, F., Olivieri, M., "A regulation-based security evaluation method for data link in wireless sensor network", (2014) Journal of Computer Networks and Communications.

[26]. Malavenda, C.S., Menichelli, F., Olivieri, M., "Wireless and Ad Hoc sensor networks: An industrial example using delay tolerant, low power protocols for security-critical applications", (2014) Lecture Notes in Electrical Engineering, 289, pp. 153-162.

[27] Jim Duffy, "<u>8 Internet things that are not IoT</u>" <u>https://www.networkworld.com/</u> . June, 26, 2014

[28] Sun, Yi, Ding Liang, Xiaogang Wang, and Xiaoou Tang. "Deepid3: Face recognition with very deep neural networks." *arXiv preprint arXiv:1502.00873* (2015).

[29] Genesys 2 Reference Manual by Digilent, [Online] <u>https://reference.digilentinc.com/refer-ence/programmable-logic/genesys-2/reference-manual</u>

[30] XILINX 7-Series User Guide and reference manual <u>https://www.xilinx.com/video/fpga/7-se-ries-fpga-overview.html</u>

[31]Cheikh.A, Klessydra-T02, "<u>A multi-threaded microprocessor interleaving as minimum two harts, which is pin-to-pin compatible with pulpino riscy cores</u>"

[32] Cheikh.A, Klessydra-T03, "<u>A multi-threaded microprocessor interleaving as minimum three harts, which is pin-to-pin compatible with pulpino riscy cores</u>"

[33] Cheikh.A Klessydra-T13, "<u>An Extended Version of the T0x multithreaded cores, with custom vector instructions, and superscalar execution. The core is pin-to-pin compatible with the pulpinor-iscy cores</u>"

[34] Blasi.L,Vigli,F Klessydra-F03, "<u>A fault tolerant version of the T03x core, using triple</u> redundancy approach to ensure fault tolerance"

[35] RISC-V GNU Toolchain "https://github.com/riscv/riscv-gnu-toolchain"

[36] Steinke, Stefan; Lars Wehmeyer; Bo-Sik Lee; Peter Marwedel (2002). <u>"Assigning Program and Data Objects to Scratchpad for Energy Reduction"</u> (PDF). University of Dortmund. Retrieved 3 October 2013.: "3.2 Scratchpad model .. The scratchpad memory uses software to control the location assignment of data."

[37] Rajeshwari Banakar, <u>Scratchpad Memory : A Design Alternative for Cache. On-chip memory in Embedded Systems</u> // CODES'02. May 6–8, 2002

[38] Huthmann, Jens, Julian Oppermann, and Andreas Koch. "Automatic high-level synthesis of multi-threaded hardware accelerators." In 2014 24th International Conference on Field Programmable Logic and Applications (FPL), pp. 1-4. IEEE, 2014.

[39] Lindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. "NVIDIA Tesla: A unified graphics and computing architecture." *IEEE micro* 28, no. 2 (2008): 39-55.

[40] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." *arXiv preprint arXiv:1409.1556* (2014).

[41] Tindall, Lucas, Cuong Luong, and Andrew Saad. "Plankton classification using vgg16 network." (2015).

[42] Liu, Bin, Xiaoyun Zhang, Zhiyong Gao, and Li Chen. "Weld Defect Images Classification with VGG16-Based Neural Network." In *International Forum on Digital TV and Wireless Multimedia Communications*, pp. 215-223. Springer, Singapore, 2017.

[43] Rezende, Edmar, Guilherme Ruppert, Tiago Carvalho, Antonio Theophilo, Fabio Ramos, and Paulo de Geus. "Malicious software classification using VGG16 deep neural network's bottleneck features." In *Information Technology-New Generations*, pp. 51-59. Springer, Cham, 2018.

[44] 7 Series FPGAs Memory Resources User Guide Xilinx

https://www.xilinx.com/support/documentation/user\_guides/ug473\_7Series\_Memory\_Resources.pd

[45] UltraScale Architecture DSP Slice User Guide - Xilinx

"https://www.xilinx.com/support/documentation/user\_guides/ug579-ultrascale-dsp.pdf"

[46] SIMD Instructions Considered Harmful "<u>https://www.sigarch.org/simd-instructions-considered-harmful</u>"

[47] Vector vs SIMD: Dynamic Power Efficiency "<u>https://massivebottleneck.com/2019/02/17/vec-tor-vs-simd-dynamic-power-efficiency/</u>"

[48] Vivado Design Suite User Guide: Using Constraints "<u>https://www.xilinx.com/support/docu-mentation/sw\_manuals/xilinx2018\_1/ug903-vivado-using-constraints.pdf</u>"

[49] R. M. Tomasulo "<u>An Efficient Algorithm for Exploiting Multiple Arithmetic Units</u>" IBM Journal of Research and Development

[50] J.A. Farrell ; T.C. Fischer "<u>Issue logic for a 600-MHz out-of-order execution microprocessor</u>" EEE Journal of Solid-State Circuits ( Volume: 33 , Issue: 5 , May 1998 )

[51] B.A. Gieseke ; R.L. Allmon ; D.W. Bailey ; B.J. Benschneider ; S.M. Britton ; J.D. Clouser ; H.R. Fair "<u>A 600 MHz superscalar RISC microprocessor with out-of-order execution</u>" 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers

[52] Gautschi, Michael, Pasquale Davide Schiavone, Andreas Traber, Igor Loi, Antonio Pullini, Davide Rossi, Eric Flamand, Frank K. Gürkaynak, and Luca Benini. "<u>Near-threshold RISC-V core</u> <u>with DSP extensions for scalable IoT endpoint devices.</u>" IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, no. 10 (2017): 2700-2713.

[53] Garofalo, Angelo, Manuele Rusci, Francesco Conti, Davide Rossi, and Luca Benini. "<u>PULP-NN: accelerating quantized neural networks on parallel ultra-low-power RISC-V processors.</u>" *Philosophical Transactions of the Royal Society A* 378, no. 2164 (2020): 20190155.