A guide to accelerating applications with just-right RISC-V custom instructions - Embedded.com

A guide to accelerating applications with just-right RISC-V custom instructions

The open instruction set architecture (ISA) of RISC-V permits broad flexibility in implementation and offers optional features that can enable fresh approaches to resolving hardware-software design tradeoffs. Based on a modular structure, a number of standard extensions and options can be used to configure the base processor as a starting point. Yet the true value actually lies in the opportunities that RISC-V offers developers to create new extensions, instructions and configurations that uniquely meet the needs of their innovative application ideas.

The software challenge for fixed ISAs

Traditionally, ISAs have been the intellectual property (IP) of commercial organizations who either wanted to sell microprocessors or microcontrollers, or who want to license their designs for others to use. Embedded developers are left to execute benchmarking software to determine which solution is best optimized for their application needs. Due to the cost of developing an independent ISA with all the necessary ecosystem, semiconductor vendors had been increasingly relying on the standard fixed ISAs offered by the mainstream IP providers, relying on Moore’s Law and integrated peripherals to deliver differentiation, such as ultra-low power, to their customers.

The challenge here is that the instructions used to execute code cannot be changed. Therefore, efficiencies that could potentially be gained by, for example, an optimized instruction for an encryption algorithm, cannot be realized. This may mean that the developer’s application is too slow, potentially uses too much power, or regularly misses a hard-real-time deadline in a control loop. Even with the best will in the world, these are factors that are challenging to resolve purely with semiconductor fabrication improvements or process shrinks.

The RISC-V ISA started as a project at the University of California, Berkeley and is now maintained by the RISC-V International Association, a non-profit group with over 300 members. These contribute to the ISA specifications, software tools, such as simulators and compilers, and the rest of the ecosystem that is needed to support such an undertaking. Whether or not it makes sense to use depends on whether one of two factors can be leveraged: it is free in terms of license, or the freedom it affords.

Being open and freely available, it provides a basic processing platform that can easily be used by both academia for teaching and research, as well as commercial applications. An open ISA also supports a number of business models for developers looking to source semiconductor IP, from the commercial IP provider through to open source projects and clean sheet, self-built designs. Commercial organizations also find this attractive, utilizing it in FPGAs, SoCs, or even the core of a microcontroller or standard product offering.

Thanks to the freedom it affords, academia can investigate new approaches to address compute challenges, implementing anything from new instructions and other accelerators, multi-core and many-core heterogeneous designs plus different microarchitecture options. Many of these options are also attractive to startups and businesses looking to tackle complex challenges, such as low-power artificial intelligence (AI) chipsets that operate at the edge, by adding custom instructions tuned to end-application requirements.

As the ecosystem has been established with RISC-V flexibility built in, any standard configuration or custom extension should be able to leverage the tools and software within the ISA compliant framework.

Understanding the flexibility of the open RISC-V ISA

Thanks to the accessibility of the RISC-V ISA and associated tools, it is simple to kick-off an investigative project to assess its suitability for use in a specific application. Simulation tools allow a standard base ISA to be trialled to determine the out-of-the-box performance. For example, a good starting point would be a 32-bit RISC-V configuration with integer “I” and multiply “M” extensions (which can be referenced as RV32IM) providing support for a base 32-bit integer instruction set; more options are available but this is sufficient for this example. This is then instantiated together with a simulated memory including access delays and wait states.

An application written in C/C++ can then be cross compiled using standard tool chains. This could be running on bare metal or as part of a (real-time) operating system (RTOS/OS). The resulting binary code is then executed using tools such as an instruction set simulator (ISS) that allows the chosen base processor model to be integrated and simulated (figure 1). This environment also provides standard input/output functionality and access to the host file system. Standard integrated development environment (IDE) tools such as Eclipse can then be used to control code execution, interfacing via the GNU debugger GDB.

From here, through a process of profiling and analysis, instruction candidates are identified, designed, and modelled. By using the original application code as the basic functional model, the resulting improvement can be quickly tested, verified, and the performance compared. This rapid iteration of profiling and analysis allows fast selection and optimization of instructions that are worth implementing. The documentation can be generated from the model and forms the basis of a function specification for the register-transfer logic (RTL) design and an optimized model.

Figure 1: New instructions can be developed and evaluated in a simulation based upon the needs of existing application code.

As an example, an encryption algorithm such as ChaCha20 may be critical to a particular application. Available source code can be compiled for a “vanilla” RV32IM base, executed, and then analyzed with estimated instruction cycle timing using basic block profiling to determine how much time was spent in which sections of code. The core of the ChaCha20 algorithm makes heavy use of XOR and rotation instructions known as quarter-rounds (figure 2). The results of block profiling immediately highlight that the majority of the execution time is spent in these functions.

Figure 2: The ChaCha20 algorithm makes extensive use of XOR and rotate instructions.
Image source: Wikimedia Commons

Graphical visualisation of these hotspots can also be generated by using a verification, analysis and profiling (VAP) tool. Rather than providing a textual output, execution time is provided as a tree of collapsible statistics, helping to better visualize hotspots with a high proportion of executed instructions. This can be seen in figure 3 where the function processWord() implements the ChaCha20 algorithm, calling in turn the four qrx_c functions to implement the required quarter-round functions.

Figure 3: The Imperas VAP profiling tool highlights that the functions associated with the ChaCha20 algorithm consume around 90% of the processor’s time

By reviewing the assembler code generated by the compiler and/or by running basic block profiling, it is then possible to determine which instructions and instruction combinations have been used to implement the algorithm. From here the next step is to determine what custom instructions, within the confines of the specifications of the ISA, could potentially increase execution speed.

Determining the potential improvement RISC-V could deliver

The ChaCha20 algorithm makes heavy use of an XOR coupled with a left-rotate of 7, 8, 12 and 16 bits. Using the available instructions of the RV32IM base specification shows this requires an XOR instruction followed by a shift-left instruction. This means there is potential to optimize these two steps into four dedicated instructions that implement an XOR together with 7, 8, 12 or 16 bits of left rotation.

Being a load-store architecture, any custom instructions must assume that the data to be manipulated already resides in one of this RISC-V’s 32‑bit registers. This immediately determines that an R-type (register) instruction will be needed that can be located in custom-1 decode space (figure 4).

Figure 4: Instructions working on data in registers need to utilize the R-type format to encode them
[Image: The RISC-V Instruction Set Manual Vol I ]
The ISA provides a clear structure for such instructions. By following these rules, we can quickly determine how to encode our new instructions. The lower 7-bits are defined as the opcode, which is assigned a value that marks it as a custom instruction in the custom-1 decode space. This is as opposed to the OP or OP-IMM opcodes that are used for the existing XOR and shift-left instructions respectively.

Three predefined blocks of bits are reserved in the ISA definition for prescribing the two source registers and the destination register for the result. This leaves a block of bits known as funct3. These three bits provide us with room to encode eight possible instructions, four of which will be used in this example.

Without requiring the detailed hardware implementation of the RTL for these instructions, it is possible to simulate them in the ISS environment to see if they would be of any benefit at all to the challenge being faced. The four new instructions are modelled using the open virtual platforms (OVP) VMI application programming interface (API). This enables the developer to rapidly iterate the design of the instructions that provide the optimum desired outcome based on the new instructions for the target application. Only once this has been achieved is it necessary to commit resources to an RTL implementation.

For the purpose of initial functional evaluation of the instructions, there are two possible approaches. The first is to call the original C/C++ version of the algorithm, binding the new instructions to this function (figure 5a). The second is to implement them as VMI morph instructions that create the same behaviour (figure 5b). This has the benefit of being more efficient and is the recommended approach.

Figure 5a: The new quarter-round instruction implemented in C
Figure 5b: The new quarter-round instruction implemented in VMI morph code

Of course, the mere existence of new instructions does not mean that a compiler can immediately make use of them. Therefore, the original C/C++ application needs to be re-written using inline assembler and cross-compiled to make use of the new instructions. As the profiling and analysis of candidate instructions can be an iterative task, the intrinsic based approach provides the most efficient way to adapt the original C application to use the new custom instructions.

The RISC-V base implementation loaded into the simulator also needs to be made aware of the new instructions in order to benefit from them. This is achieved by including them in the model prior to re-running the simulation. In this particular example, repeating the profiling indicates less overall time spent performing the algorithm (figure 6). The Imperas VAP profiling tool shows that the processWord() function, using the dedicated in-lined instructions, now accounts for 66% of the overall algorithm execution but the overall execution time for the algorithm is greatly reduced (figure 6b).

Figure 6a: Relative time in processing the ChaCha20 algorithm is greatly reduced when using the new instructions

Results with original C implementation
Info   Simulated instructions: 316,709,013
Info   Simulated time          : 5.15 seconds

Results with Custom instructions
Info   Simulated instructions: 60,474,426
Info Simulated time          : 1.38 seconds

Figure 6b: Simulation statistics for algorithm implementation showing nearly 5x execution improvement

Once correct functionality has been determined, the model is further refined by declaring the execution time for each instruction in processor cycles. Further rounds of simulation can then be used to determine any performance improvement, even taking into account wait states associated with memory accesses that may occur in an eventual hardware implementation.

Thanks to the tight integration with common open source IDEs and GDB tools, full debugging of the solution can be undertaken in conjunction with the optimized RISC-V design before committing the design to a hardware implementation (figure 7).

Figure 7: Debugging in Eclipse with GDB shows the inline assembler (left) and the decoded instructions in the disassembly (right)

Moving from simulation to implementation

With the potential performance improvement determined, the next step requires the implementation of the four new instructions in RTL. Thanks to the preliminary work undertaken, this becomes the functional specification that defines the requirements and can also be used as part of the RTL verification test plan as a golden reference model. While the use of intrinsic functions in the C application helped the profiling and analysis of custom instructions, this approach can also be utilized for future production code development or can be considered for potential compiler tool chain enhancements.

The other remaining essential task, documentation, is also a simple process. All open virtual platforms (OVP) fast processor models include documentation that can be extended to cover the functionality of changes and modifications. Following the template given, the new instructions can be declared and described, allowing the developer community to discover their capability and make use of them. The documentation is then converted into a TeX file from which a PDF can be generated (figure 8).

Figure 8: An explanation of two of the four custom quarter-round instructions in the new, extended documentation for the chosen RISC-V base.

Summary

With the freedoms of the open ISA of RISC-V, in addition to the standard options and features defined in the specification, users can develop further custom extensions and instructions. At its simplest, it enables new and creative business models including commercial and open source implementations, plus it enables a wider freedom to explore value added features beyond the mainstream traditional approaches.

However, the true value comes from taking a fully-fledged, documented and supported base core and modifying it to meet specific application needs. Through careful application analysis, profiling of code and simulation, significant performance improvements can be attained that could not be realized through fixed ISAs. All this can be developed and profiled with real application workloads before commencing the detailed hardware implementation.


Lee Moore and Duncan Graham are senior applications engineers at Imperas Software

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.