- Optional interrupts and exception handling with Machine, [Supervisor] and [User] modes as defined in the [RISC-V Privileged ISA Specification v1.10](https://riscv.org/specifications/privileged-isa/).
- There are very few fixed things. Nearly everything is plugin based. The PC manager is a plugin, the register file is a plugin, the hazard controller is a plugin, ...
- There is an automatic a tool which allows plugins to insert data in the pipeline at a given stage, and allows other plugins to read it in another stage through automatic pipelining.
- There is a service system which provides a very dynamic framework. For instance, a plugin could provide an exception service which can then be used by other plugins to emit exceptions from the pipeline.
The following numbers were obtained by synthesizing the CPU as toplevel on the fastest speed grade without any specific synthesis options to save area or to get better maximal frequency (neutral).<br>
All the cached configurations have some cache trashing during the dhrystone benchmark except the `VexRiscv full max perf` one. This, of course, reduces the performance. It is possible to produce
VexRiscv small (RV32I, 0.52 DMIPS/MHz, no datapath bypass, no interrupt) ->
Artix 7 -> 243 MHz 504 LUT 505 FF
Cyclone V -> 174 MHz 352 ALMs
Cyclone IV -> 179 MHz 731 LUT 494 FF
iCE40 -> 92 MHz 1130 LC
VexRiscv small (RV32I, 0.52 DMIPS/MHz, no datapath bypass) ->
Artix 7 -> 240 MHz 556 LUT 566 FF
Cyclone V -> 194 MHz 394 ALMs
Cyclone IV -> 174 MHz 831 LUT 555 FF
iCE40 -> 85 MHz 1292 LC
VexRiscv small and productive (RV32I, 0.82 DMIPS/MHz) ->
Artix 7 -> 232 MHz 816 LUT 534 FF
Cyclone V -> 155 MHz 492 ALMs
Cyclone IV -> 155 MHz 1,111 LUT 530 FF
iCE40 -> 63 MHz 1596 LC
VexRiscv small and productive with I$ (RV32I, 0.70 DMIPS/MHz, 4KB-I$) ->
Artix 7 -> 220 MHz 730 LUT 570 FF
Cyclone V -> 142 MHz 501 ALMs
Cyclone IV -> 150 MHz 1,139 LUT 536 FF
iCE40 -> 66 MHz 1680 LC
VexRiscv full no cache (RV32IM, 1.21 DMIPS/MHz 2.30 Coremark/MHz, single cycle barrel shifter, debug module, catch exceptions, static branch) ->
Artix 7 -> 216 MHz 1418 LUT 949 FF
Cyclone V -> 133 MHz 933 ALMs
Cyclone IV -> 143 MHz 2,076 LUT 972 FF
VexRiscv full (RV32IM, 1.21 DMIPS/MHz 2.30 Coremark/MHz with cache trashing, 4KB-I$,4KB-D$, single cycle barrel shifter, debug module, catch exceptions, static branch) ->
Artix 7 -> 199 MHz 1840 LUT 1158 FF
Cyclone V -> 141 MHz 1,166 ALMs
Cyclone IV -> 131 MHz 2,407 LUT 1,067 FF
VexRiscv full max perf (HZ*IPC) -> (RV32IM, 1.38 DMIPS/MHz 2.57 Coremark/MHz, 8KB-I$,8KB-D$, single cycle barrel shifter, debug module, catch exceptions, dynamic branch prediction in the fetch stage, branch and shift operations done in the Execute stage) ->
Artix 7 -> 200 MHz 1935 LUT 1216 FF
Cyclone V -> 130 MHz 1,166 ALMs
Cyclone IV -> 126 MHz 2,484 LUT 1,120 FF
VexRiscv full with MMU (RV32IM, 1.24 DMIPS/MHz 2.35 Coremark/MHz, with cache trashing, 4KB-I$, 4KB-D$, single cycle barrel shifter, debug module, catch exceptions, dynamic branch, MMU) ->
Artix 7 -> 151 MHz 2021 LUT 1541 FF
Cyclone V -> 124 MHz 1,368 ALMs
Cyclone IV -> 128 MHz 2,826 LUT 1,474 FF
VexRiscv linux balanced (RV32IMA, 1.21 DMIPS/MHz 2.27 Coremark/MHz, with cache trashing, 4KB-I$, 4KB-D$, single cycle barrel shifter, catch exceptions, static branch, MMU, Supervisor, Compatible with mainstream linux) ->
Note that, recently, the capability to remove the Fetch/Memory/WriteBack stage was added to reduce the area of the CPU, which ends up with a smaller CPU and a better DMIPS/MHz for the small configurations.
We now have twenty-two CPU configurations [in this directory](./src/main/scala/vexriscv/demo). Look at the files called Gen*.scala. Here is the [full configuration](./src/main/scala/vexriscv/demo/GenFull.scala), and the [smallest configuration](./src/main/scala/vexriscv/demo/GenSmallest.scala).
To run basic simulation with stdout and no tracing, loading a binary directly is supported with the `RUN_HEX` variable of `src/test/cpp/regression/makefile`. This has a significant performance advantage over using GDB over OpenOCD with JTAG over TCP. VCD tracing is supported with the makefile variable `TRACE`.
Then, you can use the [OpenOCD RISC-V](https://github.com/SpinalHDL/openocd_riscv) tool to create a GDB server connected to the target (the simulated CPU), as follows:
You can use the Eclipse + Zylin embedded CDT plugin to do it (http://opensource.zylin.com/embeddedcdt.html). Tested with Helios Service Release 2 (http://www.Eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/helios/SR2/Eclipse-cpp-helios-SR2-linux-gtk-x86_64.tar.gz) and the corresponding zylin plugin.
See https://drive.google.com/drive/folders/1NseNHH05B6lmIXqQFVwK8xRjWE4ydeG-?usp=sharing to import a makefile project and create a debug configuration.
You can find some FPGA projects which instantiate the Briey SoC here (DE1-SoC, DE0-Nano): https://drive.google.com/drive/folders/0B-CqLXDTaMbKZGdJZlZ5THAxRTQ?usp=sharing
A top level simulation testbench with the same features + a GUI are implemented with SpinalSim. You can find it in `src/test/scala/vexriscv/MuraxSim.scala`.
There is currently no SoC to run it on hardware, it is WIP. But the CPU simulation can already boot linux and run user space applications (even python).
//Decoding specification when the 'key' pattern is recognized in the instruction
List(
IS_SIMD_ADD -> True,
REGFILE_WRITE_VALID -> True, //Enable the register file write
BYPASSABLE_EXECUTE_STAGE -> True, //Notify the hazard management unit that the instruction result is already accessible in the EXECUTE stage (Bypass ready)
BYPASSABLE_MEMORY_STAGE -> True, //Same as above but for the memory stage
The first one (`CustomCsrDemoPlugin`) adds an instruction counter and a clock cycle counter into the CSR mapping (and also do tricky stuff as a demonstration).
VexRiscv is implemented via a 5 stage in-order pipeline on which many optional and complementary plugins add functionalities to provide a functional RISC-V CPU.
This approach is completely unconventional and only possible through meta hardware description languages (SpinalHDL, in the current case) but has proven its advantages
If you generate the CPU without any plugin, it will only contain the definition of the 5 pipeline stages and their basic arbitration, but nothing else,
More recent versions of gdb will not detect the FPU. Also, the openocd_riscv.vexriscv can't read CSR/FPU registers, so to have visibility on the floating points values, you need to compile your code in -O0, which will force values to be stored in memory (and so, be visible)
| catchAccessFault | Boolean | When true, an instruction read response with read error asserted results in a CPU exception trap. |
| resetVector | BigInt | Address of the program counter after the reset. |
| cmdForkOnSecondStage | Boolean | When false, branches immediately update the program counter. This minimizes branch penalties but might reduce FMax because the instruction bus address signal is a combinatorial path. When true, this combinatorial path is removed and the program counter is updated one cycle after a branch is detected. While FMax may improve, an additional branch penalty will be incurred as well. |
| cmdForkPersistence | Boolean | When false, requests on the iBus can disappear/change before they are acknowledged. This reduces area but isn't safe/supported by many arbitration/slaves. When true, once initiated, iBus requests will stay until they are acknowledged. |
| busLatencyMin | Int | Specifies the minimal latency between the iBus.cmd and iBus.rsp. A corresponding number of stages are added to the frontend to keep the IPC to 1.|
| injectorStage | Boolean | When true, a stage between the frontend and the decode stage of the CPU is added to improve FMax. (busLatencyMin + injectorStage) should be at least two. |
| prediction | BranchPrediction | Can be set to NONE/STATIC/DYNAMIC/DYNAMIC_TARGET to specify the branch predictor implementation. See below for more details. |
| historyRamSizeLog2 | Int | Specify the number of entries in the direct mapped prediction cache of DYNAMIC/DYNAMIC_TARGET implementation. 2 pow historyRamSizeLog2 entries. |
**Important** : check out the cmdForkPersistence parameter, because if it is not set, it can break the iBus compatibility with your memory system (unless you externaly add some buffers).
The stage argument specifies the stage from which the jump is asked. This allows the PcManagerSimplePlugin plugin to manage priorities between jump requests from
| resetVector | BigInt | Address of the program counter after the reset. |
| relaxedPcCalculation | Boolean | When false, branches immediately update the program counter. This minimizes branch penalties but might reduce FMax because the instruction bus address signal is a combinatorial path. When true, this combinatorial path is removed and the program counter is updated one cycle after a branch is detected. While FMax may improve, an additional branch penalty will be incurred as well. |
| prediction | BranchPrediction | Can be set to NONE/STATIC/DYNAMIC/DYNAMIC_TARGET to specify the branch predictor implementation. See below for more details. |
| historyRamSizeLog2 | Int | Specify the number of entries in the direct mapped prediction cache of DYNAMIC/DYNAMIC_TARGET implementation. 2 pow historyRamSizeLog2 entries |
| config.cacheSize | Int | Total storage capacity of the cache in bytes. |
| config.bytePerLine | Int | Number of bytes per cache line |
| config.wayCount | Int | Number of cache ways |
| config.twoCycleRam | Boolean | Check the tags values in the decode stage instead of the fetch stage to relax timings |
| config.asyncTagMemory | Boolean | Read the cache tags in an asynchronous manner instead of syncronous one |
| config.addressWidth | Int | CPU address width. Should be 32 |
| config.cpuDataWidth | Int | CPU data width. Should be 32 |
| config.memDataWidth | Int | Memory data width. Could potentialy be something else than 32, but only 32 is currently tested |
| config.catchIllegalAccess | Boolean | Catch when a memory access is done on non-valid memory address (MMU) |
| config.catchAccessFault | Boolean | Catch when the memeory bus is responding with an error |
| config.catchMemoryTranslationMiss | Boolean | Catch when the MMU miss a TLB |
Note: If you enable the twoCycleRam option and if wayCount is bigger than one, then the register file plugin should be configured to read the regFile in an asynchronous manner.
For instance, for a given instruction, the pipeline hazard plugin needs to know if it uses the register file source 1/2 in order to stall the pipeline until the hazard is gone.
BYPASSABLE_EXECUTE_STAGE -> True, //Notify the hazard management unit that the instruction result is already accessible in the EXECUTE stage (Bypass ready)
BYPASSABLE_MEMORY_STAGE -> True, //Same as above but for the memory stage
| regFileReadyKind | RegFileReadKind | Can be set to ASYNC or SYNC. Specifies the kind of memory read used to implement the register file. ASYNC means zero cycle latency memory read, while SYNC means one cycle latency memory read which can be mapped into standard FPGA memory blocks |
| zeroBoot | Boolean | Load all registers with zeroes at the beginning of the simulation to keep everything deterministic in logs/traces|
If you get a `Missing inserts : INSTRUCTION_ANTICIPATE` error, that's because the RegFilePlugin is configured to use SYNC memory read ports to access the register file, but the IBus plugin configuration can't provide the instruction's register file read address one cycle before the decode stage. To workaround that you can :
- Configure the RegFilePlugin to implement the register file read in a asyncronus manner (ASYNC), if your target device support such things
- If you use the IBusSimplePlugin, you need to enable the injectorStage configuration
- If you use the IBusCachedPlugin, you can either enable the injectorStage, or set twoCycleCache + twoCycleRam to false.
This plugin checks the pipeline instruction dependencies and, if necessary or possible, will stop the instruction in the decoding stage or bypass the instruction results
This plugin muxes different input values to produce SRC1/SRC2/SRC_ADD/SRC_SUB/SRC_LESS values which are common values used by many plugins in the execute stage (ALU/Branch/Load/Store).
| separatedAddSub | RegFileReadKind | By default SRC_ADD/SRC_SUB are generated from a single controllable adder/substractor, but if this is set to true, it use separate adder/substractors |
| executeInsertion | Boolean | By default SRC1/SRC2 are generated in the Decode stage, but if this parameter is true, it is done in the Execute stage (It will relax the bypassing network) |
This plugin implements all ADD/SUB/SLT/SLTU/XOR/OR/AND/LUI/AUIPC instructions in the execute stage by using the SrcPlugin outputs. It is a really simple plugin.
| earlyInjection | Boolean | By default the result of the shift is injected into the pipeline in the Memory stage to relax timings, but if this option is true it will be done in the Execute stage |
This plugin implements all branch/jump instructions (JAL/JALR/BEQ/BNE/BLT/BGE/BLTU/BGEU) with primitives used by the cpu frontend plugins to implement branch prediction. The prediction implementation is set in the frontend plugins (IBusX).
| earlyBranch | Boolean | By default the branch is done in the Memory stage to relax timings, but if this option is set it's done in the Execute stage|
| catchAddressMisaligned | Boolean | If a jump/branch is done in an unaligned PC address, it will fire an trap exception |
Each miss predicted jumps will produce between 2 and 4 cycles penalty depending the `earlyBranch` and the `PcManagerSimplePlugin.relaxedPcCalculation` configurations
In the decode stage, a conditional branch pointing backwards or a JAL is branched speculatively. If the speculation is right, the branch penalty is reduced to a single cycle,
Same as the STATIC prediction, except that to do the prediction, it uses a direct mapped 2 bit history cache (BHT) which remembers if the branch is more likely to be taken or not.
This predictor uses a direct mapped branch target buffer (BTB) in the Fetch stage which stores the PC of the instruction, the target PC of the instruction and a 2 bit history to remember
if the branch is more likely to be taken or not. This is actually the most efficient branch predictor implemented on VexRiscv, because when the branch prediction is right, it produces no branch penalty.
The downside is that this predictor has a long combinatorial path coming from the prediction cache read port to the programm counter, passing through the jump interface.
| catchAddressMisaligned | Boolean | If a memory access is done to an unaligned memory address, it will fire a trap exception |
| catchAccessFault | Boolean | If a memory read returns an error, it will fire a trap exception |
| earlyInjection | Boolean | By default, the memory read values are injected into the pipeline in the WriteBack stage to relax the timings. If this parameter is true, it's done in the Memory stage |
There is at least one cycle latency between a cmd and the corresponding rsp. The rsp.ready flag should be false after a read cmd until the rsp is present.
You can invalidate the whole cache via the 0x500F instruction, and you can invalidate a address range (single line size) via the instruction 0x500F | RS1 <<15whereRS1shouldnotbeX0andpointtoonebyteofthedesiredaddresstoinvalidate.
Implements the multiplication instruction from the RISC-V M extension. Its implementation was done in a FPGA friendly way by using 4 17*17 bit multiplications.
The processing is fully pipelined between the Execute/Memory/Writeback stage. The results of the instructions are always inserted in the WriteBack stage.
Implements the division/modulo instruction from the RISC-V M extension. It is done in a simple iterative way which always takes 34 cycles. The result is inserted into the
This plugin implements the multiplication, division and modulo of the RISC-V M extension in an iterative way, which is friendly for small FPGAs that don't have DSP blocks.
If an interrupt occurs, before jumping to mtvec, the plugin will stop the Prefetch stage and wait for all the instructions in the later pipeline stages to complete their execution.
If an exception occur, the plugin will kill the corresponding instruction, flush all previous instructions, and wait until the previously killed instructions reach the WriteBack
| ioRange | UInt => Bool | Function reference which eat an address and return true if the address should be uncached. ex : ioRange= _(31 downto 28) === 0xF => all 0xFXXXXXXX will be uncached|
Hardware refilled MMU implementation. Allows other plugins such as DBusCachedPlugin/IBusCachedPlugin to instanciate memory address translation ports. Each port has a small dedicated
This is a physical memory protection (PMP) plugin which conforms to the v1.12 RISC-V privilege specification, without ePMP (`Smepmp`) extension support. PMP is configured by writing two special CSRs: `pmpcfg#` and `pmpaddr#`. The former contains the permissions and addressing modes for four protection regions, and the latter contains the encoded start address for a single region. Since the actual region bounds must be computed from the values written to these registers, writing them takes a few CPU cylces. This delay is necessary in order to centralize all of the decoding logic into a single component. Otherwise, it would have to be duplicated for each region, even though the decoding operation happens only when PMP is reprogrammed (e.g., on some context switches).
##### PmpPluginNapot
The `PmpPluginNapot` is a specialized PMP implementation, providing only the `NAPOT` (naturally-aligned poser-of-2 regions) addressing mode. It requires fewer resources and has a less significant timing impact compared to the full `PmpPlugin`.
This plugin implements enough CPU debug features to allow comfortable GDB/Eclipse debugging. To access those debug features, it provides a simple memory bus interface.
| debugClockDomain | ClockDomain | As the debug unit is able to reset the CPU itself, it should use another clock domain to avoid killing itself (only the reset wire should differ) |
The EmbeddedRiscvJtag plugin can also be used with tunneled JTAG. This allows debugging with the same cable used to configure an FPGA.
This uses an FPGA-specific primitive for JTAG access (e.g. Xilinx BSCANE2):
```scala
val xilJtag = BSCANE2(userId = 4) // must be userId = 4
val jtagClockDomain = ClockDomain(
clock = xilJtag.TCK
)
```
Then, the EmbeddedRiscvJtag plugin must be configured for tunneling without a TAP. Note, the debug clock domain must have a separate reset from the CPU clock domain.
```scala
// in plugins
new EmbeddedRiscvJtag(
p = DebugTransportModuleParameter(
addressWidth = 7,
version = 1,
idle = 7
),
withTunneling = true,
withTap = false,
debugCd = debugClockDomain,
jtagCd = jtagClockDomain
)
```
Then connect the EmbeddedRiscvJtag to the FPGA-specific JTAG primitive:
This plugin offers a service to other plugins to generate a useful Yaml file describing the CPU configuration. It contains, for instance, the sequence of instructions required
This plugin allow to accelerate AES encryption/decryption by using an internal ROM to solve SBOX and permutations, allowing in practice to execute one AES round in about 21 cycles.
For more documentation, check src/main/scala/vexriscv/plugin/AesPlugin.scala, a software C driver can be found here : <https://github.com/SpinalHDL/SaxonSoc/blob/dev-0.3/software/standalone/driver/aes_custom.h>
It was also ported on libressl via the following patch :