ECE 5745 Tutorial 9: TinyRV2 Accelerators

Author: Christopher Batten, Jack Brzozowski
Date: March 23, 2021

Table of Contents

Introduction
Baseline TinyRV2 Processor FL and RTL Models
Cross-Compiling and Executing TinyRV2 Microbenchmarks
TinyRV2 Processor ASIC
VVADD Accelerator FL, CL, and RTL Models
VVADD Accelerator ASIC
Integrating the TinyRV2 Processor and the VVADD Accelerator
Accelerating a TinyRV2 Microbenchmark
TinyRV2 Processor + VVADD Accelerator ASIC
Using Verilog Accelerator RTL Models

Introduction

The infrastructure for the ECE 5745 lab assignments and projects has support for implementing medium-grain accelerators. Fine-grain accelerators are tightly integrated within the processor pipeline (e.g., a specialized functional unit for bit-reversed addressing useful in implementing an FFT), while coarse-grain accelerators are loosely integrated with a processor through the memory hierarchy (e.g., a graphics rendering accelerator sharing the last-level cache with a general-purpose processor). Medium-grain accelerators are often integrated as co-processors: the processor can directly send/receive messages to/from the accelerator with special instructions, but the co-processor is relatively decoupled from the main processor pipeline and can also independently interact with memory.

This tutorial will use the vector-vector-add (vvadd) microbenchmark as an example. We will explore the area and timing of a baseline TinyRV2 pipelined processor and the energy and performance when this processor is used to execute a pure-software version of the vvadd microbenchmark. We will then implement a vvadd accelerator, integrate it with the TinyRV2 pipelined processor, and determine the potential benefit of hardware acceleration for this simple microbenchmark. This tutorial assumes you have already completed the tutorials on Linux, Git, PyMTL, Verilog, the Synopsys ASIC tools, the PyHFlow automated ASIC flow, and SRAM generators.

The first step is to source the setup script, clone this repository from GitHub, and define an environment variable to keep track of the top directory for the project.

 % source setup-ece5745.sh
 % mkdir -p $HOME/ece5745
 % cd $HOME/ece5745
 % git clone git@github.com:cornell-ece5745/ece5745-tut9-xcel
 % cd ece5745-tut9-xcel
 % TOPDIR=$PWD

Baseline TinyRV2 Processor FL and RTL Models

The following figure illustrates the overall system we will be using with our TinyRV2 processors. The processor includes eight latency insensitive val/rdy interfaces. The mngr2proc/proc2mngr interfaces are used for the test harness to send data to the processor and for the processor to send data back to the test harness. The imem master/minion interface is used for instruction fetch, and the dmem master/minion interface is used for implementing load/store instructions. The system includes both instruction and data caches. The xcel master/minion interface is used for the processor to send messages to the accelerator. The mngr2proc/proc2mngr and memreq/memresp interfaces were all introduced in ECE 4750. For now we will largely ignore the accelerator, and we will defer discussion of the xcel master/minion interfaces to later in this tutorial.

We provide two implementations of the TinyRV2 processor. The FL model in sim/proc/ProcFL.py is essentially an instruction-set-architecture (ISA) simulator; it simulates only the instruction semantics and makes no attempt to model any timing behavior. As a reminder, the TinyRV2 instruction set is defined here:

http://www.csl.cornell.edu/courses/ece5745/handouts/ece5745-tinyrv-isa.txt

The RTL model in sim/proc/ProcPRTL.py is similar to the alternative design for lab 2 in ECE 4750. It is a five-stage pipelined processor that implements the TinyRV2 instruction set and includes full bypassing/forwarding to resolve data hazards. There are two important differences from the alternative design for lab 2 of ECE 4750. First, the new processor design uses a single-cycle integer multiplier. We can push the design through the flow and verify that the single-cycle integer multiplier does not adversely impact the overall processor cycle time. Second, the new processor design includes the ability to handle new CSRs for interacting with medium-grain accelerators. The datapath diagram for the processor is shown below.

We should run all of the unit tests on both the FL and RTL processor models to verify that we are starting with a working processor.

 % mkdir -p $TOPDIR/sim/build
 % cd $TOPDIR/sim/build
 % pytest ../proc
 % pytest ../proc --test-verilog

See the handout for lab 2 from ECE 4750 for more information about how we use pytest and the mngr2proc/proc2mngr interfaces to test the TinyRV2 processor.

We also provide an RTL model in sim/cache/BlockingCachePRTL.py which is very similar to the alternative design for lab 3 of ECE 4750. It is a two-way set-associative cache with 16B cache lines and a write-back/write-allocate write policy and LRU replacement policy. There are three important differences from the alternative design for lab 3 of ECE 4750. First, the new cache design is larger with a total capacity of 8KB. Second, the new cache design carefully merges states to enable a single-cycle hit latency for both reads and writes. Note that writes have a two cycle occupancy (i.e., back-to-back writes will only be able to be serviced at half throughput). Third, the previous cache design used combinational-read SRAMs, while the new cache design uses synchronous-read SRAMs. Combinational-read SRAMs mean the read data is valid on the same cycle we set the read address. Synchronous-read SRAMs mean the read data is valid on the cycle after we set the read address. Combinational SRAMs simplify the design, but are not realistic. Almost all real SRAM memory generators used in ASIC toolflows produce synchronous-read SRAMs, and indeed the CACTI memory compiler discussed in the previous tutorial also produces synchronous-read SRAMs. Using synchronous-read SRAMs requires non-trivial changes to both the datapath and the control logic. The cache FSM must make sure all control signals for the SRAM are ready the cycle before we need the read data. The datapath and FSM diagrams for the new cache are shown below. Notice in the FSM how we are able to stay in the TAG_CHECK_READ_DATA state if another request is ready.

We should run all of the unit tests on the cache RTL model to verify that we are starting with a working cache, and we should also use the --test-verilog option to ensure the translated Verilog is correct.

 % mkdir -p $TOPDIR/sim/build
 % cd $TOPDIR/sim/build
 % pytest ../cache
 % pytest ../cache --test-verilog

Cross-Compiling and Executing TinyRV2 Microbenchmarks

We will write our microbenchmarks in C. Take a closer look at the vvadd function which is located in app/ubmark/ubmark-vvadd.c:

 __attribute__ ((noinline))
 void vvadd_scalar( int *dest, int *src0, int *src1, int size )
 {
   for ( int i = 0; i < size; i++ )
     dest[i] = src0[i] + src1[i];
 }

We then have a set of tests that call this function and check to make sure it produces the right results. You can see an example test in app/ubmark/ubmark-vvadd-test1.c:

 void test_size3()
 {
   wprintf(L"test_size3\n");

   int src0[3] = { 1, 2, 3 };
   int src1[3] = { 4, 5, 6 };
   int dest[3] = { 0, 0, 0 };
   int ref[3]  = { 5, 7, 9 };

   vvadd_scalar( dest, src0, src1, 3 );

   for ( int i = 0; i < 3; i++ ) {
     if ( !( dest[i] == ref[i] ) )
       test_fail( i, dest[i], ref[i] );
   }
 }

 ...
 int main( int argc, char* argv[] )
 {
   test_size3();
   test_size8();
   test_pass();
   return 0;
 }

Some simple test functions such as test_fail and test_pass are provided in app/common/common.h. We have a build system that can compile these tests natively for x86 and can also cross-compile these tests for TinyRV2 so they can be executed on our simulators. When developing and testing microbenchmarks, we should always try to compile them natively to ensure the microbenchmark is functionally correct before we attempt to cross-compile the microbenchmark for TinyRV2. Debugging a microbenchmark natively is much easier compared to debugging a microbenchmark on our simulators. Here is how we compile and execute the pure-software vvadd test natively:

 % mkdir -p $TOPDIR/app/build-native
 % cd $TOPDIR/app/build-native
 % ../configure
 % make ubmark-vvadd-test1
 % ./ubmark-vvadd-test1

The test should display passed. Once you are sure your test is working correctly natively, you can cross-compile the test for TinyRV2.

 % mkdir -p $TOPDIR/app/build
 % cd $TOPDIR/app/build
 % ../configure --host=riscv32-unknown-elf
 % make ubmark-vvadd-test1

This will create a ubmark-vvadd-test1 binary which contains TinyRV2 instructions and data. You can disassemble a TinyRV2 binary (i.e., turn a compiled binary back into an assembly text representation) with the riscv32-objdump command like this:

 % cd $TOPDIR/app/build
 % riscv32-objdump ubmark-vvadd-test1 | less
 00000604 <vvadd_scalar(int*, int*, int*, int)>:
   604:  bge   x0,  x13, 630
   608:  slli  x13, x13, 0x2
   60c:  add   x13, x11, x13
   610:  lw    x15, 0(x11)   # <-.
   614:  lw    x14, 0(x12)   #   |
   618:  addi  x11, x11, 4   #   |
   61c:  addi  x12, x12, 4   #   |
   620:  add   x15, x15, x14 #   |
   624:  sw    x15, 0(x10)   #   |
   628:  addi  x10, x10, 4   #   |
   62c:  bne   x11, x13, 610 # --'
   630:  jalr  x0,  x1,  0

Recall that you can search with less by simply pressing the forward slash key and typing in your search term. So you can find the assembly code for the vvadd_scalar function by using less to search for vvadd_scalar. You can also redirect the output from riscv32-objdump to a text file for viewing with your favorite text editor. The disassembly shows the address, bytes, and assembly text for each instruction in the binary.

The assembly code for the vvadd_scalar function is similar to what we saw in ECE 4750 although with some additional optimizations. I have added some comments to show the backwards branch for the vvadd loop. The loop has eight instructions. Four instructions do useful work (i.e., two LW instructions, the actual ADDU instruction, one SW instruction) and three ADDI instructions generate the array addresses by bumping the array pointers. Notice that there is no explicit loop counter. The compiler has instead calculated the address of one past the last element in the first source array, and placed this value in x13. Each iteration, the BNE instruction compares the current pointer to see if we have reached the end of the array.

We have provided you with a simulator that composes a processor, cache, memory, and accelerator and is capable of executing TinyRV2 binaries. The simulator enables flexibly choosing the processor implementation (FL vs. RTL), the cache implementation (no cache vs. RTL), and the type and implementation of the accelerator. By default, the simulator uses the processor FL model, no cache model, and a null accelerator which we will discuss later. So let’s execute the vvadd test on the instruction-set simulator:

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim ../../app/build/ubmark-vvadd-test1

After a few seconds the simulator should display passed which means the test successfully executed on the ISA simulator. Now that we know that our vvadd function is correct, we can start to evaluate its performance using an evaluation program which serves as the actual microbenchmark. Take a look at the microbenchmark in app/ubmark/ubmark-vvadd-eval.c:

 ...
 int main( int argc, char* argv[] )
 {
   int dest[size];

   for ( int i = 0; i < size; i++ )
     dest[i] = 0;

   test_stats_on();
   vvadd_scalar( dest, src0, src1, size );
   test_stats_off();

   for ( int i = 0; i < size; i++ ) {
     if ( !( dest[i] == ref[i] ) )
       test_fail( i, dest[i], ref[i] );
   }
   return 0;
 }

The src0, src1, and ref arrays are all defined in the app/ubmark/ubmark-vvadd.dat file. The microbenchmark first initializes the destination array to be all zeros, turns stats on, does the actual vvadd computation, turns stats off, and finally verifies that the results are as expected. We need the test_stats_on() and test_stats_off() functions to make sure we can keep track of various statistics (e.g., the number of cycles) only during the important part of the microbenchmark. We do not want to count time spent in initialization or verification when comparing the performance of our various microbenchmarks. These two functions are defined in app/common/common-misc.h as follows:

 inline void test_stats_on()
 {
   int status = 1;
   asm( "csrw 0x7c1, %0" :: "r" (status) )
 }

 inline void test_stats_on()
 {
   int status = 0;
   asm( "csrw 0x7c1, %0" :: "r" (status) )
 }

We are using the GCC inline assembly extensions to enable us to directly insert a specific assembly instruction into our C code. You can find out more about inline assembly syntax here:

https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

At a high level, %0 acts as a place holder for whatever register specifier the compiler ends up allocating for the status variable. The TinyRV2 instruction set defines CSR number 0x7c1 as the stats_en control/status register, which is why we use 0x7c1 in the inline assembly. Refer to the TinyRV2 instruction set for a list of the CSRs.

http://www.csl.cornell.edu/courses/ece5745/handouts/ece5745-tinyrv-isa.txt

The idea is that the microarchitecture and/or simulator can monitor for writes to the stats_en register to determine when to start and stop keeping statistics. For more on writing microbenchmarks, please review the handout for lab 5 from ECE 4750.

Before cross-compiling and analyzing this microbenchmark on our TinyRV2 processor, we always want to check that it works natively.

 % cd $TOPDIR/app/build-native
 % make ubmark-vvadd-eval
 % ./ubmark-vvadd-eval

The microbenchmark should display passed. Once you are sure your microbenchmark is working correctly natively, you can cross-compile the microbenchmark for TinyRV2.

 % mkdir -p $TOPDIR/app/build
 % cd $TOPDIR/app/build
 % ../configure --host=riscv32-unknown-elf
 % make ubmark-vvadd-eval

This will create a ubmark-vvadd-eval binary which contains TinyRV2 instructions and data. As with the test, you can disassemble a TinyRV2 binary with the riscv32-objdump command like this:

 % cd $TOPDIR/app/build
 % riscv32-objdump ubmark-vvadd-eval | less
 000002c0 <vvadd_scalar(int*, int*, int*, int)>:
    2c0:  bge   x0,  x13, 2ec
    2c4:  slli  x13, x13, 0x2
    2c8:  add   x13, x11, x13
    2cc:  lw    x15, 0(x11)   # <-.
    2d0:  lw    x14, 0(x12)   #   |
    2d4:  addi  x11, x11, 4   #   |
    2d8:  addi  x12, x12, 4   #   |
    2dc:  add   x15, x15, x14 #   |
    2e0:  sw    x15, 0(x10)   #   |
    2e4:  addi  x10, x10, 4   #   |
    2e8:  bne   x11, x13, 2cc # --'
    2ec:  jalr  x0,  x1, 0

 00000330 <main>:
    ...
    374:  sw    x0,  0(x15)               # <-. initialize
    378:  addi  x14, x14, 1               #   | dest
    37c:  addi  x15, x15, 4               #   | array
    380:  bne   x13, x14, 374             # --'
    384:  addi  x15, x0, 1
    388:  csrw  0x7c1, x15                # turn stats on
    38c:  addi  x18, x0, 1168             #
    390:  addi  x11, x18, 400             #
    394:  addi  x12, x0, 1168             #
    398:  addi  x10, x9, 0                #
    39c:  jal   x1,  2c0 <vvadd_scalar()> # call vvadd_scalar
    3a0:  addi  x15, x0, 0                #
    3a4:  csrw  0x7c1, x15                # turn stats off
    3a8:  lw    x11, -1728(x19)
    3ac:  bge   x0,  x11, 3f0 <main+0xc0>
    ...

You can see the CSRW instructions to set and clear the stats_en bit have been inserted in the main function around the call to vvadd_scalar. Now let’s verify that the microbenchmark produces the right answer on the instruction-set simulator:

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim ../../app/build/ubmark-vvadd-eval

After a few seconds the simulator should display passed which means the microbenchmark successfully executed on the ISA simulator. The --trace command line option will display each instruction as it is executed on the ISA simulator.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --trace ../../app/build/ubmark-vvadd-eval > ubmark-vvadd-fl.trace

When dumping out large line traces, it is usually much faster to save them to a file and then open the file in your favorite text editor. You can search in the line trace for the CSRW instruction to quickly jump to where the actual vvadd_scalar function starts executing. Here is what the line trace looks like for one iteration of the vvadd loop:

       PC       instruction           imemreq           imemresp          dmemreq                   dmemresp
----------------------------------------------------------------------------------------------------------------------
-#                             |                 |                  rd:00:00000624:0:        |
-#                             |                 |                                           |rd:00:0:0:00000000
-000002d0 lw   x15, 0x000(x11) |                 |                                           |
-#                             |rd:00:000002d0:0:|                                           |
-#                             |                 |rd:00:0:0:00062703                         |
-#                             |                 |                  rd:00:00000494:0:        |
-#                             |                 |                                           |rd:00:0:0:00000047
-000002d4 lw   x14, 0x000(x12) |                 |                                           |
-#                             |rd:00:000002d4:0:|                                           |
-#                             |                 |rd:00:0:0:00458593                         |
-000002d8 addi x11, x11, 0x004 |                 |                                           |
-#                             |rd:00:000002d8:0:|                                           |
-#                             |                 |rd:00:0:0:00460613                         |
-000002dc addi x12, x12, 0x004 |                 |                                           |
-#                             |rd:00:000002dc:0:|                                           |
-#                             |                 |rd:00:0:0:00e787b3                         |
-000002e0 add  x15, x15, x14   |                 |                                           |
-#                             |rd:00:000002e0:0:|                                           |
-#                             |                 |rd:00:0:0:00f52023                         |
-#                             |                 |                  wr:00:000ffe40:0:00000047|
-#                             |                 |                                           |wr:00:0:0:
-000002e4 sw   x15, 0x000(x10) |                 |                                           |
-#                             |rd:00:000002e4:0:|                                           |
-#                             |                 |rd:00:0:0:00450513                         |
-000002e8 addi x10, x10, 0x004 |                 |                                           |
-#                             |rd:00:000002e8:0:|                                           |
-#                             |                 |rd:00:0:0:fed592e3                         |
-000002cc bne  x11, x13, 0x1fe4|                 |                                           |
-#                             |rd:00:000002cc:0:|                                           |
-#                             |                 |rd:00:0:0:0005a783                         |

Since this is an ISA simulator, instructions can functionally execute in a single cycle, although technically they take multiple cycles to interact with the memory system. These cycles are not really modeling any kind of realistic timing, but can instead be thought of as the “steps” required for functional simulation.

Now that we have verified the microbenchmark works correctly on the ISA simulator, we can run the microbenchmark on the baseline TinyRV2 pipelined processor RTL model:

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl rtl \
     --stats ../../app/build/ubmark-vvadd-eval
 num_cycles =  1318
 num_insts_on_processor =  811

We use the --proc-impl command line option to choose the processor RTL model and the --cache-impl command line option to choose the cache RTL model. The reported number of cycles and instructions is only when stats are enabled. The cycles/instruction is 1.62. You can use the --trace command line option to understand how the processor is performing in more detail.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl rtl \
     --trace ../../app/build/ubmark-vvadd-eval > ubmark-vvadd-rtl.trace

This is the line trace for two iterations of the vvadd loop in the steady state.

       F stage  D stage               X    M    W     I$   D$  D$->mem           mem->D$
---------------------------------------------------------------------------------------------------------------------------
 -000002cc|                     |    |bne |addi[(TC)|(I )]                 |.
 -000002d0|lw   x15, 0x000(x11) |    |    |bne [(TC)|(I )]                 |.
 -000002d4|lw   x14, 0x000(x12) |lw  |    |    [(TC)|(I )]                 |.
 -#       |#                    |#   |#   |    [(TC)|(TC)]                 |.
 -#       |#                    |#   |#   |    [(I )|(RR)]rd:00:00000630:0:|.
 -#       |#                    |#   |#   |    [(I )|(RW)]                 |rd:00:0:0:00000039000000550000004700000019
 -#       |#                    |#   |#   |    [(I )|(RU)]                 |.
 -#       |#                    |#   |#   |    [(I )|(RD)]                 |.
 -000002d8|addi x11, x11, 0x004 |lw  |lw  |    [(I )|(Wm)]                 |.
 -#       |#                    |#   |#   |lw  [(TC)|(I )]                 |.
 -#       |#                    |#   |#   |    [(I )|(TC)]                 |.
 -#       |#                    |#   |#   |    [(I )|(RR)]rd:00:000004a0:0:|.
 -#       |#                    |#   |#   |    [(I )|(RW)]                 |rd:00:0:0:000000200000005f000000040000002b
 -#       |#                    |#   |#   |    [(I )|(RU)]                 |.
 -#       |#                    |#   |#   |    [(I )|(RD)]                 |.
 -000002dc|addi x12, x12, 0x004 |addi|lw  |    [(I )|(Wm)]                 |.
 -000002e0|add  x15, x15, x14   |addi|addi|lw  [(TC)|(I )]                 |.
 -000002e4|sw   x15, 0x000(x10) |add |addi|addi[(TC)|(I )]                 |.
 -000002e8|addi x10, x10, 0x004 |sw  |add |addi[(TC)|(I )]                 |.
 -000002ec|bne  x11, x13, 0x1fe4|addi|sw  |add [(TC)|(TC)]                 |.
 -~       |~                    |bne |addi|sw  [(TC)|(WD)]                 |.
 -000002cc|                     |    |bne |addi[(TC)|(I )]                 |.
 -000002d0|lw   x15, 0x000(x11) |    |    |bne [(TC)|(I )]                 |.
 -000002d4|lw   x14, 0x000(x12) |lw  |    |    [(TC)|(I )]                 |.
 -000002d8|addi x11, x11, 0x004 |lw  |lw  |    [(TC)|(TC)]                 |.
 -000002dc|addi x12, x12, 0x004 |addi|lw  |lw  [(TC)|(TC)]                 |.
 -000002e0|add  x15, x15, x14   |addi|addi|lw  [(TC)|(I )]                 |.
 -000002e4|sw   x15, 0x000(x10) |add |addi|addi[(TC)|(I )]                 |.
 -000002e8|addi x10, x10, 0x004 |sw  |add |addi[(TC)|(I )]                 |.
-000002ec|bne  x11, x13, 0x1fe4|addi|sw  |add [(TC)|(TC)]                 |.
-~       |~                    |bne |addi|sw  [(TC)|(WD)]                 |.
-000002cc|                     |    |bne |addi[(TC)|(I )]                 |.

We see the first and second LW instructions miss in the data cache causing the data cache FSM to issue a refill request to the main memory (notice the data cache FSM moving through the TC->RR->RW->RU->RD->W states). The BNE instruction is resolved in the X stage and causes a two-cycle bubble. Notice that the SW instruction hit in the data cache. If we look at the C code, we can see that the vvadd microbenchmark initializes the destination array to all zeros. This essentially prefetches the destination array into the cache so that all SW instructions in the vvadd loop will hit in the cache. The two LW instructions hit in the data cache during the second iteration. This is due to spatial locality: the LW instructions in the first iteration bring in the source arrays, and then the LW instructions in the second iteration are able to access this same cache line.

We know that every fourth iteration should look like the first iteration (21 cycles) and the remaining iterations should look like the second iteration (10 cycles). Since there are 100 iterations, this means the total number of cycles should be about 1275 cycles, but our simulator reported 1318 cycles. The discrepancy is due to the extra cycles required to call and return from the vvadd_scalar function.

Sometimes it is useful to run experiments where we assume a perfect cache to help isolate the impact of the memory system. This is easy to do with our simulator. We simply avoid inserting any cache at all and instead have the processor directly communicate with the test memory. This means all memory requests take a single cycle.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl \
     --stats ../../app/build/ubmark-vvadd-eval
 num_cycles =  1012
 num_insts_on_processor =  811

We can look at the line trace to see how the processor is executing with a perfect cache:

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl \
     --trace ../../app/build/ubmark-vvadd-eval > ubmark-vvadd-rtl.trace

This is the line trace for two iterations of the loop in the steady state:

      F stage  D stage                 X    M    W    dmemreq                   dmemresp
--------------------------------------------------------------------------------------------------
-000002cc|                       |    |bne |addi|                         |.
-000002d0|lw     x15, 0x000(x11) |    |    |bne |                         |.
-000002d4|lw     x14, 0x000(x12) |lw  |    |    |rd:00:00000624:0:        |.
-000002d8|addi   x11, x11, 0x004 |lw  |lw  |    |rd:00:00000494:0:        |rd:00:0:0:00000000
-000002dc|addi   x12, x12, 0x004 |addi|lw  |lw  |                         |rd:00:0:0:00000047
-000002e0|add    x15, x15, x14   |addi|addi|lw  |                         |.
-000002e4|sw     x15, 0x000(x10) |add |addi|addi|                         |.
-000002e8|addi   x10, x10, 0x004 |sw  |add |addi|wr:00:000ffe40:0:00000047|.
-000002ec|bne    x11, x13, 0x1fe4|addi|sw  |add |                         |wr:00:0:0:
-~       |~                      |bne |addi|sw  |                         |.
-000002cc|                       |    |bne |addi|                         |.
-000002d0|lw     x15, 0x000(x11) |    |    |bne |                         |.
-000002d4|lw     x14, 0x000(x12) |lw  |    |    |rd:00:00000628:0:        |.
-000002d8|addi   x11, x11, 0x004 |lw  |lw  |    |rd:00:00000498:0:        |rd:00:0:0:00000055
-000002dc|addi   x12, x12, 0x004 |addi|lw  |lw  |                         |rd:00:0:0:00000059
-000002e0|add    x15, x15, x14   |addi|addi|lw  |                         |.
-000002e4|sw     x15, 0x000(x10) |add |addi|addi|                         |.
-000002e8|addi   x10, x10, 0x004 |sw  |add |addi|wr:00:000ffe44:0:000000ae|.
-000002ec|bne    x11, x13, 0x1fe4|addi|sw  |add |                         |wr:00:0:0:
-~       |~                      |bne |addi|sw  |                         |.
-000002cc|                       |    |bne |addi|                         |.

Notice how there is no cache, so all instruction fetches and data accesses go straight to the test memory. There are 10 cycles per iteration for a total of 1000 cycles. The simulator reported 1012 cycles with the extra 12 cycles due to the extra instructions required to call and return from the vvadd_scalar function.

TinyRV2 Processor ASIC

We can now push the baseline TinyRV2 pipelined processor through the ASIC flow to better characterize the area and timing of the processor as well as the energy when running the vvadd microbenchmark.

NOTE: The cache is not quite ready to be pushed through the flow, so for now the tutorial will just focus on pushing the processor through the flow without the cache. Eventually, you would relace --cache-impl null with --cache-impl rtl below to push the processor and cache through the flow. Everywhere we see ProcXcel would be ProcMemXcel. We will also include some results from previously pushing the processor, cache, and accelerator through the flow.

We can use the pmx-sim simulator to translate the processor and to generate a Verilog test benches that we can also use for 4-state and gate-level simulation.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl null \
     --translate --dump-vtb ../../app/build/ubmark-vvadd-test1
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl null \
     --stats --translate --dump-vtb ../../app/build/ubmark-vvadd-eval
 num_cycles = 1012
 % ls
 ...
 ProcMemXcel_null_rtl__pickled.v
 ProcMemXcel_null_rtl_pmx-sim-null-rtl-ubmark-vvadd-test1_tb.v
 ProcMemXcel_null_rtl_pmx-sim-null-rtl-ubmark-vvadd-eval_tb.v

One problem with the evaluation test bench file is that it is for the entire microbenchmark including the code for initialization and verification. We want to isolate the energy for just doing the real computation, so we will also run a dummy microbenchmark that does all of the initialization and verification but then doesn’t actually do any real work. We can subtract the energy for the dummy microbenchmark from the energy for the real microbenchmark to estimate the energy to do the desired computation. Here is what the dummy microbenchmark looks like.

 int main( int argc, char* argv[] )
 {
   int dest[size];

   for ( int i = 0; i < size; i++ )
     dest[i] = ref[i];

   test_stats_on();
   // dummy to estimate energy outside function
   test_stats_off();

   for ( int i = 0; i < size; i++ ) {
     if ( !( dest[i] == ref[i] ) )
       test_fail( i, dest[i], ref[i] );
   }

   test_pass();
   return 0;
 }

We modify the initialization code to copy the values from the reference to the destination (so the ubmark will still pass) and then we comment out the call to vvadd_scalar. We can cross-compile the dummy microbenchmark, and we can use riscv32-objdump to confirm that the dummy microbenchmark still does the initialization and verification.

 % cd $TOPDIR/app/build
 % make ubmark-vvadd-dummy
 % riscv32-objdump ubmark-vvadd-dummy | less
 000002c0 <main>:
    ...
    2f8: lw   x13, 0(x15)    # <-.
    2fc: addi x14, x14, 4    #   | initialize dest array
    300: addi x15, x15, 4    #   |
    304: sw   x13, -4(x14)   #   |
    308: bne  x15, x12, 2f9  # --'

    30c: addi x15, x0, 1     #
    310: csrw 0x7c1,x15      # turn stats on
    314: addi x15, x0, 0     #
    318: csrw 0x7c1,x15      # turn stats off

    31c: bge  x0, x10, 944   # start of verification code
    320: lui  x16, 0x20      #
    324: addi x14, x11, 0    #
    328: addi x13, x0,  944  #
    32c: addi x15, x0,  0    #
    ...

Now we can run this dummy microbenchmark on the processor RTL model.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl null \
     --stats --translate --dump-vtb ../../app/build/ubmark-vvadd-dummy
 num_cycles = 2
 % ls
 ...
 ProcMemXcel_null_rtl__pickled.v
 ProcMemXcel_null_rtl_pmx-sim-null-rtl-ubmark-vvadd-dummy_tb.v
 ProcMemXcel_null_rtl_pmx-sim-null-rtl-ubmark-vvadd-eval_tb.v

Not surprisingly there are only a few cycles in the timing loop since there is basically no work done in the timing loop. We are now ready to push the baseline TinyRV2 pipelined processor through the ASIC flow.

 % mkdir -p $TOPDIR/asic/build-tut9-pmx-null
 % cd $TOPDIR/asic/build-tut9-pmx-null
 % mflowgen run --design ../tut9-pmx-null

Now we can push the design through the flow step-by-step:

 % cd $TOPDIR/asic/build-tut9-pmx-null
 % make brg-rtl-4-state-vcssim
 % make brg-synopsys-dc-synthesis
 % make post-synth-gate-level-simulation
 % make post-synth-power-analysis
 % make brg-cadence-innovus-signoff
 % make brg-flow-summary

Here is the summary:

==========================================================================
  Summary
==========================================================================

design_name = ProcXcel_null_rtl

area & timing
  design_area   = 16898.182 um^2
  stdcells_area = 16898.182 um^2
  macros_area   = 0.0 um^2
  chip_area     = 53328.425 um^2
  core_area     = 24329.424 um^2
  constraint    = 3.0 ns
  slack         = 0.376 ns
  actual_clk    = 2.624 ns

==========================================================================
 4-State Sim Results
==========================================================================
[PASSED]: pmx-sim-null-rtl-ubmark-vvadd-test1
[PASSED]: pmx-sim-null-rtl-ubmark-vvadd-eval
[PASSED]: pmx-sim-null-rtl-ubmark-vvadd-dummy

==========================================================================
 Fast-Functional Sim Results
==========================================================================
[PASSED]: pmx-sim-null-rtl-ubmark-vvadd-test1
[PASSED]: pmx-sim-null-rtl-ubmark-vvadd-eval
[PASSED]: pmx-sim-null-rtl-ubmark-vvadd-dummy

==========================================================================
 Fast-Functional Power Analysis
==========================================================================

 power & energy

  pmx-sim-null-rtl-ubmark-vvadd-eval.vcd
      exec_time = 2892 cycles
      power     = 5.037 mW
      energy    = 36.4175 nJ

  pmx-sim-null-rtl-ubmark-vvadd-dummy.vcd
      exec_time = 1976 cycles
      power     = 5.168 mW
      energy    = 25.5299 nJ

You can run all the steps at once by just using the default make target (i.e., just enter make). Our processor passes 4-state and gate-level simulation and meets timing. Note that you cannot use the value for exec_time from this final summary. The summary is reporting 2892 cycles because that includes the time for initialization and verification, while the cycle count from pmx-sim (i.e., 1012) is only when stats are enabled.

Unfortunately, this also means the energy estimate is not correct since the summary is reporting the energy for initialization, the vvadd kernel, and verification. As mentioned above, the dummy ubmark that just does the initialization and verification steps but does not do any real work. It looks like approximately 1976 cycles and ~25.5 nJ is spent on initialization and verification. If we subtract the exec_time of the dummy ubmark from the real ubmark we get 916 cycles which is lower than 1012 but in the right ballpark. If we subtract the energy of the dummy ubmark from the real ubmark, we can estimate the energy for just the vvadd kernel to be 36.4-25.5 = 10.9nJ.

NOTE: The rest of this section in the tutorial provides some results from a previous version which includes the cache. It is included here to give students some insight about a more realistic design that includes the processor and cache together pushed through the ASIC flow.

The processor has a cycle time of 3.0ns or a peak clock frequency of 400MHz. We might want to start by taking a closer look at the critical path. Remember, that we need to look at the QOR report first to find out which path group has the critical path.

 % cd $TOPDIR/asic/icc-par/current-icc/reports
 % more chip_finish_icc.qor.rpt
 # we note critical path is in the ideal_clock1 path group
 % less -p "Path Group: ideal_clock1" chip_finish_icc.timing.rpt

  Point                                                                 Path
  ------------------------------------------------------------------------------
  clock ideal_clock1 (rise edge)                                        0.0000
  clock network delay (propagated)                                      0.4614
  dmem/dpath/tag_array_1/sram/CE1 (SRAM_32x256_1P)                      0.4614 r
  dmem/dpath/tag_array_1/sram/O1[11] (SRAM_32x256_1P)                   1.1848 f
  dmem/dpath/tag_compare_1/in1[11] (EqComparator_2)                     1.1848 f
  dmem/dpath/tag_compare_1/out[0] (EqComparator_2)                      1.4684 r
  dmem/dpath/tag_match_1[0] (BlockingCacheDpathPRTL_1)                  1.4684 r
  dmem/ctrl/tag_match_1[0] (BlockingCacheCtrlPRTL_1)                    1.4684 r
  dmem/ctrl/way_sel_comb[0] (BlockingCacheCtrlPRTL_1)                   1.5716 r
  dmem/dpath/way_sel_comb[0] (BlockingCacheDpathPRTL_1)                 1.5716 r
  dmem/dpath/data_read_mux/sel[0] (Mux_2)                               1.7304 r
  dmem/dpath/data_read_mux/out[35] (Mux_2)                              1.8657 r
  dmem/dpath/read_byte_sel_mux/in__001[3] (Mux_3)                       1.9717 r
  dmem/dpath/read_byte_sel_mux/out[3] (Mux_3)                           2.0790 r
  dmem/dpath/cacheresp_msg[3] (BlockingCacheDpathPRTL_1)                2.1429 r
  dmem/resp_bypass/enq_msg[3] (SingleElementBypassQueue_1)              2.1429 r
  dmem/resp_bypass/dpath/enq_bits[3] (SingleElementBypassQueueDpath_1)  2.1429 r
  dmem/resp_bypass/dpath/bypass_mux/in__001[3] (Mux_1)                  2.1429 r
  dmem/resp_bypass/dpath/bypass_mux/out[3] (Mux_1)                      2.1928 r
  dmem/resp_bypass/dpath/deq_bits[3] (SingleElementBypassQueueDpath_1)  2.1928 r
  dmem/resp_bypass/deq_msg[3] (SingleElementBypassQueue_1)              2.1928 r
  dmem/cacheresp_msg[3] (BlockingCachePRTL_1)                           2.1928 r
  router/in__msg[3] (Router)                                            2.1928 r
  router/out_000_msg[3] (Router)                                        2.2483 r
  proc/dmemresp_msg[3] (ProcPRTL)                                       2.2483 r
  proc/dpath/dmemresp_msg_data[3] (ProcDpathPRTL)                       2.2483 r
  proc/dpath/wb_result_sel_mux_M/in__001[3] (Mux_0)                     2.2483 r
  proc/dpath/wb_result_sel_mux_M/out[3] (Mux_0)                         2.3016 r
  proc/dpath/op2_byp_mux_D/in__002[3] (Mux_6)                           2.3016 r
  proc/dpath/op2_byp_mux_D/out[3] (Mux_6)                               2.3949 r
  proc/dpath/op2_sel_mux_D/in__000[3] (Mux_3)                           2.3949 r
  proc/dpath/op2_sel_mux_D/out[3] (Mux_3)                               2.4700 r
  proc/dpath/op2_reg_X/in_[3] (RegEnRst_9)                              2.4700 r
  proc/dpath/op2_reg_X/out_reg_3_/D (DFFX1)                             2.5360 r
  data arrival time                                                     2.5360

  clock ideal_clock1 (rise edge)                                        2.0000
  clock network delay (propagated)                                      2.4509
  clock reconvergence pessimism                                         2.4539
  proc/dpath/op2_reg_X/out_reg_3_/CLK (DFFX1)                           2.4539 r
  library setup time                                                    2.4082
  data required time                                                    2.4082
  ------------------------------------------------------------------------------
  data required time                                                    2.4082
  data arrival time                                                    -2.5360
  ------------------------------------------------------------------------------
  slack (VIOLATED)                                                     -0.1278

I have cleaned the path a bit but removing some columns and mostly only showing the ports of modules. We can see that the critical path starts in the tag array of the data cache, goes through the tag comparator, into the control logic, back into the datapath to control the data read mux select, goes through the read byte mux, and through a bypass queue which is placed at the cache response interface. The critical path continues through the router and into the processor, goes to the writeback result mux in the M stage, around through the bypass path to the bypass and operand muxes at the end of the D stage, and final ends at the op2 register in between the D and X stages. This path is highlighted on the cache and processor datapath diagrams below. Notice that the datapath diagram is not quite accurate since there is a mux that bypasses the read_data_reg in the actual RTL.

The following instruction sequence would use the bypass path which is the critical path:

 lw x1, 0(x2)   F  D  X (M) W
 opA               F  D  X  M  W
 addi                 F (D) X  M  W

So from the end of the M stage of the LW instruction, to the end of the D stage of the ADDI instruction. Bypass paths are often on the critical path, and this specific path is always difficult to deal with. We could eliminate this critical path by removing the bypass path from M to D, but this would really only reduce the cycle time by 100-200ps. A more promising option would be to try and optimize the data cache design. We could try and optimize some of the logic on the critical path (e.g., the tag comparator or various muxes). We could also pipeline the data cache which would reduce the cycle time by increase the cache hit latency to two cycles.

We might also want to take a more detailed look at the power breakdown for the complete microbenchmark is shown below, but since this includes time spent in initialization and verification, we should be careful how we interpret these results.

                                               Int      Switch   Leak     Total
 Hierarchy                                     Power    Power    Power    Power    %
 ---------------------------------------------------------------------------------------
 ProcMemXcel_null_rtl                          5.22e-02 1.44e-02 2.39e-03 6.89e-02 100.0
   proc (ProcPRTL)                             1.17e-02 6.02e-03 6.89e-04 1.84e-02  26.7
    ctrl (ProcCtrlPRTL)                        1.02e-03 5.20e-04 3.74e-05 1.58e-03   2.3
    dpath (ProcDpathPRTL)                      9.56e-03 4.72e-03 5.90e-04 1.49e-02  21.6
      alu_X (AluPRTL)                          1.06e-03 9.82e-04 6.96e-05 2.11e-03   3.1
      imul (IntMulScycleRTL)                   2.38e-03 1.42e-03 1.43e-04 3.95e-03   5.7
      rf (RegisterFile)                        1.49e-03 3.98e-04 1.94e-04 2.08e-03   3.0
    imemreq_queue (TwoElementBypassQueue)      6.68e-04 1.43e-04 2.64e-05 8.37e-04   1.2
    xcelreq_queue (SingleElementBypassQueue)   6.85e-05 2.07e-05 8.67e-06 9.79e-05   0.1
  xcel (NullXcel)                              8.93e-05 3.48e-05 1.88e-05 1.43e-04   0.2
  imem (BlockingCachePRTL_0)                   2.76e-02 2.91e-03 7.65e-04 3.13e-02  45.4
    resp_bypass (SingleElementBypassQueue_0)   5.34e-04 1.79e-04 1.84e-05 7.31e-04   1.1
    ctrl (BlockingCacheCtrlPRTL_0)             2.46e-03 1.04e-03 3.13e-04 3.81e-03   5.5
    dpath (BlockingCacheDpathPRTL_0)           2.46e-02 1.63e-03 4.30e-04 2.66e-02  38.6
      tag_array_0 (SRAM_32x256_1P_wrapper_1)   2.41e-03 1.48e-05 4.77e-05 2.47e-03   3.6
      tag_array_1 (SRAM_32x256_1P_wrapper_0)   2.39e-03 1.10e-05 4.72e-05 2.45e-03   3.5
      data_array_0 (SRAM_128x256_1P_wrapper_0) 8.54e-03 4.16e-05 1.02e-04 8.69e-03  12.6
      data_array_1 (SRAM_128x256_1P_wrapper_1) 8.49e-03 5.50e-06 1.05e-04 8.60e-03  12.5
      read_data_reg (RegEnRst_1)               1.68e-03 7.85e-04 2.46e-05 2.49e-03   3.6
  funnel (Funnel)                              6.74e-05 4.88e-05 5.90e-06 1.22e-04   0.2
  router (Router)                              4.63e-05 4.49e-05 8.46e-06 9.96e-05   0.1
  dmem (BlockingCachePRTL_1)                   1.20e-02 3.92e-03 8.84e-04 1.68e-02  24.3
    resp_bypass (SingleElementBypassQueue_1)   1.27e-04 2.94e-05 2.39e-05 1.80e-04   0.3
    ctrl (BlockingCacheCtrlPRTL_1)             3.39e-03 2.67e-03 3.33e-04 6.40e-03   9.3
    dpath (BlockingCacheDpathPRTL_1)           7.97e-03 9.80e-04 5.20e-04 9.47e-03  13.7
      tag_array_0 (SRAM_32x256_1P_wrapper_3)   7.59e-04 2.17e-05 4.44e-05 8.25e-04   1.2
      tag_array_1 (SRAM_32x256_1P_wrapper_2)   5.29e-04 2.69e-06 4.44e-05 5.76e-04   0.8
      data_array_0 (SRAM_128x256_1P_wrapper_2) 3.06e-03 2.38e-05 1.02e-04 3.19e-03   4.6
      data_array_1 (SRAM_128x256_1P_wrapper_3) 2.10e-03 7.34e-07 1.02e-04 2.20e-03   3.2
      read_data_reg (RegEnRst_3)               3.44e-04 2.21e-04 3.39e-05 5.99e-04   0.9

At a high-level we can see that 45% of the power is spent in the instruction cache, 25% of the power is spent in the data cache, and 27% of the power is spent in the processor itself. Notice that more power is spent in the instruction cache since it is accessed almost every cycle, while the data cahce is only accessed on LW/SW instructions. Almost 70% of the power is spent in the memory system. This is not too surprising. SRAMs often consume a significant fraction of the power in many ASICs. Also note that the multiplier consumes ~6% of the power, but the multiplier is hardly used in this microbenchmark! This is a great example where we might want to use either data or clock gating to reduce the power consumption of an unused functional unit.

The area breakdown is shown below.

                          Global         Local
                          Cell Area      Cell Area
                          -------------- ------------------------
 Hierarchical cell        Abs                    Non     Black-
                          Total    %     Comb    Comb    boxes
 ------------------------ -------- ----- ------- ------- --------  --------------------------
 ProcMemXcel_null_rtl     893919.8 100.0  1781.4     0.0      0.0  ProcMemXcel_null_rtl
 proc                     126491.4  14.2   381.5     0.0      0.0  ProcPRTL
 proc/ctrl                  6028.1   0.7  3123.3  1232.1      0.0  ProcCtrlPRTL
 proc/dpath               111337.5  12.5   246.0     0.0      0.0  ProcDpathPRTL
 proc/dpath/alu_X          14574.1   1.6 14574.1     0.0      0.0  AluPRTL
 proc/dpath/imul           25702.5   2.9 21805.0     0.0      0.0  IntMulScycleRTL
 proc/dpath/rf             41919.8   4.7 16134.4 24684.1      0.0  RegisterFile
 xcel                       3446.7   0.4   251.5     0.0      0.0  NullXcel
 imem                     373501.9  41.8   371.4     0.0      0.0  BlockingCachePRTL_0
 imem/ctrl                 68185.4   7.6  1158.4    99.5      0.0  BlockingCacheCtrlPRTL_0
 imem/dpath               302905.5  33.9  4264.2     0.0      0.0  BlockingCacheDpathPRTL_0
 imem/dpath/tag_array_0    24259.1   2.7   141.9     0.0  24117.2  SRAM_32x256_1P_wrapper_1
 imem/dpath/tag_array_1    24242.5   2.7   125.3     0.0  24117.2  SRAM_32x256_1P_wrapper_0
 imem/dpath/data_array_0  118078.9  13.2   120.7     0.0 117958.2  SRAM_128x256_1P_wrapper_0
 imem/dpath/data_array_1  118400.6  13.2   442.3     0.0 117958.2  SRAM_128x256_1P_wrapper_1
 funnel                      863.5   0.1   768.6     0.0      0.0  Funnel
 router                      794.4   0.1   794.4     0.0      0.0  Router
 dmem                     387040.2  43.3   491.2     0.0      0.0  BlockingCachePRTL_1
 dmem/ctrl                 70902.3   7.9  1293.0    99.5      0.0  BlockingCacheCtrlPRTL_1
 dmem/dpath               313110.4  35.0  8994.8     0.0      0.0  BlockingCacheDpathPRTL_1
 dmem/dpath/tag_array_0    24174.3   2.7    57.1     0.0  24117.2  SRAM_32x256_1P_wrapper_3
 dmem/dpath/tag_array_1    24146.7   2.7    29.4     0.0  24117.2  SRAM_32x256_1P_wrapper_2
 dmem/dpath/data_array_0  118081.7  13.2   123.4     0.0 117958.2  SRAM_128x256_1P_wrapper_2
 dmem/dpath/data_array_1  118086.3  13.2   128.1     0.0 117958.2  SRAM_128x256_1P_wrapper_3
 ------------------------ -------- ----- ------- ------- --------  --------------------------

Notice that over 80% of the area is in the caches, and more specifically, approximately 60% of the area is in the generated SRAMs. This is not unusual. SRAMs often consume a significant fraction of the area in many ASICs.

The following figure shows an amoeba plot of the baseline processor. The instruction cache is shown in green while the data memory is shown in red. The large rectangles correspond to the various SRAMs used in the caches. Since the caches are two-way set associative, there are two smaller SRAMs to store the tags and two larger SRAMs to store the data. The processor is shown in blue/purple. The register file, single-cycle multiplier, and integer ALU are also explicitly shown. Notice how the tools try to group cells from the same module together (e.g., the cells for the integer ALU are all clumped together), although sometimes the tools will spread a module apart (e.g., a portion of the data cache ends up in the lower-right corner).

VVADD Accelerator FL, CL, and RTL Models

We will take an incremental approach when designing, implementing, testing, and evaluating accelerators. We can use test sources, sinks, and memories to create a test harness that will enable us to explore the accelerator cycle-level performance and the ASIC area, energy, and timing in isolation. Only after we are sure that we have a reasonable design-point should we consider integrating the accelerator with the processor.

All accelerators have an xcel minion interface along with a standard mem master interface. The messages sent over the xcel minion interface allows the test harness or processor to read and write accelerator registers. These accelerator registers can be real registers that hold configuration information and/or results, or these accelerator registers can just be used to trigger certain actions. The messages sent over the xcel.req interface from the test harness or processor to the accelerator have the following format:

   1b     5b      32b
 +------+-------+-----------+
 | type | raddr | data      |
 +------+-------+-----------+

The 1-bit type field indicates if this messages if for reading (0) or writing (1) an accelerator register, the 5-bit raddr field specifies which accelerator register to read or write, and the 32-bit data field is the data to be written. For every accelerator request, the accelerator must send back a corresponding accelerator response over the xcel.resp interface. These response messages have the following format:

   1b     32b
 +------+-----------+
 | type | data      |
 +------+-----------+

The 1-bit type field gain indicates if this response is from if for reading (0) or writing (1) an accelerator register, and the 32-bit data field is the data read from the corresponding accelerator register. Every accelerator is free to design its own accelerator protocol by defining the meaning of reading/writing the 32 accelerator registers.

We have implemented a null accelerator which we can use when we don’t want to integrate a “real” accelerator, but this null accelerator is also useful in illustrating the basic accelerator interface. The null accelerator has a single accelerator register (xr0) which can be read and written. Take a closer look at this null accelerator in sim/proc/NullXcelRTL.py.

    @update
    def block():

      # Mux to force xcelresp data to zero on a write
      # Enable xr0 only upon write requests and both val/rdy on resp side

      if s.xcelreq_q.send.msg.type_ == XCEL_TYPE_WRITE:
        s.xr0.en @= s.xcel.resp.val & s.xcel.resp.rdy
        s.xcel.resp.msg.data @= 0
      else:
        s.xr0.en @= 0
        s.xcel.resp.msg.data @= s.xr0.out

The null accelerator simply waits for a xcel.req message to arrive. If that message is a read, then it reads the xr0 register into the xcelresp message. If that message is a write, then it sets the enable of the xr0 register so that the new value is flopped in at the end of the cycle. Here is a unit test which writes a value to the null accelerator’s xr0 register and then reads it back:

 % cd $TOPDIR/sim/build
 % pytest ../proc/test/NullXcelRTL_test.py -k basic -s

  1r                >               |.           > .
  2r                >               |.           > .
  3:                >               |.           > .
  4:                >               |.           > .
  5:                >               |.           > .
  6:                >               |.           > .
  7: wr:00:0000000a > wr:00:0000000a|            >
  8: rd:00:         > rd:00:        |wr:         > wr:
  9:                >               |rd:0000000a > rd:0000000a
 10:                >               |.           > .

From the line trace, you can see the write request message (with write data 0x0a) going into the accelerator, and then the write response being returned on the next cycle. You can also see the read request message going into the accelerator, and then the read response being returned (with read data 0x0a) again on the next cycle.

The vvadd accelerator is obviously more sophisticated. Accelerator protocols are usually defined as a comment at the top of the FL model, so take a closer look at the vvadd accelerator FL model in sim/tut9_xcel/VvaddXcelFL.py. The vvadd accelerator protocol defines the accelerator registers as follows:

xr0 : go/done
xr1 : base address of the array src0
xr2 : base address of the array src1
xr3 : base address of the array dest
xr4 : size of the array

The actual protocol involves the following steps:

Write the base address of src0 to xr1
Write the base address of src1 to xr2
Write the base address of dest to xr3
Write the number of elements in the array to xr4
Tell accelerator to go by writing xr0
Wait for accelerator to finish by reading xr0, result will be 1

A close look at the vvadd accelerator FL model shows that most of the work is really in managing this accelerator protocol. The accelerator waits for accelerator requests, updates its internal state registers, and when it receives a write to xr0 it starts doing the actual vvadd computation. The FL model makes use of method-based interfaces to simplify interacting with the memory system. Let’s run the unit tests on the FL model first:

 % cd $TOPDIR/sim/build
 % pytest ../tut9_xcel/test/VvaddXcelFL_test.py -v

The vvadd accelerator CL model is actually very close to the RTL implementation largely due to the need to carefully interact with the latency insensitive memory interface. CL modeling may or may not be useful in this context. The vvadd accelerator RTL model is in sim/tut9_xcel/VvaddXcelPRTL.py and (roughly) implements the following FSM:

While the accelerator is in the XCFG state, it will update its internal registers when it receives accelerator requests. When the accelerator receives a write to xr0 it moves into the M_RD state. While in the M_RD state, the accelerator will send out two memory read requests to read the current element from each source array. In the ADD state, the accelerator will do the actual addition, and in the M_WR state, the accelerator will send out the memory write request to write the result to the destination array. The accelerator will wait in the final WAIT state until it receives the memory write response, and then will either move back into the M_RD state if there is another element to be processed, or move into the XCFG state if we have processed all elements in the array.

The accelerator is not implemented with a control/datapath split because the accelerator is almost entirely control logic; it was just simpler to implement the accelerator as a single model. When a model is almost all control logic or almost all datapath logic, then a control/datapath split may be more trouble than its worth.

Let’s run the unit tests for all of the vvadd accelerator models:

 % cd $TOPDIR/sim/build
 % pytest ../tut9_xcel
 % pytest ../tut9_xcel --test-verilog

We have also included a simulator for just the vvadd accelerator in isolation which can be used to evaluate its performance.

 % cd $TOPDIR/sim/build
 % ../tut9_xcel/vvadd-xcel-sim --impl rtl --input multiple --stats
 num_cycles = 1058

We could use the simulator to help evaluate the cycle-level performance of the accelerator on various different datasets as we try out various optimizations.

VVADD Accelerator ASIC

We can now push the vvadd accelerator through the ASIC flow in isolation to get a feel for the area, timing, and energy. We first need to run all of the tests to generate appropriate Verilog test benches for 4-state and gate-level simulation. We can then use the evaluation simulator to translate the accelerator into Verilog and to generate the VCD file we need for energy analysis.

 % cd $TOPDIR/sim/build
 % pytest ../tut9_xcel --test-verilog --dump-vtb
 % ../tut9_xcel/vvadd-xcel-sim --impl rtl --input multiple \
     --stats --translate --dump-vtb
 ...
 VvaddXcelRTL__pickled.v
 VvaddXcelRTL_vvadd-xcel-rtl-multiple_tb.v

We will use a rather conservative clock constraint of 3.0ns since that is what we are hoping to hit for the baseline processor. Our goal in designing these accelerators is just to ensure the accelerator doesn’t increase the critical path of the processor; there is no benefit in pushing the accelerator to have a cycle time which is less than the processor since this will not help the overall cycle time of the processor, memory, accelerator composition. We can now push the accelerator through the flow.

 % mdkir -p $TOPDIR/asic/build-vvadd-xcel
 % cd $TOPDIR/asic/build-vvadd-xcel
 % mflowgen run --design ../tut9-vvadd-xcel
 % make

The accelerator has no trouble meeting timing. If the accelerator did have trouble meeting the 3.0ns clock constraint, then we could do timing optimizations on the accelerator in isolation before moving onto the processor, memory, accelerator composition. We can also do area and energy optimization on the accelerator in isolation.

Integrating the TinyRV2 Processor and the VVADD Accelerator

Now that we have unit tested and evaluated both the baseline TinyRV2 pipelined processor and the vvadd accelerator in isolation, we are finally ready to compose them. The processor will send messages to the accelerator by reading and writing 32 special CSRs using the standard CSRW and CSRR instructions. These 32 special CSRs are as follows:

  0x7e0 : accelerator register  0 (xr0)
  0x7e1 : accelerator register  1 (xr1)
  0x7e2 : accelerator register  2 (xr2)
  ...
  0x7ff : accelerator register 31 (xr31)

When the processor uses a CSRW instruction to write an accelerator register, it first reads the general-purpose register file to get the source value, creates a new accelerator request message, then sends this message to the accelerator through the xcel.req interface in the X stage. The processor waits for the response message to be returned through the xcel.resp interface in the M stage. The processor uses a CSRR instruction to read an accelerator register in a similar way, except that when the response message is returned in the M stage, the data from the accelerator is sent down the pipeline and written into the general-purpose register file in the W stage.

Here is a simple assembly sequence which will write the value 1 to the null accelerator’s only accelerator register, read that value back from the accelerator register, and write the value to general-purpose register x2.

  addi x1, x0, 1
  csrw 0x7e0, x1
  csrr x2, 0x7e0

You can run a simple test of using the CSRW/CSRR instructions to write/read an accelerator register like this:

 % cd $TOPDIR/sim/build
 % pytest ../proc/test/ProcFL_xcel_test.py
 % pytest ../proc/test/ProcRTL_xcel_test.py
 % pytest ../proc/test/ProcRTL_xcel_test.py -k [bypass -s

     src        F-stage  D-stage         X     M     W     xcelreq         xcelresp    sink
 -----------------------------------------------------------------------------------------------
  2r .        >         |               |     |     |     |              |.          > .
.        >         |               |     |     |     |              |.          > .
.        > 00000200|               |     |     |     |              |.          > .
.        > #       |#              |     |     |     |              |.          > .
.        > #       |#              |     |     |     |              |.          > .
deadbeef > 00000204|csrr x02, 0xfc0|     |     |     |              |.          >
#        > 00000208|nop            |csrr |     |     |              |.          >
#        > 0000020c|nop            |nop  |csrr |     |              |.          >
#        > 00000210|nop            |nop  |nop  |csrr |              |.          >
#        > 00000214|csrw 0x7e0, x02|nop  |nop  |nop  |              |.          >
#        > 00000218|csrr x03, 0x7e0|csrw |nop  |nop  |wr:00:deadbeef|.          >
#        > 0000021c|nop            |csrrx|csrw |nop  |rd:00:        |wr:        >
#        > 00000220|nop            |nop  |csrrx|csrw |              |rd:deadbeef>
#        > 00000224|nop            |nop  |nop  |csrrx|              |.          >
#        > 00000228|csrw 0x7c0, x03|nop  |nop  |nop  |              |.          >
deadbe00 > 0000022c|csrr x02, 0xfc0|csrw |nop  |nop  |              |.          >
#        > 00000230|nop            |csrr |csrw |nop  |              |.          >
#        > 00000234|nop            |nop  |csrr |csrw |              |.          > deadbeef
#        > 00000238|csrw 0x7e0, x02|nop  |nop  |csrr |              |.          >
#        > 0000023c|csrr x03, 0x7e0|csrw |nop  |nop  |wr:00:deadbe00|.          >
#        > 00000240|nop            |csrrx|csrw |nop  |rd:00:        |wr:        >
#        > 00000244|nop            |nop  |csrrx|csrw |              |rd:deadbe00>
#        > 00000248|csrw 0x7c0, x03|nop  |nop  |csrrx|              |.          >
00adbe00 > 0000024c|csrr x02, 0xfc0|csrw |nop  |nop  |              |.          >
#        > 00000250|nop            |csrr |csrw |nop  |              |.          >
#        > 00000254|csrw 0x7e0, x02|nop  |csrr |csrw |              |.          > deadbe00
#        > 00000258|csrr x03, 0x7e0|csrw |nop  |csrr |wr:00:00adbe00|.          >
#        > 0000025c|nop            |csrrx|csrw |nop  |rd:00:        |wr:        >
#        > 00000260|csrw 0x7c0, x03|nop  |csrrx|csrw |              |rd:00adbe00>
dea00eef > 00000264|csrr x02, 0xfc0|csrw |nop  |csrrx|              |.          >
.        > 00000268|csrw 0x7e0, x02|csrr |csrw |nop  |              |.          >
.        > 0000026c|csrr x03, 0x7e0|csrw |csrr |csrw |wr:00:dea00eef|.          > 00adbe00
.        > #       |#              |csrrx|csrw |csrr |rd:00:        |wr:        >
.        > 00000270|csrw 0x7c0, x03|     |csrrx|csrw |              |rd:dea00eef>
.        > 00000274|               |csrw |     |csrrx|              |.          >
.        > 00000278|               |???? |csrw |     |              |.          >
.        > 0000027c|               |???? |???? |csrw |              |.          > dea00eef
.        > 00000280|               |???? |???? |???? |              |.          > .

I have cleaned up the line trace a bit to annotate the columns and make it more compact. You can see the processor executing CSRW/CSRR instructions to 0x7e0 which is accelerator register 0. This results in the processor sending accelerator requests to the null accelerator, and then the accelerator sending the corresponding accelerator responses back to the processor.

Also notice the need for the processor to add new RAW dependency stall logic. CSRR instructions which read from accelerator registers send out the xcel.req in the X stage and receive the xcelresp in the M stage. This means we cannot bypass data from a CSRR instruction if it is in the X stage since the data has not returned from the accelerator yet. In cycle 34, the CSRW instruction in the decode stage needs to stall to wait for the CSRR instruction in the X stage to move into the M stage.

Accelerating a TinyRV2 Microbenchmark

To use an accelerator from a C microbenchmark, we can use the same GCC inline assembly extensions we used to write the stats_en CSR earlier in the tutorial. Take a closer look at the app/ubmark/ubmark-null-xcel.c example:

 __attribute__ ((noinline))
 unsigned int null_xcel( unsigned int in )
 {
   unsigned int result;
   asm volatile (
     "csrw 0x7E0,     %[in];\n"
     "csrr %[result], 0x7E0;\n"

     // Outputs from the inline assembly block

     : [result] "=r"(result)

     // Inputs to the inline assembly block

     : [in] "r"(in)

   );
   return result;
 }

We are inserting a CSRW instruction to copy the value passed to this function through the in argument, and then we are using an CSRR instruction to retrieve the same value from the null accelerator. Notice that unlike the inline assembly we used when setting the stats_en CSR, here we also need to handle outputs from the assembly block. Again, you can find out more about inline assembly syntax here:

https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

Let’s cross-compile this example. Note that you cannot natively compile a microbenchmark that makes use of an accelerator, since x86 does not have any accelerators!

 % cd $TOPDIR/app/build
 % make ubmark-null-xcel
 % riscv32-objdump ubmark-null-xcel | less
 000002c0 <null_xcel(unsigned int)>:
    2c0:  csrw 0x7e0, x10
    2c4:  csrr x10, 0x7e0
    2c8:  jalr x0,  x1, 0

Always a good idea to use riscv32-objdump so you can verify your C code is compiling as expected. Here we can see that the null_xcel function compiles into a CSRW, CSRR, and JALR instruction as expected. We should now run this microbenchmark on our ISA simulator to verify it works, and then we can run it on our RTL simulator.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim ../../app/build/ubmark-null-xcel
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl rtl --xcel-impl null-rtl \
     --trace ../../app/build/ubmark-null-xcel

Let’s turn out attention to our vvadd accelerator. Take a closer look at the accelerated version of the vvadd microbenchmark in app/ubmark/ubmark-vvadd-xcel.c:

 __attribute__ ((noinline))
 void vvadd_xcel( int *dest, int *src0, int *src1, int size )
 {
   asm volatile (
     "csrw 0x7E1, %[src0];\n"
     "csrw 0x7E2, %[src1];\n"
     "csrw 0x7E3, %[dest];\n"
     "csrw 0x7E4, %[size];\n"
     "csrw 0x7E0, x0     ;\n"
     "csrr x0,    0x7E0  ;\n"

     // Outputs from the inline assembly block

     :

     // Inputs to the inline assembly block

     : [src0] "r"(src0),
       [src1] "r"(src1),
       [dest] "r"(dest),
       [size] "r"(size)

     // Tell the compiler this accelerator read/writes memory

     : "memory"
   );
 }

Notice that our use of the CSRW/CSRR instructions corresponds exactly to the accelerator protocol described above. We first write the source base pointers, the destination base pointer, and the size before starting the accelerator by writing to xr0 and then waiting for the accelerator to finish by reading xr0. We need a final "memory" argument in our inline assembly block to tell the compiler that this accelerator reads and writes memory. Let’s cross-compile the accelerated version of the vvadd with a similar test as we used earlier in this tutorial.

 % cd $TOPDIR/app/build
 % make ubmark-vvadd-xcel-test1
 % riscv32-objdump ubmark-vvadd-xcel-test1 | less
 0000060 <vvadd_xcel(int*, int*, int*, int)>:
    604:  csrw 0x7e1, x11
    608:  csrw 0x7e2, x12
    60c:  csrw 0x7e3, x10
    610:  csrw 0x7e4, x13
    614:  csrw 0x7e0, x0
    618:  csrr x0, 0x7e0
    61c:  jalr x0, x1, 0

Everything looks as expected, so we can now test our accelerated vvadd on the ISA simulator.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --xcel-impl vvadd-fl ../../app/build/ubmark-vvadd-xcel-test1

Notice that we needed to specify the accelerator implementation as a command line option. If we forgot to include this option, then the simulator would use the null accelerator and clearly the accelerated vvadd microbenchmark does not work with the null accelerator! Finally, we can run the same test on the RTL implementation of the processor augmented with the RTL implementation of the vvadd accelerator:

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl rtl --xcel-impl vvadd-rtl \
     ../../app/build/ubmark-vvadd-xcel-test1

Now that we know that our accelerated implementation of vvadd is passing our tests, we can evaluate its performance using an evaluation program which serves as the actual microbenchmark just like with our pure-software implementation. We build it, make sure it works on the ISA simulator, and then evaluate its performance on the RTL implementation of the processor augmented with the RTL implementation of the vvadd accelerator:

 % cd $TOPDIR/app/build
 % make ubmark-vvadd-xcel-eval

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --xcel-impl vvadd-fl ../../app/build/ubmark-vvadd-xcel-eval
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl rtl --xcel-impl vvadd-rtl \
    --stats ../../app/build/ubmark-vvadd-xcel-eval
 num_cycles =  1109
 num_insts_on_processor =  14

Recall that the pure-software vvadd microbenchmark required 1318 cycles. So our accelerator results in a cycle-level speedup of 1.19x. We might ask, where did this speedup come from? Why isn’t the speedup larger? Let’s look at the line trace.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl rtl --xcel-impl vvadd-rtl \
     --trace ../../app/build/ubmark-vvadd-xcel-eval > ubmark-vvadd-xcel.trace

Here is what the line trace looks like for the initial configuration of the accelerator and the first two iterations of the vvadd loop:

cyc   F-stage  D-stage              X     M     W      I$   D$  xcelreq        ST              xcel->memreq            xcel<-memresp    xcelresp
------------------------------------------------------------------------------------------------------------------------------------------------
 00000308|csrw 0x7c1, x15     |addi |bne  |addi [(TC)|(I )]              (X  0:0:00000000|.                                       ).
 0000030c|addi x18, x00, 0x400|csrw |addi |bne  [(TC)|(I )]              (X  0:0:00000000|.                                       ).
 #       |addi x11, x18, 0x190|addi |csrw |addi [(TC)|(I )]              (X  0:0:00000000|.                                       ).
 #       |                    |addi |addi |csrw [(RR)|(I )]              (X  0:0:00000000|.                                       ).
-#       |                    |     |addi |addi [(RW)|(I )]              (X  0:0:00000000|.                                       ).
-#       |                    |     |     |addi [(RU)|(I )]              (X  0:0:00000000|.                                       ).
-#       |                    |     |     |     [(RD)|(I )]              (X  0:0:00000000|.                                       ).
-00000310|                    |     |     |     [(Wm)|(I )]              (X  0:0:00000000|.                                       ).
-#       |addi x12, x00, 0x400|     |     |     [(I )|(I )]              (X  0:0:00000000|.                                       ).
-00000314|                    |addi |     |     [(TC)|(I )]              (X  0:0:00000000|.                                       ).
-00000318|addi x10, x09, 0x000|     |addi |     [(TC)|(I )]              (X  0:0:00000000|.                                       ).
-~       |jal  x01, 0x1fff30  |addi |     |addi [(TC)|(I )]              (X  0:0:00000000|.                                       ).
-00000248|                    |jal  |addi |     [(TC)|(I )]              (X  0:0:00000000|.                                       ).
-0000024c|csrw 0x7e1, x11     |     |jal  |addi [(TC)|(I )]              (X  0:0:00000000|.                                       ).
-#       |csrw 0x7e2, x12     |csrw |     |jal  [(TC)|(I )]wr:01:00000590(X  0:0:00000000|.                                       ).
-#       |                    |csrw |csrw |     [(RR)|(I )]wr:02:00000400(X  0:0:00000000|.                                       )wr:
-#       |                    |     |csrw |csrw [(RW)|(I )]              (X  0:0:00000000|.                                       )wr:
-#       |                    |     |     |csrw [(RU)|(I )]              (X  0:0:00000000|.                                       ).
-#       |                    |     |     |     [(RD)|(I )]              (X  0:0:00000000|.                                       ).
-00000250|                    |     |     |     [(Wm)|(I )]              (X  0:0:00000000|.                                       ).
-#       |csrw 0x7e3, x10     |     |     |     [(I )|(I )]              (X  0:0:00000000|.                                       ).
-00000254|                    |csrw |     |     [(TC)|(I )]wr:03:000ffe3c(X  0:0:00000000|.                                       ).
-00000258|csrw 0x7e4, x13     |     |csrw |     [(TC)|(I )]              (X  0:0:00000000|.                                       )wr:
-0000025c|csrw 0x7e0, x00     |csrw |     |csrw [(TC)|(I )]wr:04:00000064(X  0:0:00000000|.                                       ).
-#       |csrr x00, 0x7e0     |csrw |csrw |     [(TC)|(I )]wr:00:00000000(X  0:0:00000000|.                                       )wr:
-#       |                    |csrrx|csrw |csrw [(RR)|(I )]rd:00:        (X  0:0:00000000|.                                       )wr:
-#       |                    |     |#    |csrw [(RW)|(I )].             (RD 0:0:00000000|rd:00:00000590:                         ).
-#       |                    |     |#    |     [(RU)|(TC)].             (RD 0:0:00000000|#                                       ).
-#       |                    |     |#    |     [(RD)|(RR)].             (RD 0:0:00000000|#                                       ).
-#       |                    |     |#    |     [(Wm)|(RW)].             (RD 0:0:00000000|#                                       ).
-#       |                    |     |#    |     [(Wm)|(RU)].             (RD 0:0:00000000|#                                       ).
-#       |                    |     |#    |     [(Wm)|(RD)].             (RD 0:0:00000000|#                                       ).
-#       |                    |     |#    |     [(Wm)|(Wm)].             (RD 0:0:00000000|#                       rd:01:0:00000017).
-#       |                    |     |#    |     [(Wm)|(I )].             (RD 1:1:00000000|rd:00:00000400:                         ).
-#       |                    |     |#    |     [(Wm)|(TC)].             (RD 0:1:00000000|.                                       ).
-#       |                    |     |#    |     [(Wm)|(RR)].             (RD 0:1:00000000|.                                       ).
-#       |                    |     |#    |     [(Wm)|(RW)].             (RD 0:1:00000000|.                                       ).
-#       |                    |     |#    |     [(Wm)|(RU)].             (RD 0:1:00000000|.                                       ).
-#       |                    |     |#    |     [(Wm)|(RD)].             (RD 0:1:00000000|.                                       ).
-#       |                    |     |#    |     [(Wm)|(Wm)].             (RD 0:1:00000000|.                       rd:01:0:00000033).
-#       |                    |     |#    |     [(Wm)|(I )].             (RD 1:2:00000000|.                                       ).
-#       |                    |     |#    |     [(Wm)|(I )].             (+  0:2:00000033|.                                       ).
-#       |                    |     |#    |     [(Wm)|(I )].             (WR 0:2:00000033|wr:00:000ffe3c:0000004a                 ).
-#       |                    |     |#    |     [(Wm)|(TC)].             (W  0:0:00000033|.                       wr:01:1:        ).
-#       |                    |     |#    |     [(Wm)|(WD)].             (W  1:0:00000000|.                                       ).
-#       |                    |     |#    |     [(Wm)|(I )].             (RD 0:0:00000000|rd:00:00000594:                         ).
-#       |                    |     |#    |     [(Wm)|(TC)].             (RD 0:0:00000000|rd:00:00000404:         rd:01:1:00000000).
-#       |                    |     |#    |     [(Wm)|(TC)].             (RD 1:1:00000000|.                       rd:01:1:00000047).
-#       |                    |     |#    |     [(Wm)|(I )].             (RD 1:2:00000000|.                                       ).
-#       |                    |     |#    |     [(Wm)|(I )].             (+  0:2:00000047|.                                       ).
-#       |                    |     |#    |     [(Wm)|(I )].             (WR 0:2:00000047|wr:00:000ffe40:00000047                 ).
-#       |                    |     |#    |     [(Wm)|(TC)].             (W  0:0:00000047|.                       wr:01:1:        ).
-#       |                    |     |#    |     [(Wm)|(WD)].             (W  1:0:00000000|.                                       ).

I have cleaned up the line trace a bit to annotate the columns and make it more compact. The ST column is the current state of the vvadd accelerator FSM. You can see the processor executing the CSRW instructions to configure the accelerator, and these instructions then turn into messages over the xcel.req interface. The accelerator is in the XCFG state receiving these messages until it receives the write to xr0 which causes the accelerator to move into the RD stage. The accelerator sends the first memory read request request into the memory system, but this causes a data cache miss so the accelerator stalls in the RD state. Once the data from the first read is returned, the accelerator sends out a second memory read request but again this causes a cache miss. Once the data from both reads have returned, the accelerator does the addition, and finally sends a memory write request with the result into the memory system. The second iteration is faster for the same reason the second iteration of the pure-software vvadd microbenchmark was faster: we are exploiting spatial locality in the data cache so the sequential reads hit in the same cache line.

We know that every fourth iteration should look like the first iteration (19 cycles) and the remaining iterations should look like the second iteration (8 cycles). Since there are 100 iterations, this means the total number of cycles should be about 1075 cycles, but our simulator reported 1109 cycles. Again, the discrepancy is due to the extra cycles required to call and return from the vvadd_scalar function. So the accelerator is a little faster than the processor since it requires fewer cycles per iteration, but notice that the execution time is largely dominated by the miss penalty. The accelerator does not really help reduce nor hide this miss latency.

There is certainly room for improvement. We can probably remove some of the bubbles and improve the accelerator performance by a couple more cycles. The accelerator could also potentially use wide accesses to the data cache to retrieve four words at a time and then process all four words in parallel. The accelerator could also potentially achieve better performance by issuing multiple memory requests to a non-blocking cache. Eventually we should be able to optimize such an accelerator so that it is memory bandwidth limited (i.e., we are doing a memory request every cycle).

As with the pure-software vvadd microbenchmark, sometimes it is useful to run experiments where we assume a perfect cache to help isolate the impact of the memory system. Again, this is easy to do with our simulator. We simply avoid inserting any cache at all and instead have the processor directly communicate with the test memory. This means all memory requests take a single cycle.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl --xcel-impl vvadd-rtl \
     --stats ../../app/build/ubmark-vvadd-xcel-eval
 num_cycles =  817

Recall the pure-software vvadd microbenchmark took 1012 cycles without any cache misses, so the accelerator is now able to improve the performance by 1.23x.

TinyRV2 Processor + VVADD Accelerator ASIC

Now that we have some results suggesting our vvadd accelerator is able to improve the cycle-level performance by 1.19x, we can push the processor, memory, accelerator composition through the flow to see the overall impact on area, energy, and timing. We can use the pmx-sim simulator to translate the processor and to generate a Verilog test benches that we can also use for 4-state and gate-level simulation.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl null --xcel-impl vvadd-rtl \
     --translate --dump-vtb ../../app/build/ubmark-vvadd-xcel-test1
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl null --xcel-impl vvadd-rtl \
     --stats --translate --dump-vtb ../../app/build/ubmark-vvadd-xcel-eval
 % ls
 ...
 ProcMemXcel_vvadd__pickled.v
 ProcMemXcel_vvadd_rtl_pmx-sim-vvadd-rtl-ubmark-vvadd-xcel-test1_tb.v.cases
 ProcMemXcel_vvadd_rtl_pmx-sim-vvadd-rtl-ubmark-vvadd-xcel-eval_tb.v.cases

Then we create a build directory for the ASIC toolflow.

 % mkdir -p $TOPDIR/asic/build-tut9-pmx-vvadd
 % cd $TOPDIR/asic/build-tut9-pmx-vvadd
 % mflowgen run --design ../tut9-pmx-vvadd

Now we can push the design through the flow step-by-step:

 % cd $TOPDIR/asic/build-tut9-pmx-null
 % make brg-rtl-4-state-vcssim
 % make brg-synopsys-dc-synthesis
 % make post-synth-gate-level-simulation
 % make post-synth-power-analysis
 % make brg-cadence-innovus-signoff
 % make brg-flow-summary

Here is the summary:

==========================================================================
  Summary
==========================================================================

design_name = ProcXcel_vvadd_rtl

area & timing
  design_area   = 20059.858 um^2
  stdcells_area = 20059.858 um^2
  macros_area   = 0.0 um^2
  chip_area     = 59958.688 um^2
  core_area     = 28870.842 um^2
  constraint    = 3.0 ns
  slack         = 0.376 ns
  actual_clk    = 2.624 ns

==========================================================================
 4-State Sim Results
==========================================================================
[PASSED]: pmx-sim-vvadd-rtl-ubmark-vvadd-xcel-test1
[PASSED]: pmx-sim-vvadd-rtl-ubmark-vvadd-xcel-eval

==========================================================================
 Fast-Functional Sim Results
==========================================================================
[PASSED]: pmx-sim-vvadd-rtl-ubmark-vvadd-xcel-test1
[PASSED]: pmx-sim-vvadd-rtl-ubmark-vvadd-xcel-eval

==========================================================================
 Fast-Functional Power Analysis
==========================================================================

 power & energy

  pmx-sim-vvadd-rtl-ubmark-vvadd-xcel-test1.vcd
      exec_time = 652 cycles
      power     = 3.809 mW
      energy    = 7.4504 nJ

  pmx-sim-vvadd-rtl-ubmark-vvadd-xcel-eval.vcd
      exec_time = 2697 cycles
      power     = 3.79 mW
      energy    = 30.6649 nJ

You can run all the steps at once by just using the default make target (i.e., just enter make). Our processor and accelerator passes 4-state and gate-level simulation and meets timing. Note that as before you cannot use the value for exec_time from this final summary. The summary is reporting 2697 cycles because that includes the time for initialization and verification, while the cycle count from pmx-sim (i.e., 817) is only when stats are enabled.

We know from our earlier experiment with the dummy ubmark, that ~25.5nJ is spent on initialization and verification, so we can estimate the energy for just the vvadd accelerator to be 30.6-25.5 = 4.9nJ. Recall tha the energy for the pure-software implementation was 10.9nJ. This means the accelerator reduces the overall total energy by 10.9/4.9 = 2.2x!

Notice that the area of the vvadd accelerator is 988,840 um^2, while the area of the baseline processor is 964,955 um^2. Adding the vvadd accelerator to the baseline processor results in an area overhead of just 2%, mostly due to the area consumed by the various queues in the design. The area breakdown confirms that most of the area stays the same except for the accelerator.

                          Global         Local
                          Cell Area      Cell Area
                          -------------- ------------------------
 Hierarchical cell        Abs                    Non     Black-
                          Total    %     Comb    Comb    boxes
 ------------------------ -------- ----- ------- ------- --------  --------------------------
 ProcMemXcel_vvadd_rtl    914452.1 100.0  1264.4     0.0      0.0  ProcMemXcel_vvadd_rtl
 proc                     124253.7  13.6   166.8     0.0      0.0  ProcPRTL
 proc/ctrl                  5759.0   0.6  2955.5  1219.2      0.0  ProcCtrlPRTL
 proc/dpath               108875.0  11.9   359.4     0.0      0.0  ProcDpathPRTL
 proc/dpath/alu_X          13868.2   1.5 13868.2     0.0      0.0  AluPRTL
 proc/dpath/imul           25097.0   2.7 21593.0     0.0      0.0  IntMulScycleRTL
 proc/dpath/rf             41881.1   4.6 16082.8 24684.1      0.0  RegisterFile
 xcel                      24461.1   2.7  6559.9  7196.7      0.0  VvaddXcelPRTL
 imem                     372201.5  40.7   182.4     0.0      0.0  BlockingCachePRTL_0
 imem/ctrl                 67863.8   7.4  1500.3    99.5      0.0  BlockingCacheCtrlPRTL_0
 imem/dpath               302570.9  33.1  4712.1     0.0      0.0  BlockingCacheDpathPRTL_0
 imem/dpath/tag_array_0    24130.1   2.6    12.9     0.0  24117.2  SRAM_32x256_1P_wrapper_1
 imem/dpath/tag_array_1    24130.1   2.6    12.9     0.0  24117.2  SRAM_32x256_1P_wrapper_0
 imem/dpath/data_array_0  117996.9  12.9    38.7     0.0 117958.2  SRAM_128x256_1P_wrapper_0
 imem/dpath/data_array_1  118013.5  12.9    55.2     0.0 117958.2  SRAM_128x256_1P_wrapper_1
 funnel                     1434.9   0.2  1166.7     0.0      0.0  Funnel
 router                     1217.4   0.1  1217.4     0.0      0.0  Router
 dmem                     389618.8  42.6   435.9     0.0      0.0  BlockingCachePRTL_1
 dmem/ctrl                 70591.7   7.7  1550.1    99.5      0.0  BlockingCacheCtrlPRTL_1
 dmem/dpath               315555.4  34.5  8992.9     0.0      0.0  BlockingCacheDpathPRTL_1
 dmem/dpath/tag_array_0    24138.4   2.6    21.1     0.0  24117.2  SRAM_32x256_1P_wrapper_3
 dmem/dpath/tag_array_1    24135.6   2.6    18.4     0.0  24117.2  SRAM_32x256_1P_wrapper_2
 dmem/dpath/data_array_0  118225.5  12.9   267.2     0.0 117958.2  SRAM_128x256_1P_wrapper_2
 dmem/dpath/data_array_1  118286.3  12.9   328.0     0.0 117958.2  SRAM_128x256_1P_wrapper_3
 ------------------------ -------- ----- ------- ------- --------  --------------------------

A close look at the critical path shows it to be similar to what we saw in the baseline processor, so the accelerator does not impact the overall cycle time.

We can use our earlier results for the energy overhead of initialization and verification to estimate the energy of using the vvadd accelerator to be 426-354 = 72nJ. This is about 50% of the energy required when using the pure-software vvadd microbenchmark. We can dig into the power reports, although this is somewhat complicated by the fact that the power reports are for the entire program execution as opposed to only when stats are enabled.

                                               Int      Switch   Leak     Total
Hierarchy                                      Power    Power    Power    Power    %
----------------------------------------------------------------------------------------
ProcMemXcel_vvadd_rtl                          4.53e-02 1.30e-02 2.51e-03 6.08e-02 100.0
  proc (ProcPRTL)                              9.74e-03 4.34e-03 6.67e-04 1.48e-02  24.3
    ctrl (ProcCtrlPRTL)                        6.96e-04 3.50e-04 3.22e-05 1.08e-03   1.8
    dpath (ProcDpathPRTL)                      8.21e-03 3.80e-03 5.65e-04 1.26e-02  20.7
      imul (IntMulScycleRTL)                   1.88e-03 1.08e-03 1.36e-04 3.10e-03   5.1
      alu_X (AluPRTL)                          8.59e-04 6.99e-04 6.01e-05 1.62e-03   2.7
      rf (RegisterFile)                        1.13e-03 2.86e-04 1.93e-04 1.61e-03   2.6
    imemreq_queue (TwoElementBypassQueue)      4.91e-04 1.00e-04 3.66e-05 6.27e-04   1.0
    xcelreq_queue (SingleElementBypassQueue)   6.99e-05 1.69e-05 8.22e-06 9.50e-05   0.2
  xcel (VvaddXcelPRTL)                         3.13e-03 1.08e-04 1.14e-04 3.36e-03   5.5
    idx (Reg_1)                                3.60e-04 7.73e-07 3.62e-06 3.65e-04   0.6
    base_dest (Reg_2)                          3.89e-04 5.60e-09 3.69e-06 3.93e-04   0.6
    sum_val (Reg_3)                            3.12e-04 2.94e-07 3.57e-06 3.15e-04   0.5
    base_src0 (Reg_0)                          3.32e-04 9.88e-10 3.43e-06 3.35e-04   0.6
    base_src1 (Reg_4)                          3.14e-04 2.59e-10 3.35e-06 3.18e-04   0.5
    size (Reg_5)                               3.59e-04 3.22e-09 3.62e-06 3.63e-04   0.6
    num_src0 (Reg_6)                           3.17e-04 7.83e-07 3.60e-06 3.21e-04   0.5
    num_src1 (Reg_7)                           3.15e-04 8.08e-07 3.59e-06 3.19e-04   0.5
    memreq_sent (Reg_1)                        4.53e-05 7.08e-07 4.63e-07 4.65e-05   0.1
    memrsp_recv (Reg_0)                        4.38e-05 4.96e-07 4.57e-07 4.48e-05   0.1
    xcelreq_q (SingleElementPipelinedQueue)    3.79e-05 6.67e-06 5.00e-06 4.96e-05   0.1
    memreq_q (SingleElementBypassQueue_2)      1.06e-04 2.40e-05 1.34e-05 1.44e-04   0.2
    memresp_q (SingleElementPipelinedQueue)    5.10e-05 1.26e-05 4.02e-06 6.76e-05   0.1
  imem (BlockingCachePRTL_0)                   1.90e-02 2.10e-03 7.46e-04 2.18e-02  35.9
    resp_bypass (SingleElementBypassQueue_0)   3.32e-04 8.55e-05 9.29e-06 4.27e-04   0.7
    dpath (BlockingCacheDpathPRTL_0)           1.68e-02 1.12e-03 4.17e-04 1.83e-02  30.1
      read_data_reg (RegEnRst_1)               1.14e-03 4.60e-04 2.46e-05 1.63e-03   2.7
      tag_array_0 (SRAM_32x256_1P_wrapper_1)   1.63e-03 3.11e-06 4.40e-05 1.67e-03   2.8
      tag_array_1 (SRAM_32x256_1P_wrapper_0)   1.61e-03 4.60e-07 4.40e-05 1.65e-03   2.7
      data_array_0 (SRAM_128x256_1P_wrapper_0) 5.82e-03 2.40e-05 1.02e-04 5.94e-03   9.8
      data_array_1 (SRAM_128x256_1P_wrapper_1) 5.74e-03 2.38e-06 1.02e-04 5.85e-03   9.6
  funnel (Funnel)                              1.06e-04 7.65e-05 1.40e-05 1.96e-04   0.3
  router (Router)                              2.55e-05 3.68e-05 1.64e-05 7.87e-05   0.1
  dmem (BlockingCachePRTL_1)                   1.24e-02 4.19e-03 9.34e-04 1.76e-02  28.9
    dpath (BlockingCacheDpathPRTL_1)           8.50e-03 1.02e-03 5.59e-04 1.01e-02  16.6
      read_data_reg (RegEnRst_3)               3.52e-04 1.88e-04 3.65e-05 5.77e-04   0.9
      data_array_0 (SRAM_128x256_1P_wrapper_2) 3.26e-03 4.23e-05 1.04e-04 3.41e-03   5.6
      data_array_1 (SRAM_128x256_1P_wrapper_3) 2.26e-03 1.67e-05 1.05e-04 2.38e-03   3.9
      tag_array_0 (SRAM_32x256_1P_wrapper_3)   7.95e-04 1.26e-05 4.41e-05 8.51e-04   1.4
      tag_array_1 (SRAM_32x256_1P_wrapper_2)   5.58e-04 2.95e-06 4.43e-05 6.05e-04   1.0
    ctrl (BlockingCacheCtrlPRTL_1)             3.66e-03 3.01e-03 3.34e-04 7.01e-03  11.5
    resp_bypass (SingleElementBypassQueue_1)   1.63e-04 3.62e-05 3.75e-05 2.36e-04   0.4

Both the pure-software vvadd microbenchmark and the accelerator likely access the data cache about the same amount, but the accelerator would have many fewer instruction cache accesses, and this in turn is probably one reason for the reduced energy when using the accelerator. You can see this in the report. In the pure-software vvadd microbenchmark, the instruction cache accounts for 45% of the power, but when we use the accelerator, the instruction cache only accounts for 36% of the power. The accelerator consumes 5.5% of the total power, with this power being evenly distributed across various registers and queues.

We can also generate the obligatory amoeba plot. The accelerator is shown in the middle of the amoeba plot. Notice how the tools have used a different SRAM placement than for the baseline processor. Also notice how the instruction and data cache are spread out and intermingled a bit. The tools use all kinds of heuristics for placement, and it is not always obvious why they have decided to place a certain module in a specific place.

Here is the final summary of the performance, area, energy, and timing comparison between the pure-software vvadd microbenchmark and the accelerated vvadd microbenchmark. The accelerator seems to be a nice win in terms of energy, but we would probably need to invest further effort to improve the overall performance benefit.

metric	pure-software	accelerated	relative
cycles	1310	1111	1.18x
cycles*	1012	817	1.23x
cycle time	2.13 ns	2.13 ns	1.00x
area	964,955 um^2	988,840 um^2	1.02x
energy	156nJ	72 nJ	0.46x

(*) assuming no cache misses

Using Verilog Accelerator RTL Models

Up until this point, we have been exclusively using PyMTL models for the processor, cache, and accelerator. PyMTL enables composition of pure-PyMTL and Verilog components, so it is also possible to compose a pure-PyMTL implementation of the processor and cache with a Verilog accelerator.

You can see a Verilog version of the vvadd accelerator in sim/tut9_xcel/VvaddXcelVRTL.v. To switch to using this version, simply change the rtl_language variable in sim/tut9_xcel/VvaddXcelRTL.py to look like this:

rtl_language = 'pymtl'

Then we rerun the unit tests for the vvadd accelerator:

 % cd $TOPDIR/sim/build
 % pytest ../tut9_xcel

Then we can rerun the test and evaluation programs on the simulator with the Verilog version of the vvadd accelerator.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl rtl --xcel-impl vvadd-rtl \
     ../../app/build/ubmark-vvadd-xcel-test1
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl rtl --xcel-impl vvadd-rtl \
     --stats ../../app/build/ubmark-vvadd-xcel-eval
 num_cycles =  1111

The number of cycles is identical to what we saw with the PyMTL version of the vvadd accelerator, since both accelerators basically use the exact same design. Feel free to examine the line trace, which should be similar in spirit to what we saw with the PyMTL vvadd accelerator.

Finally, we can push the pure-PyMTL implementation of the processor and cache along with the Verilog implementation of the vvadd accelerator through the ASIC toolflow.

 % cd $TOPDIR/sim/build
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl null --xcel-impl vvadd-rtl \
     --translate --dump-vtb ../../app/build/ubmark-vvadd-xcel-test1
 % ../pmx/pmx-sim --proc-impl rtl --cache-impl null --xcel-impl vvadd-rtl \
     --stats --translate --dump-vtb ../../app/build/ubmark-vvadd-xcel-eval

 % mkdir -p $TOPDIR/asic/build-tut9-pmx-vvadd
 % cd $TOPDIR/asic/build-tut9-pmx-vvadd
 % mflowgen run --design ../tut9-pmx-vvadd
 % make

The results should be similar to what we saw for the PyMTL version of the vvadd accelerator. You are free to implement your accelerators in this course using either PyMTL or Verilog.