User experiences with HLS and AutoESL's AutoPilot

来源:互联网 发布:植物大战僵尸2mac版 编辑:程序博客网 时间:2024/06/04 00:57
Thought I'd pass along some of our results using AutoESL's AutoPilot for
high level synthesis (HLS) -- or what you like to call "C synthesis".

We've been tracking HLS claims for several years, but only recently has it
begun to look like a viable option for chip design. I've check-pointed HLS
in the past, and generally returned to RTL, since the QoR with direct RTL
is much higher. (We're in the high-volume chip business and really can't
accept a 10% overhead.)

We recently ran a benchmark through AutoESL's AutoPilot -- which takes an
untimed C model (in C/C++/SystemC) and generates various fixed architectures
of our design with timing. We already had a VHDL RTL version of our design
along with a fixed-point C model of it (still module level though) to
compare against.

During our initial evaluation, we gave the AutoESL AE the C model for our
design. The AutoESL team was ultimately able to automatically generate RTL
that beat the QoR for our optimized hand-coded RTL in terms of performance,
area and power -- the most significant was 12% lower power in the design.

We did a comparison on the final design with Synopsys Design Compiler and
Magma Talus P&R where we built both designs in-house from the source
(hand-coded VHDL RTL and the AutoPilt-generated RTL), then measured power
with real activity vectors.

Block A: AutoESL results vs. hand-coded (both results after running through
Synopsys Design Compiler + Talus P&R)

AutoPilot 1st result AutoPilot final result
-------------------- ----------------------
Area 2% larger 1% smaller
Power 2% higher power 12% lower
Latency 40% higher latency 6% lower

Original design: 800 lines of ANSI C code
Hand-coded design: 4000 lines of VHDL RTL code
AutoESL-generated design: 100,000 lines of Verilog RTL code

There were two significant things about the benchmark.

1. First, the initial results from AutoESL were in the ballpark but
slightly worse in all categories. But after the AutoESL team tweaked
some of the design constraints (and I'm assuming improved some of
AutoPilot's internal optimizations), they were able to significantly
improve on the result.

2. The second item that was surprising was there was very little change to
the C model for AutoESL to get the final results; it was more about
properly constraining the design. They limited the design changes to
replacing some functions with constant coefficients, partitioning a few
functions into smaller blocks, and putting in-place some input/output
packing functions. I would estimate that less than 5% of the code
changed, and for most part the core arithmetic code stayed the same.

I don't have a full accounting of all the time spent, but in this early
ramp-up period, I would think we spent similar amounts of time generating
the C model version of the design as writing our hand-coded VHDL RTL model.
Our hand-coded VHDL RTL included a considerable amount of configurability
with generics, variable arrays, and generate statements. We were trying
to make our VHDL as configurable as possible to allow better architecture
decisions, and hence the number of lines of code is probably longer than
if we had simply coded the final design architecture. But the cool thing
we found out after all the front-end work was done with the C model and
Autopilot, was that we could use AutoPilot to search a large design space
with very little additional effort. AutoESL handed over the all the
scripts containing constraints and directives and we were able to run
the design in-house.

Our hand-coded RTL was based on concrete specs for throughput and latency,
and had a particular architecture in mind. AutoESL's generated Verilog RTL
design was more flexible, and allowed us to quickly answer these questions:

- what happens if we relax the latency constraint?
- what if we want to halve the clock period, and run more things in
parallel?
- what if we want to double the clock rate, and halve the parallelism?

The tool payback starts to become very apparent at this point. In the case
of the hand-coded RTL, we could make some architecture changes because of
the configurability of the VHDL code, but it invariably required some
pipelining adjustments, as well as another round of verification simulation
to get the design working again. AutoESL allows us to change clock period,
pipelining, and immediately spit out a new working design with the new
constraints.

There was a 25X difference in lines of RTL code generated by AutoPilot
versus our hand-coded VHDL design. It appears that AutoESL's automatically
generated Verilog RTL code is much closer to a netlist-style approach than
hand-coded RTL.


Some gotchas we found using HLS and AutoPilot:

1. One interesting point we have found in the HLS development strategy
is that the ability to make sure the code is clean from errors is very
important. Un-initialized variables are a big problem, since they can
produce problems in downstream verification.

2. One of our requirements is to be able to run Valgrind on our original
C/C++ source code in order track down the problems before the code goes
through HLS. Initially, our plan for code development was to use the
standard SystemC fixed-point datatypes (sc_fixed, sc_int) for
detailed fixed-point design. But after spending a considerable amount
of time compiling in the SystemC libraries into our designs, we found
too many problems. For example, once you include the systemc.h header
files, it includes many more header files under the hood. We have
several problems with conflicts internally on datatypes. Furthermore,
we could never get Valgrind to work, since it reported too many
problems in the SystemC internals for us to even begin to debug our
code.

3. Ultimately, we have standardized on the SystemC formats for our coding
style, but we are actually mapping directly to AutoESL implementation
of the datatypes. For example, we can write the C code with:

sc_fixed<10,2>

but it is remapped to the

ap_fixed<10,2>

with the correct include files. With the AP types, we can easily get
the code Valgrind clean, and it matches with the synthesizable code
that Autopilot will generate. Since AutoPilot supports all 3 languages
(C, C++ and SystemC), we can use this kind of hybrid approach to our
high level synthesis language input.

4. The key to getting a good design is to know your design targets and
getting the constraints set properly. My experience is that a loosely
constrained design in Design Compiler still gets a reasonable design
output. However, AutoPilot can drastically change the microarchitecture
such as pipelining so it is important to constrain designs judiciously
in order to get expected results, especially latency and throughput. We
have seen results where you get a larger design and higher latency if
you don't have the constraints set properly.

5. Furthermore, knowing the expected outcome is important to know when to
stop exploring the design. In our case, we had a hand-coded RTL design
to compare against, but in the future the HLS output is the only
thing you have. Our conclusion is that HLS doesn't relieve the designer
from understanding the intimate details of the block; rather HLS is a
way to not have to worry about the details of pipelining and RTL
construction, while focusing on architecture tradeoffs. You still need
to at least have a back of the envelope estimate for the design to
sanity check the results.

6. We need to instrument some non-synthesizable constructs for debug and
design creation in our code. Right now in AutoESL, the non-synthe-
sizable constructs cannot be in the code, and we need to use #ifdef
statements to hide those statements. That leads to more cumbersome
code, and we lose some of C++ design styles (abstract base classes, and
automatic hierarchy parsing, file I/O). It would be much better for
development if the AutoPilot could automatically ignore non-synthe-
sizable statements.

7. Fundamentally, the output RTL from the tool is very close to a netlist
design, so it will be very challenging to track down a bug in the
AutoPilot's RTL. Its cross-probing ability is very limited at this
point, and ECO's will be challenging. It is likely that bug fixes will
require a full resynthesis - which can be a tough problem when an entire
chip is close to tapeout, and we need just an ECO to fix a small
problem. This will re-open the netlist, RTL synthesis, P&R, etc.

8. The HLS methodology requires that the C formal verification matures very
quickly. Right now, we assume the output of AutoPilot is correct by
construction, and verified with directed simulations of the C models
with the output RTL. In order for HLS to become main-stream, the
concept of formal verifying original C model to the RTL is critical in
order to trust the design will work under all conditions.

Overall we've had a good experience with AutoPilot. Our evaluation was
strong enough that we purchased Autopilot to expand on our C level design
methodology. We are in the process of building several different types of
designs to push the HLS methodology, and ensure an end-to-end methodology
is possible, including making certain that all the backend results are
unaffected.


http://www.deepchip.com/items/0485-04.html