hls心得(1)

来源:互联网 发布:工厂模式 java 编辑:程序博客网 时间:2024/05/22 00:15

网上关于hls的东西太少 了。写一些。 



Verilog 设计考虑很多的是 存储器的使用,资源的多少。


hls 则主要考虑 算法能否 pipeline, dataflow, unroll,因此,多是用空间换时间。 


还有,算法的并行化,在Verilog里多是用 模块实例化实现,以及always语句。 在hls中,没有always,所以用 函数来生成模块,以实现并行化。

但是,过小的函数会被hls自动inline, 所以还是要看综合的效果, 不行就 inline off 一下。 


所以,在写C算法的时候,要尽量考虑数据的情况,而不是原来基于PC的那种高效C代码。 



0314更新------------------------------


静态变量可以使用,会被综合成 寄存器, 因为要在多次计算中保持结果。


常量会综合成rom


全局变量默认综合成寄存器, 可以通过 configuration 导出成IO, 但是 xilinx不建议大规模使用 全局变量


hls支持指针,和指针数组,

Vivado HLS supports pointers to pointers for synthesis but does not support them on the
top-level interface, that is, as argument to the top-level function. If you use a pointer to
pointer in multiple functions, Vivado HLS inlines all functions that use the pointer to
pointer. Inlining multiple functions can increase run time.


Arrays of pointers can also be synthesized. See the following code example in which an
array of pointers is used to store the start location of the second dimension of a global
array. The pointers in an array of pointers can point only to a scalar or to an array of scalars.
They cannot point to other pointers.


Wire, handshake, or FIFO interfaces can be used only on streaming data. It cannot be used
in conjunction with pointer arithmetic (unless it indexes the data starting at zero and then
proceeds sequentially)


关于分支结构, 不用考虑在FPGA上的效率问题, 因为对于FPGA,分支选择不涉及 cache重选, 所有分支的电路都存在了,只是电路选择的问题

In a CPU architecture, conditional or branch operations are often avoided. When the
program needs to branch it loses any instructions stored in the CPU fetch pipeline. In an
FPGA architecture, a separate path already exists in the hardware for each conditional
branch and there is no performance penalty associated with branching inside a pipelined
task. It is simply a case of selecting which branch to use.



为了程序的数据处理更与FPGA结构符合, HLS写了类 hls::stream<T>, 来流式处理数据,但是这个类只能在C++中处理。



对hls写出硬件高效代码的一些建议

Summary of C for Efficient Hardware


用FF组成的cach缓冲数据,实现数据的并发访问,突破RAM的瓶颈,提高数据访问效率,减少读写数据的clock
Minimize data input reads. Once data has been read into the block it can easily feed many
parallel paths but the input ports can be bottlenecks to performance. Read data once and
use a local cache if the data must be reused.

减少对array的访问。几种读写array。
Minimize accesses to arrays, especially large arrays. Arrays are implemented in block RAM
which like I/O ports only have a limited number of ports and can be bottlenecks to
performance. Arrays can be partitioned into smaller arrays and even individual registers but
partitioning large arrays will result in many registers being used. Use small localized caches
to hold results such as accumulations and then write the final result to the array.

分支代码放在函数,for里面, 不要放在外面,这样流水时可以把分支也流水了。 提高代码效率
Seek to perform conditional branching inside pipelined tasks rather than conditionally
execute tasks, even pipelined tasks. Conditionals will be implemented as separate paths in
the pipeline. Allowing the data from one task to flow into with the conditional performed
inside the next task will result in a higher performing system.


Minimize output writes for the same reason as input reads: ports are bottlenecks.
Replicating addition ports simply pushes the issue further out into the system.

用类 hls::streams 来流式数据。 
For C code which processes data in a streaming manner, consider using hls::streams as
these will enforce good coding practices. It is much more productive to design an algorithm
in C which will result in a high-performance FPGA implementation than debug why the
FPGA is not operating at the performance required.



hls的top-level必须是 函数, 对于c++的类,类和成员函数都不能作为top,必须实例化在一个top函数里。


hls支持用C++的模板,但是模板不能用在top-fun里



HLS schedule 为了降低延时,会最大限度的实现 逻辑操作和函数并行化, 但是for不会。要想让for并行化,可以把for放在不同的函数里。

Vivado HLS schedules logic and functions are early as possible to reduce latency. To
perform this, it schedules as many logic operations and functions as possible in parallel. It
does not schedule loops to execute in parallel.



hls可以并行展开的地方:

循环内部

表达式

函数之间

0 0