[chbot] Xiilinx HLS any good?

hamster hamster at snap.net.nz
Thu Aug 9 22:54:29 BST 2018


 

(sorry in advance if formatting breaks) 

I have only played with it
for hobby projects. It does what is on the label.... however.... 

The
"tweak a bit" is an understatement. You need to use quite a few funky
#pragmas and special datatypes, and magic interfaces. The understanding
of what works and what is a complete fail is not obvious from the C
language level. 

It works pretty well for hierarchical data pipelines
(e.g. a DSP data pipelines, with samples in and processed data out), as
all the timing for latency and so on is encapsulated inside your
function, but coordination between parallel modules as much of a
nightmare (as always). 

As everything is fully inferred seemingly
trivial changes can suddenly cause dramatic H/W resource and performance
differences, so is far less stable from build to build. 

As an example
that I haven't tried with HLS, but had to hand - here is three digit
binary to BCD conversion as a C coder might do it: 

unsigned
my_module(unsigned input) {
 unsigned rtn = 0;
 rtn = (input/100)%10;

rtn *= 16;
 rtn += (input/10)%10;
 rtn *= 16;
 rtn += input%10;
 return
rtn;
} 

It would be absolutely horrific in HLS. It has three
divide/modulus operators, the inputs and outputs don't have explicit
sizes so will default to 32 bits (before static bits get optimized
away), everything is done in one cycle, so will be a big mush of LUTs
and no registers/pipelining. But it will work - you will get a working
design that will work very slowly and take a good chunk of FPGA
resources. It you set a target timing constraint Vivado HLS will insert
registers as needed to trade of latency and clock rate. 

This is how
you might choose to implement it in a HLS friendly way (explicit sizes,
explicit latency, avoiding '/' & '%', lots of bit shifts), converting
one value every 1000 cycles (as maybe it is for a user display). Note
that it is all standard 'C', but very odd 'C': 

unsigned_12bits
my_module2(unsigned_10bits input) {
 static unsigned_12bits last = 0;

static unsigned_10bits count = 0;
 static unsigned_4bits ones = 0;

static unsigned_4bits tens = 0;
 static unsigned_4bits hundreds = 0;

static unsigned_10bits converting = 0;
 static unsigned_12bits rtn = 0;


 /* Set the outputs */
 if(count == converting) {
 last =
(hundreds<<8) | (tens<<4) | ones;
 rtn = (hundreds<<8) | (tens<<4) |
ones;
 } else {
 rtn = last;
 } 

 if(hundreds == 9 && tens == 9 && ones
== 9) {
 /* Reset the counter */
 count = ones = tens = hundreds = 0;

/* Sample the input, so it can't change mid conversion */
 converting =
input;
 } else {
 count++;
 if(tens == 9 && ones == 9) {
 hundreds++;

tens = ones = 0;
 } else if(ones == 9) {
 tens++;
 ones = 0;
 } else {

ones++;
 }
 }
 return rtn;
} 

But here is how you would most likely end
up doing it, just for simplicity: 

unsigned_16bits
my_module2(unsigned_10bits input) {
 static unsigned_16bits table[1024]
= {0x0000, 0x00001.... 0x1023};
 return table[input]
} 

That last code
will work like a champ with HLS. :-) 

It has lots of issues with shared
resources (esp on-chip and off-chip memories) that require special
incantations to work correctly. 

You get exactly what you ask for, but
if you are not very explicit you get junk. It still needs an FPGA-aware
monkey on the keyboard for reasonable results. 

It does allow for
testing in S/W land, rather than in H/W simulation, which is far more
productive. Just compile your code and run unit tests, then convert it
to H/W 

Mike 

On 10.08.2018 09:13, Charles Manning wrote: 

> Hello
All 
> 
> Has anyone done any real work, or know of any real work, that
has been done with the Xilinx HLS tools. 
> 
> This is the toolset that
takes code written in C/C++ and munches it down to run on an FPGA. 
> 
>
There is obvious appeal from a 50,000 ft perspective that you can write
a program in C, tweak it a bit and re-compile it and voilĂ , you're an
FPGA programmer! 
> 
> The sceptic in me says that nothing comes for
free. You must be giving something away. C lacks the expressiveness for
some CPU operations. It surely lacks the ability to convey concepts that
Verilog and VHDL do. 
> 
> This suggests to me that an FPGA executing
HLS is inherently going to need more resources (read bigger, more
expensive FPGAs) and need more power (ie. bigger battery, heatsinks,...)
than the same function implemented in Verilog. 
> 
> Has anyone
experience to either refute my assertions or back them up? 
> 
> Thanks

> 
> Charles 
> 
> _______________________________________________
>
Chchrobotics mailing list Chchrobotics at lists.ourshack.com
>
https://lists.ourshack.com/mailman/listinfo/chchrobotics [1]
> Mail
Archives: http://lists.ourshack.com/pipermail/chchrobotics/ [2]
>
Meetings usually 3rd Monday each month. See http://kiwibots.org [3] for
venue, directions and dates.
> When replying, please edit your Subject
line to reflect new subjects.

 

Links:
------
[1]
https://lists.ourshack.com/mailman/listinfo/chchrobotics
[2]
http://lists.ourshack.com/pipermail/chchrobotics/
[3]
http://kiwibots.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ourshack.com/pipermail/chchrobotics/attachments/20180810/9de3ae98/attachment.html>


More information about the Chchrobotics mailing list