The current 20% speed disadvantage of compiled code can be remedied. There are hopes as well that closer work with the Synopsys tools may reveal methods to speed up primitive operations, such as the 32-bit add, as well.
Loop unrolling is very successful as an optimization technique, allowing a 30% speed increase over an iterative implementation. The automatic insertion of pipeline registers into an unrolled algorithm promises further speed improvement; the methods of [AS93,POA96] might prove useful.
Remaining compiler work may include retargeting the front-end to a C subset, and implementing more optimization stages to perform strength-reduction and copy-propagation. Better support for arrays (and their decomposition into register variables) may enable us to express the RC5 algorithm in a manner that does not require substantial manual unrolling.
Finally, the possibility of compiling directly to structural VHDL remains to be considered; -functions correspond neatly to multiplexors required in a hardware implementation.