* optional register=False to decrease latency by 1 cycle * require explicit `cycles` as it influences latency (min_cycles still can be used) * add unit tests