Optimizing your simulation for vcomp
vcomp is a Verilog® compiler, not an interpreter - this means that things
that can be determined at compile time work especially well while
things that require runtime resolution tend to be much slower.
This section is intended to give you some overview of what performs
well and what does not so that you can get the fastest possible
Most of these issues also apply to other similar Verilog® compilers.
- scalared multi-bit nets - see the note below under 'good things'
- expressions other than a simple
variable name in an event expression ( @(...) ) sensitivity list
require the generation of extra code approx. equivalent to creating
a wire and 'assign' to evaluate the expression.
- force/assign - forcing or assigning to a register-type variable
is rather expensive - it takes a lot more time than a simple
- <= vs = - use of <= means that the compiler cannot perform
inter-assignment optimizations past the point where the non-blocking assignment
is performed, so, if possible, you should group any non-blocking assignments
together, On the other hand '<= #N' is only slightly more expensive
than '= #N' - '<= #N' can require the creation of arbitrary amounts of
storage if the non-blocking assignments are performed at a higher rate
than the delay - vcomp optimizes for the much more common single outstanding
non-blocking store case.
- disable is expensive if you attempt to disable something that's not
above you in your hierarchy - disable to quit a loop is cheap, disabling
something in another always block is much more expensive, disabling
a task is even more expensive (and if it's a possibility requires extra
book-keeping on the part of the compiler every time the task is called)
- recursive task/function calls - Verilog® is blessed with a memory model that
is mostly static - ie at the time a simulation is created you can mostly
figure out how much storage will be required. vcomp works really hard to keep
memory footprints as small as possible - thread stack spaces are usually in
the 10s of bytes - this helps keep simulation sizes down as well as
giving better cache/TLB performance. However one of the few places in the language
where stack space can be of arbitrary size is when you have recursive
task or function calls - fortunately these are seldom used in Verilog®
nor are they of much use (due to Verilog®'s curious globalized parameters).
When vcomp detects a recursive subroutine call it allocates a 4k stack segment
if your program crashes and you need more, you can use the -s compile-time flag
to allocate more stack space
- vcomp highly optimizes the value-change mechanism that's used for PLI VCDs
as well as internally for event expressions etc. If a register variable
appears in an event sensitivity list or equivalent (for example the RHS
of an assign statement) then extra memory must be allocated for it and
assignment to it will be much more expensive - many compiler optimizations
will also be disabled across the assignment. Some things like including
$dumpvars or equivalent routines that require waveform viewers/dumpers to
be able to attach VCD callbacks to every object in a simulation are VERY
expensive when running - and still somewhat expensive if compiled in
and not called - for example if you include $dumpvars in your simulation
and don't call it vcomp must still allocate all the VCD overhead for
every variable, disable all inter-assignment optimizations and add the
VCD checks to every assignment - this can slow things down a lot - so
if you want to have $dumpvars `ifdef it out and recompile when you need it,
(perhaps build 2 binaries every time you do a compile). Wires don't suffer
from this VCD overhead as much as register-type variables do.
- wires are more expensive than regs - this is true in general - wires need more
support (storage etc) than register variables and are not able to take
part in as many optimizations that register variables can
- hierarchical references - assignments to things outside the scope
of the current module are somewhat more expensive - mostly due to the overhead
of SMP synchronization
- tran/tranif0/tranif1 etc - frankly trans are evil .... they slow down the
evaluation of the nets they are attached to by a factor of 2-3
- Wires are optimized for the single driver case
- We attempt to keep vectored (multi-bit) wires vectored wherever possible,
this is because the simulate much faster. If you do not use the 'vectored'
or 'scalared' keywords then vcomp
will attempt to keep a net as vectored - if you pass a selection of a wire
(ie one bit or a range of bits) to a module primitive or gate then
vcomp is forced to split the vectored net into a bunch of wires - an N-bit
scalared wire will take almost N times the effort an N-bit wire will take
The main problem you have to look out for is code that depends on event ordering,
this is almost always somewhat different between Verilog® implementations,
because of our SMP implementation it can be even more unpredictable (even simulation run
to simulation run because of the randomness of who wins during SMP locking). If you think you are
suffering from such an event ordering problem try running your simulation
with the -Do flag and see if the randomness goes away (ie it always succeeds or always
breaks the same way rather than each simulation working differently). Note: such a simulation problem
is almost always an indication of a 0-time race in your design - you're probably not pipelining
something correctly or something similar - this can be a genuine bug and it's something that
should be tracked down and fixed
and not ignored - the fact that SMP operation allows you to find these bugs is a good
thing - even though they are often really hard to find and fix.
The one situation where you might find a genuine event ordering difference in a design
is a place where signals are crossing a clock boundary where the clocks periods are
nor synchronized with respect to each other - in this situation in the real world the order in which
signals are resolved is undefined anyway - since this is the simulator equivalent of
metastability you probably want some randomness to check your own synchronizer
circuits (you don't have any? time to raise a red flag!).
Finally we've also seen people have problems with the following sort of construct:
always @(...) begin
x = ....;
This happens to work as expected with some simulators and not with others, it probably doesn't
with ours (at least in the first release), it's actually buggy code and not portable, the problem is that
when an output of a module is also declared as a 'reg' the language really defines it to be
assign x_ext = x_int;
always @(...) begin
x_int = ....;
All writes to the 'reg' go to the 'x_int' value and are propagated to the external net,
not always immediately, reads to 'x' however always go to the net. If you write to
a registered output and read it again in the same always statement the value may not
yet have been propagated and the result you read may be stale.
It's easy to miss this sort of bug, better to avoid registered outputs if you
want to read them in the same block, and instead explicitly make a 'reg'
and a 'output' and assign them together - that way you can read the internal
registered value directly which is probably what you intended.