josuah.net | panoramix-labs.fr
While writing HDL to teach a chip new tricks, it is best to avoid drowning in the complexity.
The famous divide and rule helps: splitting the design in modules that, like a programming language function, reduce the scope of what is worked on, and hides the complexity for the parent module that calls them.
But it quickly ends-up in an sea of many modules communicating in many different ways.
Adding another layer of organisation becomes necessary: is using a bus that acts as a central spine for communication across the whole design.
Multiple bus protocols are used, with Wishbone the simplest and most widely used one for open source cores.
The Wishbone bus comes in multiple variants:
CTI
signal: Classic or Registered Feedback;ACK
: Synchronous or Asynchronous;STB
and CYC
: Standard or Pipelined;I suppose the aim was to offer the largest coverage of all use-cases, so that Wishbone to be used in a standard way for most situations.
This large range of options also makes it harder to support every combination, some being incompatible together, and it seems common to use the most basic wishbone on every case.
Left is to decide which combination is the simplest.
At first, I wanted to avoid the Pipelined mode, to keep it as simple as possible. But my opinion changed when having a look at how both worked:
In Standard mode, when a master issue a request with STB_O
, as long sa the slave did not send ready, it will keep STB_O
high, until it sees an ACK_I
held high by the slave. The CYC_O
and STB_O
are both set on the clock where ACK_I is received, and it is only on the next clock that it is possible to isue a new request.
___ ___ ___ ___ ___ CLK_I __/ \___/ \___/ \___/ \___/ \__ _______________________ CYC_O __/ \______________ _______________________ STB_O __/ \______________ _______ ACK_I __________________/ \______________
In Pipelined mode, a master issue a request by taking STB_O
high, and instead of waiting for ACK_I
to take it back low, it check STALL_I
: if high, then it waits; if low, it considers the request queued by the slave, and may submit another one right away. In that case, the ACK_I
only tells the master that a queued request has finished.
___ ___ ___ ___ ___ CLK_I __/ \___/ \___/ \___/ \___/ \__ _______________________________ CYC_O __/ \______ _______________ STB_O __/ \______________________ _______ ACK_I __________________________/ \______ _______ STALL_I __/ \______________________________
In both case, CYC_O
stays up through the whole transaction, and ACK_I
announces that the request is done.
Other signals, such as data, read/write or address have been omitted for clarity.
Implementing a Pipelined slave does not reveal to be more complex in practice:
STALL_I
can be tied low (STALL_I = 0
) and ignored.STALL_I
would have been used in Standard mode anyway, in the form of an internal busy
register.Although, a Standard master is a bit simpler to implement, as it does not have to wait that the request is queued first, and then to wait again that the slave provides an answer, and instead only has to wait the ACK_I
.
In the timing examples above, the slave takes 3 cycles to work on the request, and then sets the ACK_I
signal.
It seems to take one more clock cycle to operate, but the Pipelined mode still has a higher throughput: it is not necessary to wait that the result is available to submit a new request.
This will only work if the slave is having a buffer, a FIFO to queue the incoming requests and work on them later.
Having a Pipelined mode may seem more difficult to implement since it suggests that a complex queuing mechanism is to write for it, but a pipeline is entirely optional even in Pipelined mode.
The only ACK_I
needs to be shifted by one clock, which is done by using a register instead of a wire for it. This will add the delay needed, due to registers applying changes on the next clock.
That way, it is still possible to write very simple modules that do everything in a single clock.
A single clock cycle is indeed consumed in Wishbone in its Pipelined mode. This could lead to an overall higher lattency, in particular if there are multiple Wishbone buses chained together.
If too complex operations are done in a single clock cycle, it may take too much time for all the signal to settle down and stablise until the next clock tick.
A too long chain of logic and the timing constraint (the clock rate) might be missed.
A long chain of logic might be broken down in two steps with registers, that let half of the steps be done before, and after the register, so that there is roughly half of the work to be one in a single clock tick.
If Wishbone is used in Standard mode, the signals would have to propagate inside the master, then to the slave, then inside the slave, then back to the master, all of that in probably a single clock tick.
Placing a register in the bus, by making ACK_O
a register, permits to break the long chain form master to slave and back to master by introducing an intermediate step (register) for the signal to take a pause before going back to master, making sure it had time to settle down in the slave.
That way, if the timings of the slave are fine with one master, it has better chances to be fine with any other master, since the timings of the slave and master do not sum-up anymore.
While the Standard wishbone seems more frequently uesd, the Pipelined mode seems to be a bit more keen on timing, and most of the drawbacks like extra clock for ACK or extra signal, would likely also appear in the Standard mode.
I am still new to Wishbone, and much curious about what you think about it: Which variant do you use? Anything that I would have missed for the Standard mode? me@josuah.net
Among notable Pipelined mode users is ZipCPU.
While looking at this ZipCPU article, it seems that its motivation for using Pipelined mode is expressed in these sentencse:
Reminding the way logic gates may "solve maths":
One solution to sequencing operations is to create a giant state machine. The reality, though, is that an FPGA tends to create all the logic for every state at once, and then only select the correct answer at the end of each clock tick. In this fashion, a state machine can be very much like the simple ALU we've discussed.
And the conclusion of what makes more sense:
On the other hand, if the FPGA is going to implement all of the logic for the operation anyway, why not arrange each of those operations into a sequence, where each stage does something useful? This approach rearranges the algorithm into a pipeline.
And its use of Wishbone is extensively explained in https://raw.githubusercontent.com/ZipCPU/zipcpu/master/doc/orconf.pdf.