git | cv | links | quotes | ascii | tgtimes | gopher | mail

Wishbone B4: Standard or Pipelined?

While writing HDL to teach a chip new tricks, it is best to avoid drowning in the complexity.

The famous divide and rule helps: splitting the design in modules that, like a programming language function, reduce the scope of what is worked on, and hides the complexity for the parent module that calls them.

But it quickly ends-up in an sea of many modules communicating in many different ways.

Organising communication with a bus

Adding another layer of organisation becomes necessary: is using a bus that acts as a central spine for communication across the whole design.

Multiple bus protocols are used, with Wishbone the simplest and most widely used one for open source cores.

What flavor?

The Wishbone bus comes in multiple variants:

I suppose the aim was to offer the largest coverage of all use-cases, so that Wishbone to be used in a standard way for most situations.

This large range of options also makes it harder to support every combination, some being incompatible together, and it seems common to use the most basic wishbone on every case.

Left is to decide which combination is the simplest.

Standard and Pipelined

At first, I wanted to avoid the Pipelined mode, to keep it as simple as possible. But my opinion changed when having a look at how both worked:

In Standard mode, when a master issue a request with STB_O, as long sa the slave did not send ready, it will keep STB_O high, until it sees an ACK_I held high by the slave. The CYC_O and STB_O are both set on the clock where ACK_I is received, and it is only on the next clock that it is possible to isue a new request.

	   ___     ___     ___     ___     ___   
CLK_I	__/   \___/   \___/   \___/   \___/   \__
CYC_O	__/                       \______________
STB_O	__/                       \______________
ACK_I	__________________/       \______________

In Pipelined mode, a master issue a request by taking STB_O high, and instead of waiting for ACK_I to take it back low, it check STALL_I: if high, then it waits; if low, it considers the request queued by the slave, and may submit another one right away. In that case, the ACK_I only tells the master that a queued request has finished.

	   ___     ___     ___     ___     ___
CLK_I	__/   \___/   \___/   \___/   \___/   \__
CYC_O	__/                               \______
STB_O	__/               \______________________
ACK_I	__________________________/       \______
STALL_I	__/       \______________________________

In both case, CYC_O stays up through the whole transaction, and ACK_I announces that the request is done.

Other signals, such as data, read/write or address have been omitted for clarity.

Standard uses one less signal

Implementing a Pipelined slave does not reveal to be more complex in practice:

Although, a Standard master is a bit simpler to implement, as it does not have to wait that the request is queued first, and then to wait again that the slave provides an answer, and instead only has to wait the ACK_I.

Pipelined for better throughput

In the timing examples above, the slave takes 3 cycles to work on the request, and then sets the ACK_I signal.

It seems to take one more clock cycle to operate, but the Pipelined mode still has a higher throughput: it is not necessary to wait that the result is available to submit a new request.

This will only work if the slave is having a buffer, a FIFO to queue the incoming requests and work on them later.

Pipelined as easy to implement as Standard

Having a Pipelined mode may seem more difficult to implement since it suggests that a complex queuing mechanism is to write for it, but a pipeline is entirely optional even in Pipelined mode.

The only ACK_I needs to be shifted by one clock, which is done by using a register instead of a wire for it. This will add the delay needed, due to registers applying changes on the next clock.

That way, it is still possible to write very simple modules that do everything in a single clock.

Standard has a 1-clock better lattency

A single clock cycle is indeed consumed in Wishbone in its Pipelined mode. This could lead to an overall higher lattency, in particular if there are multiple Wishbone buses chained together.

Pipelined may help with timing

If too complex operations are done in a single clock cycle, it may take too much time for all the signal to settle down and stablise until the next clock tick.

A too long chain of logic and the timing constraint (the clock rate) might be missed.

A long chain of logic might be broken down in two steps with registers, that let half of the steps be done before, and after the register, so that there is roughly half of the work to be one in a single clock tick.

If Wishbone is used in Standard mode, the signals would have to propagate inside the master, then to the slave, then inside the slave, then back to the master, all of that in probably a single clock tick.

Placing a register in the bus, by making ACK_O a register, permits to break the long chain form master to slave and back to master by introducing an intermediate step (register) for the signal to take a pause before going back to master, making sure it had time to settle down in the slave.

That way, if the timings of the slave are fine with one master, it has better chances to be fine with any other master, since the timings of the slave and master do not sum-up anymore.


While the Standard wishbone seems more frequently uesd, the Pipelined mode seems to be a bit more keen on timing, and most of the drawbacks like extra clock for ACK or extra signal, would likely also appear in the Standard mode.

I am still new to Wishbone, and much curious about what you think about it: Which variant do you use? Anything that I would have missed for the Standard mode?

Among notable Pipelined mode users is ZipCPU.


While looking at this ZipCPU article, it seems that its motivation for using Pipelined mode is expressed in these sentencse:

Reminding the way logic gates may "solve maths":

One solution to sequencing operations is to create a giant state machine. The reality, though, is that an FPGA tends to create all the logic for every state at once, and then only select the correct answer at the end of each clock tick. In this fashion, a state machine can be very much like the simple ALU we've discussed.

And the conclusion of what makes more sense:

On the other hand, if the FPGA is going to implement all of the logic for the operation anyway, why not arrange each of those operations into a sequence, where each stage does something useful? This approach rearranges the algorithm into a pipeline.

And its use of Wishbone is extensively explained in