soc/cores: rename spiopi to spi_opi
This commit is contained in:
parent
f58e8188b7
commit
62f3537db0
|
@ -9,13 +9,13 @@ class SpiOpi(Module, AutoCSR, AutoDoc):
|
||||||
self.intro = ModuleDoc("""
|
self.intro = ModuleDoc("""
|
||||||
Intro
|
Intro
|
||||||
********
|
********
|
||||||
|
|
||||||
SpiOpi implements a dual-mode SPI or OPI interface. OPI is an octal (8-bit) wide
|
SpiOpi implements a dual-mode SPI or OPI interface. OPI is an octal (8-bit) wide
|
||||||
variant of SPI, which is unique to Macronix parts. It is concurrently interoperable
|
variant of SPI, which is unique to Macronix parts. It is concurrently interoperable
|
||||||
with SPI. The chip supports "DTR mode" (double transfer rate, e.g. DDR) where data
|
with SPI. The chip supports "DTR mode" (double transfer rate, e.g. DDR) where data
|
||||||
is transferred on each edge of the clock, and there is a source-synchronous DQS
|
is transferred on each edge of the clock, and there is a source-synchronous DQS
|
||||||
associated with the input data.
|
associated with the input data.
|
||||||
|
|
||||||
The chip by default boots into SPI-only mode (unless NV bits are burned otherwise)
|
The chip by default boots into SPI-only mode (unless NV bits are burned otherwise)
|
||||||
so to enable OPI, a config register needs to be written with SPI mode. Note that once
|
so to enable OPI, a config register needs to be written with SPI mode. Note that once
|
||||||
the config register is written, the only way to return to SPI mode is to change
|
the config register is written, the only way to return to SPI mode is to change
|
||||||
|
@ -23,32 +23,32 @@ class SpiOpi(Module, AutoCSR, AutoDoc):
|
||||||
reconfiguring the FPGA: a simple JTAG command to reload from SPI will not yank PROG_B low,
|
reconfiguring the FPGA: a simple JTAG command to reload from SPI will not yank PROG_B low,
|
||||||
and so the SPI ROM will be in DOPI, and SPI loading will fail. Thus, system architects
|
and so the SPI ROM will be in DOPI, and SPI loading will fail. Thus, system architects
|
||||||
must take into consideration a hard reset for the ROM whenever a bitstream reload
|
must take into consideration a hard reset for the ROM whenever a bitstream reload
|
||||||
is demanded of the FPGA.
|
is demanded of the FPGA.
|
||||||
|
|
||||||
The SpiOpi architecture is split into two levels: a command manager, and a
|
The SpiOpi architecture is split into two levels: a command manager, and a
|
||||||
cycle manager. The command manager is responsible for taking the current wishbone
|
cycle manager. The command manager is responsible for taking the current wishbone
|
||||||
request and CSR state and unpacking these into cycle-by-cycle requests. The cycle
|
request and CSR state and unpacking these into cycle-by-cycle requests. The cycle
|
||||||
manager is responsible for coordinating the cycle-by-cycle requests.
|
manager is responsible for coordinating the cycle-by-cycle requests.
|
||||||
|
|
||||||
In SPI mode, this means marshalling byte-wide requests into a series of 8 serial cyles.
|
In SPI mode, this means marshalling byte-wide requests into a series of 8 serial cyles.
|
||||||
|
|
||||||
In OPI [DOPI] mode, this means marshalling 16-bit wide requests into a pair of back-to-back DDR
|
In OPI [DOPI] mode, this means marshalling 16-bit wide requests into a pair of back-to-back DDR
|
||||||
cycles. Note that because the cycles are DDR, this means one 16-bit wide request must be
|
cycles. Note that because the cycles are DDR, this means one 16-bit wide request must be
|
||||||
issued every cycle to keep up with the interface.
|
issued every cycle to keep up with the interface.
|
||||||
|
|
||||||
For the output of data to ROM, expects a clock called "spinor_delayed" which is a delayed
|
For the output of data to ROM, expects a clock called "spinor_delayed" which is a delayed
|
||||||
version of "sys". The delay is necessary to get the correct phase relationship between
|
version of "sys". The delay is necessary to get the correct phase relationship between
|
||||||
the SIO and SCLK in DTR/DDR mode, and it also has to compensate for the special-case
|
the SIO and SCLK in DTR/DDR mode, and it also has to compensate for the special-case
|
||||||
difference in the CCLK pad vs other I/O.
|
difference in the CCLK pad vs other I/O.
|
||||||
|
|
||||||
For the input, DQS signal is independently delayed relative to the DQ signals using
|
For the input, DQS signal is independently delayed relative to the DQ signals using
|
||||||
an IDELAYE2 block. At a REFCLK frequency of 200 MHz, each delay tap adds 78ps, so up
|
an IDELAYE2 block. At a REFCLK frequency of 200 MHz, each delay tap adds 78ps, so up
|
||||||
to a 2.418ns delay is possible between DQS and DQ. The goal is to delay DQS relative
|
to a 2.418ns delay is possible between DQS and DQ. The goal is to delay DQS relative
|
||||||
to DQ, because the SPI chip launches both with concurrent rising edges (to within 0.6ns),
|
to DQ, because the SPI chip launches both with concurrent rising edges (to within 0.6ns),
|
||||||
but the IDDR register needs the rising edge of DQS to be centered inside the DQ eye.
|
but the IDDR register needs the rising edge of DQS to be centered inside the DQ eye.
|
||||||
|
|
||||||
In DOPI mode, there is a prefetch buffer. It will read `prefetch_lines` cache lines of
|
In DOPI mode, there is a prefetch buffer. It will read `prefetch_lines` cache lines of
|
||||||
data into the prefetch buffer. A cache line is 256 bits (or 8x32-bit words). The maximum
|
data into the prefetch buffer. A cache line is 256 bits (or 8x32-bit words). The maximum
|
||||||
value is 63 lines (one line is necessary for synchronization margin). The downside of
|
value is 63 lines (one line is necessary for synchronization margin). The downside of
|
||||||
setting prefetch_lines high is that the prefetcher is running constantly and burning
|
setting prefetch_lines high is that the prefetcher is running constantly and burning
|
||||||
power, while throwing away most data. In practice, the CPU will typically consume data
|
power, while throwing away most data. In practice, the CPU will typically consume data
|
||||||
|
@ -57,12 +57,12 @@ class SpiOpi(Module, AutoCSR, AutoDoc):
|
||||||
1-3 lines read-ahead of the CPU. Any higher than 3 lines probably just wastes power.
|
1-3 lines read-ahead of the CPU. Any higher than 3 lines probably just wastes power.
|
||||||
In short simulations, 1 line of prefetch seems to be enough to keep the prefetcher
|
In short simulations, 1 line of prefetch seems to be enough to keep the prefetcher
|
||||||
ahead of the CPU even when it's simply running straight-line code.
|
ahead of the CPU even when it's simply running straight-line code.
|
||||||
|
|
||||||
Note the "sim" parameter exists because there seems to be a bug in xvlog that doesn't
|
Note the "sim" parameter exists because there seems to be a bug in xvlog that doesn't
|
||||||
correctly simulate the IDELAY machines. Setting "sim" to True removes the IDELAY machines
|
correctly simulate the IDELAY machines. Setting "sim" to True removes the IDELAY machines
|
||||||
and passes the data through directly, but in real hardware the IDELAY machines are
|
and passes the data through directly, but in real hardware the IDELAY machines are
|
||||||
necessary to meet timing between DQS and DQ.
|
necessary to meet timing between DQS and DQ.
|
||||||
|
|
||||||
dq_delay_taps probably doesn't need to be adjusted; it can be tweaked for timing
|
dq_delay_taps probably doesn't need to be adjusted; it can be tweaked for timing
|
||||||
closure. The delays can also be adjusted at runtime.
|
closure. The delays can also be adjusted at runtime.
|
||||||
""")
|
""")
|
||||||
|
@ -227,13 +227,13 @@ class SpiOpi(Module, AutoCSR, AutoDoc):
|
||||||
self.architecture = ModuleDoc("""
|
self.architecture = ModuleDoc("""
|
||||||
Architecture
|
Architecture
|
||||||
**************
|
**************
|
||||||
|
|
||||||
The machine is split into two separate pieces, one to handle SPI, and one to handle OPI.
|
The machine is split into two separate pieces, one to handle SPI, and one to handle OPI.
|
||||||
|
|
||||||
SPI
|
SPI
|
||||||
=====
|
=====
|
||||||
The SPI machine architecture is split into two levels: MAC and PHY.
|
The SPI machine architecture is split into two levels: MAC and PHY.
|
||||||
|
|
||||||
The MAC layer is responsible for:
|
The MAC layer is responsible for:
|
||||||
- receiving requests via CSR register to perform config/status/special command sequences,
|
- receiving requests via CSR register to perform config/status/special command sequences,
|
||||||
and dispatching these to the SPI PHY
|
and dispatching these to the SPI PHY
|
||||||
|
@ -242,69 +242,69 @@ class SpiOpi(Module, AutoCSR, AutoDoc):
|
||||||
- managing the chip select to the chip, and ensuring that one dummy cycle is inserted after
|
- managing the chip select to the chip, and ensuring that one dummy cycle is inserted after
|
||||||
chip select is asserted, or before it is de-asserted; and that the chip select "high" times
|
chip select is asserted, or before it is de-asserted; and that the chip select "high" times
|
||||||
are adequate (1 cycle between reads, 4 cycles for all other operations)
|
are adequate (1 cycle between reads, 4 cycles for all other operations)
|
||||||
|
|
||||||
On boot, the interface runs in SPI; once the wakeup sequence is executed, the chip permanently
|
On boot, the interface runs in SPI; once the wakeup sequence is executed, the chip permanently
|
||||||
switches to OPI mode unless the CR2 registers are written to fall back, or the
|
switches to OPI mode unless the CR2 registers are written to fall back, or the
|
||||||
reset to the chip is asserted.
|
reset to the chip is asserted.
|
||||||
|
|
||||||
The PHY layers are responsible for the following tasks:
|
The PHY layers are responsible for the following tasks:
|
||||||
- Serializing and deserializing data, standardized on 8 bits for SPI and 16 bits for OPI
|
- Serializing and deserializing data, standardized on 8 bits for SPI and 16 bits for OPI
|
||||||
- counting dummy cycles
|
- counting dummy cycles
|
||||||
- managing the clock enable
|
- managing the clock enable
|
||||||
|
|
||||||
PHY cycles are initiated with a "req" signal, which is only sampled for
|
PHY cycles are initiated with a "req" signal, which is only sampled for
|
||||||
one cycle and then ignored until the PHY issues an "ack" that the current cycle is complete.
|
one cycle and then ignored until the PHY issues an "ack" that the current cycle is complete.
|
||||||
Thus holding "req" high can allow the PHY to back-to-back issue cycles without pause.
|
Thus holding "req" high can allow the PHY to back-to-back issue cycles without pause.
|
||||||
|
|
||||||
OPI
|
OPI
|
||||||
=====
|
=====
|
||||||
The OPI machine is split into three parts: a command controller, a Tx PHY, and an Rx PHY.
|
The OPI machine is split into three parts: a command controller, a Tx PHY, and an Rx PHY.
|
||||||
|
|
||||||
The Tx PHY is configured with a "dummy cycle" count register, as there is a variable length
|
The Tx PHY is configured with a "dummy cycle" count register, as there is a variable length
|
||||||
delay for dummy cycles in OPI.
|
delay for dummy cycles in OPI.
|
||||||
|
|
||||||
In OPI mode, read data is `mesochronous`, that is, they return at precisely the same frequency
|
In OPI mode, read data is `mesochronous`, that is, they return at precisely the same frequency
|
||||||
as SCLK, but with an unknown phase relationship. The DQS strobe is provided as a "hint" to
|
as SCLK, but with an unknown phase relationship. The DQS strobe is provided as a "hint" to
|
||||||
the receiving side to help retime the data. The mesochronous nature of the read data is
|
the receiving side to help retime the data. The mesochronous nature of the read data is
|
||||||
why the Tx and Rx PHY must be split into two separate machines, as they are operating in
|
why the Tx and Rx PHY must be split into two separate machines, as they are operating in
|
||||||
different clock domains.
|
different clock domains.
|
||||||
|
|
||||||
DQS is implemented on the ROM as an extra data output that is guaranteed to change polarity with
|
DQS is implemented on the ROM as an extra data output that is guaranteed to change polarity with
|
||||||
each data byte; the skew mismatch of DQS to data is within +/-0.6ns or so. It turns out the mere
|
each data byte; the skew mismatch of DQS to data is within +/-0.6ns or so. It turns out the mere
|
||||||
act of routing the DQS into a BUFR buffer before clocking the data into an IDDR primitive
|
act of routing the DQS into a BUFR buffer before clocking the data into an IDDR primitive
|
||||||
is sufficient to delay the DQS signal and meet setup and hold time on the IDDR.
|
is sufficient to delay the DQS signal and meet setup and hold time on the IDDR.
|
||||||
|
|
||||||
Once captured by the IDDR, the data is fed into a dual-clock FIFO to make the transition
|
Once captured by the IDDR, the data is fed into a dual-clock FIFO to make the transition
|
||||||
from the DQS to sysclk domains cleanly.
|
from the DQS to sysclk domains cleanly.
|
||||||
|
|
||||||
Because of the latency involved in going from pin->IDDR->FIFO, excess read cycles are
|
Because of the latency involved in going from pin->IDDR->FIFO, excess read cycles are
|
||||||
required beyond the end of the requested cache line. However, there is virtually no
|
required beyond the end of the requested cache line. However, there is virtually no
|
||||||
penalty in pre-filling the FIFO with data; if a new cache line has to be fetched,
|
penalty in pre-filling the FIFO with data; if a new cache line has to be fetched,
|
||||||
the FIFO can simply be reset and all pointers zeroed. In fact, pre-filling the FIFO
|
the FIFO can simply be reset and all pointers zeroed. In fact, pre-filling the FIFO
|
||||||
can lead to great performance benefits if sequential cache lines are requested. In
|
can lead to great performance benefits if sequential cache lines are requested. In
|
||||||
simulation, a cache line can be filled in 10 bus cycles if it happens to be prefetched
|
simulation, a cache line can be filled in 10 bus cycles if it happens to be prefetched
|
||||||
(as opposed to 49 bus cycles for random reads). Either way, this compares favorably to
|
(as opposed to 49 bus cycles for random reads). Either way, this compares favorably to
|
||||||
288 cycles for random reads in 100MHz SPI mode (or 576 for the spimemio.v, which runs at
|
288 cycles for random reads in 100MHz SPI mode (or 576 for the spimemio.v, which runs at
|
||||||
50MHz).
|
50MHz).
|
||||||
|
|
||||||
The command controller is repsonsible for sequencing all commands other than fast reads. Most
|
The command controller is repsonsible for sequencing all commands other than fast reads. Most
|
||||||
commands have some special-case structure to them, and as more commands are implemented, the
|
commands have some special-case structure to them, and as more commands are implemented, the
|
||||||
state machine is expected to grow fairly large. Fast reads are directly handled in "tx_run"
|
state machine is expected to grow fairly large. Fast reads are directly handled in "tx_run"
|
||||||
mode, where the TxPhy and RxPhy run a tight loop to watch incoming read bus cycles, check
|
mode, where the TxPhy and RxPhy run a tight loop to watch incoming read bus cycles, check
|
||||||
the current address, fill the prefetch fifo, and respond to bus cycles.
|
the current address, fill the prefetch fifo, and respond to bus cycles.
|
||||||
|
|
||||||
Writes to ROM might lock up the machine; a TODO is to test this and do something more sane,
|
Writes to ROM might lock up the machine; a TODO is to test this and do something more sane,
|
||||||
like ignore writes by sending an ACK immediately while discarding the data.
|
like ignore writes by sending an ACK immediately while discarding the data.
|
||||||
|
|
||||||
Thus, an OPI read proceeds as follows:
|
Thus, an OPI read proceeds as follows:
|
||||||
|
|
||||||
- When BUS/STB are asserted:
|
- When BUS/STB are asserted:
|
||||||
TxPhy:
|
TxPhy:
|
||||||
|
|
||||||
- capture bus_adr, and compare against the *next read* address pointer
|
- capture bus_adr, and compare against the *next read* address pointer
|
||||||
- if they match, allow the PHYs to do the work
|
- if they match, allow the PHYs to do the work
|
||||||
|
|
||||||
- if bus_adr and next read address don't match, save to next read address pointer, and
|
- if bus_adr and next read address don't match, save to next read address pointer, and
|
||||||
cycle wr/rd clk for 5 cycle while asserting reset to reset the FIFO
|
cycle wr/rd clk for 5 cycle while asserting reset to reset the FIFO
|
||||||
- initiate an 8DTRD with the read address pointer
|
- initiate an 8DTRD with the read address pointer
|
||||||
- wait the specified dummy cycles
|
- wait the specified dummy cycles
|
||||||
|
@ -312,24 +312,24 @@ class SpiOpi(Module, AutoCSR, AutoDoc):
|
||||||
- greedily pre-fill the FIFO by continuing to clock DQS until either:
|
- greedily pre-fill the FIFO by continuing to clock DQS until either:
|
||||||
- the FIFO is full
|
- the FIFO is full
|
||||||
- pre-fetch is aborted because bus_adr and next read address don't match and FIFO is reset
|
- pre-fetch is aborted because bus_adr and next read address don't match and FIFO is reset
|
||||||
|
|
||||||
RxPHY:
|
RxPHY:
|
||||||
- while CTI==2, assemble data into 32-bit words as soon as EMPTY is deasserted,
|
- while CTI==2, assemble data into 32-bit words as soon as EMPTY is deasserted,
|
||||||
present a bus_ack, and increment the next read address pointer
|
present a bus_ack, and increment the next read address pointer
|
||||||
- when CTI==7, ack the data, and wait until the next bus cycle with CTI==2 to resume
|
- when CTI==7, ack the data, and wait until the next bus cycle with CTI==2 to resume
|
||||||
reading
|
reading
|
||||||
|
|
||||||
- A FIFO_SYNC_MACRO is used to instantiate the FIFO. This is chosen because:
|
- A FIFO_SYNC_MACRO is used to instantiate the FIFO. This is chosen because:
|
||||||
- we can specify RAMB18's, which seem to be under-utilized by the auto-inferred memories by migen
|
- we can specify RAMB18's, which seem to be under-utilized by the auto-inferred memories by migen
|
||||||
- the XPM_FIFO_ASYNC macro claims no instantiation support, and also looks like it has weird
|
- the XPM_FIFO_ASYNC macro claims no instantiation support, and also looks like it has weird
|
||||||
requirements for resetting the pointers: you must check the reset outputs, and the time to
|
requirements for resetting the pointers: you must check the reset outputs, and the time to
|
||||||
reset is reported to be as high as around 200ns (anecdotally -- could be just that the sim I
|
reset is reported to be as high as around 200ns (anecdotally -- could be just that the sim I
|
||||||
read on the web is using a really slow clock, but I'm guessing it's around 10 cycles).
|
read on the web is using a really slow clock, but I'm guessing it's around 10 cycles).
|
||||||
- the FIFO_SYNC_MACRO has a well-specified fixed reset latency of 5 cycles.
|
- the FIFO_SYNC_MACRO has a well-specified fixed reset latency of 5 cycles.
|
||||||
- The main downside of FIFO_SYNC_MACRO over XPM_FIFO_ASYNC is that XPM_FIFO_ASYNC can automatically
|
- The main downside of FIFO_SYNC_MACRO over XPM_FIFO_ASYNC is that XPM_FIFO_ASYNC can automatically
|
||||||
allow for output data to be read at 32-bit widths, with writes at 16-bit widths. However, with a
|
allow for output data to be read at 32-bit widths, with writes at 16-bit widths. However, with a
|
||||||
bit of additional logic and pipelining, we can aggregate data into 32-bit words going into a
|
bit of additional logic and pipelining, we can aggregate data into 32-bit words going into a
|
||||||
32-bit FIFO_SYNC_MACRO, which is what we do in this implementation.
|
32-bit FIFO_SYNC_MACRO, which is what we do in this implementation.
|
||||||
""")
|
""")
|
||||||
self.bus = wishbone.Interface()
|
self.bus = wishbone.Interface()
|
||||||
|
|
Loading…
Reference in New Issue