Conceptual understanding with AXI Protocol and Cache Coherency

As AXI protocol and Cache Coherency are commonly used concepts these days in almost each and every complex SoC’s so knowledge of those concepts are must for everyone to know how it works.

If the AXI slave component is taking more time (in terms of clock cycles) in responding back to the master for the completion of the transfer
then such components are said to be having high initial access latency.

AXI protocol provides a signal called WSTRB will enable on which data
lanes the data has to transfer.

The responses from the slave can be sent out of order. There is no
restriction from the slave side where the responses are completed in the order
in which they have been received. The exception here is the first transaction.
Except for the first transaction, this facility is applicable.

No. Addresses (read/write) are generated only from the AXI Master side only. It is the READ data and write response channels that are owned by AXI slave.
The slave will only be sending READ Data, READ response, WRITE Responses.

The statement means that for both WRITE and READ, there will be an
associated WLAST and RLAST signals which can indicate whether the last item within a transaction has been taken place or not.

The AWSIZE signal denotes how much amount of data in bytes can be
accommodated in a single transfer of the burst. The maximum value is 128 bytes.

5 channels
Each channel will have a valid & ready signal.
Write operation has both data and address channels
AWADDR: write address
AWVALID: Write address valid: source is Master
AWREADY: READ address ready: source is Slave
WDATA: Write data
WVALID: VALID write data: Source is Master
WREADY: write ready : Source is slave

READ operation has both data and address channels
ARADDR: READ address
ARVALID: READ address valid: source is Master
ARREADY: READ address ready: source is Slave
RDATA: READ data: slave
RVALID: VALID READ data: Source is Slave
RREADY: READ ready: Source is Master

WRITE response channel:
Owned by slave
BVALID : Source is a slave
BREADY : Source is MAster

The max allowable AWSIZE is 128 bytes and the max allowable length is 16.
So, it is the product of 128*16 = 2048 bytes.

Let us take the following system scenario:

------------------------------------
|processor with internal cache (L1)| AXI Master
-----------------------------------
|
|---------|L2 cache|
|
----------------
AXI interconnect
----------------
|
|
|AXI Slave(DDR Memory)|

L2 cache memory is in the path between the processor and interconnect.
Any transfer that can access the cache will check the cache contents
(called cache lookup) before potentially accessing the downstream memory in
this case, it is DDR memory.

INCR is the simplest burst type, accessing a lower address and sequentially
and stepping up in memory to a higher address. These types of bursts can also be used in performing a cache, but the problem with that burst type is that you might need to perform a complete cache linefill before that data you want is stored in the cache and made available to the processor. This is where WRAP burst has an advantage.

A WRAP burst fetches the important data first (which the processor actually
wants) and then completes the cache line fill around that important data.

In system-level terminology, this important data which the processor actually
wants from the particular access location of the cache is called "critical word".

As an example, if we had an 8-word cache line, and the processor wanted to
read data from address 0x18 (the 7th entry on a cache line if that data was
cached), and INCR burst would need to fetch data for:
0x00, 0x04, 0x08, 0x0C, 0x10, 0x14 before finally getting the 0x18 data the
processor wants (the processor is no longer stalled), and then the final 0x1C
cache line entry is filled.

Instead, if we use a WRAP burst, this burst can start at 0x18 (so the processor
is no longer stalled), and the cache line then fills up around this "critical word", with accesses to 0x1C, 0x00, 0x04, 0x08, 0x0C, 0x10 and 0x14.

There will still be 8 memory accesses to perform the cache linefill but in most
cases the WRAP burst type will stall the requesting processor for fewer cycles than the INCR burst type.

NO. Early burst termination is NOT supported in AXI. AXI Master can disable
writing by deasserting all the write strobes but it must complete the
remaining transfers of the burst. Discarding READ data that is NOT required
can result in lost data when accessing a READ sensitive device like FIFO.

Cache coherency is a system where the system s/w updates all cache to the same data, using some additional extensions provided by the AMBA AXI4 ACE(AXI Coherency Extension) protocol.
L1 cache is specific to each core.
L2 cache is specific to processor sub-system
Example: Each core will have a unique L1 cache and all other cores in a subsystem will have 1 L2 cache.

TRUE.
AWLEN = 3, then burst length is 4.

System s/w will decide which address is cacheable & which address is non-cacheable.
Accordingly, the processor will generate the signal AWCACHE in such a way that the address will be cached.

The processor will go and create an entry in the cache and will fetch the data & put it into the cache.

If the address is not present in the cache, then the processor will go and create an entry in the cache and will fetch the data & put it into the cache. During this process, there is a chance that L1 and L2 may go out of sync.

For example, there is an address 'h1000 present in the DDR memory, L1 and L2. In a case where the L1 cache address got updated and L2 is NOT updated, there should be a mechanism to make them in sync. Such a mechanism is called cache coherency.

Prefetching refers to retrieving & storing data into buffer memory (cache) before the processor requires the data.

When the processor wants to process the data, it is readily available and can be processed within a short period of time.

Had there not been a cache memory, the processor has to download the data directly from the memory address, hence there could be a delay.

Cache prefetching is a speed-up technique used by the processors where instructions/data are fetched before they are needed.
AWCACHE[1]:- For writes this means that number of writes can be merged together.
ARCACHE[1]:- For reads, this means that the location can be prefetched or can be fetched just once for multiple read transactions.

System s/w will decide which address is cacheable & which address is non-cacheable. Accordingly, the processor will generate the necessary
attributes over the signals AWCACHE/ARCACHE to provide support to system-level caches about the transaction types.

If requested data is found in cache memory then it is called cache HIT.
If the requested data is not present in the cache, then it can be termed a cache MISS.

RA: if high, it means that if the transfer is read and if it misses in the cache then it could be allocated.
WA: if high, it means that if the transfer is written and if it misses in the cache then it could be allocated.

In the form of different variants of accesses.
a. privileged
b. normal
c. secure
d. non-secure

The deemed accesses are 1. privleged 2. secure

EX access fails. If a master doesn't complete the write portion of an exclusive operation, a subsequent EX-RD changes the address that is
being monitored for exclusivity.

EX Fails. In such a case, to overcome the memory overriding problem, the slave reserves some memory resource for M1 virtually as indicated
by EX-RD request earlier from M1. This is the fundamental advantage of exclusive access in AXI.

AXI slave will start monitoring the ADDRS on which EXREAD operation has been initiated and also the ARID provided by the master until either a write occurs to that location or until another EX READ with the same ARID value resets the EX ACCESS monitoring logic in the slave to a different address.

if(start address % transfer size == 0)
address -> aligned address

start address = 0x4
AWSIZE = 2
transfer size 2 ^ AWSIZE = 4

0x4 % 4 == 0, hence address is aligned

if start address is 0x7

0x7 % 4 != 0, hence address is unaligned

No. The address for an exclusive operation must be aligned to the total number of bytes in the transaction.

Yes. The number of bytes to be transferred in exclusive access must be a power of 2 i;e 1,2,4,8,16,32,64 or 128 bytes.

Can a transaction be a cacheable one for exclusivity? The answer is NO. This means that the slave that monitors for exclusivity MUST see the transactions.

IN WRITE: there is just one response given for the entire burst but not for each and every individual data item within the burst.

FOR READ: the slave can provide different responses for different transfers within a burst.

For example:
in a burst of 16 read transfers, the slave might return an OKAY response for 15 of them and a SLVERR response for the 16th item.

The interconnect can enable transactions with fast responding slaves to
complete in advance compared to earlier transactions with slower slaves.
Complex slaves can return read data out of order. For example, a data item
for later access might be available from an internal buffer before the data
for earlier access is available.

In a multi-master system, the IC will append additional information to the
ID tag to ensure that ID tags from all the masters are unique. The ID tag is
similar to a master number but with an extension that each master can
implement multiple virtual masters within the same port by supplying an ID tag to indicate the virtual master number.

Write data is treated as buffered so that the master can perform write
transactions without slave acknowledgment of previous writes.

WLAST is asserted for the last data item of the burst by the AXI Master.

Yes. Unless both ARVALID & ARREADY signals are seen HIGH, RVALID cannot be driven to HIGH value. It is a READE transaction. Unless the master drives the ADDRS for fetching the data, the READ transaction cannot be performed. Unless there is a valid read address, there cannot be a READ.

Transfer size * burst_length

AWLEN: 4
AWSIZE is 4 bytes: 32 bits

address boundary is: 16
0 - F

By default, WSTRB member of the master transaction is random, and it would get random values of all the bits when randomized. If the user wants to set all the bits to '1', then they can apply the constraint as below Add relevant constraint during randomization as follows:
foreach (wstrb[i])
wstrb[i] == (1<<(1<<this.burst_size)) - 1;