AXI Protocol And Cache Coherency: A Conceptual Deep Dive

In the world of modern computing systems, efficient data transfer and memory management are crucial for optimal performance. Two key concepts that play a vital role in this domain are the Advanced eXtensible Interface (AXI) protocol and cache coherency. This article aims to provide a comprehensive understanding of these concepts, their importance, and how they work together in modern system-on-chip (SoC) designs.

Understanding the AXI Protocol

What is AXI?

The Advanced eXtensible Interface (AXI) is a part of the Advanced Microcontroller Bus Architecture (AMBA) family of protocols developed by ARM. AXI is designed to provide high-performance, high-frequency system designs for complex SoCs.

Key Features of AXI

Separate Address/Control and Data Phases: AXI separates the address/control and data phases, allowing for more efficient pipelining of transactions.
Multiple Outstanding Transactions: AXI supports multiple outstanding transactions, enabling higher throughput in complex systems.
Out-of-Order Transaction Completion: Transactions can complete out of order, which can significantly improve system performance.
Burst-based Transactions: AXI supports burst-based transactions, allowing for efficient transfer of large data blocks.
Flexible Arbitration Scheme: AXI provides a flexible arbitration scheme, allowing system designers to optimize for specific application requirements.

AXI Channels

AXI uses five different channels for communication:

Read Address Channel
Read Data Channel
Write Address Channel
Write Data Channel
Write Response Channel

This channel-based approach allows for simultaneous read and write operations, enhancing overall system performance.

Cache Coherency: Ensuring Data Consistency

What is Cache Coherency?

Cache coherency refers to the consistency of data stored in local caches of a shared resource. In a multi-processor or multi-core system, ensuring that all caches have a consistent view of memory is crucial for correct system operation.

The Cache Coherency Problem

When multiple processors or cores have their own local caches, there’s a risk of data inconsistency. For example, if one processor modifies data in its cache, other processors might still have the old version of that data in their caches, leading to potential errors.

Cache Coherency Protocols

To maintain coherency, various protocols have been developed:

MESI Protocol: This protocol defines four states for a cache line – Modified, Exclusive, Shared, and Invalid.
MOESI Protocol: An extension of MESI that adds an Owned state for improved performance in some scenarios.
Snooping Protocols: These protocols involve continuously monitoring the system bus for any memory operations that might affect the local cache.
Directory-based Protocols: These protocols maintain a centralized directory of the state of cache lines, which can be more scalable for large systems.

AXI and Cache Coherency: Working Together

The AXI protocol includes features that support cache coherency in multi-processor systems:

AXI Coherency Extensions (ACE): ACE adds new transaction types and signals to the AXI protocol to support hardware-managed cache coherency.
Snoop Channels: ACE introduces additional channels for snoop transactions, allowing coherent masters to maintain cache coherency.
Distributed Virtual Memory (DVM) Transactions: ACE supports DVM transactions, which allow for efficient management of virtual memory across multiple processors.

Conclusion

Understanding the AXI protocol and cache coherency is essential for designing and optimizing modern computing systems. The AXI protocol provides a flexible and high-performance interface for SoC designs, while cache coherency mechanisms ensure data consistency across multiple processors or cores. Together, they form a crucial foundation for efficient and reliable computing systems.

What are outstanding transactions in AXI?

The transactions which are yet to be completed are called outstanding transactions.

for example: Let us say we have 10 writes initiated from the Master component. Out of 10, only 3 of them have received an OKAY response from slave.
In such a case, the rest of the 7 writes whose responses are yet to be received are called outstanding transactions.

What does high initial latency devices mean?

If the AXI slave component is taking more time (in terms of clock cycles) in responding back to the master for the completion of the transfer
then such components are said to be having high initial access latency.

What is a byte strobe in AXI?

AXI protocol provides a signal called WSTRB will enable on which data
lanes the data has to transfer.

What is an out of order response?

The responses from the slave can be sent out of order. There is no
restriction from the slave side where the responses are completed in the order
in which they have been received. The exception here is the first transaction.
Except for the first transaction, this facility is applicable.

Can we generate address information from slave?

No. Addresses (read/write) are generated only from the AXI Master side only. It is the READ data and write response channels that are owned by AXI slave.
The slave will only be sending READ Data, READ response, WRITE Responses.

Both the read data channel and the write data channel also include a LAST signal to indicate when the transfer of the final data item within a transaction takes place. Elaborate this statement.

The statement means that for both WRITE and READ, there will be an
associated WLAST and RLAST signals which can indicate whether the last item within a transaction has been taken place or not.

Explain the significance of AWSIZE.

The AWSIZE signal denotes how much amount of data in bytes can be
accommodated in a single transfer of the burst. The maximum value is 128 bytes.

Difference between rvalid, araddr, arvalid?

5 channels
Each channel will have a valid & ready signal.
Write operation has both data and address channels
AWADDR: write address
AWVALID: Write address valid: source is Master
AWREADY: READ address ready: source is Slave
WDATA: Write data
WVALID: VALID write data: Source is Master
WREADY: write ready : Source is slave

READ operation has both data and address channels
ARADDR: READ address
ARVALID: READ address valid: source is Master
ARREADY: READ address ready: source is Slave
RDATA: READ data: slave
RVALID: VALID READ data: Source is Slave
RREADY: READ ready: Source is Master

WRITE response channel:
Owned by slave
BVALID : Source is a slave
BREADY : Source is MAster

If the size of the each transafer is 4 bytes, then what is the total transaction size in bytes if AWLEN value is 4.

16 bytes.

hat is the maximum amount of allowable data that can be sent in a single Write transaction from an AXI Master as per the protocol?

The max allowable AWSIZE is 128 bytes and the max allowable length is 16.
So, it is the product of 128*16 = 2048 bytes.

Explain how a WRAP burst is an example of cache line access?

Let us take the following system scenario:

------------------------------------
|processor with internal cache (L1)| AXI Master
-----------------------------------
|
|---------|L2 cache|
|
----------------
AXI interconnect
----------------
|
|
|AXI Slave(DDR Memory)|

L2 cache memory is in the path between the processor and interconnect.
Any transfer that can access the cache will check the cache contents
(called cache lookup) before potentially accessing the downstream memory in
this case, it is DDR memory.

INCR is the simplest burst type, accessing a lower address and sequentially
and stepping up in memory to a higher address. These types of bursts can also be used in performing a cache, but the problem with that burst type is that you might need to perform a complete cache linefill before that data you want is stored in the cache and made available to the processor. This is where WRAP burst has an advantage.

A WRAP burst fetches the important data first (which the processor actually
wants) and then completes the cache line fill around that important data.

In system-level terminology, this important data which the processor actually
wants from the particular access location of the cache is called "critical word".

As an example, if we had an 8-word cache line, and the processor wanted to
read data from address 0x18 (the 7th entry on a cache line if that data was
cached), and INCR burst would need to fetch data for:
0x00, 0x04, 0x08, 0x0C, 0x10, 0x14 before finally getting the 0x18 data the
processor wants (the processor is no longer stalled), and then the final 0x1C
cache line entry is filled.

Instead, if we use a WRAP burst, this burst can start at 0x18 (so the processor
is no longer stalled), and the cache line then fills up around this "critical word", with accesses to 0x1C, 0x00, 0x04, 0x08, 0x0C, 0x10 and 0x14.

There will still be 8 memory accesses to perform the cache linefill but in most
cases the WRAP burst type will stall the requesting processor for fewer cycles than the INCR burst type.

Is EBT supported in AXI ? How can the AXI Master disable further writing of the transfer ? How can this be handled in READ transfer?

NO. Early burst termination is NOT supported in AXI. AXI Master can disable
writing by deasserting all the write strobes but it must complete the
remaining transfers of the burst. Discarding READ data that is NOT required
can result in lost data when accessing a READ sensitive device like FIFO.

What is the simple definition of cache coherency?

Cache coherency is a system where the system s/w updates all cache to the same data, using some additional extensions provided by the AMBA AXI4 ACE(AXI Coherency Extension) protocol.
L1 cache is specific to each core.
L2 cache is specific to processor sub-system
Example: Each core will have a unique L1 cache and all other cores in a subsystem will have 1 L2 cache.

Is this statement true ? Burst length = AxLEN + 1

TRUE.
AWLEN = 3, then burst length is 4.

What is the role of system software with respect to the cache address allotment?

System s/w will decide which address is cacheable & which address is non-cacheable.
Accordingly, the processor will generate the signal AWCACHE in such a way that the address will be cached.

What will happen if the address is not present in the cache?

The processor will go and create an entry in the cache and will fetch the data & put it into the cache.

Explain the need for cache coherency?

If the address is not present in the cache, then the processor will go and create an entry in the cache and will fetch the data & put it into the cache. During this process, there is a chance that L1 and L2 may go out of sync.

For example, there is an address 'h1000 present in the DDR memory, L1 and L2. In a case where the L1 cache address got updated and L2 is NOT updated, there should be a mechanism to make them in sync. Such a mechanism is called cache coherency.

Explain cache prefetching.

Prefetching refers to retrieving & storing data into buffer memory (cache) before the processor requires the data.

When the processor wants to process the data, it is readily available and can be processed within a short period of time.

Had there not been a cache memory, the processor has to download the data directly from the memory address, hence there could be a delay.

Cache prefetching is a speed-up technique used by the processors where instructions/data are fetched before they are needed.
AWCACHE[1]:- For writes this means that number of writes can be merged together.
ARCACHE[1]:- For reads, this means that the location can be prefetched or can be fetched just once for multiple read transactions.

System s/w will decide which address is cacheable & which address is non-cacheable. Accordingly, the processor will generate the necessary
attributes over the signals AWCACHE/ARCACHE to provide support to system-level caches about the transaction types.

What is the meaning of cache HIT / MISS?

If requested data is found in cache memory then it is called cache HIT.
If the requested data is not present in the cache, then it can be termed a cache MISS.

What is the purpose of RA and WA?

RA: if high, it means that if the transfer is read and if it misses in the cache then it could be allocated.
WA: if high, it means that if the transfer is written and if it misses in the cache then it could be allocated.

How a protection mechanism is provided in AXI protocol?

In the form of different variants of accesses.
a. privileged
b. normal
c. secure
d. non-secure

What is the default access that an AMBA based system generates?

The deemed accesses are 1. privleged 2. secure

Master1 performing EX-READ to a slave address. At the same time, another master2 performs an EX-READ on the same addrs of the same slave before EX-WRITE of Master1. What will happen in this scenario in terms of EX access result?

EX access fails. If a master doesn't complete the write portion of an exclusive operation, a subsequent EX-RD changes the address that is
being monitored for exclusivity.

M1 performing EX-RD towards slave address, M2 performing WR(normal) to the same address, what will happen if M1 tried EX-WR on the same location later?

EX Fails. In such a case, to overcome the memory overriding problem, the slave reserves some memory resource for M1 virtually as indicated
by EX-RD request earlier from M1. This is the fundamental advantage of exclusive access in AXI.

M1 tries to perform EX RD towards AXI slave address, M1 tries to perform normal Write to the same address. What will happen?

Exclusive Write

How does the slave treat the EX RD operation initiated by the master?

AXI slave will start monitoring the ADDRS on which EXREAD operation has been initiated and also the ARID provided by the master until either a write occurs to that location or until another EX READ with the same ARID value resets the EX ACCESS monitoring logic in the slave to a different address.

How aligned addresses are calculated?

if(start address % transfer size == 0)
address -> aligned address

start address = 0x4
AWSIZE = 2
transfer size 2 ^ AWSIZE = 4

0x4 % 4 == 0, hence address is aligned

if start address is 0x7

0x7 % 4 != 0, hence address is unaligned

one byte lane strobe for every eight data bits, indicating which bytes of the data bus are valid. Explain this statement

The length of the burst must be 2,4,8,16. No support for unaligned transfers

Is it possible to have exclusive access for unaligned transfers?

No. The address for an exclusive operation must be aligned to the total number of bytes in the transaction.

Is the AWSIZE/ARSIZE value of 5 is allowed in exclusive access?

Yes. The number of bytes to be transferred in exclusive access must be a power of 2 i;e 1,2,4,8,16,32,64 or 128 bytes.

What is the dependency of AWCACHE[3:0] / ARCACHE[3:0] signals on exclusive access?

Can a transaction be a cacheable one for exclusivity? The answer is NO. This means that the slave that monitors for exclusivity MUST see the transactions.

If an AXI slave doesnt support exclusive access, but if the maSTER performs an exclusive write, then does the slave memory gets updated?

Yes.

In the response signalling mechanism, what is the difference between the responses for READ & WRITE?

IN WRITE: there is just one response given for the entire burst but not for each and every individual data item within the burst.

FOR READ: the slave can provide different responses for different transfers within a burst.

For example:
in a burst of 16 read transfers, the slave might return an OKAY response for 15 of them and a SLVERR response for the 16th item.

Is this statement true or false ? The AXI protocol requires that the transactions with the same ID tag are completed in order but the transactions with different ID tag can be completed out of order?

TRUE.

Is this statement true or false ? The AXI protocol requires that the transfers with the same ID tag are completed in order but the transfers with different ID tag can be completed out of order?

FALSE

How do out of order transactions can improve system performance?

The interconnect can enable transactions with fast responding slaves to
complete in advance compared to earlier transactions with slower slaves.
Complex slaves can return read data out of order. For example, a data item
for later access might be available from an internal buffer before the data
for earlier access is available.

Is it possible for an AXI Master to have the transactions completed in the same order ? If yes, then how is it possible?

Yes. Provided that they must all have the same ID tag.

How does AXI interconnect ensures that ID tags from all the masters are unique?

In a multi-master system, the IC will append additional information to the
ID tag to ensure that ID tags from all the masters are unique. The ID tag is
similar to a master number but with an extension that each master can
implement multiple virtual masters within the same port by supplying an ID tag to indicate the virtual master number.

Why is write data treated as buffered?

Write data is treated as buffered so that the master can perform write
transactions without slave acknowledgment of previous writes.

Does the slave should provide the responses to bufferable transactions all the time in a system?

No. The interconnect can provide the responses.

If there are 16 write transaction items, then how many wvalid and wready handshakes are needed?

When WLAST asserted?

WLAST is asserted for the last data item of the burst by the AXI Master.

An AXI slave MUST NOT give read data unless read address phase completes. Is it TRUE? Explain how?

Yes. Unless both ARVALID & ARREADY signals are seen HIGH, RVALID cannot be driven to HIGH value. It is a READE transaction. Unless the master drives the ADDRS for fetching the data, the READ transaction cannot be performed. Unless there is a valid read address, there cannot be a READ.

What is address boundary calculation?

Transfer size * burst_length

AWLEN: 4
AWSIZE is 4 bytes: 32 bits

address boundary is: 16
0 - F

How to set all the WSTRB bits to ‘1’?

By default, WSTRB member of the master transaction is random, and it would get random values of all the bits when randomized. If the user wants to set all the bits to '1', then they can apply the constraint as below Add relevant constraint during randomization as follows:
foreach (wstrb[i])
wstrb[i] == (1<<(1<<this.burst_size)) - 1;