This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Most of the current multi-core processors possess L2 caches that are shared among all the processors. Cache contention is the major problem seen in such architectures. Cache contention occurs when one of the cores uses the part of the available L2 cache that may belong to other processor cores. Cache partitioning is a technique implemented in OS for addressing the problem of cache contention. Page coloring is a software based technique implemented in OS in order to divide the available cache to different applications that are competing for the cache. Cache partitioning can be static or dynamic. This paper discusses about software based cache partitioning using page coloring.
Chip multiprocessors (CMP's) are highly prevalent in today's technology that demands high bandwidth and low power. It replicates multiple processor cores on a single chip. Chip multiprocessors use a common directory for sharing of data. Fig 1 depicts a multi-core cache hierarchy. A multiprocessor consists of multiple processor cores that are connected to their respective private L1 caches. On the other side the L2 cache is shared among all the processor cores. Allowing multiple CPU cores to access the shared L2 cache can result in cache contention. This in turn can lead to L2 cache misses. A solution for the cache contention is cache partitioning. Cache partitioning can be software based or hardware based. This paper first discusses about the multicore cache hierarchy, then describes about the issue of cache contention
Then it discusses about the software based
L2 cache partitioning which apply the technique of page coloring.
Directory-based cache coherence protocol architecture
The CMP system consists of multiple processor cores each one with its own private L1 caches. It uses the L1 and L2 cache for storing instructions and data. The L1 cache is assigned for each processor core while the L2 cache is shared among all processor cores. L2 cache is considered to be inclusive, since data present in L1 cache will also be present in the L2 cache. The processor core and the L1 cache can be powered down, but the L2 cache needs to be on always. A directory controller is associated with the L2 cache which keeps
record of the caching location of records. It also sends data to all the L1 caches whenever the shared data is updated. Every cached block in the L2 cache comprises of following bits.
Valid: If the block contains valid data.
Dirty: When write is performed on the data in the block after it was extracted from the main memory.
Shared: It is an N-bit vector where N specifies the number of L1 caches or processor cores. The ith bit of the vector is assigned to 1 if the ith L1 cache is used to share the cache block. This information is used by the directory controller for broadcasting the updates to the active caches that are shared.
If the data is present in any of the L1 cache, then it will be available in the L2 cache system also. The size of L2 cache will be equivalent to the total number of L1 caches. All L1 cache has dual tag directories associated with cached blocks of which one will be always on. The tag directory that is on receives the "invalidate" messages for the blocks that are updated by L2 cache directory controller when the processor will be in sleep mode. Both L1 and L2 cache are consistent, since any update by a processor on its cached blocks is sent to the shared L2 cache. Â Â
Updates of any cached blocks by the processors are sent promptly to the L2 cache that is shared among the processors. Hence the L1 and L2 are considered to be consistent. This updated data is sent to all "active" processors that are sharing the same data by L2 cache directory controller. It also sends "invalidate" messages to processors that are "powereddown". "Invalidate" message is sent because it is not sure whether the processor will need that data when it comes out of the "sleep" mode. If the processor requires the data after it wakes up, then it can get the copy that is updated from the L2 cache. When invalidation occurs to shared block placed in processor which is in sleep mode, the information regarding sharing is removed by L2 cache directory thereby preventing future messages regarding that shared block between processor and L2 cache controller.
For L1 cache, the cache line can be in one of the two states- "Invalid" or "Valid". An entry related to L1 cache is updated in the L2 cache on "write". While the L2 cache controller sends the data to the L1 caches of other processors using directory, if required. This immediate "writethrough" arrangement between L1 and L2 caches may lead to an increase in traffic towards L2 cache, but it assures the consistency between both the caches. This gives processors the provision to put it in "sleep" mode, even without updating the L2 cache.
The states for cache line related to L2 cache can have the states: "Valid/Invalid", "Clean/Dirty", "Shared/Non-shared". Â N-bit vector is used to represent the Shared state where N denotes the number of L1 caches/ processor cores. If the shared block is shared at ith processor of L1 cache, then the ith bit of the vector is assigned to 1. The initial state of every cache entries is "Invalid". When initially data is loaded from the main memory by L2 system, the lines are loaded in the state "Valid and Clean". The data load happening initially from main memory is concurrently performed for L1 cache of the processor that requests for it and for L2 cache. The operations of L1 cache related to writes and reads are:
Read Hit: If the data that is requested is available in L1 cache and no change in state occurs.
Read Miss: The data that is requested is not available in L1 cache. The data request is then concurrently sent to main memory and L2 cache system. If the data that is requested is present in L2 cache it provides it to L1 cache and the corresponding "shared vector bit" is updated. Also the on-going requests to the memory are aborted. If the data is not present in L2 cache, it is provided to L2 cache from main memory in "Clean" state. The data is then provided to L1 cache after it updates the shared vector.
Write Hit: The data that is requested is available in L1 cache. The processor performs a write on data and the updated data is sent to L2 cache. The L2 cache then changes the block state to "Dirty" and checks the "Sharing" vector to determine the processors that shares the data. Then the L2 cache sends the data only to the processors that shares the data and updates the sharing vector id required. The data that is sent to L1 caches will be in "clean" state since it is consistent with data in L2 cache.
Write Miss: This operation is like "Read Miss" where the data that is requested is not available in L1 cache. The data request is then concurrently sent to main memory and L2 cache system. If the data that is requested is present in L2 cache it provides it to L1 cache and the corresponding "shared vector bit" is updated. Also the on-going request to the memory is aborted. The updated data is then provided to L1 caches that requests for it and the shared vector is updated appropriately.
This proposed protocol makes sure that the data integrity is always maintained even in the powered-down state of the processor. This states that if any updates is done by processor, it is updated to L2 cache system also. The reason of data integrity is "simple cache coherence transactions" and consistency of L1, L2 cache systems. In a powered down processor the only available component is a tag directory. The processor can use data available in L1 cache when it wakes up. Any shared data that was updated during its sleep period will have been invalidated by "L2 cache controller" and if the processor requires that shared data it can obtain a copy from the L2 cache system. In this protocol, processor can be shifted to "sleep" mode without causing any loss or inconsistencies in data. It also provides low over-head for operations related to coherence.
Â Â Â Cache contention is the major drawback related to "L2 cache architecture". It occurs when multiple CPU core compete for using the single "shared L2 cache". In this uncontrolled sharing, one core can evict the useful content of L2 cache that might belong to another core. Â is allowed to access L2 without any control. Such contention can cause an increase in L2 cache misses which can lead to a decrease in the performance of application. The uncontrolled sharing of L2 cache can also cause a reduction in the ability of enforcing priorities and also providing Quality-of-Service (QoS). For instance, an application with lower priority that is running in one core and which streams rapidly through L2 cache could consume the whole L2 cache. This can cause the removal of applications with high priorities that are scheduled in different cores
Page coloring is a software technique which allows the core processors to cache maximum number of pages. It is process to control the mapping of physical memory pages and processor's cache lines. It is also called "cache coloring". In shared last level code which is indexed physically, each physical page is mapped to continuous group of cache lines. Therefore, every page has same color are mapped to same group of cache lines. By executing the page coloring we can modify the physical page allocation mechanism so that processors are able to cache maximum number of pages.
Page coloring technique is applied to provide cache partitioning in L2 cache. When a core processor requests a new physical page, the operating system allocated a page in such a way that it maps with slot in the L2 cache assigned by the application. Due to which we are able to isolate the usage of L2 cache. In general, Fig2. Depicts the page coloring process. It also shows that every physical page in the physically indexed L2 cache has a fixed mapping to a physically adjacent bunch of cache lines.
Fig2. Page and cache line mapping
It is observed from the figure that various physical pages named Color A are mapped to the same group of physically adjacent L2 cache lines named as Color A. Similarly, the group of cache lines in L2 cache labeled Color B is mapped to the physical pages with same labeled section Color B and with further are mapped to the virtual pages of their respective processor. Thus, in general the several pages are mapped to the same color labeled cache lines in physically indexed L2 cache, which later on are mapped with their respective virtual pages of their core processors.
In this section we show the motivation for application based cache partitioning.
Why application cache partitioning?
The runtime behavior of multithread show some characteristics that show that they should have application based dynamic cache partitioning.
Different threads belonging to a asame application may have some different cache requirements from one to another..
Application threads Interaction of cache across threads interact in different ways.
Problems in current cache partitioning schemes
The current partition schemes are either throughput based or fairness based.
However, when the threads belong to the same application, in the process of improving throughput, these schemes can
end up improving the performance of non-critical threads,
which may not have much impact on application performance. In other words, most of the existing schemes try to
improve a global metric such as throughput without caring
about the thread relationships. In the process, there is no
targeted effort to improve the performance of an individual
application locally, although overall processor
On the other hand, fairness oriented schemes , 
try to ensure that the impact of cache sharing is uniform for
all the threads in a shared cache environment with LRU.
This is akin to mimicking a private cache configuration.
These schemes ensure all threads including the critical path
thread make balanced progress. In , Chang and Sohi
allocate a bigger partition of the cache to one of the threads.
However, each of the threads gets that bigger partition for a
fixed quantum in a round robin fashion, thereby improving
fairness. In comparison, speeding up the critical path thread
can be more beneficial in the intra-application case.
V. REQUIRED HARDWARE SUPPORT
In this section, we discuss the requisite underlying hardware
enhancements required to implement our technique.
There are at least two options in partitioning a shared
cache. The first approach is to use reconfigurable caches
where the cache hardware structures are modified at runtime
, . This approach may lead to considerable loss of
data during the reconfiguration. Also, the cache remains
unavailable during the reconfiguration process and hardware
complexity increases. The second approach is to implicitly
partition the cache by modifying the cache replacement
algorithm used by the shared cache . In this case, there
is no sudden reconfiguration but a gradual move towards
the intended partition. This approach also does away with
problems of cache unavailability during reconfiguration and
heavy hardware complexity. When a thread suffers a cache
miss and the number of cache ways that belong to it is less
than the thread's assigned cache partition ways, a cache line
belonging to some other thread is chosen for replacement.
If, on the other hand, the number of cache ways belonging
to the thread is greater than or equal to the assigned number
of ways, a cache line belonging to the same thread is
chosen for replacement. This way, the cache is incrementally
partitioned via the replacement policy. To implement this
strategy, each set of the cache is assigned four counters
to record the current assignment of cache ways among the
four threads. There are four other counters which contain
the current target assignment of ways for each thread. Upon
a miss, if the current assignment counter for the thread is
less than its target assignment, a cache way from some
other thread is replaced. If the counter is not less than
the target, one of its own is replaced. Note that the least
recently used policy (LRU) is still used for replacement
but in a sense it is now the thread-wise LRU. Essentially,
cache partitions are maintained by controlling which thread
can evict which cache line. Therefore, in this approach,
although a thread/core can access a cache line present in
another thread's cache partition, it cannot evict a cache line
belonging to another thread's partition.Current Scenario
Currently there are few but not equally efficient algorithms implemented in Shared L2 cache for dynamic allocation with cache coloring.There is static allocation in L2 cache which leads to cache contention and some wastage of cache memory.
In the current shared L2 cache more than single thread can contend for cache space at a single point of time.In such kind of scenario contending threads can be adversely affected..The Least Replacement Used [LRU] policy can lead to threads occupying the most part of shared cache with small performance gain ,whereas the other threads which would have a good cache behavior would starve.There are many techniques proposed to limit such performance impact and improve the overall throughput. 29], , , .Some of the techinques in these papers partition the L2 cache among other concurrently executing applications to refrain contention and discard individual application performance.There are many more schemes proposed to look at fairhess and Quality of service that are known in cache partitioning schemes.
We are proposing a model in which we can show the shared L2 cache partitioning based on page coloring. The most recent pages would be added to the cache based on Clock Cycle per instruction. In the CPI based page coloring method, the most used page would be kept in the L2 cache where I the throughput is more.
Dynamic partitioning scheme
In dynamic partitioning scheme ,the partition is done at the intervals of execution.This is nearly about 15million instruction.At the end of each and every interval the optimization for next intervak is done.This dynamic partitioning gathers the information like the cache hits ,cache miss,instruction counts and cycle counts.At the end of each interval,the shared L2 cache is partitioned on the thread performace and the space is allocated.This would speed up the critical path thread .The dynamic partitioning scheme saves the amount of cache which is mostly wasted in the static allocation scheme.
Clock Cycles Per Instruction based Partitiong.
In the CPI partitioning scheme,the cache partition is done according to the clock cycles of a particular thread.The thread which would have a higher CPI would be allotted more amount of cache and those which would have a low CPI would be allotted less amount of cache.This is implemented so that a thread with higher CPI can improve its performance with a larger cache allocation.
The formula for cache partition is given below.3
No. of partitions = Total Cache ways/No.Of cores.
At end of interval: Note the CPI for each thread.
Denote cache partitions for threads based on CPI's
Thereforepartition=(CP/âˆ‘CP)*All Cache ways
Where CP is the cycles per instruction of single thread.
NI-Number of instructions for a given instruction,
CCI-clock cycles for a given instruction ,
IC-total instruction count.
So in the above scenario as we can see that the a partition can be recalculated at every end of a thread execution.
So we can term it as :3
At end of interval:
Note the cpi for the thread
Determine single cache partition by redistributiong cache ways based on thread performance
Step1: Reassign partitions
Maxthread=Highest CPI thread
Minthread=Lowest CPI thread
Step2- the thread CPIs are recalculated after reassignment on single thread performance model;
newThreadMax= HighestCPI thread
if(MaxCPIthread) â‰ newThreadMax)
Else goto step1
Assign new calculated cache partitions