Reconfigurable Hybrid Cache Hierarchy in 3D Chip-Multi Processors

By Matt Swarbrick

✅ Paper Type: Free Essay	✅ Subject: Computer Science
✅ Wordcount: 4385 words	✅ Published: 23 Sep 2019

Reference this

Share this: Facebook Twitter Reddit LinkedIn WhatsApp

Reconfigurable Hybrid Cache Hierarchy in 3D Chip-Multi Processors Based on a Convex Optimization Method

Abstract—Chip-Multi Processor (CMP) design process has been confronted by a number of serious insurmountable technological challenges such as memory wall, memory static power and limited memory bandwidth as potential bottlenecks of the system performance. To overcome these bottlenecks in future multi-core architectures, emerging memory technologies such as STT-RAM, PCRAM, eDRAM and resistive-RAM due to their many attractive features like high density, low leakage, and non-volatility, are being examined as potential replacements to existent main memories and on-chip caches. In this paper, we propose a novel reconfigurable hybrid cache configuration with hybrid memory technologies, in which STT-RAM, a type of Non-Volatile Memories, is incorporated in the last-level cache with SRAM. Based on the predicted bandwidth demands of different applications, the reconfiguration mechanism dynamically adapts the last-level cache space during run-time. We formulate this problem as a novel online convex optimization-based approach that solves the reconfiguration problem to choose the high performance last-level cache configuration during run-time. We accomplish experiments on a 16-core CMP which shows that the proposed architecture achieves on average and energy saving over non-reconfigurable SRAM-only cache and non-reconfigurable hybrid cache architectures.

Keywords— Reconfigurable Hybrid Cache, Non-Volatile Memory (NVM), Chip-Multi Processor (CMP), Convex Optimization, Leakage Power.

I. Introduction

Technology scaling has enabled increasing number of cores on a chip in chip-multiprocessors (CMPs) by continuously shrinking transistors sizes to improve performance. In this context, power consumption emerges as the main constraint for designing many-core systems. The ever growing power consumption leads to reduced chip reliability, an increase in the burden of heat dissipation and a decrease in the lifetime of mobile systems. Therefore, architecting heterogeneous CMPs and integrating cores and cache hierarchy made up of different materials on the same die emerges as an attractive design option to alleviate the power constraint. Emerging memory technologies, Non-Volatile Memories phenomena (NVMs), (e.g. Spin-Torque Transfer RAM (STT-RAM), Phase-Change RAM (PCRAM), and Resistive RAM (RRAM)) with benefits such as higher density and lower leakage (near to zero). Ccomparing with the traditional SRAM-based on-chip, storage architectures are potentially attractive to architect the new classes of memory modules and cache subsystem.

STT-RAM as one of the well-known NVM technologies, by combining the speed of SRAM, the density of DRAM and the non-volatility of Flash memory has created a new methodology for architecting new classes of memory modules. In addition, excellent scalability and very high integration with conventional CMOS logics are the other superior characteristics of STT-RAM [1]. In this work, we use emerging technologies, such as three-dimensional integrated circuits (3D ICs) [2] and NVMs [3-5] to exploit the device heterogeneity and design of dark-silicon-aware multi/many-core systems. The contributions of this paper can be summarized as below:

Using emerging memory technologies to construct a high-performance hybrid cache hierarchy to improve the energy consumption and performance of CMPs.
A bandwidth-aware hybrid reconfigurable cache hierarchy consisting of a high-performance hybrid cache with minimum power consumption during run-time.
Dynamically allocating the space of the cache banks according to the predicted demand bandwidth of various applications and endurance of STT-RAM cache ways, by utilizing a novel run-time cache reconfiguration mechanism.
A novel online convex optimization-based approach that solves the reconfiguration problem to choose a high-performance cache configuration with minimum energy consumption during run-time.

The rest of this paper is organized as follows. Section 2 reviews the prior related works. Section 3 describes the details of the proposed hybrid reconfigurable cache hierarchy consisting of a hybrid cache configuration and a proposed run-time reconfiguration mechanism. In Section 4, the details of convex optimization-based reconfiguration regulator and formulation of its fundamental processes are investigated. In Sections 5, experimental evaluation is presented. Finally, section 6 concludes the paper.

II. RELATED WORK

A number of researchers proposed CMP architectures with 3D stacked cache hierarchy/memory system to improve performance and reduce power consumption. The researches in [6-9] worked on synthesis of heterogeneous CMPs to extract better energy efficiency and performance in the dark-silicon era. In paper [6] proposed a design time framework for synthesis of heterogeneous dark silicon CMPs. The study in [7] exploited process variation to evaluate the benefits of selecting a more suitable subset of cores for an application in a given fixed dark silicon power budget to maximize performance. In order to maximize performance under power constraint for architectural synthesizing, in [8] and [9], cores that are homogeneous but synthesized with different power/performance targets have been exploited. All of the prior works [6-9] on the dark silicon phenomena over the past 6 years mainly focus on cores rather than un-core components. In this work, we efficiently consider the power impact of un-core components in parallel with that of cores unlike the previous researches on power management techniques and dark silicon which consider only the portion of power consumption related to on-chip cores. To improve performance and reduce power consumption of CMPs, study in [2] proposed CMPs with 3D stacked large SRAM cache or DRAM memory on a core layer, up to sixteen layers, without exceeding the maximum thermal limit. In [10] to build a hybrid cache architecture, 3D stacking STT-RAM is used on top of the SRAM. A number of previous researches proposed some proactive techniques to reduce the power consumption in multicore systems such as dynamic voltage and frequency scaling (DVFS) technique, thread scheduling, thread mapping, shutting-down schemes, and migration policies [11-14]. Also, these approaches limit their scope just to cores and provide power management for platforms implemented by using technology nodes in which dark silicon issue does not practically exists (e.g. 45nm CMOS technology) and leakage power is not too problematic. These proposed techniques are inappropriate for multithreaded applications. With increasing core count and parallelization of applications from emerging domains such as recognition, mining, synthesis and especially mobile applications, this issue has become important that in future many-core CMPs, workloads are expected to be multithreaded applications. The overall strategy used in our proposed hybrid cache designs is that the heavily written cache blocks in NVMs will be dynamically migrated to SRAM blocks for power efficiency.

Fig. 1. Overview of the dynamic hybrid reconfigurable last- level cache.

III. Proposed 3D CMP with reconfigurable heterogeneou

IV. s cache hierarchy

Fig. 1 shows the block diagram of the proposed CMP system that consists of the hybrid run-time reconfigurable last-level cache made from four memory technology regions. In this figure, CMP system consists of multiple cores. The SRAM is used in the L2 cache level, eDRAM in the L3 cache level, STT-RAM in the L4 cache level.The L1 caches are private to each core. The regulator monitors the 3D CMP state determined by number of misses. The convex optimization-based regulator monitors the CMP state by miss per second at level 1 of cache (L1) online. Then the optimal controller estimates the bandwidth demands of applications at the end of each execution time interval and then predicts the future bandwidth demands for the next time interval. The length of the time interval which reconfiguration mechanism is applied can be fixed, or depends on the operating system context switch. The optimization process can be performed online by each core which has the longest idle time among other cores at each time interval. It is considerable that this optimization process requires a small computational complexity. Furthermore, this process can be implemented by operating system during context switch without extra performance overhead.

V. Problem Formatlation

The details of the convex optimization-based reconfiguration regulator with formulate it’s defined in following two phases;

A. Demanded bandwidth prediction

The length of the evaluation time interval in which reconfiguration mechanism is applied, depends on the operating system context switch or can be fixed. At the end of each time interval as in Fig.2, reconfiguration mechanism is applied and the optimal on-line regulator predicts the future bandwidth demands of running applications for the next time interval. Therefore, based on (1), the predicted demand bandwidth (dbw) is measured by miss per second (t) at L1 cache.

$dbw = \frac{{L 1}_{m \times} {L 1}_{ls}}{t}$

(1)

Where L1_m,and L1_ls ; are the number of misses at L1 cache and line size of L1 cache , respectively.

In this context, the on-line regulator by taking into account the temperature distribution of ways of SRAM region and the number of writes which has been done onto the ways of STT-RAM region. These factors determine the physical location of the allocated cache ways to applications at each memory type region. To make an accurate prediction on future demand bandwidth, history information about the miss per second of L1 cache is exploited by the proposed technique.

The prediction is made by using a linear model to perform a best (d^th) order polynomial fit.The proposed policy can indeed forecast future trends of the demand bandwidth. In this technique, the linear model is used because the prediction length, L, is short and ranges usually from 1 to 10 samples.

For fitting the polynomial, the error is minimized within the observed window of context switch by using (2):

${||dbw - \overset{̌}{A} x||}_{2}^{2}, \forall t = 1, 2, \dots, N$

(2)

In the polynomial interpolation process, vector

$x_{t} \in R^{d + 1}$

and matrix

$\overset{̌}{A} \in R^{(d + 1) \times (N + L)}$

are used.

${dbw}_{t}$

contains the bandwidth requirements

$\forall t = 1, 2, . . ., N,$

and

$N$

is the length of the observation window of historical data. The structure of matrix

$\overset{̌}{A}$

is shown in (3).

By solving (2) as a least square’s minimization problem, vector x is derived. It is assumed that the derived linear model will hold for the next L data samples justly. The prediction on the future demand bandwidth is performed as result.

$\overset{̌}{A} = [\begin{matrix} 1 & 0 & 0 & 0 & \dots & 0 \\ 1 & 1 & 1 & 1 & \dots & 1 \\ 1 & 2 & 4 & 8 & \dots & 2^{d} \\ 1 & 3 & 9 & 27 & \dots & 3^{d} \\ . & . & . & . & \dots & . \\ 1 & N + L & {(N + L)}^{2} & {(N + L)}^{3} & \dots & {(N + L)}^{d} \end{matrix}]$

(3)

Therefore, the future demand bandwidth is given by (4):

${dbw}_{t} = \overset{̌}{A} x_{t}, \forall N \leq tN + L$

(4)

Where

${dbw}_{t} \in R^{p}$

for

$t > N$

is the demand bandwidth at time

$t$

and

$p$

is the number of different applications at specific time

$t$

. A weighting vector

$ω_{t}$

is computed to calculate the accuracy of this prediction. It is defined according to (5):

$ω_{t} = ρ_{t} - ||{dbw}_{t} - {\hat{dbw}}_{t}||, \forall N - L \leq t \leq N$

(5)

Where

$ρ_{t} \in R^{p}$

is a penalty vector. Penalty function

$ρ_{t}$

can be chosen to be linear, quadratic and exponential or in any other schemes, according to the impact that inaccurate prediction has on performance parameters.

${\hat{dbw}}_{t} \in R^{p}$

is the predicted demand bandwidth at time

$t$

by the on-line optimal regulator. The expression

${dbw}_{t} \in R^{p}$

acts its actual value. For choosing N and d, empirical studies have been used.

The absolute value of difference between

${dbw}_{t}$

and

${\hat{dbw}}_{t}$

represents the prediction error in (5). In fact, the smaller prediction error means the more reliable and accurate prediction. Since the prediction must be reliable in our formulation, we can define a cost function J to minimize this error to accurate prediction of the demand bandwidth corresponding to the future time frames. Function J is a weighted sum of the amount of predicted demand bandwidth that related target workload has not been executed. It is formulated as the following convex optimization problem:

$minimize J = \sum_{t = 1}^{L} ω_{t} r_{t} (6)$

$subject to : r_{t} = {dbw}_{t} - {pdbw}_{t}, \forall t (7)$

$r_{t} ≽ 0, \forall t (8)$

$0 ≼ {pdbw}_{t} ≼ {dbw}_{\max}, \forall t (9)$

Function

$J$

is minimized during

$L$

future time frames by the optimal regulator. Therefore, the demand bandwidth of running applications in

$L$

future time steps is computed.

In our model

$pdbw$

is optimization variable. As described in (9), it is assumed that

$pdbw$

to be continuous and ranging from zero to the maximum bandwidth that an application can demand. Maximum bandwidth that an application can demand relates to maximum frequency of occurred misses in L1 cache.

The vector

$r_{t} \in R^{p}$

as described formally in (8), is used to measure the performance of the optimal regulator in achieving the requested bandwidth at time

$t$

. Every entry of

$r_{t}$

, as described in (8), has to be equal or greater than zero because the system cannot execute program workloads that will arrive in future time steps. In (8) and (9) the symbols

$≽$

and

$≼$

mean element-wise inequality relations.

Since all equations in the proposed problem are convex models, the proposed problem is convex. Our convex problem presented in this section can be solved with polynomial time complexity using interior point method as in [15].As mentioned earlier, according to the predicted demand bandwidth at each time interval which is obtained by minimizing of the convex function

$J$

, the cache space is configured by the optimal regulator to the available size that is the nearest to the capacity point as shown in Fig.3. Note that, in the case of a set-based bank division, the number of cache banks assigned to a core must be power of two since the number of sets must be power of two.

B. Physical placemnt based on number of writes

We use an endurance model to determine the exact location of the allocated capacity in phase 1 based on limited write endurance of STT-RAM for optimal placement. This endurance constraint can be expressed as follows:

$\frac{\sum_{i = 1}^{P} {FREQ}_{i, st, w}}{{Endurance}_{STT - line}} \times {ST}_{st, x, y} < \frac{N}{2}, \forall x, y, st (10)$

Since STT-RAM has an endurable write threshold, we can only write a limited number of times in each line of STT-RAM. If the number of writes into one line is more than the threshold, that line will be destroyed. We assume a worst case scenario that all of write operations are written in one line until destroying that line and after that a new line is selected for rest of write operations. When 50% of lines in a STT-RAM memory bank have been destroyed, a new write operation only has half chance to go to a valid line which was not destroyed so far.

More specifically, there is equal chance of successful or unsuccessful write to the STT-RAM bank. If more than half lines of a STT-RAM banks is destroyed, chance of successful write to this bank is even less than half. Thus, the maximum tolerable destroyed lines for us to use a special STT-RAM bank is N/2. Note that, we assume the number of lines for a STT-RAM bank is equal to N. Thus, in our endurance constraint model, if placing a STT-RAM memory bank in the special position leads to destruction of more than half lines of that memory due to writing frequency of cores, STT-RAM bank is not chosen for that position.

In addition, our endurance model distributes write operations between STT-RAM banks and prevents from destruction of one STT-RAM bank due to many number of write operations when other banks are slightly or completely unused. Therefore, distribution of write operations is uniform between different STTRAM banks in our model. This phenomenon can increase life time of the proposed hybrid memory design. To determine the exact physical location of the specified capacity determined in phase 1, we will start from the first bank (ST1). If the related bank does not satisfy in (10), we cannot select that as one of the target bank for physical allocation and we will move in the counter clock wise direction to select other banks which satisfy the equation 10 as shown in Fig.4.

VI. Experimental Evaluation

A. Experimental setup

GEM5 simulator [16] is used to run our proposed architectures. This design is validated, by run our experimental framework driven by traces extracted from real application workloads.

Fig.5 shows platform setup are used in this work. The NVSim and CACTI [17] tools are used to evaluate the energy consumption and cache capacities of cache memories respectively, while 3D Noxim with ORION [18] are used to calculate the performance parameters and power consumption of the architecture.

Fig.5 presents a full proposed 16-core CMP architecture. In this design, each core is connected to a router. Each core tile has a TSV bus consisting of TSVs for address, data, and control signals to upper cache banks. The area of the core tile (consisting of a processing core, private 32KB L1 instruction and data caches, and the cache controller with cache tags) is 3.5 mm2 estimated by McPAT [19] and CACTI 6.0. The detailed system configurations are given in Table I.

TABLE.I. Parameters for a proposed CMP configurations.

Item	Description
No. of cores	16,4×4 mesh
Configuration	Alpha21164, 3GHz, area 3.5mm2, 32nm
Private L1 cache	SRAM, 4 way, 32B line, size 32KB per core
Shared caches	SRAM/ STT-RAM/ eDRAM 32B line, 1 to 3 levels, size of 256KB to 32MB
Main memory	4GB, 320 cycle access, 4 on-chip Memory Controllers at each corner node
Network Router	Two-stage wormhole switched, XYZ routing, virtual channel flow control, 2 VCs per port, a buffer with depth of 5 flits per each VC, 8 flits per Data Packet, 1 flit per address packet, each flit is set to be 16-byte long
Network Topology	3D network, each layer is an 4×4 mesh, each node in layer 1 has a router, 16 TSV links which are128b bi-directional in each layer

For experimental evaluation, we assumed that the area of each cache layer is equal to the area of the core layer. As shown in Fig.6, in a core-stack, at the first level of the cache hierarchy, there are four SRAM cache banks, each of which has 256KB capacity, placed directly on top of one core. Based on estimates from CACTI 6.0, a 1MB SRAM cache bank and associated router/controller have an area roughly equal to the area of one core. At the second and third levels of the cache hierarchy, there is a 4MB eDRAM and a 4MB STT-RAM cache bank placed directly on top of one core in each layer, respectively. We consider three layers in each cache level of the stacked hierarchy for the 16-core CMP. In each cache layer, there are 4 regions and total number of regions in each cache level is 12.So that, in the hybrid scenario, we have capacity for L2 cache is 16 × 1 MB SRAM bank, the capacity of each layer of L3 is 16 × 4 MB eDRAM cache banks, and the capacity of each layer of L4 is assigned as 16 × 4 MB STT-RAM cache banks.

The interface material between two adjacent silicon layers are connected by homogeneous TSVs. We use synchronization based multithreaded workloads selected from PARSEC [20], SPLASH-2 [21] and Phoenix [22], as shown in Table II, for performing our experiments.

B. Experimental results

In this work, the objective of our proposed architecture is reducing the total power consumption. For that purpose, we evaluate a 3D CMP with stacked cache hierarchy, in four different scenarios:

1) The CMP with fixed core frequencies in the core layer and pure SRAM-based L2 cache with fixed capacity (Baseline1 2MB SRAM l2 (16 bank SRAM, each of which has 128MB).

2) The CMP with fixed core frequencies in the core layer and reconfigurable SRAM-only cache levels in the stacked hierarchy (Baseline2 : L2 :2MB SRAM (16 bank SRAM, each of which has 128MB, L3 : 8MB SRAM (16 bank SRAM, each of which has 512kB).

3) The CMP with fixed core frequencies and hybrid L2/L3/L4 caches with fixed maximum available capacity at each level (Baseline3 : L2:2MB SRAM(16 bank SRAM, each of which has 128MB,L3 : 32MB STT-RAM (64 bank STT-RAM, each of which has 512kB).

4) The proposed CMP with heterogeneous cache hierarchy and heterogeneous frequency/voltage in each core specified based on design-time optimization problem (Reconfigurable Hybrid-Proposed) (Proposed: 2MB L2 (16 bank SRAM, each of which has 128kB), 32 MB L2 (64 bank STT-RAM, each of which has 512kB).

We compared the experimental results of the baseline designs with the proposed architecture to evaluate our work.

Fig.7 compares the energy delay product (EDP) of the proposed architectures normalized with the baseline 1. As shown in this figure, owing to the higher leakage power consumption, the fixed SRAM(baseline1) and reconfigurable SRAM (baseline2) demonstrate a higher EDP compared with our proposed architectures. The proposed architecture improves the EDP by an average of 61%, 91%, and 76% compared with baseline1, baseline2, and baseline3, respectively.

Fig. 8, the proposed power model estimates the power consumption of heterogeneous (hybrid) and other scenarios at a 3D CMPs under baseline2 normalized. Compared with baselines CMP, the proposed architecture improves the power consumption of baseline 1, baseline 2, and baseline 3 by about 99%, 95% and 90% on average under execution of multi-threaded workload.

Fig.9 compares the number of instructions per cycle (IPC) normalized with the baseline 3. In the average results, the proposed architecture shows an improvement about 17% compared with baseline3 and an average degradation of about 28% and 29% compared with baseline1 and baseline2 architecture, respectively. The reason of this degradation in is the reconfiguration time. The reconfiguration controller consumes additional cycles at each time interval.

Fig.7: Comparison of Energy Delay Product (EDP)

normalized to baseline1.

Workload	Benchmark suite
blackScholes	PARSEC
bodytrack	PARSEC
canneal	PARSEC
fluidanimate	PARSEC
streamcluster	PARSEC
swaptions	PARSEC
FFT	SPLASH-2
ocean	SPLASH-2
radix	SPLASH-2
barnes	SPLASH-2
LU.cont	SPLASH-2
LU.ncont	SPLASH-2
water.nsq	SPLASH-2
water.sp	SPLASH-2
kmeans	Phoenix

TABLE.II. Barrier synchronization based multithreadedbenchmarks.

Fig.8: Comparison of power consumption normalized to baseline2.

Fig.9: Comparison of IPC normalized to SRAM‐baseline3.

Fig.10. Contribution of SRAM vs STT-RAM banks.

In fig.10 shows the percentage of SRAM with STT-RAM contribution used in our proposed design under execution of PARSEC benchmark. In applications which are read-intensive workload such as fluidanimate and heavily memory accessed such as canneal , In our design uses 42% of SRAM and 58% of STT-RAM memories on average for PARSEC benchmark applications. So, the percentage of STT-RAM contribution is greater than SRAM. The reason for that, the proposed technique turns on STT-RAM banks incorporated with SRAM banks to solve the shortage problem of last level cache for these types of applications. However, in a write‐intensive workload, the contribution of SRAM banks is increased.Such as blackScholes workload, about 69% of banks are SRAM, while is about 31% for STT-RAM banks. These differences in the results belongs to the small amount of write-intensive data blocks or lower leakage energy of STT-RAM in comparison with SRAM.

VII. Conclusion

In this work, we propose a novel hybrid reconfigurable last-level cache architecture. This reconfigurable last-level cache consists of a hybrid cache reconfiguration and a run-time reconfiguration mechanism. Based on the predicted bandwidth demands of different applications, the reconfiguration mechanism dynamically adapts the last-level cache space during run-time. We formulate this problem as a novel online convex optimization based method that solves the reconfiguration problem to choose the high performance last-level cache configuration with minimum energy consumption during run-time. This online convex optimization method solves the reconfiguration problem based on the predicting of bandwidth demands of different applications and considering temperature distribution of cache ways to minimize the energy consumption.

We evaluate the proposed design method using Simics as the simulator to run our experiments. Results show that the proposed architecture improves performance and achieves an average

$77 %$

and

$52 %$

energy saving over non-reconfigurable SRAM-only cache and non-reconfigurable hybrid cache architectures.

References

[1]

[2] D. Shelepov, J. C Scale Integr. (VLSI) Syst.,vol.20,no.1,pp.191-196,2012.

[3] X. Dong, C. Xu, N. Jouppi, and Y. Xie, “NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Non-volatile

Matt Swarbrick

Matt holds a BA and MA certificate from Cambridge, and is an subject-matter expert in Business and Management. Matt also writes about subjects like Finance, Economics and Computing/ICT.

Share this: Facebook Twitter Reddit LinkedIn WhatsApp

Cite This Work

To export a reference to this article please select a referencing stye below:

Related Services

View all

Essay Writing Service

From £99

Report Writing Service

From £99

Student reading and using laptop to study

Assignment Writing Service

From £99

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please click the following link to email our support team:

Request essay removal