# Study On The Multicore Processor Technology Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Processors were originally developed with only one core. The core is the part of the processor that actually performs the reading and execution of instructions. Single-core processors can process only one instruction at a time. (To improve efficiency, processors commonly utilize pipelines internally, which allow several instructions to be processed together; however, they are still fed into the pipeline one at a time).

Multicore architectures [10], [12], [13] provide a solution to increase the performance capability on a single chip without requiring a complex system and increasing the power requirements. They allow for faster execution of applications by taking advantage of parallelism, or the ability to work on multiple problems simultaneously. A multicore processor is composed of two or more independent cores. All the cores are integrated onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package. Different multicore architectures use different ways of the caching and cache [14] sharing among the cores. The proximity of multiple CPU cores on the same die allows the cache coherency circuitry to operate at much higher clock-rates than is possible if the signals have to travel off-chip.

Each process has private memory for local data items and shared memory for globally shared data values. While the shared memory is partitioned among the cooperating processes (each process will contribute memory to the shared global memory), a process can directly access any item within the global address space with a single address.

PGAS combined the performance and data locality (partitioning) features of distributed memory with the programmability and data referencing simplicity of a shared memory (global address space) model. The PGAS programming model support for distributed data

structures. A collection of threads [10] operating in a partitioned global address space is logically distributed among cores.

We come to a problem of efficient implementation of a program on a parallel computer which requires the coordinated generation of low-level machine language instructions to exploit instruction level parallelism as well as a high-level run time to commands support such operations as load balancing [4]. Our work focuses on distribution [5] of the nodes of a binary search tree an efficient manner in the Multicore Programming Language X10's architecture.

The fundamental goal of X10 is to enable scalable, high-performance, high-productivity programming [7] in high-end computers. X10 introduces concurrency, distribution and locality. X10 is designed to increase the performance in case of supercomputing domains using the partitioned global address space. A computation is divided among a set of virtual processors known as Places and every task is assigned to a place.

The X10 language maps these places to processors based on the architecture of the system. The main advantage of the language is that internal architectural details of the machine are hidden from the user and working interface similar to a high level programming language.

A tree is nothing but more specifically arborescence [6]. An arborescence is an acyclic connected graph in which each node has zero or more child nodes and at most one parental node. Here our focus is on binary search trees (ordered binary trees) which have the following properties:

(i) The left sub-tree of a node contains only nodes with keys having values lesser than the parent node's key value.

(ii)The right sub-tree of a node contains only nodes with keys whose values are greater than the node's

key value.

(iii)Both the left and right sub-trees must also be binary search trees.

In general a node of a tree consists of a record rather than a single data element. The major advantage of choosing trees over other data structure is key efficiency in searching for a record which represents a real world entity.

Representation [6] of trees is done by using either a linked list or an array. The linked list representation is efficient when the tree is unbalanced whereas the array representation overtakes this when the tree is complete.

Basically the Binary Search Tree is distributed based on either the height of tree or level by level [18]. This approach leads to an increase in the number of steps taken in X10 to search for a node. This is because no two nodes in any path of the tree are located in the same place. Hence this requires more references to other distinct places.

## 2. X10 Architecture

X10 is a container-based object-oriented language similar to Java and C++, and more recent language such as Scala. Its basic components are places, activities, clocks and distributed multidimensional arrays [1].

A place is an abstraction for a computational unit with a locally synchronous view of shared memory. An X10 place is a repository for data and activities, corresponding loosely to a process or a processor. An X10 computation [10] runs over a large collection of places. The set of places available to a computation is determined at the time that the program is started, and remains fixed through the running of the program. Every object created by the program has a home place. Multiple memory locations in multiple places cannot be accessed atomically.

An activity is a statement being executed, independently, with its own local variables and it may be thought of as a very light-weight thread. An activity may synchronously and atomically use one or more memory locations in the place in which it resides. An activity may shift to another place to execute a statement block. The notion of asynchronous activities is the foundation for lightweight threads that can be created locally or remotely. Asynchronous activities address the requirements of both thread based parallelism and asynchronous data transfer.

Activities coordinate their execution by using various control and data structures. An activity may be running, blocked on some condition or

terminated. Every activity save the initial main activity is that spawned it. Thus, at any instant, the activities in a program form a tree. X10 uses this tree in crucial ways. At any time t, several activities may work independently, but synchronize together before proceeding to time t+1. X10 supports this communication structure with the generalization of barriers called clocks. Clocks [10] are designed so that programs which follows a simple syntactic rules will not have either deadlock and race conditions.

An X10 array object is a collection of multiple elements that can be indexed by multidimensional points that form the underlying index region for the array. Generally arrays are indexed by points, which are n dimensional tuples of integers.

Points comprises of rank [1], which is nothing but the number of coordinates they have. For example, the rank of a one dimensional array is one. A region is a set of points that have the same rank. All the points in a region have the same rank and a region itself. For example, the region for the identity matrix has a rank equal to two.

An array's distribution is a mapping from a region to a set of places, which describes where each element of a distributed array is kept. An array's distribution specifies how its elements are distributed across multiple places in the PGAS [7]. This distribution remains unchanged throughout the program's execution. There are number of factory methods making different distributions of X10, such as an unique distribution, a constant distribution and a block distribution.

X10 supports Class, Structure and other primitive data types. In this paper we provide an efficient mapping function for a binary search tree which will become an additional data structure to this already powerful language.

## 3.Distribution of Tree Nodes across Processor:

A node may be used to represent a heterogeneous packing of data. A tree comprises

of a collection of nodes linked in some way based on the property of the tree. It will not be an efficient way to place all the tree nodes at a single place which will be shown in our performance evaluation. To reduce the overall load at a single place, tree nodes are distributed across places. Our focus is on mapping a Binary Search Tree in X10 in which every node's left child has a key value less than the node's key value and every right child's key value greater than that.

## 3.1 Previous Works

Global Trees System [2] provides a multi-layered interface to the global address space view of distributed tree data structures, while providing scalable performance on distributed memory systems. The Global Trees System utilizes coarse-grained data movement to enhance locality and communication efficiency. This shared memory programming model can be provided at different levels of abstraction. At the highest level, modern parallel programming languages such as Chapel and Fortress offer support for a global address view of distributed data structures.

The key benefits of using this system include efficient shared memory style programming of distributed tress, tree specific optimizations for data access and computation.

## 3.2 Our Work

Our idea is based in terms of the working set model. Working set is assigned based on the size of subtrees of the tree. Locality of Reference [2] is on the sub tree whose size is comparatively greater than the sizes of other sub trees. The nodes of a sub tree are mapped to four number of places (default number of places in X10) based on our mapping scheme. The mapping is such that the first four larger sub trees are mapped to four places individually provided that the size of each subtree is not greater than half of the size of the tree. If the above condition fails then further partitioning on that subtree is done based on the mapping scheme.

Fig 1: Design for Mapping

Each node of the tree has two elements to maintain the sizes of its left and right subtree. When a node is to be inserted in the tree, a position is found by traversing the tree based on BST property from the root. When traversed either the left or the right element of the node is incremented by one based on the direction taken to find the position. This approach gives the size of the left and right subtrees of the entire node in the tree. The subtrees are mapped to places based on the Load Utilization Factor [4] and Working Set of our mapping scheme.

Tree data is grouped as per the scheme and searching the nodes in the tree can be serviced with a minimum inter-place communication [10]. Each process involved in the parallel computation may independently and asynchronously access the tree nodes with minimal explicit cooperation from other processes. While it provides fine grained node-level data access similar to shared memory programming models, internally the system is aware of the data distribution and is able to exploit data structure specific knowledge to improve locality and yield better performance.

Figure 1 shows various aspects for mapping the nodes of tree in our scheme. The function Find Subtree gets a tree as input and finds various subtrees. The function Allocation of subtree finds the respective places for the subtrees based on the Utilization factor and the working set. Finally each nodes of the sub tree are mapped to their respective place. The mapped results can be shown by taking advantage of the Programming Language Memory model which can tell where a node resides in terms of its place.

The structure of the BST for our work will lie in the form of one of the following two cases:

Case 1: In the simplest case, n(Pi)<n(T)/2 ∀ i∈ [1,4]

Here, number of nodes of all the sub trees is less than half the number of nodes in the tree. Each sub tree is mapped to a place. The top sub tree SR (Refer fig2) is mapped to a place which has minimum number of nodes.

Figure 2 shows the mapping process for case 1. Let the total number of nodes in the tree be n(T). The triangular box specifies the subtrees of the tree T. Let us define a cluster set which comprises of the nodes of the tree from level l=0 to l=1. Subtree which contain the cluster set is always mapped into the place Pj where Pj = minimum {n(P1), n(P2), n(P3), n(P4)}.

Fig 2: Mapping Design for Case (1)

Case 2: n(Pi)>=n(T)/2 ∀ i∈ [1,4]

Here, at least the size of one sub tree is greater than or equal to half the number of nodes in the tree.

Subtree Si is in place Pi ∀ i∈ [1,4]. Let us assume i=1 for which the above condition fails. Hence the subtree S1 is split into right and left subtrees. Let the maximum sized subtree is said as S11 and the other as S12. S11 is now mapped to the place P1 where S1 resided.

The mapping of the subtree S12 is now viewed in two subcases which follow:

Case i: n(S12)<minimum{n(Si})∀ i∈ [1,4]

Let Sk be the subtree which contains minimum {n(Si)} i ∈ [1,4]. If n(Sk) + n(S12) < n(T) / 2, then S12 is mapped to the place where Sk resides.

Otherwise split the subtree S12 into left and right subtrees. Let the maximum sized subtree is said as S12a and the other as S12b. Subtree S12a is mapped to the place where Sk. Place for the subtree S12b is found by applying the case i.

Case ii: n(S12)>minimum(n(Si))∀ i∈ [1,4]

Let Sk be the minimum sized subtree such that n(Sk)<n(S12). Now S12 and Sk interchange their roles (S12 is mapped to the place Where Sk resided and Sk finds a place as per the cases given).

Figure 3 shows mapping process for case 2. Here, the number of nodes in the subtree S1 is greater than half the number of nodes in the tree. So, find subtrees of subtree S1. Let S11, S12 be the subtrees of subtree S1. Map the maximum sized subtree (S11, S12) in the place where S1 resided and place for the other subtree is found by considering the cases.

Fig 3: Mapping Design for Case (2)

Applications that use tree data structures will usually access nodes within the tree in a fine-grained manner. This is because it is crucial to accommodate element-wise access in an efficient manner; hence nodes in the larger subtrees are grouped together.

This subtree is of sufficient size to offset the overhead of performing a communication. This mapping is transparent to the application, so it sufficiently produces better performance.

## 4. Performance evaluation

## 4.1 Performance in terms of number of place reference

To evaluate performance, let us define a binary search tree T containing n(T) nodes of height h(T). At level l=2 in the tree, the maximum number of sub trees available is four. Let the number of nodes in the ith sub tree Si be n(Si ) .

We now show the number of distinct place reference made while searching for a random node r.

For this we have to choose a initial level l for finding the subtree based on the below formula.

l = log2 P

where P represent the number of places available.

So for mapping in X10, we choose initial level l=2. One can also choose any level as the initial level based on the number of places available.

Let us define a probability function in which a successful search for every node r in T where 1<r<n(T) has a probability value Pr = 1/n(T).

Our evaluation [19] is based upon performing a search for all nodes in T and evaluating the performance in terms of distinct number of place references which increases the delay [1].

Since our model places the Sub tree Si in the place Pi where 1 < i < 4 and the cluster set in some place Px such that n(Px) < n(Pi) where 1 < i < 4.

The structure of the BST for our work will lie in the form of one of the two cases which follows:

Case1: In the simplest case, n(Pi)<n(T)/2 ∀ i

∈ [1,4]

Searching for all the nodes in T can be categorized in one of these two ways:

1a) Searching the nodes in the cluster set, number of distinct place reference required is

Eq.1

1b) Searching the nodes in all the sub trees, the number of distinct place required is

+ Eq.2

Where i ∉ some x such that place Px contains the Cluster set.

We have the cluster set in a single place, so in case 1a only one place reference is required to access any node within it. But in case 1b, one place reference is required to reach the root of the sub trees and a one more place reference for accessing the nodes in the sub trees except for some Sx which we have discussed earlier.

In effect, the number of distinct place reference required to search all the nodes of the tree is obtained by adding Eq.1 and Eq.2

## +

## +

## =

n(T) - + 2 * { - n(Sx) } + 1 * n(Sx) .

= n(T) - + 2 * - n(Sx)

Case2: n(Pi)>=n(T)/2 ∀ i∈ [1,4]

Here we further partition the sub tree Si available in the place Pi (for which the above condition holds) into two sub trees Si1 and Si2 .

For searching the nodes in T can be categorized in one of these three ways.

2a) Searching the nodes in the cluster set, number of distinct place reference required is

Eq.3

2b) Searching the nodes of all the subtrees except Si2 (which we have defined earlier) is further viewed as two sub cases.

(i) When the partitioned tree Si1(which we have defined earlier) and the cluster set resides in different places, searching the nodes in all the subtrees except Si2, number of distinct place reference required is

++ Eq.4

where i ∉ some x such that Sx has been partitioned and i ∉ some y such that Py contains the Cluster set.

(ii) When the partitioned tree Si1 and cluster set resides in a same place, number of distinct place reference required is

+ Eq.5

where i ∉ some x such that Sx has been partitioned.

2c) Searching the nodes in the subtree Si2, whether Si2 resides in the place of the cluster set or in a different place, number of distinct place reference required is

Eq.6

Total number of references required in the worst case for searching all the elements of the tree can be obtained by adding the number of distinct place references calculated from Eq.3, Eq.4 and Eq.6

n(T) - + 2 * - 2 * n(Sy) - 2 * n(Sx) + 2 * n(Sx1) + n(Sy) + 3 * n(Sx2)

=> n(T) + - n(Sy) - n(Sx) - 2 * n(Sy) - 2 * n(Sx) + 2 * n(Sx1) + n(Sy) + 3 * n(Sx2).

=> n(T) + - 3 * n(Sy) - 3 * n(Sx) + 2 * n(Sx1) + n(Sy) + 3 * n(Sx2).

=> n(T) + - 2 * n(Sy) - 3 * n(Sx) + 2 * n(Sx1) + 2 * n(Sx2) + n(Sx2) .

=> n(T) + - 2 * n(Sy) - 3 * n(Sx) + 2* n(Sx) + n(Sx2) .

=> n(T) + - 2 * n(Sy) - n(Sx1).

Total number of references required in the best case for searching all the elements of the tree can be obtained by adding the number of distinct place reference calculated from Eq.3, Eq.5 and

Eq.6

n(T) - + 2 * + n(Sx1) + 3 * n(Sx2).

=> n(T) + - n(Sx) + n(Sx1) + 3 * n(Sx2).

=> n(T) + - n(Sx) + n(Sx1) + n(Sx2)

+ 2 * n(Sx2).

=> n(T) + + 2* n(Sx2).

## 4.2 Performance with Global mapping:

One would assign all the nodes of the tree to the global address space where all activities can access the complete tree without creating any extra activities. This leads to a serious drawback, since global data can only be read and can't be updated. Also the overhead in maintaining the global data is high since every access should be checked to see if it satisfies the constraints placed on activities that access global data. So when the number of nodes in global place increases, the overhead also increases. But our model, provides an efficient distribution to achieve parallelism with the least amount of communication delay.

## 5.Conclusion

In this paper, we have proposed an efficient way of mapping a Binary Search Tree's nodes into multiple computation places. By this mapping the overall delay involved in searching the nodes in the tree gets reduced by limiting the number of inter-place communication. When the inter-place communication gets reduced, cost and time spent in implementing advanced communication methods are minimized which in turn can be used to further increase the levels of parallelism by providing the fine real multi cores.