# Matrix Multiplication Operations Representation Using Karnaugh Map Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

This paper discusses a Karnaugh map representation of the multidimensional matrix multiplication operations. The Karnaugh map is a method utilized for simplification of Boolean algebra expressions, and can be used to reduce the need for numerous and extensive calculations through the use of the capability of humans to recognize patterns. This enables quick identification and elimination of any potential race conditions. By using a Karnaugh map, the boolean variables can be transferred, usually from a truth table, and then the variables are arranged according to the principles of Gray code Multi-dimensional matrix multiplication and array operations are used in a large number of important scientific codes, including finite element methods, molecular dynamics, and climate modeling. Various methods have been proposed for the efficient implementation of these Multi-dimensional matrix multiplication and array operations. We design data parallel algorithms for matrix-matrix addition that is based on the 2D mesh distribution schemes, row, the column, and, and matrix multiplication array operations in both traditional matrix representation and Karnaugh map representation schemes for multidimensional arrays. A data parallel algorithms that is for six Fortran 90 array intrinsic functions was also designed all, Cshift Merge, Maxval, Pack, and Sum,. The time of the data distribution, the local computation, and the result collection phases of these array operations based on the traditional matrix representation and the Karnaugh map representation schemes was also compared. The results from the experiment show that algorithms based on the Karnaugh map representation scheme performed better than those based on the traditional matrix representation scheme for all test cases.

## Introduction

Multi-dimensional matrix multiplication and array operations are used in a large number of important scientific codes, including finite element methods, molecular dynamics, and climate modeling. Various methods have been proposed for the efficient implementation of these Multi-dimensional matrix multiplication and array operations, but most of these methods tend to focus on the 2-dimensional arrays. "When extended to higher dimensional arrays, these methods usually do not perform well. Hence, designing efficient algorithms for multidimensional matrix operations becomes an important issue." (G.H.Golub and C.F.Van Loan, 1989)

This paper discusses a Karnaugh map representation of the multidimensional matrix multiplication operations. The Karnaugh map is a method utilized for simplification of Boolean algebra expressions, and can be used to reduce the need for numerous and extensive calculations through the use of the capability of humans to recognize patterns. This enables quick identification and elimination of any potential race conditions. By using a Karnaugh map, the boolean variables can be transferred, usually from a truth table, and then the variables are arranged according to the principles of Gray code in which only one variable changes in between squares. Data is arranged into the possible largest groups containing 2n cells (n=0,1,2,3...)[1] Once the table is generated and then the output possibilities are transcribed, and the minterm is generated by means of the axiom laws of Boolean algebra.Karnaugh maps, generally it become more cluttered and hard to interpret when adding more variables. Karnaugh maps works pretty well for up to four variables, that happens to be a general rule and shouldn't in any case be used for more than six variables. The Quine-McCluskey algorithm can be used for expressions with larger numbers of variables. The process of minimization these days in general is carried out by computer, as a result of this, a standard minimization program has been developed which is the Espresso heuristic logic minimizer.

For the simplification of any boolean expression, the K-Map method may be applied theoretically notwithstanding its number of variables, but most times it is used when fewer than six variables are involved, this is because K-Maps of expressions become more complex and tedious to simplify when it has more than six variables. Two possibilities: are contributed by each variable, the initial value, and its inverse; it therefore organizes all possibilities of the system. There is only one variable change between two adjacent grid boxes as variables are arranged in Gray code in which. The output possibilities are transcribed according to the grid location that is provided by the variables once they are defined. The output possibility is defined for every possibility of a boolean input or variable.

Upon completion of the Karnaugh map, in order to derive a minimized function, there is a grouping of the "1s" or desired outputs into the largest possible rectangular groups in which the number of grid boxes (output possibilities) in the groups must be equal to a power of 2.[1] For example, the groups may be 5 boxes in a line, 3 boxes high by 5 boxes long, 3 boxes by 3 boxes, and so on. "Don't care(s)" possibilities (generally represented by an "X") are grouped only if the group created is larger than the group with "Don't care" is excluded. If it generates the least number of groups, the boxes can be used more than once Each "1" or desired output possibilities must be contained within at least one grouping.

by: locating and transcribing the variable possibility attributed to the box, the groups are converted to a boolean expression and by the axiom laws of boolean algebra-in which if the (initial) variable possibility and its inverse are contained within the same group the variable term is removed. Each group provides a "product" to create a "sum-of-products" in the boolean expression.

The "0s" are grouped instead of the "1s". In order to determine the inverse of the Karnaugh map. The two expressions are non-complementary.

A four variable minterm Karnaugh map

## Related literature

The cost of a parallel algorithm is defined as the product of the running time of the parallel algorithm and the number of processors used. Cost = Running time X Number of Processors If the cost of the parallel algorithm matches the lower bound of the best known sequential algorithm by a constant multiple factor then the algorithm is said to be cost optimal.

The algorithm for adding n numbers takes O(log n) steps on an n-1 processor tree. Thus the cost of parallel algorithm is given by O(n log n) whereas the sequential algorithm in this case takes O(n) times. Thus a parallel algorithm is not cost optimal. The efficiency of a parallel algorithm is defined as the ratio of the worst case running time of sequential algorithm to the cost of parallel algorithm. Parallel processing can be achieved using two public domain message passing system namely PVM and MPI. Message passing Interface (MPI) - MPI is a standard specification for a library of message passing functions. MPI specifies a public domain platform independent standard of message passing library which is portable. An MPI library consists of names, calling sequences and results of subroutines from FOTRAN 77 programs and functions to be called from C programs. Users write their programs in FOTRAN 77 and C and are compiled with ordinary compilers, which are linked to the MPI libraries.

A parallel program written in FOTRAN 77 or C using the MPI library could run without any change on a single PC, a workstation, a network of work station, a parallel computer from any vendor or in any operating system. The design of MPIs is based on four orthogonal concepts, which are message data types, communicators, communication operations and virtual topology. Compiling an MPI program - A program written using MPI format should be compiled using the parallel C compiler mpcc. mpcc fucntionname.c -o myprog.o Message Passing in MPI - Processes in MPI are heavy weighted and single threaded with separate address spaces. Since one process cannot directly access variables in another process's address space, message passing is used in interprocess communications. The routine MPI_Send and MPI with RECV are used in order to send/receive a message to or from a process. A message has two parts namely, the content of the message(message buffer)and the destination of the message(message envelope

The MPI routine has six parameters, the first three specifies the message address, message count the message datatype.MPI introduces data type identifier to support heterogeneous computing and to allow messages from non-contiguous memory locations. The last three parameters specify the destination process id of the process, tag and communicator respectively. These three parameters constitute together the message envelope. Point to Point Communications - MPI provides both blocking and non-blocking operations, and non blocking versions whose completions can be tested for and waited for explicitly. MPI also has multiple communications mode.

The standard mode corresponds to current practice in message passing systems. The synchronous mode required are send to block until the corresponding receive has occurred. In buffered mode a send assumes the availability of certain amount of buffer space which must be previously specified by the user program through a routine call MPI_Buffer_attached(buffer,size) that allocated a user buffer. The ready mode is a way for the programmer to notify the system that the corresponding receives has already received, so that the underlying system can use a faster protocol. When all the processes in a group participate in a global communication operation, the resulting communication is called the collective communication.

In most parallel programs, each process communicates with only a few other processes and the pattern of communication within these processes are called an application topology. MPI allows the user to define a virtual topology. Communication within this topology takes place with the hope that the underlying network topology will correspond and expedite the message transfer. An example of virtual topology is the Cartesian or Mesh Topology.

## .

There are16 positions available in the Karnaugh map, so the input variables can be combined in 16 different ways, and as such it is arranged in a 4Â Ã-Â 4 grid. For any given combination of inputs, function's output is represented by the binary digits in the map. So in the upper leftmost corner of the map 0 is written, this is becauseÂ Æ’Â =Â 0 whenÂ AÂ =Â 0,Â BÂ =Â 0,Â CÂ =Â 0, DÂ =Â 0. in the same way, the bottom right corner is marked as 1 because AÂ =Â 1,Â BÂ =Â 0,Â CÂ =Â 1,Â DÂ =Â 0 gives Æ’Â =Â 1. Take note that the values are ordered in aÂ Gray code, so between any pair of adjacent cells, precisely one variable changes

The Random Access Machine (RAM) - The schematic diagram of Ram is as shown below. The basic functional units of the RAM are : ï‚· Memory unit contains m locations.

Efficiency = Worst case running time of sequential algorithm / Cost of parallel algorithm

A processor operates under the control of a sequential algorithm. The processor can read data from memory location, and can perform basic arithmetic and logical operations. ï‚· A memory access unit (MAU) creates a path from the processor to an arbitrary location in the memory.

The processor provides the MAU with the address of the location it wishes to access and the read/write operations it wishes to perform. This address is used by MAU to establish direct connection between the processor and the memory location. Any step of algorithm for the Ram model consists of three basic phases, namely:

Read: The processor reads a datum from the memory which is stored in one of its local registers.

Execute: The processor can perform basic arithmetical and logical operations on the content of one or two of its registers.

Write: The processor writes the contents of one of its register into a memory location. The Parallel Random Access machine

PRAM is one of the popular models for designing parallel algorithm. The PRAM consists of the following:

A set of N (P1, P2, P3.......PN) identical processors.

A memory with m locations which is shared by all the N processors.

An MAU which allows the processors to access the shared memory.

It is to be noted that the shared memory also functions as a communication medium for the processors. Here each step of an algorithm consists of the following phases:

Read: Can read up to N processes simultaneously in parallel from N locations and store their value in local registers.

Compute: N processors perform basic arithmetic or logical operations on the value in their registers.

Write: N processors can write simultaneously into N memory locations from their registers. This PRAM model can be further sub-divided into four categories based on the way simultaneous memory accesses are handled:

Exclusive Read, Exclusive Write (EREW) PRAM. In this model every access to a memory location (Read/Write) has to be exclusive.

Concurrent Read, Exclusive Write (CREW) PRAM. In this model only write operations to memory location are exclusive whereas two or more processors can concurrently read from the memory locations are exclusive.

Exclusive Read, Concurrent Write (ERCW) PRAM. This model allows multiple processors to concurrently write to the same location, whereas, the read operations are exclusive.

Concurrent Read, Concurrent Write (CRCW) PRAM. This model allows both multiple read and multiple write operations to a memory location and it the most powerful of the four models.

During read operations all processors reading from a particular memory location read the same value, whereas, during write operation many processors try to write different values to the same memory location. So, this model has to specify precisely the value that is to be written to the memory location. So, we specify some protocols which identify the value that is to be written to a memory location. They are: o Priority CW: Here only the processor with highest priority can succeed in writing its value to the memory location.

The main idea of the Karnaugh map representation scheme here is the representation of a multidimensional matrix by a set of 2-dimensional arrays, thus, reducing the complexity of efficient algorithm design for multidimensional matrix operations. "The main idea of Karnaugh map representation is to represent any nD matrix by 2D matrices. Hence, efficient algorithms design for nD matrices becomes less complicated. Parallel matrix operation algorithms based on Karnaugh map representation and traditional matrix representation are presented. Analysis and experiments are conducted to assess their performance. Both our analysis and experimental result show that parallel algorithms based on Karnaugh map representation outperform those based on traditional matrix representation." (B.B.Fraguela, R.Doallo, E.L.Zapata,Cache 1998)

f(A,B,C,D) =Â âˆ‘(007A6,8,9,10,11,12,13,14)Â Note: The values insideÂ âˆ‘Â are the minterms to map (i.e. which rows have output 1 in the truth table).

## Truth table

Using the defined minterms, the truth table can be created:

## #

## A

## B

## C

## D

## f(A,B,C,D)

0

0

0

0

0

0

1

0

0

0

1

0

2

0

0

1

0

0

3

0

0

1

1

0

4

0

1

0

0

0

5

0

1

0

1

0

6

0

1

1

0

1

7

0

1

1

1

0

8

1

0

0

0

1

9

1

0

0

1

1

10

1

0

1

0

1

11

1

0

1

1

1

12

1

1

0

0

1

13

1

1

0

1

1

14

1

1

1

0

1

15

1

1

1

1

0

## Experimental results and analysis

This section deals with the experimental results for the performance of the parallel string matching implementation, which is based on static master-worker model. This algorithm is implemented in ANSI C programming language using the MPI library [5, 10, 11] for the point-to point and collective communication operations. The target platform for our experimental study is a personal computer cluster connected with 100 Mb/s Fast Ethernet network. More specifically speaking, the cluster consists of 6 PCs, based on 100 MHz Intel Pentium processors, with 64 MB RAM. The MPI implementation used on the network is MPICH version 1.2.

During all experiments, the cluster of personal computers was dedicated. Finally, to get reliable performance results 10 executions occurred for each experiment and the reported values are the average ones. The number of processors, the pattern lengths and the several text sizes, can influence the performance of the parallel string matching significantly and thus these parameters are varied in our experimental study. Tables 1 and 2 we show the execution times in seconds, for the BF string matching algorithm, the four pattern lengths, for two total English text sizes and for different number of processors. Further, Figure 2 presents the speedup factor with respect to the number of processors for English text of various sizes and for the BF string matching algorithm. We define the speedup Sp in the usual form Sp = T1/T p Where T1 and T p are execution times of the same algorithm (implemented for sequential and parallel execution) on 1 and p processors, respectively. It is important to note that the speedup, which is plotted in Figure, is result of the average for four pattern lengths.

p/m 5 10 30 60

1 1.155 1.111 1.087 1.182

2 0.622 0.596 0.584 0.631

3 0.419 0.405 0.396 0.428

4 0.312 0.301 0.296 0.318

5 0.252 0.247 0.242 0.266

6 0.203 0.205 0.197 0.212

Table 1: Experimental execution times (in secs) for text size 3MB using several pattern lengths

p/m 5 10 30 60

1 9.237 8.724 8.513 9.284

2 4.642 4.46 4.375 4.769

3 3.095 2.969 2.926 3.201

4 2.334 2.26 1.801 2.421

5 1.875 1.801 1.785 1.962

6 1.558 1.492 1.472 1.631

Table 2: Experimental execution times (in secs) for text size 24MB using several pattern lengths Matrix Vector multiplication Problem - Execution time for matrix-vector multiplication operation with 3 processors (for traditional matrix representation - traditional matrix representation and Karnaugh map representation - extended Karnaugh map representation) matrix size / time graph.

Interconnection Networks and Combinational Circuits differ according to whether the processors communicate among themselves through a shared memory or an interconnection network. We also looked upon two Parallel Programming models Message Passing Programming and Shared Memory Programming.We also found two standard libraries namely PVM and MPI which is implemented in almost all types of parallel computers.

In order to solve the problem on a standalone system we need to design an algorithm for the problem. This algorithm gives a sequence of steps which the sequential computer has to execute in order to solve the problem. This type of algorithm is known as sequential algorithm. Similarly, for solving problems on a parallel computer the algorithms used are known as Parallel Algorithms. A parallel algorithm defines how a given problem can be solved on a given parallel computer i.e. how the problem is divided into sub-problems, how the processes communicate and how the partial solutions are combined to produce the final result. This type of algorithms are generally machine dependent. In order to simplify the design and analysis of parallel algorithm parallel computers are represented by various abstract machine models. These models make simplifying assumptions about the parallel computer. Even though some assumptions may not be practical they are justified in the following sense: In designing algorithms for these models one can learn about the inherent parallelism in the given problem. The models help us compare the relative given powers of the various computers. They also help us in determining the kind of parallel architecture it is best suited for a problem.

In order to design a parallel solution to this problem, it must first be decomposed into smaller tasks which can be executed simultaneously. This is referred to as the partitioning stage and can be done by one of two ways. Each script could be marked by a different marker - this would require n markers. Alternatively, marking each question could be viewed as a task. This would result in m such tasks, each of which could be tackled by a separate marker, implying that every script passes through every marker. In the first approach, the data (scripts) is first decomposed and then the computation (marking) is associated with it. This technique is called domain decomposition. In the second approach, the computation to be performed (marking) is first decomposed and then the data (scripts) is associated with it. This technique is called functional decomposition. The partitioning technique that will be chosen often depends on the nature of the problem. Suppose one needs to compute the average mark of the n scripts.

If domain decomposition was chosen, then the marks from each of the markers would be required. If the markers are at different physical locations, then some form of communication is needed, in order to obtain the sum of the marks. The nature of the information flow is specified in the communication analysis stage of the design. In this case, each marker can proceed independently and communicate the marks at the end. However, other situations would require communication between two concurrent tasks before computation can proceed. It may be the case that the time to communicate the marks between two markers is much greater than the time to mark a question. In which case, it is more efficient to reduce the number of markers and have a marker work on a number of scripts, thereby decreasing the amount of communication.

Effectively, several small tasks are combined to produce larger ones, which results in a more efficient solution. This is called granularity control. For example, k markers could mark n/k scripts each. The problem here is to determine the best value of k. The mapping stage specifies where each task is to execute. In this example, all tasks are of equal size and the communication is uniform, so any task can be mapped to any marker. However, in more complex situations, mapping strategies may not be obvious, requiring the use of more sophisticated techniques. Parallel algorithm design is an interesting and challenging area of computer science which requires a combination of creative and analytical skills.

A sequential algorithm is evaluated in terms of two parameters i.e. the running time complexity and the space complexity. For evaluating parallel algorithms we consider three principle criteria. They are: Running Time, Number of processors, and Cost Running Time. Since the speeding up of solutions to a problem is the main reason for building parallel computers; an important measure in evaluating a parallel algorithm is its running time. This is defined as the time taken by the algorithm to solve a problem on a parallel computer. A parallel algorithm is made up of two kinds of steps, which are: Computational step and Communication step. In a computational step a processor performs a local arithmetical or logical operation whereas in the communication step data is exchanged between the processors via the shared memory or through the inter-connection network. Thus the running time of a parallel algorithm includes the time spent during computational and communicational steps. The worst case efficiency time for solving such type of algorithm is defined as the maximum running time of the algorithm taken over all the inputs, whereas, the average case running time efficiency is the average of the running time of algorithm over all the inputs.

Another important criteria for evaluating a parallel algorithm is the number of processors required. Given a problem of input size n the number of processors required by an algorithm is a function of n denoted by P(n). Sometimes the number of processors is a constant independent of n. Cost - The cost of a parallel algorithm is defined as the product of the running time of the parallel algorithm and the number of processors used. Cost = Running time X Number of Processors If the cost of the parallel algorithm matches the lower bound of the best known sequential algorithm by a constant multiple factor then the algorithm is said to be cost optimal.

The algorithm for adding n numbers takes O(log n) steps on an n-1 processor tree. Thus the cost of parallel algorithm is given by O(n log n) whereas the sequential algorithm in this case takes O(n) times. Thus a parallel algorithm is not cost optimal. The efficiency of a parallel algorithm is defined as the ratio of the worst case running time of sequential algorithm to the cost of parallel algorithm.

Common CW: Here the processors are given a chance to write to a memory location if and only if they have the same value. o Arbitrary CW: Here if one processor succeeds in writing to memory location it is arbitrarily chosen without affecting the correctness of the algorithm. o Combining CW: Here there is a function that maps the multiple value that the processors try to write a single value that is actually written into the memory location. Interconnection Networks - We know that in PRAM, all exchanges of data among processes take place through shared memory. There is also another way for the processors to communicate i.e. via direct links. In this method instead of shared memory the M locations of memory are distributed among N processors. So the local memory of each processor now contains M/N locations. Combinational Circuits - A combinational circuit can be viewed as a device that has a set of input lines on one end and set of output lines on the other. Such type of circuits are made of interconnected components arranged in columns called stages, each component has a fixed number of input lines called fan in and fixed number of output lines called fan out. After each component receives its input, a simple arithmetical or logical operation is performed in one unit time and result is produced as output.

Parallel Virtual Machine (PVM) - PVM is a public domain software system that was originally designed to enable a collection of heterogeneous UNIX computers to be co-operatively used as one virtual message passing parallel computer. Unlike MPI, PVM is a self contained system i.e. while MPI depends on the underlying platform to provide process management an I/O functions PVM doesn't. On the other hand PVM is not a standard which means it can undergo version changes frequently than MPI. The PVM system is composed of two parts: A PVM daemon (pvmd3) that resides on all the computers which make up the virtual machine and user-callable library (libpvm3.a) link to the user application for message passing, process management and modifying the virtual machine.

To run a PVM application the user first creates a virtual machine by starting up PVM. Multiple users can configure overlapping virtual machine in a UNIX system, and each user can execute several PVM applications simultaneously. A general method of programming an application with PVM is as follows: User codes one or more sequential program in FOTRAN 77 or C which contains calls to the PVM library. These programs are compiled in the host pool and the resulting object files are placed in a location accessible from machines in the host pool. To execute an application, a user starts one copy of one task from a machine within the host pool. This task subsequently starts other PVM task which computes locally and exchange messages with each other to solve the problem. All PVM tasks are identified by an integer task identifier (tid) which is assigned by the PVM system.

## Conclusions and future work

The design and analysis of sequential algorithms is a well developed field, with a large body of commonly accepted results and techniques. This consensus is built upon the fact that the methodology and notation of asymptotic analysis (the so-called â€•big-Oâ€- notation) deliver results which are applicable across all sequential computers, programming languages, compilers and so on. This generality is achieved at the expense of a certain degree of blurring, in which constant factors and non-dominating terms in the analysis are simply ignored. In spite of this, the approach produces results which allow useful comparisons of the essential performance characteristics of different algorithms which are reflected in practice when implemented on real machines, in real languages through real compilers.

The diversity of proposed and implemented parallel architectures is such that it is not clear that such a model will ever emerge. Worse than this, the variations in architecture capabilities and associated costs mean that no such model can emerge, unless we are prepared to forgo certain tricks or shortcuts exploitable on one machine but not another. An algorithm designed in some abstract model of parallelism may have asymptotically different performance on two different architectures (rather than just the varying constant factors of different sequential machines). Secondly, our notion of â€•betterâ€- even in the context of a single architecture must surely take into account the number of processors involved, as well as the run time. The trade-offs here will need careful consideration. In this course we will not attempt to unify the irretrievably diverse. Thus we will have a small number of machine models and will design algorithms for our chosen problems for some or all of these. However, in doing so we still hope to emphasize common principles of design which transcend the differences in architecture.

The experimental results show that the execution time of array operations based on the Karnaugh map representation scheme is less than that based on the traditional matrix representation scheme in the data distribution, the local computation, and the result collection phases for all test cases. The results encourage us to use the Karnaugh map representation scheme for multidimensional array representation on distributed memory multi-computers." (B.B.Fraguela, R.Doallo, E.L.Zapata,Cache 1998)

In this paper, loop re-permutation and other concepts were used to design algorithms for multidimensional array operations. All programs of array operations based on the traditional matrix representation and Karnaugh map representation schemes are derived by hand. There have also been some previous proposals about some automated methods for generating efficient parallel codes for 2-dimensional matrix array operations based on the traditional matrix representation scheme. It is interesting to see if their methods can be applied to multidimensional array operations. Future work will try to extend to multidimensional array operations based on the traditional matrix representation and Karnaugh map representation schemes. Equally, in some instances, we will exploit particular features of one model where that leads to a novel or particularly effective algorithm. Similarly, better notions should be investigated as well as continued employment of the notation of asymptotic analysis. It must be noted that particularly care should be taken of constant factors. In the parallel case a constant factor discrepancy of 32 in an asymptotically optimal algorithm on a 64 processor machine is a serious matter.