High Performance Computing keeps on evolving. Also Java adaptation in High Performance Computing has comprehensible prospects, because of its significant characteristics like object oriented, platform independent, portable, secure, built-in networking and multithreading support and has an extensive API. Another trend we observe is the multi-core systems and consequently multi-core programming tools are becoming popular like OpenMP. Above and beyond, today the leading HPC programming model is MPI. The main objective of this paper is to review and analyze the literature on MPI implementation in Java for shared and distributed memory models. Moreover, perform a critical evaluation of the reviewed MPI implementations and come up with some new developments with Java for HPC if possible.
Nowadays, there is a persistently demand of greater computational power to process the large amounts of data in complex manners. High Performance Computing makes previously unachievable calculations possible. Some of today's Grand Challenge Problems are not limited to Climate Modeling, Fluid Turbulence, Pollution Dispersion, Human Genome and Ocean Circulation. To attain computing performance, modern computer architectures are more and more relying upon hardware level parallelism through realization of multiple execution units, pipelined instructions and multi-core. Nowadays, the largest and fastest computers use both shared and distributed memory architectures (see Figure 1). Contemporary trends indicate that the hybrid type of memory architecture will continue to prevail.
Figure 1: Hybrid Distributed Shared Memory Architecture.
Modern trend to multi-core clusters accentuates the importance of parallelism and multithreading. Java is an appropriate choice for High Performance Computing applications as it is multithreaded language and provides built-in networking support. This demonstrates that Java is competent to employ shared as well as distributed memory architectures . There are several parallel programming models exist to be implemented on any core hardware. These include Shared Memory, Threads, Message Passing, Data Parallel and Hybrid. An example of a hybrid model is the combination of the message passing model (MPI) with either the threads model (POSIX threads) or the shared memory model (OpenMP). Two popularÂ messaging approaches are Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) . MPI is the leading model used inÂ HPCÂ at present. It supports both point-to-point and collective communication. MPI aims high performance, scalability, and portability . Interoperability of objects defined in MPI facilitates mixed-language message passing programming. Though MPI is not approved by any standards body yet it has become aÂ de factoÂ standardÂ forÂ communicationÂ among processes.
Figure 2: Message Passing Model
Message Passing Interface standard is established in 1994 . It supports the Single Program Multiple Data (SPMD) model of parallel computing. The message passing model (see figure 2) uses its own local memory during computation. Multiple tasks can reside on the same physical machine also on any random number of machines. For data transfer cooperative operations are required to be performed by each process, specifically a send operation must have a matching receive operation. Communications modes offered by MPI are blocking, non-blocking, buffered and synchronous. The MPI standard documents presented a language independent specification as well as C and Fortran bindings . Whereas, MPI-2  an extended standard of MPI added a C++ binding. Java binding is not part of any of MPI standards releases.
This paper is aimed at providing a literature review of existing MPI implementations in Java. We would discuss different aspects of Java message passing models that include socket implementation, high-speed network support, different API supports and performance evaluations. The remaining part of the paper is organized as follows: section 2 covers the literature overview, critical evaluation of the MPI models is presented in section 3, the concluding remarks and future directions are covered in section 4.
II. Litreature review
The purpose of literature review is to emphasize the significance of MPI models in Java for HPC environment. A brief overview of different implementations and their possible analysis will be covered.
JavaMPI  proposes flexible and portable programming paradigms for both the sahred memory and the distributed memory models. JavaMPI achieves this by Java which provides built-in classes and methods for developing multithreaded programs for shared memory models and automatic binding of native MPI using JCI for distributed memory or hybrid applications. JavaMPI uses Java Native Interface (JNI). JNI lets C functions to access Java data and perform format conversion. JNI is convenient only for writing new C code which to be called from Java.
In JavaMPI (see figure 3) there is an additional interface layer Java to C interface generator (JCI) to bind a MPI library to Java. JCI takes a header file containing the C function prototypes of the native library as an input and outputs a file of C stub-functions, files of Java class and native method declarations. Also the shell scripts for doing the compilation and linking. The JCI is responsible to generate a C stub-function and a Java native method declaration for each exported function of the MPI library.
Figure 3: The binding of native MPI library to Java using JNI
MPIJ  is a Java-based implementation of MPI. It runs as part of the Distributed Object Group Metacomputing Architecture (DOGMA) system. MPIJ provides MPI-like functionality and includes point-to-point communication, intra-communicator operations, groups, and user-defined reduction operations.
MPIJ communication uses native primitive Java types. MPIJ can function on applet-based nodes. This provides flexibility in creating clusters without installing any system or user software related to the message-passing environment on the participating nodes.
Message Passing Java or MPJava  is a pure-Java message passing framework supported by the java.nio package introduced in Java 1.4. MPJava is based on the Single Program Multiple Data (SPMD) model of MPI. Each MPJava instance has a unique processor identification tag (PID). Details regarding total nodes for the computation and PID are known by each MPJava instance. The programmer can easily use this information to decide how to split up the shared data.
There are point-to-point send() and recv() functions available in MPJava API, along with communication operations, for example an all-to-all broadcast(). This routine can be used to recreate the entire array of n elements on each node. The default setting is used for the routine where an array with n elements will be split between p nodes with each node holding n/p elements and the last node holding n/p + n(mod)p elements. There are series of start-up scripts available for a purpose to create a network of MPJava processes. These scripts read a list of hostnames, perform the necessary remote logins to each machine and start the MPJava processes on each machine with special arguments. The special arguments allow each MPJava process to find the others.
Each MPJAva process has TCP connections to every other process in the network. These connections facilitate the nodes for both point-to-point and collective communications. Consequently, in MPJava there are two different all-to-all broadcast algorithms presented. In a multi-threaded concurrent algorithm all pairs of nodes exchange data in parallel and a parallel prefix algorithm uses a single thread only. In the parallel prefix implementation, data exchange proceeds in log2(n) rounds, data of 2r-1 pieces is sent in each round, where r is the current round number. Whereas in the concurrent algorithm, each node has a separate send and receive thread. In addition to handle this complex communication with all the other processors there is a select() routine.
MPJava framework is based on java.nio features which are important for high-performance computing like channels, select(), and buffers. These features help to maximize the performance notably. Channels are an abstraction for TCP sockets that allow non-blocking I/O calls, can be polled and selected by calls to java.nio's select() mechanism, and operate on java.nio.ByteBuffers. These are easier to employ than sockets. The use of java.nio.Buffers, in MPJava framework leads to two views of each buffer. These are ByteBuffer and DoubleBuffer. Maintaining multiple "views" of the same piece of data is required for the following reasons; ByteBuffer supports operations to read or write other primitive types such as doubles or longs. All socket I/O calls require ByteBuffer parameters. But in ByteBuffer each operation requires checks for alignment and endianness in addition to the bounds checks typical of Java. DoubleBuffer provides bulk transfer methods for doubles that do not require checks for alignment and endian-ness. Furthermore, these transfers are visible to the ByteBuffer "view" without the need for expensive conversions. The prevention of overlap of simultaneous I/O calls on the same data is taken into account to shun the resulting race condition. The MPJava implementation takes normal Java arrays as parameters which mean that the data is copied from arrays into buffers before the data passes to system-level OS calls.
The benchmark of MPJava is based on the Java Grande Forum's ping-pong benchmark, Java Grande Forum's JGFAlltoAllBench.java and NAS PB Conjugate Gradient (CG) benchmark. The all-to-all microbenchmark measures bandwidth use in a more practical way than ping-pong. For ping-pong benchmark two different java.io implementations: java.io (doubles), and java.io (bytes) have been used. MPJava framework does better than the java.io doubles implementation because the conversions are extremely inefficient. For java.io (bytes) implementation for data sizes larger than about 2000 doubles it achieves well again. This behavior is observed due to inefficiencies in java.io's buffering of data.
The native LAM-MPI implementation provides superior performance than MPJava for message sizes until about 1000 doubles, while MPJava provides better performance for sizes larger than 7000 doubles.
The CG benchmark provides an evaluation of the suitability of Java for high performance scientific computation because it contains significant floating point arithmetic. The results demonstrate that MPJava is capable of delivering performance that is competitive with Fortran, MPI and native MPI/C applications. Work on asynchronous messages is deferred to the future work by the authors.
Jcluster  is a pure Java implementation for the heterogeneous clusters. Jcluster API is based on UDP protocol and Java threads. Its automatic load balancing is obtained by relatively a new load-balancing algorithm, Transitive Random Stealing (TRS).
Jcluster consists of two parts: a console and a runtime environment. Console provides users to start and monitor their programs on the cluster. The runtime environment (see figure 4) comprises on a communication layer, a resource manager, a task scheduler and a PVM/MPI-like message-passing interface.
Figure 4: The architecture of the runtime environment.
The communication layer provides point-to-point message-passing interface. It is based on asynchronously multithreaded transmission using the UDP protocol. The resource manager monitors the state of the nodes, maintains a list of available nodes and if detects failures delete the node from the list of nodes on cluster. The task scheduler implements the TRS algorithm. It helps to schedule tasks with efficient load balancing on cluster. Programmers run their program in the runtime environment as one or more threads. The thread is called a tasklet. A Tasklet can be local or remote. Purpose of a tasklet is to communicate with a local or a remote tasklet through PVM/MPI-like message-passing interface. A tasklet can also generate subtasks to the task scheduler. The task queue is a double-ended queue.
Jcluster's design philosophy for dynamic load-balancing algorithms is to reduce the idle time for all nodes. A node is in the idle state when it has no tasks to execute. Distribution of the workload during a running application is achieved by sending the tokens to the schedulers on remote nodes. Token has all the information to create a new tasklet.
Jcluster takes full advantage of the network bandwidth by implementing an asynchronously multithreaded transmission with UDP. It retains the order of messages between the sender and the receiver. The sender first decomposes a message into blocks then the sending thread sends a block to the receiver. It waits for an acknowledging message from the receiver. If there is no response and resending exceeds the set value in Jcluster, then the sender sends the failure information to the resource manager. When the receiver receives a block, it replies with an acknowledging message to the corresponding sending thread. Then the receiver sends the block to a processor, which is responsible for sorting and reconstructing the message by the block header.
Inorder to ease programming, Jcluster has provided four simple interfaces. These classes are JTasklet, JEnvironment, JMessage and JException. The PVM-like message-passing interface supports dynamically creating the group, barrier and broadcast operations in the group for the collective communication. For the MPI-like message-passing interface, operations are implemented: barrier, broadcast, scatter, scatterv, gather, allGather, alltoAll, reduce and allReduce for the collective communication.
To test the Jcluster environment there were two approaches are followed. In the first one, TRS is compared with Shis and RS for five different load distributions. It is noticed that the difference of the performance between the algorithms on small scale cluster is almost similar. But with the increase in the numbers of the nodes TRS shows better performance. Jcluster is compared with LAM-MPI and mpiJava for its communication evaluation using Java Grande Forum's pingpong benchmark. It shows the following results;
The LAM-MPI C primitive has the lowest communication overhead.
mpiJava over LAM-MPI, has larger latencies. It is due to the overhead of calling the native LAM-MPI routines through JNI.
The Jcluster environment has to some extent smaller latencies than mpiJava.
Jcluster environment making better use of the bandwidth of the network than LAM-MPI and mpiJava on LAM-MPI with the method of asynchronously multithreaded transmission. Jcluster achieves larger bandwidths for the messages on larger sizes, and is very close to the theoretical bandwidth of Fast Ethernet.
Another is divide-and-conquer, the 16-Queen problem. The efficiency of the speedup is attained up to 91.73% on the real platform.
Parallel Java (PJ)  is developed with features both offered by OpenMP and by MPI. PJ is a unified shared memory and message passing parallel programming library written in Java. Using the same API of PJ programmer can write parallel programs for SMP machines, clusters, and hybrid SMP clusters. PJ also includes its own middleware for managing a queue of PJ jobs and submitting processes on the cluster machines.
Figure 5: Parallel Java Architecture
PJ Architecture (see figure 5) consists of cluster parallel computer with one frontend processor and multiple backend processors connected by a high-speed network. PJ has a Job Launcher Daemon and a Job Scheduler Daemon. The Job Scheduler keeps track of each backend processor's status and maintains a queue of PJ jobs. To run a PJ job on the cluster, the JVM connects to the Job Scheduler. The Job Scheduler waits until the requisite number of backend processors is idle, and then tells the job frontend which backend processors to use. The job frontend connects to the Job Launcher on each backend processor and tells the Job Launcher to produce a JVM. As the PJ program runs in the job backend processes, the job backends set up connections among themselves for message passing. Message passing in PJ is implemented using Java New I/O (NIO) direct byte buffers and socket channels.
Using Floyd's algorithm, the PJ API is exemplify for SMP programming, cluster programming and hybrid SMP cluster programming. Floyd's algorithm exhibits poor scalability for smaller problem sizes because it is not a massively parallel problem. Then the PJ program for block cipher key search was tested for potential keys searched. The key search program's scalability for this massively parallel problem is found better than the Floyd's algorithm program's scalability. When the PJ's message passing is compared with mpiJava, the performance of PJ is found slightly better than mpiJava's. This evaluation is done by using a "ping-pong" program.
The mpiJava   API follows the model of C++ binding specified in the MPI 2 standard. The major classes of mpiJava are MPI, Group, Comm, Datatype, Status and Request (See Figure 6). The class MPI has static members and contains global service. The communication class is Comm and it has communication functions either as its members or its subclasses. The standard send and receive operations of MPI are members of Comm. Datatype describes the type of the elements in the message buffers. The mpiJava provides all the derived datatype constructors of standard MPI except MPI_TYPE_STRUCT. It has its own predefined datatype MPI_Object.
Java runtime and the interrupt mechanisms have no interrupt problem. It is possible because in the MPI implementations the JDK allows the use of green or native threads. The mpiJava works steady on NT platforms using WMPI and JDK 1.1 or later as and on UNIX platforms using MPICH and JDK 1.2.
Figure 6: Classes of mpiJava
For functionality tests C test suite (originally developed by IBM) is translated to mpiJava. The basic inter-processor communications performance tests are based on PingPong. Startup Latencies are observed due to the relatively poor performance of the JVM. In both the Shared Memory (SM) mode and the Distributed Memory (DM) mode the mpiJava wrapper adds an extra micro seconds compared to WMPI and MPICH. The performance of mpiJava in SM under WMPI is good. The MPICH results for mpiJava show reasonable performance. In DM under WMPI the C and mpiJava codes display very similar performance characteristics throughout the range tested while under MPICH there is distinct performance difference between C and mpiJava is observed.
The middleware P2P-MPI  has been designed to facilitate parallel programs deployment on grids. It allows an automatic discovery of resources, a message passing programming model, a fault-detection system and replication mechanisms.
The API implements appropriate message handling, and relies on the following three modules (see figure 7); "The Message Passing Daemon" (MPD), "The File Transfer Service" (FT) and "The Fault Detection Service" (FD).
Figure 7: P2P-MPI structure
In P2P-MPI an up-to-date directory of participating nodes and discovery for execution platform use the discovery service of JXTA. P2P-MPI has a replication mechanism to increase the robustness of an execution platform with the exception for the root process. The coordination of replicas means a protocol insuring that the communication scheme is kept coherent with the semantics of the original MPI program. Between the two broad classes of this type of protocol i.e. passive replication and active replication, P2P-MPI protocol follows the latter strategy. Moreover specific agreement protocols are also added on both sender and receiver sides.
For fault detection and recovery instead of the classic models of fault detectors push and pull models, the gossip-style protocol is adopted for P2P-MPI. Two experiments are setup to test communication and computation performances, how applications scale, and the impact of replication on performance.
In the first experiment, the gap between P2P-MPI and some reference MPI implementations is measured. For this two test programs with contrary characteristics from the NAS benchmarks are used. These are IS (Integer Sorting) and EP (Embarrassingly Parallel). For this experiment the hardware platform is a student computers room. The prototype achieves its goals at the expenses of an overhead gained extra communications. P2P-MPI has simpler optimizations and the use of a virtual machine (java) instead of processor native code leads to slower computations.
P2P-MPI benchmark IS shows an approximately as good performance as LAM/MPI up to 16 processors while MPICH-1.2.6 is noticeably slower on this platform. IS shows a linear increase of execution time with the replication degree. It is observed that P2P-MPI performance curve tends to approach LAM and MPICH ones. Whereas, the EP benchmark shows that P2P-MPI is slower for computations because it uses Java. It is noted that this test is twice as slow as when EP programs use Fortran. The EP test shows very little difference with or without replication.
In this second experiment to test P2P-MPI the application "ray-tracer" from the Java Grande Forum MPJ Benchmark is used. After several series of executions of the application, it was observed that the application scales well up to 64 processors on a single site, and with 128 processors, the scalability largely decreases. It turns to a slowdown with distant sites because a computation to communication ratio does not allow more parallelism. However, the experiment confirms the good scalability of the application.
MPJ Express  is a thread-safe Java messaging library that provides a full implementation of the mpiJava 1.2 API specification. Two communication devices are implemented as part of the library, the first, called niodev is based on the Java New I/O package and the second, called mxdev is based on the Myrinet eXpress (MX) library. MPJ Express comes with an experimental runtime, which allows portable bootstrapping of Java Virtual Machines across a cluster or network of computers.
Figure 8: Hybrid Distributed Shared Memory Architecture
The design philosophy of Ibis is similar to MPJ EXPRESS. MPJ Express has a layered design that allows incremental development, and provides the capability to update and swap layers in or out as needed. Figure 1 is a layered view of the messaging system that shows MPJ Express levels: high-level, base-level, mpjdev, and xdev. The high and base level rely on the mpjdev and xdev levels for actual communications, and interaction with the underlying networking hardware.
There are two implementations of the mpjdev level. The first implementation will provides JNI wrappers to native MPI implementations. It would also be possible to use higher-level features of native MPI library, instead of relying on the pure Java ones implemented at top levels of the design. The second implementation will use the lower level device called xdev to provide access to Java sockets or specialized communication libraries. xdev is not needed by the wrapper implementation because the native MPI is responsible for selecting and switching between the different communication protocols.
An important component of a Java messaging system is the mechanism used for bootstrapping MPJE processes across various platforms. A challenge here is to make the mechanism cope with heterogeneous platforms. If the compute-nodes are running a UNIX-based OS, it is possible to remotely execute commands using SSH. But if the compute-nodes are running Microsoft Windows, these utilities are not universally available. The MPJE runtime provides a unified way of starting MPJE processes on compute-nodes irrespective of the operating system. The runtime system consists of two modules. The daemon module executes on compute-nodes and listens for requests to start MPJE processes. The daemon is a Java application listening on an IP port, which starts a new JVM whenever there is a request to execute an MPJE process. MPJ Express uses the Java Service Wrapper Project  software to install daemons as a native OS service. The mpjrun module acts as a client to the daemon module.
The Java Grand Forum proposed a next generation MPI for Java called MPJ (Message Passing in Java) based on the lessons learnt from previous implementations and coping with various requirements of contemporary scientific computing. MPJ/Ibis  is the first available implementation of MPJ written in pure Java to use high speed networks using some native code and some optimization techniques for performance. The MPJ  specification builds directly on the MPI-1 infrastructure provided by the MPI Forum, together with language bindings motivated by the C++ bindings of MPI-2.
This paper discusses the design choices of MPJ/Ibis for implementing the MPJ API. Micro benchmarks from the Java-Grande benchmark suite show the following rsults;
On a Myrinet cluster: MPJ/Ibis communicates is slower than C-based MPICH. MPJ/Ibis outperforms MPI-Java, a Java wrapper for MPI.
Using TCP on Fast Ethernet: MPJ/Ibis is significantly faster than C-based MPICH. MPI-Java does not run in this configuration.
With the JavaGrande bench-mark applications from the Java-Grande benchmark suite  shows that MPJ/Ibis is either on-par with MPIJava or even outper-forms it. MPJ/Ibis is a promising message-passing platform for Java that combines competitive performance with portability ranging from high-performance clusters to grids.
A key problem in making Java suitable for grid programming is designing a system that obtains high communication performance while retaining Java's portability. Current Java runtime environments are heavily biased to either portability or performance. The Ibis strategy achieves both goals simultaneously. With Ibis, grid applications can run simultaneously on a variety of different machines, using optimized software where possible (e.g., Myrinet), and using standard software (e.g., TCP) when necessary.
MPJ/Ibis is written completely in Java on top of the Ibis Portability Layer. It matches the MPJ specification mentioned in . The architecture of MPJ/Ibis, shown in Figure [-], is divided into three layers;
Figure 9: Hybrid Distributed Shared Memory Architecture
The Communication Layer provides the low level communication operations. The MPJObject class stores MPJ/Ibis messages and the information needed to identify them, ie. tag and context id. To avoid serialization overhead the MPJObject is not sent directly, but is split into a header and a data part. When header and message arrive at the destination, MPJ/Ibis decides either to put the message directly into the receive buffer or into a queue, where the retrieved message waits for further processing.
The Base Communication Layer takes care of the basic sending and receiving operations in the MPJ specification. It includes the blocking and nonblocking, send and receive operations and the various test and wait statements. It is also responsible for group and communicator management.
The Collective Communication Layer implements the collective operations on top of the Base Communication Layer.
MPJ/Ibis Implementation avoid expensive operations like buffer copying, serialization and threads where it is possible. On the sender side, MPJ/Ibis analyses the message to send out if there is a need to copy it into a temporary buffer. This is necessary when using displacements. On the receiver side MPJ/Ibis has to decide to which communicator the message is targeted. The receive operation uses a blocking downcall receive to the Ibis receive port, where it waits for a message to arrive. When the message header comes in MPJ/Ibis determines if this message was expected. If it was not, the whole message including the header will be packed into a MPJObject and then moved into a queue, copying then is mandatory. Otherwise MPJ/Ibis decides either to receive the message directly into the user's receive buffer or into a temporary buffer from where it will be copied to it's final destination. There is no need to use threads for the blocking send and receive operations in MPJ/Ibis. MPJ supports non-blocking communication operations, such as isend and irecv. These are built on top of the blocking operations using Java threads.
Some of the open issues are; since Java provides derived datatypes natively there is no real need to implement derived datatypes in MPJ/Ibis. Nevertheless contiguous derived datatypes are supported by MPJ/Ibis to achieve the functionality of the reduce operations MINLOC and MAXLOC, which need at least a pair of values inside a given one-dimensional array. At the moment MPJ supports one-dimensional arrays. Multidimensional arrays can be sent as an object. In place receive is not possible in this case. MPJ/Ibis supports creating and splitting of new communicators, but intercommunication is not implemented yet. At this moment, MPJ/Ibis does not support virtual topologies.
MPJ/Ibis is evaluated in comparison with the MPJ, MPICH and mpiJava. For the Java measurements, two different JVMs, one from Sun and one from IBM, both in version 1.4.2 are used. For C, MPICH/GM for Myrinet and MPICH/P4 for Fast Ethernet are used.
For roundtrip latency: On Myrinet, Java has considerably higher latencies than C. This is partly caused by switching from Java to C using the JNI. On Fast Ethernet Ibis and MPJ are a lot faster than MPICH/P4. In this case, only Java code is used, so the JNI is not involved.
Throughput for 64 KByte arrays of doubles, the data is received in preallocated arrays, no new objects are allocated and no garbage collection is done by the JVM. The numbers show that the IBM JVM is much faster than the SUN JVM in this case, because the SUN JVM makes a copy of the array when going through the JNI.
JMPI (Java Message Passing Interface)  is an implementation of message passing library which complies with MPJ. This library provides graphic user interface. It is based on two typical distributed system communication mechanisms, Socket and RMI.
Sockets are complicated in their implementation and RMI (Remote Method Invocation) is the appropriate alternative. The JMPI libraries involve the classes that communicate and manage process ports, share information among processes, provide communications with daemons and methods for initialisation/finalisation. The core library of the JMPI package consists of four classes i.e. JMPIApp, Comm, Slave and MsgHandler. The JMPIApp class includes the method to initialize. To communicate, the RMI based message passing system registers. Comm class manages communication with processes. Slave class manages processes and their port numbers. MsgHandler class identifies and orders messages in asynchronous send/receive methods.
JMPI provides a set of GUI tools. These enable the users to easily set distributed computing environments and monitor states of distributed applications.
The performance of each message passing system (Socket and RMI based) is evaluated by measuring its processing speed with respect to the increasing number of computers by executing three well-known applications; CPI, ASP and SOR. CPI is a simple application calculating the value pi. ASP is All-pairs Shortest Paths application computing the shortest path between any two nodes of a given node graph. And SOR is a Red-Black Successive Over-Relaxation application for solving partial differential equations.
The comparison of the two message passing systems is performed by measuring the processing speed according to the increasing number of computers. Here are the following observations,
The CPI application runs efficiently
Socket based system on the 8-node cluster
RMI based system, on the 12-node cluster
The ASP application runs efficiently
The performance results show that the two types of JMPI systems differently obtain their speedups depending on both the communication patterns of applications and the number of computers used. This indicates that the most efficient processing speed can be obtained by increasing the number of the computers in consideration of network traffics generated by applications.
F-MPJ is a scalable and efficient Java message-passing library. F-MPJ provides efficient MPJ non-blocking communication based on Java IO sockets, allowing communications overlapping; it is efficiently coupled with JFS. The JFS is a high-performance Java IO sockets implementation, which provides shared memory and high-speed networks support and avoids the primitive data type array serialization. F-MPJ avoids the use of communication buffers; and implements scalable Java message-passing collective primitives.
Figure 10: Overview of F-MPJ communication layers on HPC hardware
F-MPJ has been evaluated on an InfiniBand multi-core cluster, outperforming significantly two representative MPJ libraries, MPJ Express and MPJ/Ibis. Thus, the micro-benchmarking results showed a performance increase of up to 60 times for F-MPJ. Moreover, the subsequent kernels and application benchmarking obtained speedup increases of up to seven times for F-MPJ on 128 cores, depending on the communication intensiveness of the analyzed MPJ benchmarks. F-MPJ has improved MPJ performance scalability, allowing Java message-passing codes that previously increased their speedups only up to 8-16 cores to take advantage of the use of 128 cores, improving significantly the performance and scalability of current MPJ libraries.
Message-passing is the most widely used parallel programming paradigm as it is highly portable, scalable and usually provides good performance. It is the preferred choice for parallel programming distributed memory systems such as clusters, which can provide higher computational power than shared memory systems. Regarding the languages compiled to native code (e.g., C and Fortran), MPI is the standard interface for message-passing libraries.
The Message-passing in Java (MPJ) libraries can be implemented:
using Java RMI
wrapping an underlying native messaging library like MPI through Java Native Interface (JNI)
using Java sockets. Each solution fits with specific situations, but presents associated trade-offs.
The use of Java RMI, a pure Java approach, as base for MPJ libraries, ensures portability, but it might not be the most efficient solution, especially in the presence of high speed communication hardware. The use of JNI has portability problems, although usually in exchange for higher performance. The use of a low-level API, Java sockets, requires an important programming effort, especially in order to provide scalable solutions, but it significantly outperforms RMI-based communication libraries. Although most of the Java communication middleware is based on RMI, MPJ libraries looking for efficient communication have followed the latter two approaches.
Currently, these two projects, mpiJava and MPJ Express, are the most active projects in terms of uptake by the HPC community, presence on academia and production environments, and available documentation. These projects are also stable and publicly available along with their source code.
The selected MPI implementations based on Java has its unique features, strengths and weaknesses (see table 1). To make a comparative study tangible several parameters are identified. These are as follows;
Implementation Style (pure Java/Java wrapper)
Resource Environment (homogenous/hetrogenous)
Communication Libraries (Native MPI, Java I/O, Java new I/O, RMI, Sockets
Communication Type (Blocking, Non blocking, Synchronous, Asynchronous)
Memory Model (Shared, Distributed)
Multiple Protocol Support
High Speed Networks Support
High Speed Network(s) Support
Blocking, Non- blocking
Blocking, Non- blocking
Java new I/O
Parallel Java 
Blocking, Non- blocking
Blocking, Non- blocking
Java new I/O
Blocking, Non- blocking
MPJ Express 
Java new I/O
Blocking, Non- blocking
Blocking, Non- blocking
Java I/O, RMI, Sockets
Shared, Distributed, Hybrid
Java I/O, JFS
Blocking, Non- blocking
The implementation style plays an important role. If an implementation is written in pure Java then it assures the portability. Resource environment provides flexibility on hardware as well as software resources.
Communication libraries must be flexible and support different implementations to assure interoperability. In addition communication layer must be independent from underlying hardware.
Communication type is an important parameter for HPC frameworks. In any case synchronous, blocking and non blocking communication should be provided by the frameworks. Nowadays asynchronous communication in HPC framework is desirable as it increases message delivery reliability even if the recipient is not available at the time message is sent.
There are two major classifications of memory models; shared memory and distributed memory. The unified approach clearly provides scalability and performance for the clusters and multi-core systems.
Java I/O and TCP are sufficient for communication in MPI. The current trend towards high speed Internet demands that MPI should support message passing over the Internet as well. This demands that the framework should support different network protocols according to the layout of the network such as LAN, WAN etc, that's why multiple protocol support is included in the parameter list.
Communication speed itself is very important to achieve best performance over the network. Most of the MPI implementations support high speed networks based on Myrinet, InfiniBand and SCI. Currently; there is no support available for these high speed networks in Java. The only solution is to use native libraries.
Table 1 provides the comparison, similarities and dissimilarities between the various implementations.
JavaMPI is not a pure Java implementation like mpiJava and unlike other models presented in literature review. It is based on native communication libraries. Like MIPJ and MPJava, it does not provide support for high speed network libraries. The difference between JavaMPI and JCluster, Paralle Java is that both JCluster and Parallel Java can support shared and distributed memory models whereas JavaMPI only supports distributed memory model. Communication in P2P-MPI is based on Java I/O and Java new I/O whereas Java communication MPI libraries for JavaMPI are native still the communication performance of both the models is comparable.
MPIJ provides pure Java communication based on Java I/O in contrast MPJ Express can support Java new I/O, Java I/O and native libraries for high speed networks. When it is compared to MPJ/Ibis both support distributed memory model but the communication libraries of MPJ/Ibis are based on Ibis. The modes of communication for MPIJ are blocking, non-blocking, asynchronous while JCluster is the only model provides asynchronous mode of communication using UDP protocol.
MPJava is another pure Java implementation for homogenous resources. This is the only implementation that does not support non-blocking message communication.
P2P-MPI and MPJ/Ibis are the only two models in the reviewed literature that can support grid. JMPI has a unique support for communication layer that is based on Java I/O, RMI and Sockets.
F-MPJ is the promising implementation. It uses JFS for high speed network support unlike MPJ Express and MPJ/Ibis and mpiJava that use native libraries. However mpiJava and F-MPJ are the only implementations amongst the other which provide the high speed network support for Myrinet, Infiniband and SCI.
This paper has study the existing status of Java for HPC. It has been observed that most of these projects are in investigational milieu. Therefore the general adoption of any of them is not yet seen.
The performance overhead enforces by Java still raise the concerns for the researchers. Though there are no doubts about the features that the Java provides for parallel programming multi-core architectures. In fact Java is the obvious choice for programmers in future.
The study comes up with the strong point view that even today the implementation of Java for HPC requires an easy to use client side API as well a framework for programmers and researchers. A framework which provide system management, data distribution and messaging facilities for HPC applications with portability, performance, handling of irregular data structures, transparency of execution, multiple programming language support, automatic scheduling, and asynchronous communication in heterogeneous environment.