A SEMINAR REPORT ON
Submitted by: MD NASIM ALAM
COMPUTER SCIENCE & ENGINEERING
SCHOOL OF ENGINEERING
COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
OpenMosix is a Linux kernel extension for single-system image clustering. This kernel extension turns a network of ordinary computers into a cluster computer for Linux applications. Once we have installed OpenMosix, the nodes in the cluster start talking to one another and the cluster adapts itself to the workload. Processes originating from any one node, if that node is too busy compared to others, can migrate to any other node. With OpenMosix' Auto Discovery, a new node can be added while the cluster is running and the cluster will automatically begin to use the new resources. There is no need to program applications specifically for OpenMosix. Since all OpenMosix extensions are inside the kernel, every Linux application automatically and transparently benefits from the distributed computing concept of OpenMosix. OpenMosix is a free cluster management system was originally forked from MOSIX by Moshe Bar. It allows program processes (not threads) to migrate to machines in the node's network that would be able to run that process faster (process migration). It is particularly useful for running parallel and intensive input/output (I/O) applications. It is released as a Linux kernel patches, but is also available on specialized Live CDs.
OpenMosix is a kernel extension for Single System Image Clustering. It is a tool for Linux kernel consisting of adaptive resource sharing algorithms. It allows multiple uniprocessors (UP) and symmetric multiprocessors (SMPnodes) running in the same kernel to work in close co-operation. The goal is to improve the cluster-wide performance and to create a convenient mutiuser, time sharing environment for the execution of both sequential and parallel applications. The standard run time environment of OpenMosix is a computing cluster, in which the cluster wide resources are available to each node. The current implementation of OpenMosix is designed to run on clusters of X86/Pentium based workstations, both Ups and SMPs, that are connected by standard LANs.Possible configuration may range from a small cluster of PCs that are connected by 10Mbps Ethernet, to a high performance system, with a large number of high-end, Pentium based SMP servers that are connected by Gigabit LAN or Myrinet. Where Myrinet is a cost- effective, high-performance, packet-communication and switching technology that is widely used to interconnect clusters of workstations. OpenMosix is a free cluster management system was originally forked from MOSIX by Moshe Bar e.g. automatic work distribution among nodes. It allows program processes (not threads) to migrate to machines in the node's network that would be able to run that process faster (process migration). It is particularly useful for running parallel and intensive input/output (I/O) applications. It is released as a Linux kernel patches, but is also available on specialized Live CDs. If we have two or more computers, chances are that at any given time, at least one of them is doing nothing. Unfortunately, when you really do need CPU power - during a C++ compile, or encoding music files - we need a lot of it at once. The idea behind clustering is to spread these loads among all available computers, using the resources that are free on other machines.
The basic unit of a cluster is a single computer, also called a "node". Clusters can grow in size i.e. they scale by adding more machines. A cluster as a whole will be more powerful the faster the individual computers and the faster their connection speeds are. In addition, the operating system of the cluster must make the best use of the available hardware in response to changing conditions. This becomes more of a challenge if the cluster is composed of different hardware types (a "heterogeneous" cluster), if the configuration of the cluster changes unpredictably (machines joining and leaving the cluster), and the loads cannot be predicted ahead of time.
The clustering refers to technologies that allow multiple computers to work together to solve common computing problems. The computing problems in question can be anything from complex CPU-intensive scientific computations to a horde of miscellaneous processes with no underlying commonality. A cluster is a group of locally connected computers that work together as a unit. Clusters provide powerful and faster computation. It is scalable and ensures automatic recovery from failure. When a node fails the cluster automatically re runs the failed job. There are different types of Linux based clusters. Beowulf cluster and Single System Image (SSI) cluster. Probably the best known type of Linux â€œbased cluster is the Beowulf cluster.
A Beowulf cluster consists of multiple machines connected to one another on a high speed LAN.In order for these systems to be able to pool their computing resources ,special cluster enabled applications must be written using clustering libraries. The most popular clustering libraries are PVM and MPI; both are very mature and work well. By using PVM or MPI, programmers can design applications that can span across an entire clusterâ„¢s computing resources rather than being confined to the resources of a single machine. For many applications, PVM and MPI allow computing problems to be solved at a rate that scales almost linearly in relation to the number of machines in the cluster While Beowulf clusters are extremely powerful, they arenâ„¢t for everyone. The primary drawback of Beowulf clusters is that they require specially designed software (written with explicit PVM or MPI support) in order to take advantage of cluster resources. This is not a problem for those in the scientific and research communities who are used to writing their own special purpose applications from scratch; since they write their code in-house, they can use the PVM or MPI libraries to create cluster aware applications.
Beowulf clusters are able to run specially written parallel programs. Suppose we have to encode Ëœ.wavâ„¢ file into Ëœa .mp3 fileâ„¢. On Linux we can use the Ëœlameâ„¢ program to do this .On a Beowulf cluster of about 10 nodes ,we have to modify the Ëœlameâ„¢ program to split the Ëœ.wav fileâ„¢ into 10 chunks and run parallel instances of the lame on each of the nodes. Once done the lame process can then consolidate the results and give one single .mp3 file. Since the work is split and done among the nodes simultaneously, it will take one-tenth the time it would take lame to run on a single PC.This is the kind of work Beowulf clusters are capable of, but we need to modify the Ëœlame Ëœprogram to take advantage of the Beowulf cluster .The Beowulf cluster requires specially designed software in order to take advantage of the cluster resources. Also the implementation is complicated.
2.2Single System Image cluster
In Single System Image cluster configuration, the whole cluster looks like single machine that has many CPUs and large amount of RAM No special hardware is required for this type of cluster. We can string together normal PCs and build a cluster that presents the consolidated power of these system. The PCs need to be networked .So the network interface card (NIC) on each of the systems is needed. For a cluster consisting of more than two systems a hub/switch is required to connect them. If there is only two nodes we can use a cross cable to link the two NICs.The primary goal of the SSI cluster is load balancing and optimal use of resources.SSI cluster make use of the system resources transparent and will offer improved system response time and performance. It simplifies the management because the system administrator does not have to know the underlying system architecture to use the machines effectively.
SSI cluster or OpenMosix cluster are not used for the type of the parallel computing found in the Beowulf cluster. Considering the encoding of Ëœ.wav fileâ„¢, 10 instances of Ëœlame programâ„¢ is run on 10 nodes. This convert 10 Ëœ.wav fileâ„¢ to 10 different Ëœ.mp3 filesâ„¢. There is no need to modify the Ëœlame programâ„¢ for this.
3. WHAT IS OPENMOSIX?
OpenMosix is a Linux kernel extension for SSI clustering. The kernel extension turns a network of ordinary computers into a super computer for Linux applications. Once you installed OpenMosix, the nodes in the cluster will start talking to each other by exchanging messages. The cluster adapts itself to the wok load. OpenMosix adds cluster functionality to any Linux flavor. OpenMosix uses adaptive load balancing techniques; processes that run on a node can transparently be distributed to another. Due to the complete transparency of OpenMosix, a process does not have to know where it is running .The process thinks that it is running locally.
The transparency means that no additional programming is needed to take advantage of the OpenMosix load balancing technology. OpenMosix turns multiple Linux hosts into one large virtual SMP.Real SMP systems with two or more physical processors can exchange large amounts of data, in practice this means that real SMP systems much faster. With OpenMosix, the speed at which the nodes can exchange data is limited to the speed of the LAN connection. Using a high bandwidth connection will increase the effectiveness of the OpenMosix cluster. Another great advantage of OpenMosix is the ability to build a cluster out of inexpensive hardware giving you a traditional supercomputer. OpenMosix can also be used with performance enhancing techniques like Hyper-Threading available on Intel Pentium 4 and Xeon processors. Using this technique we can enhance the performance of a node. The node can handle multiple co operating threads that cannot be separated and distributed among OpenMosix nodes.
4. OPENMOSIX ROLE
The OpenMosix software package automatically balances the load between different nodes of the cluster, and nodes can join or leave the running cluster without disruption of the service. The load is spread out among nodes according to their connection and CPU speeds. Since OpenMosix is part of the kernel and maintains full compatibility with Linux a user's programs, files, and other resources will all work as before without any further changes. The casual user will not notice the difference between a Linux and an OpenMosix system. To us the whole cluster will function as one fast GNU/Linux system. OpenMosix is a Linux-kernel patch which provides full compatibility with standard Linux for IA32-compatible platforms. The internal load-balancing algorithm transparently migrates processes to other cluster members. The advantage is a better load- sharing between the nodes. The cluster itself tries to optimize utilization at any time. This transparent process-migration feature makes the whole cluster look like a BIG SMP-system with as many processors as available cluster-nodes (of course multiplied with X for X-processor systems such as dual/quad systems and so on). OpenMosix also provides a powerful optimized File System (oMFS) for HPC-applications, which unlike NFS provide cache, time stamp and link consistency.
5. OPENMOSIX TECHNOLOGY
The OpenMosix technology consists of two parts: a Preemptive Process Migration (PPM) mechanism and a set of algorithms for adaptive resource sharing. Both parts are implemented at the kernel level, using a loadable module, such that the kernel interface remains unmodified. Thus they are completely transparent to the application level. The PPM can migrate any process, at anytime, to any available node. Usually, migrations are based on information provided by one of the resource sharing algorithms, but users may override any automatic system-decisions and migrate their processes manually. Each process has a Unique Home-Node (UHN) where it was created. Normally this is the node to which the user has logged-in. The single system image model of OpenMosix is a CC (cache coherent) cluster, in which every process seems to run at its UHN, and all the processes of a user's session share the execution environment of the UHN. Processes that migrate to other (remote) nodes use local (in the remote node) resources whenever possible, but interact with the user's environment through the UHN. The PPM is the main tool for the resource management algorithms. As long as the requirements for resources, such as the CPU or main memory are below a certain threshold, the user's processes are confined to the UHN. When the requirements for resources exceed some threshold levels, then some processes may be migrated to other nodes to take advantage of available remote resources. The overall goal is to maximize the performance by efficient utilization of the network-wide resources. The granularity of the work distribution in OpenMosix is the process. Users can run parallel applications by initiating multiple processes in one node, and then allow the system to assign these processes to the best available nodes at that time. If during the execution of the processes new resources become available, the resource sharing algorithms are designed to utilize these new resources by possible reassignment of the processes among the nodes. This capability to assign and reassign processes is particularly important for ease-of-use and to provide an efficient multi-user, timesharing execution environment. OpenMosix has no central control or master/slave relationship between nodes: each node can operate as an autonomous system, and it makes all its control decisions independently. This design allows a dynamic configuration, where nodes may join or leave the network with minimal disruptions. Additionally, this allows for a very great scalability and ensures that the system runs well on large configurations as it does on small configurations. Scalability is achieved by incorporating randomness in the system control algorithms, where each node bases its decisions on partial knowledge about the state of the other nodes, and does not even attempt to determine the overall state of the cluster or any particular node. For example, in the probabilistic information dissemination algorithm, each node sends, at regular intervals, information about its available resources to a randomly chosen subset of other nodes. At the same time it maintains a small window, with the most recently arrived information. This scheme supports scaling, even information dissemination and dynamic configurations.
5.1The Resource Sharing Algorithms
The main resource sharing algorithms of OpenMosix are the load-balancing and the memory ushering. The dynamic load-balancing algorithm continuously attempts to reduce the load differences between pairs of nodes, by migrating processes from higher loaded to less loaded nodes. This scheme is decentralized all the nodes execute the same algorithms and the reduction of the load differences is performed independently by pairs of nodes. The number of processors at each node and their speed are important factors for the load balancing algorithm. This algorithm responds to changes in the loads of the nodes or the runtime characteristics of the processes. It prevails as long as there is no extreme shortage of other resources such as free memory or empty process slots. There are two main resource-sharing algorithms in OpenMosix: The first one, the memory ushering (depletion prevention) algorithm is geared to place the maximal number of processes in the cluster-wide RAM, to avoid as much as possible thrashing or the swapping out of processes. The algorithm is triggered when a node starts excessive paging due to shortage of free memory. In this case the algorithm overrides the load-balancing algorithm and attempts to migrate a process to a node that has sufficient free memory, even if this migration would result in an uneven load distribution. Lately, OpenMosix was given a new algorithm to select on which node a given program should run. The mathematical model for this scheduling algorithm comes from the field of economics research. Determining the optimal location for a job is a complicated problem. The most important complication is that the resources available on a cluster of Linux computers are heterogeneous. In effect, the costs for memory, CPU, process communication, and so forth are incomparable. They are not even measured in the same units. Communication resources are measured in terms of bandwidth, memory in terms of space, and CPU in terms of cycles. The natural greedy strategy, balancing the resources across all of the machines, is not even well defined. The new algorithm employed by OpenMosix is very interesting because it tries to reconcile these differences (and maybe it could be applied to non-cluster schedulers as well) based on economic principles and competitive analysis. The key idea of this strategy is to convert the total usage of several heterogeneous resources, such as memory and CPU, into a single homogeneous "cost". Jobs are then assigned to the machine where they have the lowest cost. Just like in a market oriented economy. This economic strategy provides a unified algorithm framework for allocation of computation, communication, memory, and I/O resources. It allows the development of near-optimal on-line algorithms for allocating and sharing these resources.
5.2 Pre- emptive Process Migration
OpenMosix supports preemptive and completely transparent process migration (PPM).After migration, a process continues to interact with its environment regardless of its location. To implement the PPM, the migrating process is divided into two contexts: the user context, which can be migrated, and the system context, which is UHN dependent and may not be migrated. The user context, called the remote, contains the program code, stack, data, memory maps, and registers of the process. The remote encapsulates the process when it is running in the user level. The system context, called the deputy, contains a description of the resources that the process is attached to, and a kernel-stack for the execution of system code on behalf of the process. The deputy encapsulates the process when it is running in the kernel. It holds the site dependent part of the system context of the process; hence it must remain in the UHN of the process. While the process can migrate many times between different nodes, the deputy is never migrated. The interface between the user-context and the system context is well defined. Therefore it is possible to intercept every interaction between these contexts, and forward this interaction across the network. This is implemented at the link layer, with a special communication channel for interaction. The migration time has a fixed component, for establishing a new process frame in the new (remote) site, and a linear component, proportional to the number of memory s to be transferred. To minimize the migration overhead, only the tables and the process' dirty s are transferred. In the execution of a process in OpenMosix, location transparency is achieved by forwarding site-dependent system calls to the deputy at the UHN. System calls are a synchronous form of interaction between the two process contexts. All system calls that are executed by the process are intercepted by the remote site's link layer. If the system call is site-independent it is executed by the remote locally (at the remote site). Otherwise, the system call is forwarded to the deputy, which executes the system call on behalf of the process in the UHN. The deputy returns the result(s) back to the remote site, which then continues to execute the user's code. Other forms of interaction between the two process contexts are signal delivery and process wakeup events, such as when network data arrives. These events require that the deputy asynchronously locate and interact with the remote. This location requirement is met by the communication channel between them. In a typical scenario, the kernel at the UHN informs the deputy of the event. The deputy checks whether any action needs to be taken, and if so, inform the remote. The remote monitors the communication channel for reports of asynchronous events, like signals, just before resuming user-level execution. This approach is robust, and is not affected even by major modifications of the kernel. It relies on almost no machine dependent features of the kernel, and thus does not hinder porting to different architectures. One drawback of the deputy approach is the extra overhead in the execution, of system calls. Additional overhead is incurred on file and network access operations.
6. OPENMOSIX DESIGN
Process Unique Home Node Remote Node
Before Migration UHN DS R Process Remote
After Migration DS-Deputy Stub R-Remote Stub UHN-Unique Home Node
The deputy is the representative of the remote process at the UHN. Because the entire user space memory resides at the remote node, the deputy does not hold a memory map of its own. Instead, it shares the main kernel map similarly to a kernel thread. In many kernel activities, such as the execution of system calls, it is necessary to transfer data between the user space and the kernel This is normally done by the copy to user(), copy from user() kernel primitives. In OpenMosix, any kernel memory operation that involves access to user space requires the deputy to communicate with its remote to transfer the necessary data. The overhead of the communication due to remote copy operations, which may be repeated several times within a single system call, could be quit substantial, mainly due to the network latency. In order to eliminate excessive remote copies, which are very common, a special cache was implemented that reduces the number of require interactions by pre- fetching as much data as possible during the initial system call request, while buffering partial data at the deputy to be returned to the remote at the end of the system call. To prevent the deletion or overriding of memory-mapped files (for demand-paging) in the absence of a memory map, the deputy holds a special table of such files that are mapped to the remote memory. The user registers of migrated processes are normally under the responsibility of the remote context. However, each register or combination of registers may become temporarily owned for manipulation by the deputy. Remote (guest) processes are not accessible to the other processes that run at the same node (locally or originated from other nodes) and vice versa. They do not belong to any particular user (on the remote node, where they run) nor can they be sent signals or otherwise manipulated by local processes. Their memory cannot be accessed and they can only be forced, by the local system administrator, to migrate out. A process may need to perform some OpenMosix functions while logically stopped or sleeping. Such processes would run OpenMosix functions "in their sleep," and then resume sleeping, unless the event they were waiting for has meanwhile occurred. An example is process migration, possibly done while the process is sleeping. For this purpose, maintains a logical state, describing how other processes should see the process, as opposed to its immediate state.
7. FILE SYSTEM
OpenMosix has its own new Cluster File System for Linux that gives a shared cluster wide view of all the file systems. Cluster developers saw that all current solutions for a cluster- wide file system relied on a central file server, but that some new file system technologies were being developed addressing the very needs of a single system image cluster (SSI) like OpenMosix. OpenMosix uses Direct File System Access (DFSA). DFSA was designed to reduce the extra overhead of executing I/O oriented system-calls of a migrated process. This was done by allowing the execution of most such system-calls locally - in the process's current node. In addition to DFSA, a new algorithm that takes into account I/O operation was added to the OpenMosix process distribution (load-balancing) policy. The outcome of these provisions is that a process that performs moderate to high volume of I/O is encouraged to migrate to the node in which it does most of its I/O. one obvious advantage is that I/O-bound processes have greater flexibility to migrate from their respective home- nodes for better load-balancing. So, unlike all existing network file systems (say, NFS) which bring the data from the file server to the client node over the network, the cluster attempts to migrate the process to the node in which the file actually resides.
Statistics about a process's behavior are collected regularly, such as at every system call and every time the process accesses user data. This information is used to assess whether the process should be migrated from the UHN. These statistics decay in time, to adjust for processes that change their execution profile. They are also cleared completely on the execve() system calls because the process is likely to change its nature. Each process has some control over the collection and decay of its statistics. For instance, a process may complete a stage knowing that its characteristics are about to change, or it may cyclically alternate between a combination of computation and I/O.
8. OPENMOSIX API
The OpenMosix API has been traditionally implemented via a set of reserved system calls that were used to configure, query, and operate OpenMosix. In line with the Linux convention, the API was modified to be interfaced via the /proc file system. This also prevents possible binary incompatibilities of user programs between different Linux versions. The API was implemented by extending the Linux /proc file system tree with a new directory: /proc/OpenMosix. The calls to OpenMosix via /proc include: synchronous and asynchronous migration requests; locking a process against automatic migrations; finding where the process currently runs; finding out about migration constraints; system setup and administration; controlling statistic collection and decay; information about available resources on all configured nodes; and information about remote processes.
9. MIGRATION CONSTRAINTS
Certain functions of the Linux kernel are not compatible with process context division. Some obvious examples are direct manipulations of I/O devices, such as direct access to privileged bus-I/O instructions, or direct access to device memory. Other examples include writable shared memory and real-time scheduling. The last case is not allowed because one cannot guarantee it while migrating, as well as being unfair towards processes of other nodes. A process that uses any of these functions is automatically confined to its UHN. If the process has already been migrated, it is first migrated back to the UHN. The OpenMosix Performance Unlike MPPs, which allow a single user per partition, CCs are geared for multiuser, timesharing environments. In order to make CC systems as easy to program, manage, and use as an SMP, it is necessary to develop means for global (cluster-wide) resource allocation and sharing that can respond to resource availability, distribute the workload dynamically, and utilize the available, cluster-wide resources efficiently and transparently. Such mechanisms are necessary for performance scalability in clusters of servers and to support a flexible use of workstations because the overall available resources in such systems are expected to be much larger than the available resources at any workstation or server. The development of such mechanisms is particularly important to support multiuser, time- sharing parallel execution environments, where it is necessary to share the resources and at the same time distribute the workload dynamically, to utilize the global resources efficiently.
OpenMosix is a kernel extension available as a patch to the 2.4 kernel series. An extension to
the 2.6 series is in works. Since there is no configuration involved installing OpenMosix is just a matter of compiling the kernel and its user land programs. To install OpenMosix at first need to acquire the OpenMosix 2.4.26 patch, the OpenMosix utilities and the 2.4.26 kernel from the sites listed at the reference section. Then follow these steps:
First untar the kernel sources and patch it with the $tar xvjf linux-2.4.26.tar. bz2
$mv openMosix â€œ2.4.26 â€œ1 linux â€œ2.4.26
$patch â€œpl < openMosix â€œ2.4.26 â€œ1
Then configure, compile© the kernel
$ make menuconfig
$make dep bzImage modules
$sudo make modules install
$sudo cp arch/i386/boot/bzImage/boot/kernel â€œ2.4.26 â€œom
$sudo cp .config /boot/config â€œ2.4.26 â€œom
$sudo cp System.map /boot/System.map â€œ 2.4.26 -om
Compile the other tools available in $tar xvzf openMosix â€œ tools â€œ0.3.6 â€œ2.tar.gz
$cd openMosix â€œtools â€œ0.3.6 â€œ2
$make && sudo make install
$tar xvzf openMosixview â€œ1.5.tar.gz
$cd openMosix â€œ1.5
$make && sudo make install
We should know what drivers we need for our system and we have to compile them as modules or include them in kernel directly.
11. Graphical User Interface of 1. OpenMosixView:Monitoring
2. OpenMosixViewProcs:Managing Processes
4. OpenMosixViewAnalyzer:Analyse the logs
5. OpenMosixViewHistory:history of the processes
6. OpenMosixMigmon:Drag and Drop migration monitoring tool
12. OPENMOSIX VIEW
The functionality is explained in the following. OpenMosix view displays a row with a lamp, a button, a slider, a lcd-number, two progress bars and some labels for each cluster-member. The lights at the left are displaying the openMosix-Id and the status of the cluster-node. Red if down, green for available. If you click on a button displaying the ip-address of one node a configuration-dialog will pop up. It shows buttons to execute the most common used "mosctl"- commands. (Described later in this How To) With the "speed-sliders" you can set the openMosix-speed for each host. The current speed is displayed by the lcd-number. You can influence the load-balancing of the whole cluster by changing these values. Processes in a OpenMosix-Cluster are migrating easier to a node with more OpenMosix-speed than to nodes with less speed. Sure it is not the physically speed you can set but it is the speed OpenMosix "thinks" a node has. E.g. a cpu-intensive job on a cluster-node which speed is set to the lowest value of the whole cluster will search for a better processor for running on and migrate away easily. The progress bars in the middle gives an overview of the load on each cluster-member. It displays in percent so it does not represent exactly the load written to the file /proc/hpc/nodes/x/load (by openMosix), but it should give an overview. The next progress bar is for the used memory the nodes. It shows the currently used memory in percent from the available memory on the hosts (the label to the right displays the available memory). How many CPUs your cluster has is written in the box to the right. The first line of the main windows contains a configuration button for "all-nodes". we can configure all nodes in your cluster similar by this option. How good the load-balancing works is displayed by the progress bar in the top left. 100% is very good and means that all nodes nearly have the same load.
The OpenMosix migmon
The OpenMosix migmon is a monitor for migrations in your openMosix-cluster.It displays all your nodes as little penguins sitting in a circle.
The main penguin is the node on which openMosixmigmon runs and around this node it shows its processes also in a circle of small black squares.
-> main process-circle
If a process migrates to one of the nodes the node gets an own process-circle and the process moved from the main process-circle to the remote process-circle. Then the process is