|
M O S I X
Cluster and Multi-Cluster Management
|
|---|
| Home About Distributions Clouds Wiki HUGI FAQ Pubs Contact |
|
MOSIX Frequently Asked Questions - Flat listing
|
Copyright © 1999 - 2010 A. Barak. All rights reserved.
|
General
|
Its main feature is to provide users and applications with the illusion of running on a single computer with multiple processors, without changing the interface and the run-time environment of their respective login nodes.
More information can be found in the
About web page and
"The MOSIX2 Management System for Linux Clusters and
Multi-clusters" white paper.
MOSIX® is a registered
trademark of Amnon Barak and Amnon Shiloh.
In a MOSIX cluster/multi-cluster there is no need to modify or to link applications with any library, copy files or login to remote nodes, or even assign processes to different nodes, including nodes in different clusters - it is all done automatically.
The outcome is ease of use, better utilization of resources and
near maximal performance.
Users can run their regular sequential and parallel applications as if they use one computer (node), while MOSIX automatically (and transparently) seek resources and migrate processes among nodes to improve the overall performance.
This is accomplished by on-line algorithms that monitor the state of the system-wide resources and the running processes, then, whenever appropriate, initiate process migration to:
Distributions are provided as RPMs for openSUSE, for use in native Linux
and as a pre-installed virtual-disk image that can be used to create
a MOSIX virtual cluster on Windows and/or Linux computers.
MOSIX version 2 (MOSIX2) for Linux-2.6 can manage x86-based clusters and multi-clusters.
MOSIX version 1 for Linux-2.4 can manage a single cluster.
Note that guest processes run in a sandbox, which prevents such
processes from accessing local resources in the hosting nodes.
|
MOSIX2 - conceptual
|
This means that:
MOSIX provides the following features to manage VOs:
However, individual nodes may have different number of processors (cores),
different speed, different memory size or I/O devices.
Linux processes are not affected by MOSIX - they run as they do on any Linux system and can not be migrated.
MOSIX processes run in an environment that allows them to migrate from one node to another.
Linux processes usually include administrative and other tasks that are not suitable for migration, whereas MOSIX processes are selected user-applications that are suitable and can benefit from migration.
Apart from process-migration that is available only to MOSIX processes, MOSIX includes batch mechanisms that can queue and assign new jobs to start on the best available nodes: these batch mechanisms are available for both Linux and MOSIX jobs.
MOSIX processes are invoked by the "mosrun" command. If you want to make use of the MOSIX batch mechanisms for Linux (non-migratable) processes, use the "mosrun -E" option.
This can be summarized in the following table:
| Process type | Migratable (MOSIX) | Non-Migratable (Linux) |
|---|---|---|
| Batch | mosrun -M [-b] | mosrun -E [-b] |
| Fully-interactive | mosrun [-b] | (do not use "mosrun") |
where the "-b" selects the best location to run it.
When a checkpoint is performed, the image of the processes is saved to a file. The process can later recover itself from that file and continue to run from that point.
For successful checkpoint and recovery, a process must not depend heavily on its Linux environment. For example, for security reasons processes with setuid/setgid privileges or processes with open pipes or sockets can't be checkpointed.
Checkpoints can be triggered by a program, by a manual request
and/or automatically - at regular time intervals, see the next question.
/proc/self/checkpoint
/proc/self/checkpointfile
/proc/self/checkpointlimit
/proc/self/checkpointinterval
These files are private to each process. They allow the process to
modify its checkpoint parameters and to trigger a checkpoint operation.
#include < stdlib.h>
#include < unistd.h>
#include < string.h>
#include < stdio.h>
#include < fcntl.h>
#include < sys/stat.h>
#include < sys/types.h>
// Setting the checkpoint file from withing the process
// This can also be done via the -C argument to mosrun
int setCheckpointFile(char *file) {
int fd;
fd = open("/proc/self/checkpointfile", 1|O_CREAT, file);
if (fd == -1) {
return 0;
}
return 1;
}
// Triggering a checkpoint from within the process
int triggerCheckpoint() {
int fd;
fd = open("/proc/self/checkpoint", 1|O_CREAT, 1);
if(fd == -1) {
fprintf(stderr, "Error doing self checkpoint \n");
return 0;
}
printf("Checkpoint was done successfuly\n");
return 1;
}
int main(int argc, char **argv) {
int j, unit, t;
char *checkpointFileName;
int checkpointUnit = 0;
if(argc < 3) {
fprintf(stderr, "Usage %s < checkpoint-file> < unit> \n", argv[0]);
exit(1);
}
checkpointFileName = strdup(argv[1]);
checkpointUnit = atoi(argv[2]);
if(checkpointUnit < 1 || checkpointUnit > 100) {
fprintf(stderr, "Checkpoint unit should be > 0 and < 100\n");
exit(1);
}
printf("Checkpoint file: %s\n", checkpointFileName);
printf("Checkpoint unit: %d\n", checkpointUnit);
// Setting the checkpoint file from within the process (can also be done using
// the -C argument of mosrun
if(!setCheckpointFile(checkpointFileName)) {
fprintf(stderr, "Error setting the checkpoint filename from within the process\n");
fprintf(stderr, "Make sure you are running this program via mosrun\n");
return 0;
}
// Main loop ... running for 100 units. checnge this loop if you wish
// the program to run do more loops
for( unit = 0; unit < 100 ; unit++ ) {
// Consuming some cpu time (simulating the run of the application)
// Change the number below to cause each loop to consume more (or) less time
for( t=0, j = 0; j < 1000000 * 500; j++ ) {
t = j+unit*2;
}
printf("Unit %d done\n", unit);
// Trigerring a checkpoint request from within the process
if(unit == checkpointUnit) {
if(!triggerCheckpoint())
return 0;
}
}
return 1;
}
To compile: gcc -o checkpoint_demo checkpoint_demo.c
To run: mosrun checkpoint_demo
A typical run:
> mosrun ./checkpoint_demo ccc 5
Checkpoint file: ccc
Checkpoint unit: 5
Unit 0 done
Unit 1 done
Unit 2 done
Unit 3 done
Unit 4 done
Unit 5 done
Checkpoint was done successfuly
Unit 6 done
Unit 7 done
Unit 8 done
^C
The program triggered a checkpoint after unit 5.
The checkpointed file was saved in ccc.1.
After unit 8 the program was killed.
To restart:
> mosrun -R ccc.1
Checkpoint was done successfuly
Unit 6 done
Unit 7 done
Unit 8 done
Unit 9 done
Unit 10 done
...
The program was restarted from the point right after it was checkpointed.
The queuing system includes tools for tracing queued jobs, setting
and changing their priorities or the order of execution, and for running
parallel jobs.
The number of jobs that can be placed in the queue is limited by the number of Linux processes (about 30000 for all users). To queue a larger number of jobs, there is an option to run multiple command-lines from a file, each with its own arguments. This option is commonly used to run the same program with many different sets of arguments. Another option allows to set an upper limit on the number of simultaneous jobs that are allowed to run. This option combines well with the queuing system which run jobs based on the availability of cluster/multi-cluster resources.
There is an argument to inform the queuing system that the job may
split into a number of parallel processes, so that more resources
are reserved for it. Another argument allows bundling for easy
identification of several instances of a job by a single job-ID.
Jobs can also be handled as a group and be killed collectively.
There are two types of batch jobs: Linux and MOSIX. Linux batch processes do not migrate, while MOSIX batch processes can migrate, but their home-node can be different than their dispatching node. MOSIX can assist both types by:
Batch jobs are started from binaries in another node and preserve only some of the caller's environment: they receive the environment variables; they can read from their standard-input and write to their standard output and error, but not from/to other open files; they receive signals, but if they fork, signals are delivered to the whole process-group rather than just the parent; they can not communicate with other processes on the calling node using pipes and sockets (other than standard input/output/error), semaphores, messages, etc. and can only receive signals, but not send them to processes on the calling node.
The main advantage of batch jobs is that
they save time by not needing to refer to the dispatching-node to perform
system-calls, and that temporary files can be created on the node where
they start, preventing the dispatching node from becoming a bottleneck.
This approach is therefore recommended for programs that
perform a significant amount of I/O.
MOSIX can run in a virtual machine in any platform that supports virtualization (including Windows).
The MOSIX web provides a free evaluation copy of MOSIX on a
pre-installed virtual-disk image
that can be used to create a MOSIX virtual cluster
on Linux and/or Windows computers.
Note that the total number of processors used by the VMs should not
exceed the number of physical processors.
Specifically:
Question:
Is it possible to install and run more than one VM with MOSIX on the same node
Answer:
Yes, this is especially useful on multi-core computers.
Question:
Can MOSIX run on an unmodified Linux kernel
Answer:
Yes, within a Virtual Machine.
Question:
Why migrate processes when one can move a whole VM with a process inside
Answer:
Mainly because it is expensive, both in terms of time and the required
memory, to create a VM for each process.
|
MOSIX Reach the Clouds (MRC)
|
MRC can run on both Linux nodes and MOSIX clusters.
MRC applications run in a hybrid environment, where some of their
files are on their launching (local) node and the rest are on
target (remote) nodes.
With a proper choice of directories, this allows to achieve:
If the target node is part of a MOSIX cluster, then MRC jobs can benefit
from all the MOSIX features. If a target node runs Linux (but not MOSIX),
then MRC jobs can only run there as native Linux jobs.
|
MOSIX2 - technical
|
If you use a common NFS root directory for your cluster, you can install MOSIX in that directory.
Otherwise, on a small cluster, you can install MOSIX node by node.
You must use only the official Linux kernel sources for your specific MOSIX distribution.
Note: do not use the kernel sources supplied with commercial Linux
distributions - they were modified and could cause the MOSIX patch
to fail.
When you use a standard Linux package (such as RedHat, SuSe, or Debian), your kernel (and/or kernel modules) would already be configured by that package, but when you compile your own kernel - as you do when installing MOSIX, you need to make sure that the kernel configuration suits your hardware and contains all the necessary device-drivers and file-systems that you are using.
One tool that often helps in constructing the correct kernel configuration is to use the output of "gzip -cd < /proc/config.gz", produced on the originally-supplied kernel, as a basis for the new configuration (but note that not every Linux distribution has "/proc/config.gz"). This output may not be totally accurate because it comes from a different (usually older) Linux kernel-version, but is a good place to start: place it in the file ".config" of the kernel-source directory, then adjust it by running "make menuconfig".
Another tip that may help to configure the kernel correctly, is that
unless you are a very experienced Linux system-administrator, you should
probably avoid the "initrd" hassles and configure all the drivers and
file-systems that you need in order to get the system to start within
the kernel itself rather than as kernel modules.
Once you modify configuration files, the changes will take effect
within a minute. After editing the list of nodes in your cluster
("/etc/mosix/mosix.map") you need to run "setpe", but if you are
using "mosconf" to modify the local configuration, then there is
no need to run "setpe".
mosrun -e awk 'BEGIN {for(i=0;i<100000;i++)for(j=0;j<100000;j++);}'
First you should see an increase of the load in one node. After a few seconds, if the process migration works you will see how the load is spread among the nodes.
If your nodes are not of the same speed then more processes
will run in the faster nodes.
UDP ports 249 - 250.
The "mosctl block" command prevents new remote processes from migrating
to that workstation.
The "mosctl expel &" move out MOSIX guest processes.
Note that an & is used after the expel command, since
expelling processes may take some time and we don't want the user login
process to hang. The processes are expelled while the user logs in.
This command allows remote processes to migrate to the workstation.
On a Debian system using GDM the appropriate file to add this command is /etc/gdm/PostSession/Default .
Note that when adding the mosctl commands to the GDM script you shouldn't
interfere with the correct work of gdb.
|
32-bit and 64-bit applications
|
|
Running applications
|
In MOSIX it is possible to run threaded applications as standard Linux processes. Such applications can not be migrated, but can still benefit from MOSIX features such as queuing and best initial-assignment.
To launch threaded applications use "mosrun -E".
> native {threaded_program} [program-args]...
Note that the amount and frequency of I/O is taken into account and weighted against other considerations in making such a decision.
The direct-communication (migratable socket) can reduce this slow-down
affect for I/O between communicating processes.
Otherwise, MOSIX is not different from Linux: depending on the particular needs of the process, whatever approach (other than shared-memory) that is best in Linux is best on MOSIX. It could be pipes, SYSV-messages, UNIX-sockets, TCP-sockets and files.
Obviously files can be slow when they usually require writing on
a physically-moving surface and/or networking. On the other hand,
Linux has very good caching mechanisms for local files.
Direct communication allows processes to exchange messages
directly between migrated processes, bypassing their home-nodes.
First, tune MATLAB to MOSIX by the following 3 steps:
the result should be :
#LD_ASSUME_KERNEL=2.4.1
#export LD_ASSUME_KERNEL
You can now run MATLAB jobs in a cluster/multi-cluster using mosrun.
Example: to run the following MATLAB test.m program:
a=randn(3000);
b=svd(a);
use:
> mosrun -e
The MOSIX version should be at least MOSIX-2.24.0.0 and jobs should be started by:
> mosrun -E -b -i matlab ....
MOSIX will assign each job to the best node in the local cluster.
Example: to run the following MATLAB test.m program:
a=randn(3000);
b=svd(a);
use:
> mosrun -E -b -i matlab -nojvm -nodesktop -nodisplay < test.m
A JAVA job should be started by:
> mosrun -E -b java job
MOSIX will assign each job to the best node in the local cluster.
MPI allocates processes to slave nodes of a cluster in a Round-Robin fashion, without checking the state of the resources, e.g. speed, current load and available memory.
Process migration can improve the performance by load-balancing, by
migration of processes from slower to faster nodes and to nodes with
sufficient free memory, as well as by migration of MPI processes to
nodes in remote clusters, which are not part of the user's cluster.
|
HUGI
|
Most clusters are private. They are made of production servers that belong to research groups in various departments. Four clusters are made of workstations in student labs.
Processes of users are allowed to migrate to idle workstations and among nodes in private clusters, subject to the priorities among the different groups. For example, since the workstations belong to the CS department, processes that are started in a CS private cluster has a higher priority to move to a workstation over already running processes from the Chemistry cluster.
Due to the increased computing demands by our researchers,
the amount of installed memory in the workstations was increased
(beyond the needs of the students), to allow large guest processes
from the private clusters to run in these workstations.
User-files and home directories are located on central NFS servers.