Parallel Computing

From NEST

Jump to: navigation, search

[ Software:Documentation ]

Contents

Introduction

As of version 2.0, NEST is capable of running simulations on multi-core, multi-processor machines and computer clusters using two ways of parallelization: thread-parallel simulation and distributed simulations. The first is implemented using POSIX threads, while the second is implemented on top of the Message Passing Interface (MPI). Using threads allows to take advantage of multi-core and multi-processor computers without the need for additional software. Using distributed computing has the additional benefit of allowing larger simulations, than would fit into the memory of a single machine. Additionally, the overall connect time and the runtime of the simulation can be reduced, as the network is built on multiple machines in parallel. The following paragraphs describe the facilities for parallel and distributed computing in detail.

See Plesser et al (2007) for more information on NEST parallelization and be sure to check the documentation on Random numbers in NEST.

Concepts and definitions

Basic scheme for counting threads (T), virtual processes (VP) and MPI processes (P) in NEST
Basic scheme for counting threads (T), virtual processes (VP) and MPI processes (P) in NEST

In order to ease the handling of node distribution with both thread and process based parallelization, we use the concept of local and remote threads, called virtual processes. A virtual process (VP) is a thread living in one of NEST's MPI processes. Virtual processes are distributed according to a modulo-based algorithm onto the MPI processes and are counted continuously over all processes. The concept is visualized in the figure to the right.

The status dictionary of each node contains three entries that are related to parallel computing:

  • local: boolean that indicates if a node exists on the local process or not
  • thread: the id of the local thread a node is assigned to
  • vp: the id of the virtual process a node is assigned to.

Node distribution

The distribution of nodes is based on the type of the node. Neurons are assigned automatically to one of the virtual processes again using a modulo-based algorithm. On all other virtual processes, a proxy node is allocated. The assignment is given by idVP = idNode mod NVP, where idVP is the id of the virtual process a neuron is assigned to, idNode is the global id of the node and NVP is the total number of virtual processes.

Illustration of node distribution. sg=spike_generator, iaf=iaf_neuron, sd=spike_detector. Numbers to the left and right indicate global ids.
Illustration of node distribution. sg=spike_generator, iaf=iaf_neuron, sd=spike_detector. Numbers to the left and right indicate global ids.

Devices for the stimulation and observation of the network are replicated on each thread in order to balance the load of the different threads and minimize the interaction between the different threads. Devices thus do not have proxies in remote virtual processes. For recording devices that are configured to record to a file (property to_file set to true), this means that each thread will write its own data file with the measurements of the neurons on its own threads. The files names are composed according to the following scheme

[model|label]-gid-vp.[dat|gdf]

The first part is the name of the model (e.g. voltmeter or spike_detector) or, if set, the label of the recording device. The second part is the global id (GID) of the recording device. The third part is the id of the virtual process the recorder is assigned to, counted from 0. The extension is gdf for spike files and dat for analog recordings. The label and file_extension of a recording device can be set like any other parameter of a node using SetStatus.

The default assignment of nodes to virtual processes is a good initial choice with respect to load balancing, as it corresponds to a random distribution for networks with random connectivity. However, for networks where the flow of activity is known, better distributions always exist. To allow a more fine-grained distribution, the model subnet has a boolean parameter children_on_same_vp, which forces all nodes in the subnet to be assigned to the same virtual process.

The node distribution for a small network consisting of spike_generator, four iaf_neurons, and a spike_detector in a scenario with two processes with two threads each is shown in the figure to the right.

Using multiple threads

Thread-parallelism is compiled into NEST by default and should work on all MacOS and Linux machines and without additional requirements. To use multiple threads for the simulation, you have to set the desired number of threads before you create any nodes or connections. The SLI command for this is

0 << /local_num_threads T >> SetStatus

In PyNEST, the number of threads is set using

SetKernelStatus({"local_num_threads": T})

Usually, a good choice for T is the number of processor cores available on your machine. However, recent observations showed that it can be advantageous to oversubscribe the cores with up to 2 threads per core yielding 20-30% improvement in simulation speed (extensive testing to determine the upper bound where oversubscribing is still beneficial has not been performed as yet). The reason for this is most likely heavy optimizations built into most of the current processors (e.g. simultaneous multithreading dubbed hyper-threading in Intel chips) and different kernel scheduling strategies used in extreme resource starvation scenarios.

Using distributed computing

Build requirements

To compile NEST for distributed computing, you need a library implementation of MPI on your system. If you are on a cluster, you most likely have this already. We currently support MPICH, OpenMPI and ScaliMPI. Several other implementations are known to work, but may require a little tweaking. Most Linux distributions provide packaged versions of MPICH and OpenMPI. Note, that in the case of a pre-packaged MPI library you will need both, the library and the development packages.

Please see the Installation instructions for general information on installing NEST.

Compilation

If the MPI library and header files are installed to the standard directories of your system, it is likely that a simple

$NEST_SOURCE_DIR/configure --with-mpi

will find them ($NEST_SOURCE_DIR is the directory holding the NEST sources). If your MPI installation resides in a non-standard location, you may use

$NEST_SOURCE_DIR/configure --with-mpi=/path/to/mpi

to specify the base directory, where the MPI bin, include and lib directories are found. This variant is more flexible and can also be used if MPI has been installed from sources. In some cases, you may need to specify the MPI compiler wrappers explicitly, e.g.,

$NEST_SOURCE_DIR/configure CC=mpicc CXX=mpicxx --with-mpi

For additional information concerning MPI on OSX, please see here.

Running distributed simulations

Distributed simulations cannot be run interactively, which means that the simulation has to be provided as a script. However, the script does not have to be changed compared to the script for serial simulation: inter-process communication and node distribution is managed transparently inside of NEST.

To run a simulation distributed on 128 processes, simply run

mpirun -np 128 nest script.sli

If using PyNEST, the command is

mpirun -np 128 python script.py

Note that PyNEST will only work together nicely with OpenMPI. For details on the usage of mpirun please refer to the documentation that came with your library.

MPI related commands

In MPI programs, each process has a unique identifier, its rank. Normally, you will not have to care about this number. However, in special cases it may be necessary to use it (e.g. opening files with a different names for each process). The rank of a process is available via the command

Rank

The number of MPI processes in a simulation can be obtained by

NumProcesses

To find out the name of the machine on each machine, use

MPIProcessorName

Reproducibility

To achieve the same simulation results even when using different parallelization strategies, you have to keep the number of virtual processes constant. A simulation with a specific number of virtual processes will always yield the same results, no matter how they are distributed over threads and processes, given that the seeds for the random number generators of the different virtual processes are the same (see Random numbers in NEST). In order to achieve a constant number of virtual processes, NEST provides the property total_num_virtual_procs.

The following listing contains a complete simulation in SLI with four neurons that are connected in a chain. The first neuron receives random input from a poisson_generator and the spikes of all four neurons are recorded to files.

0 << /total_num_virtual_procs 4 >> SetStatus
/poisson_generator << /rate 50000.0 >> Create /pg Set
/iaf_neuron 4 Create ;
/spike_detector << /to_file true >> Create /sd Set
pg 2 1000.0 1.0 Connect
2 3 1000.0 1.0 Connect
3 4 1000.0 1.0 Connect
4 5  1000.0 1.0 Connect
[ 2 3 4 5 ] sd ConvergentConnect
100.0 Simulate

Let's now assume that we saved the script above as simulation.sli, so we can run the simulation using mpirun and different numbers of processes:

mkdir 4vp_1p 4vp_2p 4vp_4p
cd 4vp_1p
mpirun -np 1 nest ../simulation.sli
cd ../4vp_2p
mpirun -np 2 nest ../simulation.sli
cd ../4vp_4p
mpirun -np 4 nest ../simulation.sli
cd ..
diff 4vp_1p 4vp_2p
diff 4vp_1p 4vp_4p

All variants of the experiment produce four data files (spike_detector-6-0.gdf, spike_detector-6-1.gdf, spike_detector-6-2.gdf, and spike_detector-6-3.gdf). Using diff on the three directories shows that they all contain the same spikes, which means that the simulation results are the same independently of the details of parallelization.

Using PyNEST instead of the SLI code above works all the same. The only differences are that simulation.sli in above description has to be replaced by simulation.py, and the python executable has to be used instead of the nest executable. The PyNEST code for the simulation is shown in the following listing:

from nest import *
SetKernelStatus({"total_num_virtual_procs": 4})
pg = Create("poisson_generator", params={"rate": 50000.0})
n = Create("iaf_neuron", 4)
sd = Create("spike_detector", params={"to_file": True})
Connect(pg, [n[0]], 1000.0, 1.0)
Connect([n[0]], [n[1]], 1000.0, 1.0)
Connect([n[1]], [n[2]], 1000.0, 1.0)
Connect([n[2]], [n[3]], 1000.0, 1.0)
ConvergentConnect(n, sd)
Simulate(100.0)

Running the experiment looks as shown in the following listing:

mkdir 4vp_1p 4vp_2p 4vp_4p
cd 4vp_1p
mpirun -np 1 python ../simulation.py
cd ../4vp_2p
mpirun -np 2 python ../simulation.py
cd ../4vp_4p
mpirun -np 4 python ../simulation.py
cd ..
diff 4vp_1p 4vp_2p
diff 4vp_1p 4vp_4p
Views
Personal tools