@hackage hdph0.0.1

Haskell distributed parallel Haskell

Haskell Distributed Parallel Haskell

Haskell distributed parallel Haskell (HdpH) is a Haskell DSL for parallel computation on distributed-memory architectures. HdpH is implemented entirely in Haskell but does make use of a few GHC extensions, most notably TemplateHaskell.

HdpH is described in some detail in the paper Implementing a High-level Distributed-Memory Parallel Haskell in Haskell. The paper on The Design of SymGridPar2 paints the bigger picture of what HdpH was designed for and where future developments will lead.

This release is considered alpha stage.

Building HdpH

Should be straightforward from the cabalised package hdph. Note that hdph depends on hdph-closure, an independent package that factors out the closure representation of HdpH.

Running HdpH

Launch via ssh

HdpH comes with a few sample applications. The simplest one, a distributed Hello World, is good for testing whether HdpH works. For example, the following Bourne shell command will run Hello World on three nodes, distributed over two hosts, bwlf01 and bwlf02.

$> for host in bwlf01 bwlf02 bwlf01; do ssh $host hdph/dist/build/hello/hello -numProcs=3 & done

This will launch three instances of the Hello World binary hello (in directory hdph/dist/build/hello relative to the user's home directory), two on bwlf01 and one on bwlf02. Note that the obligatory option -numProcs must match the number of instances; HdpH will likely hang otherwise.

Assuming that ssh has been set up to enable password-less logins, the output should be something like this:

Master Node:137.195.143.101:19754:0 wants to know: Who is here?
Hello from Node:137.195.143.101:19754:0
Hello from Node:137.195.143.102:38939:0
Hello from Node:137.195.143.101:30062:0

Launch via mpiexec

Shell scripts and ssh are quite a cumbersome way of launching HdpH. More convenient are dedicated parallel job launchers such as mpiexec, which comes with any recent MPI distribution. Even though this version of HdpH does not use MPI, it can still be launched by mpiexec. An MPICH installation, for example, might launch the Hello World application with the following command:

$> mpiexec -hosts bwlf01,bwlf02,bwlf01 ./hello -numProcs=3

This will launch the Hello World binary hello (in the current working directory) 3 times, twice on bwlf01 and once on bwlf02. The expected output should look somewhat like this:

Hello from Node:137.195.143.102:12701:0
Hello from Node:137.195.143.101:15353:0
Master Node:137.195.143.101:14247:0 wants to know: Who is here?
Hello from Node:137.195.143.101:14247:0

Parallel launch via mpiexec

To test parallelism, there is a sample application computing the nth number of the Fibonacci sequence using a naive parallel divide-and-conquer algorithm. For example, the following command will compute the 45th Fibonacci number in parallel on 3 nodes, switching to sequential execution for subproblems below a threshold of 30; a wall clock runtime of about 16 seconds is reported here.

$> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 v2 45 30
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=16.157814s}

When passed the switch -d1 HdpH will produce some summary output per node relating to parallelism, eg. number of sparks generated per node, maximum number of sparks residing on a node, number of times a node requested work, and number of times work was scheduled in response. An example run with -d1 might look like this:

$> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -d1 v2 45 30
137.195.143.101:9759:0 #SPARK=1017   max_SPARK=154   max_THREAD=[3,1]
137.195.143.101:9759:0 #FISH_sent=1   #SCHED_rcvd=0
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=16.449183s}
137.195.143.102:12551:0 #SPARK=308   max_SPARK=4   max_THREAD=[1,1]
137.195.143.102:12551:0 #FISH_sent=215   #SCHED_rcvd=214
137.195.143.103:27402:0 #SPARK=271   max_SPARK=4   max_THREAD=[0,1]
137.195.143.103:27402:0 #FISH_sent=225   #SCHED_rcvd=224

This shows that bwlf01 (IP address 137.195.143.101) was the master node, generating 1017 sparks in total, requesting work once (probably at the end of the computation) and not receiving any. The other nodes did generate some sparks themselves (308 and 271, respectively) but they also both received a substantial number of sparks (214 and 224, respectively). Most importantly, each node requested work exactly once more often than receiving it, which means no node was ever idling, except probably at the very end of the computation.

Parallel launch via mpiexec on homogeneous multicores

So far HdpH ran in a single thread on each node. To make use of multicores mpiexec can place several single-threaded nodes on the same host, eg. the following examples will launch two and four nodes per host, respectively; note that the number following the -n switch of mpiexec must match the number following the -numProcs switch of fib.

$> mpiexec -hosts bwlf01,bwlf02,bwlf03 -n 6 ./fib -numProcs=6 v2 45 30
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=8.026417s}

$> mpiexec -hosts bwlf01,bwlf02,bwlf03 -n 12 ./fib -numProcs=12 v2 45 30
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=4.845796s}

The other possibility is to run HdpH itself in a multi-threaded mode. The following two examples run HdpH on two and four threads per node, respectively, expecting that the OS will bind each thread to a core.

$> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -scheds=2 v2 45 30 +RTS -N2
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=10.648305s}

$> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -scheds=4 v2 45 30 +RTS -N4
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=6.507969s}

Note that the GHC RTS switch -N determines the number of threads (or HECs in GHC terminology) whereas the fib switch -scheds determines how many of these threads run the HdpH scheduler loop. Whether there should be as many schedulers as threads or less depends on the GHC version and probably on the OS. Using one scheduler less than the number of threads has been observed to reduce variability, and sometimes even bring performance gains:

$>  mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -scheds=3 v2 45 30 +RTS -N4
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=5.831932s}

Parallel launch via mpiexec on heterogeneous multicores

So far all hosts were assumed to have the same number of cores. However, mpiexec can be used to launch on heterogeneous clusters. For instance, the following call will launch HdpH with two threads each on bwlf01 and bwlf02, and with four threads on bwlf03; please see the man page of mpiexec for an explanation of the command line format.

$> mpiexec -hosts bwlf01,bwlf02,bwlf03 -n 2 ./fib -numProcs=3 -scheds=2 -d1 v2 45 30 +RTS -N2 : -n 1 ./fib -numProcs=3 -scheds=4 -d1 v2 45 30 +RTS -N4
137.195.143.101:21394:0 #SPARK=1114   max_SPARK=183   max_THREAD=[3,1,1]
137.195.143.101:21394:0 #FISH_sent=3   #SCHED_rcvd=1
137.195.143.103:22027:0 #SPARK=261   max_SPARK=2   max_THREAD=[0,1,1,1,1]
137.195.143.103:22027:0 #FISH_sent=209   #SCHED_rcvd=208
137.195.143.102:22374:0 #SPARK=221   max_SPARK=3   max_THREAD=[1,1,1]
137.195.143.102:22374:0 #FISH_sent=173   #SCHED_rcvd=172
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=10.502205s}

The switch -d1 reveals that bwlf03 (IP address 137.195.143.103) did indeed run 4 schedulers because its max_THREAD list was length 5, and the length of the max_THREAD list is always 1 plus the number of schedulers.

Launch caveats

At startup HdpH nodes open random ports and rely on UDP multicasts to discover each other. This results in a number of limitations:

  • UDP multicasts must be routed between all nodes (which may imply that all hosts must reside in the same subnet).

  • The obligatory -numProcs switch, which tells an application how many nodes to expect, must match exactly the number of nodes launched.

  • Node discovery must be completed within a certain time frame (typically 10 seconds).

  • No second HdpH application may be launched during the node discovery phase.

Any deviation from the above limitations is likely to cause HdpH to hang forever.

References

  1. Patrick Maier, Phil Trinder. Implementing a High-level Distributed-Memory Parallel Haskell in Haskell. Proc. 2011 Symposium on Implementation and Application of Functional Languages (IFL 2011), Springer LNCS 7257, pp. 35-50.

  2. Patrick Maier, Rob Stewart, Phil Trinder. Reliable Scalable Symbolic Computation: The Design of SymGridPar2. Proc. 28th ACM Symposium On Applied Computing (SAC 2013), pp. 1677-1684.

  3. HdpH development repository on github.