SYMMIC Users Manual CapeSym

mpiexec xSYMMIC

As mentioned in the Temperature Computations section, OpenMP provides shared memory parallelism on a single computer, and MPI provides distributed memory parallelism that can execute over many computers. MPI parallelism is only used when xSYMMIC is preceded by mpiexec on the command line, as demonstrated in this section.

By default, OpenMP parallelism is fully utilized by the xSYMMIC command line, as follows.

> xSYMMIC FET.xml

Starting xSYMMIC with 4 OpenMP threads...

To add the MPI parallelism and distribute the computations across multiple computers, the xSYMMIC command line must be invoked through the mpiexec launcher from an MPI library. xSYMMIC is compiled and tested with the Intel MPI Library. For differences with other libraries, see the discussion towards the end of the Remote Run section. The Remote Run and Remote Jobs dialogs use the same command line for MPI computation as described here for execution on a local cluster.

Note: Both xSYMMIC and the Intel MPI Library should already be installed, and environment variables should be configured to allow these commands to the used at the command prompt without giving the full paths. In these examples, the > symbol is meant to indicate a Windows command prompt.

> mpiexec -n 1 xSYMMIC FET.xml

Starting xSYMMIC with 4 OpenMP threads...

This is the minimal command line for testing MPI and xSYMMIC. The mpiexec command always comes first, then the mpiexec flags (-n, -ppn, -hostfile, etc.), then the xSYMMIC executable, the template file, and finally the xSYMMIC comand line options, if any. This example should produce exactly the same execution and parallelism as the non-MPI command line above, because the -n 1 option specifies a single MPI process, in other words, no additional MPI parallelism.

Since most CPUs are hyper-threaded, with two threads per core, the number of physical cores is usually half of the number of (logical) processors or CPUs shown in a task manager or system monitor application. If the Intel MPI library is installed on the system, the cpuinfo utility may used at the command prompt to view the hyper-thread and physical core information.

> cpuinfo

=====  Processor composition  =====
Processor name    : Intel(R) Xeon(R)  E5-1620 0
Packages(sockets) : 1
Cores             : 4
Processors(CPUs)  : 8
Cores per package : 4
Threads per core  : 2
=====  Processor identification  =====
Processor       Thread Id.      Core Id.        Package Id.
0               0               0               0
1               1               0               0
2               0               1               0
3               1               1               0
4               0               2               0
5               1               2               0
6               0               3               0
7               1               3               0
=====  Placement on packages  =====
Package Id.     Core Id.        Processors
0               0,1,2,3         (0,1)(2,3)(4,5)(6,7)
=====  Cache sharing  =====
Cache   Size            Processors
L1      32  KB          (0,1)(2,3)(4,5)(6,7)
L2      256 KB          (0,1)(2,3)(4,5)(6,7)
L3      10  MB          (0,1,2,3,4,5,6,7)S

To run using two MPI parallel processes increase the -n flag to 2.

> mpiexec -n 2 xSYMMIC FET.xml

Starting xSYMMIC with 2 MPI processes and 2 OpenMP threads each...

Since the command line did not list any hosts, the logical CPUs of the local host are divided up between the two MPI processes running on the local machine. To see the distribution of logical CPUs to processes on the local machine, use the -env flag to set the I_MPI_DEBUG environment variable during the run.

> xSYMMIC -n 2 -env I_MPI_DEBUG=4 xSYMMIC FET.xml
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[1] MPI startup(): Internal info: pinning initialization was done
[0] MPI startup(): Internal info: pinning initialization was done
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       4200     Cape29     {0,1,2,3}
[0] MPI startup(): 1       7868     Cape29     {4,5,6,7}

Starting xSYMMIC with 2 MPI processes and 2 OpenMP threads each...

The last two lines of debug information report that two processes (rank=0 and rank=1) are being used, both on the same machine (Cape29). The rank 0 process has access to four logical CPUs (0,1,2,3) while rank 1 has access to the four other logical CPUs (4,5,6,7). As reported by cpuinfo, logical CPUs 0 and 1 reside on physical core 0, logical CPUs 2 and 3 on core 1, logical CPUs 4 and 5 on core 2, and logical CPUs 6 and 7 on core 3. There are two logical CPUs per physical core on this machine because each Intel core has Hyper-Threading technology. xSYMMIC automatically chooses the correct total number of threads to use such that one and only one thread resides on each physical core (e.g. four in the example above). If one thread per logical CPU is prescribed then the performance will be reduced.

To run on a cluster, the machines of the cluster all need to have xSYMMIC and the MPI library installed, and the template(s) to be solved should reside on a shared network file system. Define a host file or machine file that lists the names of the machines in the cluster. (On Linux the host names are defined in the /etc/hosts file.) For example, a hosts file containing two machines might look like:

$ more hosts
HPCL8
HPCL7

Here, the dollar sign ($) signifies the Linux command prompt. Use this hosts file as follows.

$ mpiexec -n 2 -hostfile hosts -env I_MPI_DEBUG=4 xSYMMIC FET.xml
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank  Pid  Node name       Pin cpu
[0] MPI startup(): 0    30223   HPCL8 {0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23}
[0] MPI startup(): 1    30224   HPCL8 {8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31}

Starting xSYMMIC with 2 MPI processes and 8 OpenMP threads each...

Although the hosts file lists multiple machines, process pinning was left up to the MPI library which chose to put all processes on the first machine in the file. Rank 0 is assigned 8 of the 16 physical cores on HPCL8, while rank 1 is assigned the other 8 cores. The number of processes per node may be specified with the -ppn flag, as follows.

$ mpiexec -n 2 -ppn 1 -hostfile hosts -env I_MPI_DEBUG=4 xSYMMIC FET.xml
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): tcp data transfer mode
[1] MPI startup(): tcp data transfer mode
[0] MPI startup(): Rank  Pid  Node name       Pin cpu
[0] MPI startup(): 0   30777   HPCL8    {0,1,2,3,4,5,6,7,8,9,10,11,12,13,...,30,31}
[0] MPI startup(): 1   26715   HPCL7    {0,1,2,3,4,5,6,7,8,9,10,11,12,13,...,30,31}

Starting xSYMMIC with 2 MPI processes and 16 OpenMP threads each...

Now the two ranks are divided between two machines, with 16 physical cores (32 hyper-threads) per process. This could also be achieved through the use of a machine file in which the node names may be augmented with the desired number of processes per thread.

$ more machines
HPCL8:1
HPCL7:1

$ mpiexec -n 2 -machinefile machines -print-rank-map xSYMMIC FET.xml
(HPCL8:0)
(HPCL7:1)

Starting xSYMMIC with 2 MPI processes and 16 OpenMP threads each...

In this last example, the -print-rank-map flag (instead of the I_MPI_DEBUG environment variable) is used to display the process pinning. Rank 0 is assigned to machine HPCL8, while rank 1 is assigned to HPCL7. MPI will assign all of the available hyper-threads on the machine to the process.

Level 2 Superposition

As described in the Parallel Computations section, Level 2 superposition solves a layout in parallel by giving each core a separate part to solve independently, whereas Level 1 superposition solves each part in sequence by dividing each solve up over all of the cores (i.e. all of the cores work together to solve each part of the layout).

Here's a simple test for level 2 superposition that can be performed on any desktop with at least four cores and 8 Gb of RAM. Open the mesaResistor.xml template in SYMMIC and use Create layout... from the File menu to make an array of 16 mesa resistors.




In the Device Layout Table dialog that follows, set the Length and Width of the MMIC to 12mm (12000). Save the layout to a file named mesaResistor_4x4layout.xml. Although the solution mesh for the layout will contain almost two million temperature points, the individual solutions of the superposition are much smaller and should require less than 0.5 Gb of RAM to solve. Thus, it is reasonable to run 4 MPI processes in parallel on a machine with 8 Gb of RAM for level 2 superposition of this problem.

Level 2 superposition is requested by the -s flag on the command line with mpiexec and xSYMMIC. To run the layout over 4 cores on the local machine, the command would be as follows.

> mpiexec -n 4 xSYMMIC mesaResistor_4x4layout.xml -s=2

Starting xSYMMIC with 4 MPI processes...
:
Parallel Direct Solver for Steady State
:
Total run time was 237.262 seconds.

The solver being used is announced during the run as "Parallel Direct Solver" which indicates PARDISO. Level 1 superposition would have a total run time of about 369 seconds on the same machine, since it must perform all 32 solves one at a time sequentially instead of in parallel. Furthermore, the speed up provided by the added MPI parallelism has to compete with the loss of OpenMP threads available to the PARDISO solver because the machine has only four physical cores.

Moving to a Linux machine with 16 cores we get the following result.

$ mpiexec -n 4 xSYMMIC mesaResistor_4x4layout.xml -s=2

Starting xSYMMIC with 4 MPI processes and 4 OpenMP threads each...
:
Parallel Direct Solver for Steady State
:
Total run time was 56.044 seconds.

Solving the same problem on the same cores with level 1 superposition takes longer using the built-in PCG solver distributed over 4 MPI processes.

$ mpiexec -n 4 xSYMMIC mesaResistor_4x4layout.xml -s=1

Starting xSYMMIC with 4 MPI processes and 4 OpenMP threads each...
:
Solving part 1 of 1...

Cluster PCG Solver for Steady State
:
Total run time was 78.721 seconds.

As predicted from the Choosing Parallel Computing Methods section, the best performance for level 1 superposition is realized by using all of the cores for the OpenMP parallelism in the direct solver rather than using the hybrid approach and the PCG solver.

$ xSYMMIC mesaResistor_4x4layout.xml

Starting xSYMMIC with 16 OpenMP threads...
:
Parallel Direct Solver for Steady State
:
Total run time was 63.194 seconds.

Accessing the PETSc Solver

An iterative solution method may be substituted for the direct PARDISO solver for single computer simulations. For Linux, the iterative PETSc solver is available by adding the -usePETSc flag to the command line.

$ xSYMMIC GaNSi_FET5million.xml -usePETSc

Starting xSYMMIC with PETSc and 16 OpenMP threads...
:
Solving part 1 of 1...

Parallel Iterative PETSc Solver (ConjugateGradient-ILU) for Steady State
:
Total run time was 462.499 seconds.

On Windows, the use of the -usePETSc flag will not have any effect. Instead, a message will appear saying that PETSc is not available:

> xSYMMIC mesaResistor.xml -usePETSc
The PETSc library and solver are not available. Using Pardiso.
Starting xSYMMIC with 4 OpenMP threads...

:
Total run time was 0.421 seconds.

When PETSc is not available, the computation will revert to the default non-PETSc solver. This is usually the PARDISO solver except when the mpiexec command is used, in which case the built-in PCG solver will be used.

PETSc is most advantageous for cluster computing, where the mpiexec command is used to distribute the problem over multiple cores on multiple machines. For example, creating a machine file in which up to 8 parts of the problem are distributed to each host results in much better performance than using PETSc on just one process on a single host.

$ more machines
HPCL8:8
HPCL7:8
HPCL5:8
HPCL3:8

$ mpiexec -n 32 -machinefile machines -print-rank-map xSYMMIC GaNSi_FET5million.xml -usePETSc
(HPCL8:0,1,2,3,4,5,6,7)
(HPCL7:8,9,10,11,12,13,14,15)
(HPCL5:16,17,18,19,20,21,22,23)
(HPCL3:24,25,26,27,28,29,30,31)

Starting xSYMMIC with PETSc and 32 MPI processes and 2 OpenMP threads each...
:
Solving part 1 of 1...

Parallel Iterative PETSc Solver (ConjugateGradient-ILU) for Steady State
:
Total run time was 47.319 seconds.

On Linux, PETSc MPI parallelism can even be used with superposition. The above example is repeated to allow direct comparison.

$ mpiexec -n 4 xSYMMIC mesaResistor_4x4layout.xml -s=1 -usePETSc

Starting xSYMMIC with 4 MPI processes and 4 OpenMP threads each...
:
Solving part 1 of 32...

Parallel Iterative PETSc Solver (ConjugateGradient-ILU) for Steady State
:
Total run time was 71.726 seconds.

Using PETSc with level 2 superposition is ideal for large problems, when memory is insufficient to use PARDISO.

$ mpiexec -n 4 xSYMMIC mesaResistor_4x4layout.xml -s=2 -usePETSc

Starting xSYMMIC with 4 MPI processes and 4 OpenMP threads each...
:

Parallel Iterative PETSc Solver (ConjugateGradient-ILU) for Steady State
:
Total run time was 50.224 seconds.

CapeSym > SYMMIC > Users Manual
© Copyright 2007-2019 CapeSym, Inc. | 6 Huron Dr. Suite 1B, Natick, MA 01760, USA | +1 (508) 653-7100