SYMMIC Users Manual CapeSym

Choosing Parallel Computing Methods

SYMMIC's cloud computing and command line facilities offer user-configurable parallel computations. This section provides a quick guide for choosing the best configuration in some common circumstances.

Choosing the Best Solver

Table 1. How to Access the Best Solver by Platform


The PARDISO solver (the default) is recommended for solving most problems on a Windows desktop. For cluster computing, use the PETSc solver. The built-in PCG solver has the smallest random-access memory requirement per process. (For detailed descriptions of the linear solvers, see the previous section.)

The PCG and PETSc solvers require xSYMMIC to be run through the MPI executor mpiexec. To use the PETSc solver on Linux, add the -usePETSc option to the end of the xSYMMIC command line.

	> mpiexec -n 2 xSYMMIC FET.xml -usePETSc
	Starting xSYMMIC with PETSc and 32 MPI processes and 2 OpenMP threads each...
	:

PARDISO is usually the fastest solver but it requires a large amount of memory. Use the formulas in the following table to estimate the amount of physical RAM required for each solver. In these formulas, x is the problem size in millions of nodes, where a node is a unique mesh vertex as reported to the console during meshing. The variable n is the number of parallel processes or threads involved in the computation. For PARDISO, the number of threads is set equal to the number of physical cores by default. For the iterative solvers, user specifies the number of MPI processes on the mpiexec command line by the -n option.

Table 2. Estimated Solver Memory Requirement

Linear Solver

*Peak RAM Usage (GB)

PARDISO (in-core)

PARDISO (out-of-core)

PCG (default under mpiexec)

PETSc (similar PCG algorithm)

†Will increase faster than linear with large n. *Lower bounds for large problems (> 1 million nodes).

Workstation Example: The X-Band Amplifier in Chapter 4 is exported without simplification to create a device template with 17.4 million unique nodes. This problem size is reported in the console when the Show Mesh button is used in the Model Check dialog.

	Mesh Generation
	...30660418 nodes...16499124 elements...17393996 unique nodes
	Meshing took 2.371 s

Applying the first formula of Table 2 to the millions of unique nodes yields

The desktop workstation has only 4 cores and 32 GB of RAM, so in-core PARDISO will clearly fail due to insufficient memory. The PARDISO out-of-core solver is not a viable option either.

Windows Task Manager shows only 27 GB available, but choosing the solver with lowest memory requirement (PCG) with two MPI processes (n=2) will likely consume less than this amount.

The problem is therefore successfully solved from the command prompt by

	> mpiexec -n 2 xSYMMIC XbandAmp_export.xml
	Starting xSYMMIC with 2 MPI processes and 2 OpenMP threads each...
	:
	Cluster PCG Solver for Steady State
	:
	Writing solution file XbandAmp_export.rst...

	Total run time was 5016.84 seconds.

Cluster Example: The exported X-Band Amplifier device template with 17.4 million unique nodes is to be solved on a cluster of 8 machines, each machine having 16 cores and 32 GB of RAM. How many parallel processes should be used, and how should these MPI processes be distributed to the machines in the cluster?

Applying the PCG formula of Table 2 yields

The entire cluster has only 8×32=256 GB of RAM, so using 64 or more processes (ranks) is not possible for this problem. For 32 processes, we next calculate the per process memory requirement.

Always divide by n+1 because process 0 will need twice as much memory as the other ranks for the built-in PCG solver. (For PETSc, the additional rank 0 overhead is less than 50%, so use (n+0.5) instead.)

Now we need to assign processes to machines, keeping the 32 GB per machine limit in mind.

So there must be no more than 6 processes per machine, and the machine running rank 0 will be able to handle only 5 processes. We need only 32 processes total over 8 machines, and 4 processes per machine is clearly within the memory capacity of the cluster. Therefore, this problem is successfully solved with:

  > mpiexec -print-rank-map -hostfile hosts -n 32 -ppn 4 xSYMMIC XbandAmp_export.xml
  (HPCL8:0,1,2,3)
  (HPCL7:4,5,6,7)
  (HPCL6:8,9,10,11)
  (HPCL5:12,13,14,15)
  (HPCL4:16,17,18,19)
  (HPCL3:20,21,22,23)
  (HPCL2:24,25,26,27)
  (HPCL1:28,29,30,31)
  :
  Starting xSYMMIC with 32 MPI processes and 4 OpenMP threads each...
  :
  Cluster PCG Solver for Steady State
  :
  Writing solution file XbandAmp_export.rst...

  Total run time was 955.47 seconds.

Choosing Parallel Methods for Layouts

The following table suggests the best number of MPI processes and OpenMP threads for solving layout templates. Layouts of multiple devices will be solved using superposition. Level 1 superposition solves each device one-at-a-time in series, while level 2 superposition solves many devices at the same time using MPI parallel processes. Choose level 2 for solving layouts with large numbers of relatively small devices.

Table 3. What Kind of Parallel Computing Is Best?

Template Type

Device Size

OMP Threads

MPI Ranks*

Superposition Level

Preferred Platform/Solver

Single device

small

n

1

(n/a)

workstation/PARDISO

large

1

n

(n/a)

Linux cluster/PETSc

Layout with few devices

small

n

1

1

workstation/PARDISO

large

1

n

1

Linux cluster/PETSc

Large layout of d devices

small

1

min(d,n)

2

cluster/PARDISO

n is the number of physical cores in the workstation or cluster.
*or maximum allowed by the memory requirements of the solver and solution mesh.

Note: Layouts consisting of several large devices are not likely to benefit a lot from level 2 parallelism. Multiple cores would probably be better used to speed up device solutions by multi-threading during PARDISO, or by using the PCG or PETSc parallel solvers with level 1 superposition.

For level 2 superposition, one should set the number of MPI processes to the total number of physical cores available in the cluster. However, the number of processes must not be more than the number of devices in the layout. Also beware of the memory expense of the MPI approach. Each process (rank) must fully duplicate the memory for problem setup as well as the memory required to hold the solution mesh. If several ranks are placed on the same machine, the sum of the memory required for each parallel computation must fit within the physical RAM of the machine. For the case of a large layout containing a very large number of small devices (Table 4), this constraint will depend on the number of nodes in the layout's solution mesh (as displayed in the console by Show Mesh).

Table 4. Non-Solver Memory Required for Level 2 Superposition

Template Type

Millions of Nodes

in Layout Mesh

MPI Ranks

*RAM Usage (GB)

Large layout with lots of small devices

n

*b is memory (GB) required for problem setup of the layout before meshing.

Example: Suppose a Linux workstation with 30 Gb of available (free) RAM and 16 cores is to solve a layout containing over 5,034 small devices. How many parallel processes should be used for level 2 superposition?

Every solution in the superposition would need to consume less than 30/n GB in order for the n solutions running in parallel to not use up all of the available RAM. Each process will use memory equal to the sum of non-solver (Table 4) and solver (Table 2) memory requirements. For the PARDISO solver, this total is

where x1 is millions of nodes to solve one part of the superposition. To estimate x1 using the SYMMIC GUI, delete all but one of the devices from the layout and then use the Model Check dialog to Show Mesh. (If the devices are not all the same size, choose the largest device in terms of the number of mesh nodes for worse-case memory usage.) The size of the mesh to be solved will be reported in the console.

	Mesh Generation
	...194319 nodes...134385 elements...143420 unique nodes
	Meshing took 0.016 s

To estimate b close the layout template file, and note the memory usage when no template is open as shown on the SYMMIC statusbar. Then re-open the full layout template and note the change in the statusbar memory usage after the template has finished loading. This increase in memory is the amount to use for b.

To get the millions of nodes x2 in the layout solution mesh, use the Show Mesh button on the whole layout, noting the number of unique nodes reported in the console.

	Mesh Generation for Layout
	...34624876 nodes...28934836 elements...34057708 unique nodes
	Meshing layout took 25.662 s

The allowable number of processes is determined by inverting the above formula for total RAM required.

Because of the non-solver memory requirements, no more than two parallel processes may be used with level 2 superposition on this machine. This solution is computed by the following command.

    $ mpiexec -n 2 ./xSYMMIC layout5034.xml -s=2
    Starting xSYMMIC with 2 MPI processes and 8 OpenMP threads each...

    Running L2 superposition of layout...

    Mesh Generation for Layout
    ...34624876 nodes...28934836 elements...34057708 unique nodes
    Meshing layout took 25.104 s
    :
    Parallel Direct Solver for Steady State
    :
    Superposition onto 34057708 nodes took 14.412 s.

    Summing solutions after iteration 2
    Writing solution file /home/Matt/nfs/layout5034.rst...

    Total run time was 104342 seconds.

For a cluster of computers with 30 GB of RAM each, the same analysis would apply. The maximum number of processes per computer is two, but any number of computers could be included in the superposition to speed up the computation. For example, distributing 16 processes to eight computers yields

    $ mpiexec -print-rank-map -hostfile hosts -n 16 -ppn 2 ./xSYMMIC layout5034.xml -s=2
    (HPCL8:0,1)
    (HPCL7:2,3)
    (HPCL6:4,5)
    (HPCL5:6,7)
    (HPCL4:8,9)
    (HPCL3:10,11)
    (HPCL2:12,13)
    (HPCL1:14,15)
    :
    Starting xSYMMIC with 16 MPI processes and 8 OpenMP threads each...

    Running L2 superposition of layout...
    :
    Writing solution file /home/Matt/nfs/layout5034.rst...

    Total run time was 15106.1 seconds.

If a level 2 superposition runs out of memory during the computation, the execution will stall and the MPI executor may eventually exit with an error message of the form:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 4553 RUNNING AT HPCL8
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
:
===================================================================================

When this happens, try reducing the number of MPI ranks assigned per machine so that sufficient memory is available to the solver running on each rank.

Environment Variables Not Recommended

Unless there is a special need to do so, setting the environmental variables MKL_NUM_THREADS for PARDISO, or OMP_NUM_THREADS for the other OpenMP computations, is not recommended. The absence of any such environment variables allows xSYMMIC to make its own determination of how many threads to use. It will choose the same number of threads as physical cores, since this will generally give optimum performance. When SYMMIC or xSYMMIC is started, it will print the number of OpenMP threads and MPI ranks that it is using to the console window, as shown in the above examples.

Choosing the Best Hardware

If there is a choice between running SYMMIC on a fast desktop versus a Xeon workstation, one should choose the workstation since its architecture is better suited for solver computations. >As the next figure shows, the parallel speedup due to multi-threading is greater for a dual-Xeon workstation compared to a desktop with an Intel i7-series processor, even though the CPU clock speed in the workstation is slower than in the desktop. The PARDISO solver is able to take advantage of the many more cores available in the Xeon workstation to run the solver's parallel threads.




If you do not have a Xeon desktop with a large number of cores, SYMMIC provides easy access to high-performance hardware through commercial cloud services. Both OpenMP and MPI parallel computations are available on state-of-the-art workstations and clusters in the Elastic Compute Cloud of Amazon Web Services (AWS). CapeSym provides the xSYMMIC software and cloud licenses to all users through a pre-configured Amazon Machine Image that may be directly accessed from the SYMMIC desktop. For details see the SYMMIC in the Cloud section of the users manual.

Table 5 shows some of the current machine types available to SYMMIC users through AWS. The memory-optimized r5.24xlarge instances are recommended as single workstations for solving single device problems up to 60 million nodes. For larger problems, try the higher-memory x1 machine types for solving problems up to about 250 million nodes. Any of the machine instance types may also be configured into a cluster of up to 200 machines for running xSYMMIC through mpiexec. The newer Skylake and Cascade Lake instances, with superior interconnect speeds, are excellent choices for high-performance cluster computing.

Table 5. Start-of-the-Art Hardware Available to SYMMIC Users

Instance Type

Processor

Cores

RAM

Network Speed

c5.24xlarge

3.6 GHz Intel Xeon Scalable (Cascade Lake)

48

192 GB

25 Gbps

r5.24xlarge

3.1 GHz Intel Xeon Platinum 8175

48

768 GB

25 Gbps

x1.32xlarge

2.3 GHz Intel Xeon E7-8880 v3 (Haswell)

64

1952 GB

25 Gbps

x1e.32xlarge

2.3 GHz Intel Xeon E7-8880 v3 (Haswell)

64

3904 GB

25 Gbps

i3.16xlarge

2.3 GHz Intel Xeon E5-2686 v4 (Broadwell)

32

488 GB

25 Gbps

i3en.24xlarge

3.1 GHz Intel Xeon Scalable (Skylake)

48

768 GB

100 Gbps

r5n.16xlarge

3.1 GHz Intel Xeon Scalable (Cascade Lake)

32

512 GB

75 Gbps

c5n.18xlarge

3.0 GHz Intel Xeon Scalable (Skylake)

36

192 GB

100 Gbps

CapeSym > SYMMIC > Users Manual
© Copyright 2007-2019 CapeSym, Inc. | 6 Huron Dr. Suite 1B, Natick, MA 01760, USA | +1 (508) 653-7100