SYMMIC Users Manual CapeSym

SYMMIC in the Cloud

CapeSym provides a low-cost, cloud-based machine image for running xSYMMIC on compute instances in the Elastic Compute Cloud (EC2) of Amazon Web Services (AWS). This makes it easy to use Remote Run to solve problems in the cloud with faster turn-around than on the desktop, as shown in the following table.

Performance of xSYMMIC in the Cloud Version 3.0.0

Compute Location

Instance Type

Processor

Cores

RAM (Gb)

Spot ($/hr)

Solve

(sec.)

Wall Time

(seconds)

AWS us-east (N. VA)

r4.large

Xeon E5-2686v4 2.3 GHz

1

15

0.04

167

180

AWS us-east (N. VA)

r4.xlarge

Xeon E5-2686v4 2.3 GHz

2

30

0.08

94.2

106

Desktop (Natick, MA)

--

Xeon E5-1620 3.6 GHz

4

32

--

92.4

95

AWS us-east (N. VA)

m4.2xlarge

Xeon E5-2686v4 2.3 GHz

4

32

0.13

56

68

AWS us-east (N. VA)

m5.2xlarge

Xeon Platinum 8175 2.3 GHz

4

32

0.14

42.8

58

AWS us-east (N. VA)

r4.2xlarge

Xeon E5-2686v4 2.3 GHz

4

64

0.15

56.7

72

AWS us-east (N. VA)

c4.2xlarge

Xeon E5-2666v3 2.9 GHz

4

15

0.13

48.7

62

AWS us-east (N. VA)

c5.2xlarge

Xeon Platinum 8124 3 GHz

4

16

0.16

48

58

AWS us-east (N. VA)

r4.8xlarge

Xeon E5-2686v4 2.3 GHz

16

244

0.65

28.3

42

AWS us-east (N. VA)

c4.8xlarge

Xeon E5-2666v3 2.9 GHz

18

60

0.55

26.3

40

AWS us-east (N. VA)

c5.9xlarge

Xeon Platinum 8124 3 GHz

18

72

0.55

23.7

36

AWS us-east (N. VA)

m4.10xlarge

Xeon E5-2686v4 2.3 GHz

20

160

'2

30.1

43

AWS us-east (N. VA)

m4.16xlarge

Xeon E5-2686v4 2.3 GHz

32

256

'3.2

28.6

43

AWS us-east (N. VA)

r4.16xlarge

Xeon E5-2686v4 2.3 GHz

32

488

1

27.3

42

AWS us-east (N. VA)

x1.16xlarge

Xeon E7-8880v3 2.3 GHz

32

976

2

31.5

42

AWS us-east (N. VA)

m5.24xlarge

Xeon Platinum 8175 2.3 GHz

48

384

'4.6

20.5

33

'On demand pricing was used because spot pricing was not available at the time for these instances.

This table contains execution time measurements made during the week of May 14-18, 2018, at random times between 8AM to 6PM. Jobs were launched using the Remote Run dialog from a desktop workstation in Natick, MA, onto instances running in Amazon's Northern Virginia facilities. All cases solve FETbig.xml (1.8 million nodes) in steady-state using xSYMMIC with PARDISO. Wall time includes file upload, computation, and solution download to Natick from N. Virginia. The solution file was 106 Mbytes in size. Provisioning an Amazon Linux 2 instance, from start of launch wizard until instance was running, took about a minute. Provisioning time is not included in the wall time measurements shown in the table because the instance was already running when the Remote Run job was launched.

Amazon classifies instances as general purpose (e.g. m4 and m5), memory-optimized (e.g. r4 and x1) and compute-optimized (e.g. c4 and c5), based on the hardware configuration. The instances listed in the table are just a sample of the available hardware configurations. As of this writing, instances as big as 64 cores with 3904 GB of RAM are available. Estimating the memory required for a direct solver in physical RAM on a machine this size is difficult, but these largest instances might be capable of solving device templates with 250 million unknowns (or unique mesh vertices). We solved a device template with over 206 million mesh vertices using the PARDISO solver on an x1e.32xlarge instance (64 cores, 3904 GB RAM). This solve completed in under 7 hours, consuming about 2615 GB of the memory on that machine.

To start using xSYMMIC in the Cloud, you must first create an AWS account and then subscribe to the product at: https://aws.amazon.com/marketplace/pp/B07DDPVWX6. The free subscription gives you access to the SYMMIC in the Cloud machine image that can be used to run xSYMMIC on EC2 instances. SYMMIC in the Cloud is currently available only in the US East (N. Virginia), US East (Ohio), US West (N. California), and US West (Oregon) facilities. Running an instance of SYMMIC in the Cloud costs a small amount per hour in addition to Amazon's base rate for running the EC2 instance. All payment is through your own AWS account. A detailed walk-through of the sign up and run process is available on CapeSym's website at: http://www.capesym.com/cloud.html.

AWS Clusters

SYMMIC is now available on AWS clusters, providing nearly unlimited computing power! As an example of the speed up possible on an AWS cluster, consider the following example solved on a cluster of twenty c5.18xlarge compute nodes, each having 36 physical cores and 144 GB of RAM, and communicating over a 25 Gbit interconnect.




The cluster is solving a FET template with 20 million unknowns as described in the section on Managing Jobs on a Remote Cluster. These runs use 12 MPI ranks (processes) per node (instead of using all 36 cores) because that is the most ranks that the installed memory could support for 20 million unknowns. Although mesh generation does not speed up as much, it is a small part of the computation performed only at the start of the solution. Overall the solution computation is 70 times faster than a serial solve!

As for most finite element codes, parallel performance is greatly affected by the speed of the interconnect. This is the communications channel that connects all of the nodes (i.e. computers), and over which the MPI communications travel. The 1 Gbit per second (Gbps) ethernet connection typical of many local area networks is inadequate for high performance computing. Ten Gbps ethernet is the minimum workable speed, and in the case of SYMMIC, this interconnect is best reserved for Superposition Level 2 runs that require less communications. AWS offers 25 Gbps interconnects for its largest instance types, and those should be used if at all possible. At the time of this writing the 25 Gbps instances available include the c5.18xlarge, i3.16xlarge, m4.16xlarge, m5.24xlarge, r4.16xlarge, r5.24xlarge, x1.32xlarge and x1e.32xlarge. The largest instance of each type (in terms of the number of CPUs) has the fastest speed, because the entire machine (i.e. all sockets and cores of the computer) is being reserved. AWS current pricing is based on the number of cores, without any penalty for the larger instances, so one should always use the largest possible of the desired type when configuring a cluster.

When one creates an AWS account, the initial limits on how many EC2 instances one can use may be quite low for the largest types (possibly 0). These limits can be seen by going to the EC2 Dashboard and selecting the limits page (currently it is a link on the top left hand side). On this page choose one or more of the largest instance types that you want to use and request an increase, with the reason that the xSYMMIC product requires it, and hopefully AWS will approve it within a day or two.

SYMMIC clusters on AWS are licensed under the same xSYMMIC in the Cloud product subscription on the AWS Marketplace described above, so no additional subscription is required. The cost per hour is small for both serial and parallel computations. Clusters are easily created using the AWS CloudFormation Service. CapeSym provides a launch template to facilitate configuring and starting a CloudFormation cluster (or stack). This template is publicly available at the address below.

https://s3.amazonaws.com/symmic-cloudformation-templates/SYMMIC-AWScluster.json

After logging into the AWS Console, navigate to the CloudFormation console, select the “Create new stack” button and enter this S3 template address in the Amazon S3 template URL box as shown below, then click “Next” and proceed to configure the cluster, following the instructions on CapeSym's website.




After the cluster has been created and is ready to use, the CloudFormation console will provide an IP address to the master node (see the Outputs tab) that can be used to connect the Remote Run dialog to the cluster. A full tutorial on configuring an AWS cluster is now available on the web at: http://www.capesym.com/cloudformation.html.

When done using a stack, it should be deleted by returning to the CloudFormation console, selecting the stack and using the Action pull-down menu to “Delete the stack”. Do not terminate the individual instances using the EC2 console. CloudFormation will take care of shutting down the instances and deleting the network components in an orderly manner.

Estimating Cluster Performance

The availability of remote clusters of differing instance types and nearly unlimited sizes presents the potential user with a myriad of cluster configuration choices. Here are a few guidelines to aid in these decisions.

Memory: As listed elsewhere in this manual, the PCG solvers used with clusters require about 0.5 GB of RAM per one million equations being solved, per core (i.e. MPI process) used. The number of equations being solved is equal to the number of vertices (the “unique nodes” value given during meshing) which do not have a fixed temperature assigned to them, (usually all of them).

Cost: If the only constraint is AWS cost, then one would always solve with the fewest cores possible, since parallel efficiency degrades with increasing numbers of cores. However, since the AWS cost per node per hour is much much less than a user's time is worth, one should weigh the trade off between cost and the speed of the calculation.

Performance: The solution calculation speed can be estimated with the following procedure. Example values are given based on steady-state runs made of the five- and twenty-million-vertex amplifier templates described elsewhere. Note that mesh generation time is not included in the following estimates. Meshing time is generally much less than solution time, and although it currently doesn't scale very well (as the figures above show), big improvements are currently under development and should be released soon.

  1. Run a one million equation problem on the instance type of choice, in serial, using the (PETSc) PCG solution algorithm. This solution time is denoted as “t0 ”. The following table gives the t0 values for four common AWS instance types.

Instance Type:

Cores:

RAM / Core (GiB):

t0 (sec):

m4.16xlarge

32

8

31.3

m5.24xlarge

48

8

27.2

r5.24xlarge

48

16

26.2

c5.18xlarge

36

4

28.9

  1. Run a larger version of the same template used for Step 1, and find the exponent “a” in the following equation, where “Neq” is the number of equations. For the m4.16xlarge instance we found a = 1.75 .

  2. The parallel solution time can now be calculated as follows, where “n” is the number of cores (MPI ranks) employed and “Neq0” is 1x106 from Step 1.

  3. Parallel efficiency (“Efficiency” in the equation above), for the current PETSc PCG solution implementation on the AWS instance types listed in the table above, is approximated as follows.

The following figure compares the solution times calculated with the above procedure for the slowest (m4) and fastest (r5) AWS instance types listed in the table above. The problem is a ten-million-vertex template.




This figure shows that, although the new r5 instances are faster and have twice as much memory per core as the common m4 instances, a cluster of m4 instances can provide performance which is just as satisfactory. Therefore, in regions where r5 and m5 instances are not yet available, SYMMIC users can build clusters using m4 instances and still achieve great results.

The next two figures show the performance of m4 and r5 instances for problems with 20 to 100 million equations to solve.





In summary, for a 100-million-equation problem, the serial solution time is predicted to be 23 hours on r5 and 27.5 hours on m4, using 50 GiB of RAM. If 256-core clusters are used instead (32 r5 nodes or 64 m4 nodes, to satisfy memory requirements), then the predicted solution time reduces to just 11 minutes on r5 and 13 minutes on m4!

CapeSym > SYMMIC > Users Manual
© Copyright 2007-2019 CapeSym, Inc. | 6 Huron Dr. Suite 1B, Natick, MA 01760, USA | +1 (508) 653-7100