## Temperature Computations

### Heat Conduction Theory

SYMMIC calculates the temperature field in a device by solving the energy conservation equation.

The term on the left multiplies volume density (ρ), specific heat capacity (c), and the change in temperature with time. The first term on the right hand side is the divergence of the product of the conductivity tensor (κ) and the temperature gradient. (The tensor embodies material thermal conductivity that may vary with temperature and direction.) The second term on the right (Qv) is the heat flux per unit volume.

In steady-state, the equation reduces to

such that volume density and specific heat capacity are not required for solution.

Four types of boundary conditions (BCs) can be applied when solving the energy equation: the Dirichlet BC, in which the temperature is fixed at a known value; the flux BC, a constant or time-varying heat flux on a surface or volume; the film BC, in which a heat transfer coefficient and sink temperature are specified; and the radiation BC that models radiative heat loss to the ambient environment. The default BC is the adiabatic or insulated condition, essentially a zero-flux BC. Device templates typically have all external surfaces as adiabatic, except for a film BC on the bottom surface (backside) of the device.

The film boundary condition obeys Newton's law of cooling:

where the surface flux QA is equal to a heat transfer coefficient (h) times the difference between the surface temperature and a constant environmental temperature (T). The heat transfer coefficient and temperature are intended to capture the thermal resistance from the backside of the device or layout to the reference (e.g. cold plate) temperature. A very large value for the heat transfer coefficient (e.g. 1 W/µm2 °C) can be used to effectively turn the film BC into a Dirichlet BC.

For semiconductor transistors, power dissipation is typically due to joule heating which is modeled as a surface or volume heat flux BC applied in the active epi-layers. Power parameters allow the user to set the magnitude of this flux. The power dissipation and film parameters of a device template are usually modifiable in a parameter list. See the documentation for a particular device template for details. Materials properties and thermal conductivities are also modifiable through the Materials list.

### Finite Element Method

SYMMIC uses the Finite Element Method (FEM) to solve the energy conservation equation. The finite element method reduces the continuous problem to a discrete problem with a finite number of degrees of freedom. The starting point of the method is to transform the strong form of the continuous problem (i.e. the partial differential equations) to a weak variational form (principle of virtual work in structural mechanics) defined in appropriate functional spaces. The second step of the finite element approximation is the subdivision of the problem domain into conforming finite elements. The last step is to discretize the weak form of the equations based on the finite dimensional subspaces associated with the elements. This process is detailed in many books (e.g.[1]). The end result is a matrix equation of the form

where x is the vector of unknown temperatures at the nodes of the mesh.

The mesh resolution is defined in the device template. Details on the specification of the mesh are given in the appendix, in the Points Section of the Device Template File Format. SYMMIC automatically generates the finite element mesh and solves the energy conservation equation based on this specification, so that the template user does not need to be concerned with such details. In addition, the hexahedral mesh generation algorithms in SYMMIC have been custom designed for this application, to ensure robustness and enable high accuracy without creating a mesh that is too big to solve quickly on desktop hardware.

Template designers should take advantage of the biased meshing capabilities built into the specification of the geometric features (Points and Layers), which allow each feature to be meshed finely in some parts or directions, and meshed more coarsely in other regions or directions. The best way to mesh for thermal analysis is to choose a size attribute for the smallest mesh elements at the edges of the power dissipation regions, along with bias attributes that gradually increase the mesh size away from high temperature gradients. These size and bias attributes should be defined as template parameters so that they can be easily modified from the SYMMC user interface. A simple example of this type of meshing design is the gResistor.xml device template.

Once this specfication is established, the template designer should conduct a mesh resolution study in which the problem is repeatedly solved and the mesh size within and near the heat flux incrementally decreased. The mesh resolution may be considered sufficient when the peak temperature stops increasing as the size parameters are decreased. Another indicator of poor mesh resolution can be slow convergence of the solver and/or the nonlinear fixed-point iterations. Poor mesh resolution across a large, sharp temperature gradient will not only slow down convergence, it can also result in a visibly unstable solution in which large temperature spikes appear in place of the usual smooth transition. Note that aspect ratios of the mesh elemets are not usually an issue, because large aspect ratios are not likely to cause convergence or stability problems for the type of finite elements used by SYMMIC.

### Linear Solvers

The matrix equation is solved using one of three linear solvers. The table below shows the three solvers and the commands that may be used to access them on different platforms. The default linear solver is the Parallel Direct Sparse Solver (PARDISO). The Run simulation command in the Solve menu invokes the PARDISO solver on a Windows desktop workstation. This is represented in the table by the green area under the PARDISO heading.

PARDISO uses multi-threading to speed up the matrix inversion through parallel computations on multi-core systems. An iterative Preconditioned Conjugate Gradient (PCG) solver is also available on Windows. This PCG solver requires Message Passing Interface (MPI) technology for its parallel computations and can only be accessed through the mpiexec command operating on the xSYMMIC command line utility. The mpiexec command is only available when the MPI library has been installed on the platform. For more details on mpiexec, please see the related section in the xSYMMIC Simulations chapter of the manual.

Linear solvers from the PETSc Library are also available when that library has been installed on a Linux operating system. Use of the PETSc library is not available on a Windows platform. Linux execution is currently only available through the xSYMMIC command line utility. Both PETSc and the built-in PCG solver can operate under mpiexec to parallelize the computation across a cluster, provided the MPI library is installed on each of the compute nodes.

SYMMIC supports solving the FEM problem on a different computer system from the one running the Windows GUI, through the Remote Run and Remote Jobs dialogs. These dialogs interact with remote computer systems that already have licensed copies of xSYMMIC installed. CapeSym provides a cloud-based Linux machine image with xSYMMIC pre-installed that can be accessed for a small hourly fee. For details, please see the section entitled SYMMIC in the Cloud.

### Nonlinear Fixed-Point Iterations

The linear solver is applied in sequential iterations to resolve the effects of temperature-dependent properties. The fastest solutions are obtained for single device problems in which the material properties do not depend on temperature, since no additional iterations are required. To solve problems with temperature-dependent properties, after each iteration the values of the M and K matrices are updated based on the just-computed temperature solution. These nonlinear iterations will be reported in the console during the computation of a solution in the following format.

Iteration: 1

Fixed Point Residual = 4.71e-05

The "Fixed Point Residual" is the absolute maximum change in temperature at any node in the solution from the temperature at that node on the previous iteration. The first iteration (Iteration: 0) is the linear solution to the matrix equation assuming M and K are constant rather than varying with temperature.

The nonlinear iterations may be disabled by unchecking the box at the top of the Solver Settings dialog, as shown in the figure below. The convergence level for the nonlinear iterations can be adjusted through the Fixed-point residual tolerance parameter in the dialog. The problem is considered to have converged if the "Fixed Point Residual" is less than this parameter, the minimum value of which is fixed at 1.0e-4 °C. The default value is1.0e-3 for accurate results, but it may increased to reduce the number of nonlinear fixed-point iterations and possibly speed up the computation of the solution.

The Solver Settings Dialog

### Superposition Method

To solve layouts, SYMMIC uses a superposition of solutions for individual devices. This approach reduces the size of the mesh that must be solved numerically. Rather than creating a large finite element problem that fully resolves the heating of all devices simultaneously, each device is meshed and solved separately and the results recombined to get the final solution. Since each single device simulation requires less memory and computation time than the full layout, the problem can often be solved using fewer resources.

For a layout involving N devices with individual solutions Ti(t), the full solution T(t) is computed by

where TØ is the “null” solution of the non-powered layout. The null solution captures the influence of both the initial and environmental temperatures on the full solution, allowing the individual device solutions to be computed at realistic temperatures.

Solving a layout of devices using superposition uses a different mesh for each part Ti in the above equation. The total solution T is represented by a mesh different from any of the Ti, as illustrated in the following graphic. In this example, a layout of four mesa resistors is solved by summing together the null solution mesh and the meshes for each device simulated alone. The total solution is summed on a coarse mesh between the devices and a finer mesh within each device region. The fine meshes are determined by the device template meshing specifications, while the uniform spacing of the coarse mesh between devices is specified by the Mesh Size parameter in the Layout Editor. The Mesh Size also determines the grid spacing of the null solution mesh TØ.

As each individual device is solved during superposition, a biased (non-uniform) mesh Ti is created that grows outward from the device to cover the entire MMIC. This mesh is used to determine the heat flow from the device to adjacent devices in the layout. The growth rate of the mesh elements is specified by the Bias parameter in the Layout Editor. The user does not see the individual meshes used to solve the layout and so the effect of the Bias parameter may not be readily apparent. To view the biased mesh for a device, delete all of the other devices in the layout and then use Show Mesh in the Model Check dialog.

Note: For a layout of only one device, the Show Mesh preview is different from the solution mesh.

Superposition of individual meshes are used to solve a layout.

Adjacent devices in a layout will, in general, cause the temperature of a device to be elevated above the level obtained from single-device simulation. When temperature-dependent material properties are involved, this may alter the solution for that device. Strict linear superposition will not be accurate in this case. Once the magnitude of the effect is known, however, each device solution can be recalculated taking into account the heating from adjacent devices. The user has control over whether this second iteration is performed through the “Repeat superposition” check box in the Solver Settings dialog. The repetition essentially doubles the computation time for the problem, but may not be necessary if devices are widely separated and of sufficiently low power, or if material properties are not strongly affected by temperature.

Note: When solving a layout that contains only a single device, “Repeat superposition” is unnecessary because there are no interactions with other devices.

### Parallel Computations

Parallel computations in SYMMIC may be divided into two types, those that share a common memory space versus those that do not. Shared memory parallelism often occurs within the cores of a single processor, while separate computers usually have independent memory spaces. These different ways of distributing a problem over multiple cores or computers require different programming constructs. The Message Passing Interface (MPI) is designed for distributed memory systems, where each process has isolated memory, whereas Open Multi-Processing (OpenMP) requires shared memory. OpenMP provides parallelism for a single workstation, and MPI provides parallelism for a group (or “cluster”) of machines. MPI also works on the multiple cores within the shared memory of a single workstation.

The superposition computation offers a hybrid parallel mode, where MPI is used to distribute the problem over different workstations (or “nodes”) in a cluster, and OpenMP is used to distribute a portion of problem among the CPU cores of a node. Layouts containing a large number of devices may be more efficiently solved by distributing the individual devices in a superposition across the cores of multiple nodes in a cluster using MPI. A master-worker approach is used, where the master process distributes device templates to be solved to all of the other processes, receives the device solutions, and hands out more work as needed, until the entire layout solution is completed. Since layouts are solved by superposition, each of the device template solutions can be calculated independently of all the others, so this works very well in parallel. Speedups of over 100x have been achieved. (The speedup is the serial execution time divided by the parallel execution time.) This parallel algorithm is known as “Level 2” superposition to distinguish it from normal, “Level 1” superposition, which has been a hallmark of SYMMIC's solution approach for layouts. The figures below illustrate the two superposition algorithms.

Level 2 superposition works very well when there are many more templates in the layout than there are CPU cores, and each node has enough memory to solve a different template on every core, or at least on most of its cores. At a minimum, the number of templates in the layout should not be less than the number of cores requested from MPI. On Windows, each worker node solves for device temperatures using the PARDISO solver and may benefit from the availability of additional cores for OpenMP parallelism.

Level 1 Superposition: There are three cores (n1, n2, n3) which all work together to solve each device template (T1, T2) in sequence. At the end of each solution, the main thread (root path) sums the device into the final solution mesh.

Level 2 Superposition: There are three MPI nodes (n1, n2, n3) that work independently to solve each device template (T1, T2, T3) in parallel. The n1 process (root path) is the scheduler that doles out problems to the other nodes to be solved in parallel. When a node finishes one problem it asks for another one to solve, until all device solutions have been computed and summed to get the final solution for the layout.

Level 2 superposition execution speed scales very well with the number of nodes in the cluster. This is demonstrated for a layout with 5039 devices (depicted below) solved on the NERSC Edison supercomputer cluster. As shown in the results below, the run time is still decreasing at 252 cores spread over 42 nodes.

Although the supercomputer has a very fast interconnect between the nodes, the interconnect is not a big factor in the execution speed. A 10 Gbit interconnect will perform just about as well. Excellent performance can also be obtained by solving on high performance instances in the cloud. xSYMMIC is now available on the Amazon Marketplace, giving everyone access to as large a single workstation or cluster as desired. Amazon only charges hourly for the wall time that the instance is running, and they provide very high performance hardware, which together makes for a compelling high-performance computing option for many companies. See the SYMMIC in the Cloud section for more details.

A layout of 5039 devices on a mixed-signal MMIC.

Faster solution of the layout is achieved by level 2 superposition on more cores.

### The PARDISO Linear Solver

The default solver is the Parallel Direct Sparse Solver (PARDISO) in the Intel Math Kernel Libraries (MKL). The PARDISO algorithm was originally developed at the Department of Computer Science, University of Basel [2]. As implemented in the MKL, the PARDISO solver includes a version of the METIS algorithm originally developed in the Karypis lab at the Univeristy of Minnesota [3]. The PARDISO solver in SYMMIC does not use MPI parallelism, but it does use OpenMP parallel constructs.

The PARDISO solver is very fast, but uses a lot of memory. Fast performance is only obtained when there is sufficient physical RAM to hold the entire computation. The peak memory required for PARDISO is dependent on problem size and structure, as well as the number of threads utilized in the computation. The requirement usually lies in the range of 5-10 GigaBytes (GB) per million nodes. A simple estimate of the lower bound of the required memory for the PARDISO solver is 5.5 GB of physical RAM per million nodes. The number of unique problem nodes (temperatures) is reported to the console when the problem is meshed. Hence, it may be a good idea to mesh the problem first before trying to solve it. Use the Show Mesh button in the Model Check dialog to estimate the memory requirement. For example, if Show Mesh reports 10 million nodes then the PARDISO solver will require at least 10*5.5 = 55 GB of RAM for a fast solve. Be aware that memory consumption will grow faster than linear in the number of problem nodes due to multi-threading, as discussed further below.

PARDISO estimates the required memory to solve the problem in the analysis phase. Just before numerical factorization is attempted, this estimate is compared to the memory currently available as reported by the operating system. If there is insufficient memory, numerical factorization will not be attempted. Instead the following message sequence will appear in the console.

```  Memory available= 18172924 KB
Memory needed to solve in-core= 28294654 KB, out-of-core= 6592078 KB

Pardiso Error -2 : Not enough memory

Unable to complete the simulation run.
```

In this specific case, the problem to be solved involved 4,968,066 temperature nodes, but the available memory of 17.3 GB was less than the 27 GB that PARDISO estimated it would need to factor a problem of this size. The memory analysis is not fool proof. More memory may be consumed during the solution than initially estimated by PARDISO. Also, it is possible for all of the available memory to be consumed in the analysis phase itself for extremely large problems. However, the PARDISO analysis will generally indicate whether another approach needs to be tried to solve the problem.

PARDISO normally holds the entire matrix factorization in physical RAM, also known as in-core storage. There is a PARDISO option to store the factorization on the hard drive instead. This out-of-core storage generally requires a lot less RAM than the in-core solution, but the computation may be twice as slow. In the above message, PARDISO also estimated that 6.3 GB would be needed for out-of-core solution. Solving the whole problem out-of-core typically uses more RAM than the amount estimated. A useful rule of thumb for the lower bound requirement is 2.75 GB per million nodes, (e.g. 13.7 GB for 4.97 million nodes). If this rule-of-thumb estimate is more than the memory available, consider using one of the iterative solvers and mpiexec to distribute the memory requirement over multiple machines. For an example, see the section on Choosing Parallel Computing Methods later in this chapter.

The PARDISO out-of-core option may be selected by checking the box in the Solver Settings dialog. During out-of-core solution, a group of temporary files ending in _ooc.ind, _ooc.jal, _ooc.lup, _ooc.sin, and _ooc.sle will be written to the working directory. This directory will need to have enough disk space to hold the whole factorization. For the 4.97 million node problem, this turns out to be about 25 GB on Windows and 23 GB on Linux. In addition to the files just mentioned, a sequence of .lnz files is generated on Windows. On Linux, a single .lnz file is generated to hold the factorization. Often PARDISO generates a new error message and aborts the computation if the RAM or disk space turns out to be inadequate.

Known Issue: In some instances of Linux, the out-of-core solver will hang just before creating the .lnz file. If this happens try moving the template to a directory with lots of disk space that is not in a network file system, and then restart the computation in that directory.

SYMMIC uses OpenMP threading technology to speed up the finite element computations on multi-core systems. OpenMP recognizes the environment variable OMP_NUM_THREADS as specifying the number of threads available to the application. The PARDISO solver also uses OpenMP technology. The MKL also recognizes environment variables that are independent of OpenMP, such as MKL_NUM_THREADS, and these variables are inspected first, then the OpenMP variables are examined, and if neither is found, PARDISO chooses the default number of threads. The default number of OpenMP threads is equal to the number of physical cores on the system.

Some systems support two independent threads per physical core, effectively doubling the number of “processors” available. For instance, the Task Manager on Windows may show eight processors which are really hyper-threads running on four physical cores. Environment variables should never be defined to use more than the total number of hyper-threads, and may run faster with half that value.

The user may reduce the amount of parallelization in the PARDISO solver by setting the environment variable MKL_NUM_THREADS to a low number. This will slow down the computation but may result in less memory and CPU utilization, freeing computational resources for other tasks. The best practice is to not define any environment variables at all and let the MKL choose the fastest parallelization.

If you just want to set the environment variable locally in a command prompt before running xSYMMIC you can do that with the Windows command

In a Linux shell, use the command

To see the value of the environment variable, type the "set" command without any parameters. This will show all the defined environment variables and their values.

The amount of memory consumed by the PARDISO solver depends on the number of parallel threads it uses, as seen in the figures below. On the left is shown the memory consumed and estimated by PARDSIO on an 8-core machine, while the right image shows memory rates for the same set of problems on a 48-core machine. The analysis phase memory consumed and the estimated RAM for out-of-core factorization both increase linearly with problem size. On large problems, the analysis phase may actually use more memory than the out-of-core factorization. For example, the analysis phase needs 27 GB on the problem with 10.8 million temperatures, while the out-of-core estimate is 16 GB.

Linear estimates are not sufficient for multi-threaded solves on large problems. An exponent (on the number of nodes) larger than 1.1 is needed to keep up with the increase in memory consumption for in-core solves, and this exponent increases when more cores are made avalable to PARDISO for multi-threading.

### The PCG Linear Solver

When MPI parallelism (mpiexec) is used on the Windows command line, the linear solver is switched from PARDISO to an iterative preconditioned conjugate gradient (PCG) solver with block Jacobi preconditioning using incomplete LU factorization with zero fill-in [4]. Partitioning of equations among parallel processes is performed by the METIS library [3] prior to the PCG computation. When the number of parallel processes becomes large enough, the PCG iterative solver will usually be faster than PARDISO, with less memory usage per machine.

The next figure plots the peak memory used by the PCG solver for different problem sizes, when running on an Intel Xeon Platinum 8175 processor with 48 cores and 2 hyperthreads per core. The number of parallel MPI processes is designated by the -n flag on the command line, so “PCG -n 48” indicates that 48 parallel processes are being used in the computation. The lowest line in the plot (orange) is the total memory used by two parallel processes. For a problem with a million unknowns, the total memory used by two processes was 1.6 GB, versus the 6 GB required for PARDISO. At 48 parallel processes, the PCG memory requirement of almost 13 GB exceeds that of PARDISO. Going from 2 to 48 processes results in an 8-fold increase in the peak memory requirement at all problem sizes.

A reasonably accurate model of the relationship of problem size (designated by x million nodes) and number of parallel processes (n) to the peak memory used, is given by

This formula is represented by the black diamonds on the above figure. To find the RAM required by the PCG solver for a problem with x=5 million nodes when running on n=4 MPI processes, we would calculate just under 10 GB using this formula. Therefore, the PCG solver could solve for 5 million temperatures on a Windows workstation with 4 cores and 32 GB of RAM, while the PARDISO in-core solver would fail due to insufficient memory on this same problem.

Use the MPI executor on a command line to run the PCG solver instead of PARDISO.

> mpiexec -n 4 xSYMMIC GaNSi_FET5million.xml

For more information about accessing MPI from the command line, please see the chapter on xSYMMIC Simulations. The PCG solver is only available when using xSYMMIC and mpiexec.

### The PETSc Linear Solver

The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines for the parallel solution of partial differential equations developed at Argonne National Laboratory [5]. On Linux systems, xSYMMIC can use PETSc (or the built-in PCG solver) as the linear solver instead of PARDISO. The required libraries are included with the xSYMMIC Linux distribution, so no additional installation of PETSc is required. Neither SYMMIC nor xSYMMIC will use PETSc on Windows.

The PETSc solver is selected by using the -usePETSc command line option. The PETSc solver may be selected with or without the mpiexec command. Without the mpiexec command, the -usePETSc flag selects the serial (single process) version of the PETSc solver, instead of the PARDISO solver with full OpenMP parallelism. When the mpiexec command is used, the PETSc solver is parallelized over the number of processes (or ranks to use the MPI terminology) specified by the -n flag. The PETSc solver does not make use of OpenMP threads, so it is fastest when the number of processes equals the number of physical cores in the machine. Always choose as many processes per machine as the available memory will support, but no more than the number of physical cores.

The solver implemented for the PETSc option is an iterative preconditioned conjugate gradient (PCG) solver with block Jacobi preconditioning using incomplete LU factorization with zero fill-in. This is the same algorithm used by the built-in PCG solver and the two solvers have comparable performance. Partitioning for the PETSc solver is based on the PT-SCOTCH library [6]. Options for monitoring or configuring PETSc may be added to the end of the xSYMMIC command line following the -usePETSc flag. See the section on xSYMMIC Simulations and the PETSc users manual [5] for more details.

The peak memory required by the PETSc solver is more difficult to predict, but is consistently more than the memory required by the built-in PCG solver. An approximate model for the PETSc solver is

where n is the number of processes and x is the problem size in millions of nodes. This model indicates that the PETSc solver, when running on 48 parallel processes, requires about 8.5 times more memory than it uses when running on just 2 parallel processes. PARDISO typically requires about 3-4 times as much RAM as PETSc when only 2 cores are utilized, but when 48 cores are used on a single machine PARDISO will probably use less memory than either the PETSc or built-in PCG solvers.

PETSc parallel computation is capable of providing faster solves than PARDISO, even on a single computer, when there a enough physical cores. The graph above summarizes the solution speed on three problem sizes when running on 2-48 processes/threads on a single workstation with an Intel Xeon Platinum 8175 processor. Solution speed is problem-dependent, but in most cases the PARDISO solver is faster than the iterative solvers when running on just a few cores. When the problem is large and distributed to many cores, the PETSc solver is fastest. The PETSc solver is faster than the built-in PCG solver in this situation, because of relaxed convergence criteria that reduce the number of iterations. However, these relaxed convergence criteria could also affect the relative accuracy of the PETSc solver on some problems.

### References Cited

1. T.J.R. Hughes. The Finite Element Method. Linear Static and Dynamic Finite Element Analysis, Dover Publications, Mineola, NY, 2000.

2. O. Schenk, K. Gartner, and W. Fichtner. Efficient Sparse LU Factorization with Left-right Looking Strategy on Shared Memory Multiprocessors. BIT, 40(1):158-176, 2000.

3. G. Karypis and V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 20(1):359-392, 1998.

4. Y. Saad. Iterative Methods for Sparse Linear Systems. Second Edition. Society for Industrial and Applied Mathematics, 2003.

5. S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, (2018) PETSc Users Manual, https://www.mcs.anl.gov/petsc/petsc-current/docs/manual.pdf

6. C. Chevalier and F. Pellegrini. PT-Scotch: A tool for efficient parallel graph ordering. Parallel Computing 34, 6-8 (2008) 318-331.

CapeSym > SYMMIC > Users Manual
© Copyright 2007-2019 CapeSym, Inc. | 6 Huron Dr. Suite 1B, Natick, MA 01760, USA | +1 (508) 653-7100