|CapeSym||Table of Contents|
The Remote jobs... menu item in the Solve menu of the GUI can be used to run a simulation through a job scheduler on a remote cluster. Whereas Remote run... invokes a single interactive run on a remote system, Remote jobs... enables submission and monitoring of multiple non-interactive jobs running on a job scheduler, also known as a “batch” scheduler. Currently SYMMIC supports the slurm job scheduler on a Linux cluster; for more details please go to https://slurm.schedmd.com/
The Remote Jobs dialog is similar to the Remote Run dialog, but there are a few differences. Before opening the dialog, one should first open a template to establish the desired local working directory. Within the dialog, if a new template file is desired, it can be selected by using the browse button (...) next to the Template edit box. Once a remote job has been submitted, the template file in the edit box can be changed to submit another job, or one can disconnect and leave the dialog altogether while waiting for the submitted job(s) to complete. And once a job has completed and been approved for download, the results will download correctly regardless of which template is showing in the dialog.
The three primary tasks of the Remote Jobs dialog is to (1) specify and submit jobs to the job scheduler on the remote system, (2) to query that job scheduler to obtain the status of previously-submitted jobs, and (3) to download the results and clean up the remote system when the job is complete. The figure below illustrates how this dialog can be used. After connecting to the remote system, one can either submit a job or check the status. Both of these operations can be done multiple times before disconnecting.
The Operational States of the Remote Jobs Dialog
The Options button opens a dialog for setting the command line options for xSYMMIC. The same Options dialog is used for the Remote Run and Background Run dialogs, however the -usePetsc option is selected by default for Remote Jobs. If a Superposition Level 2 job is being submitted, then -usePetsc should be unselected and -s=2 selected instead. Please refer to the Background Run section for a more complete description of the Options dialog, and the sections entitled Command Line Utility and mpiexec xSYMMIC for explanation of the options themselves.
The following Remote Jobs example solves a very large FET array with twenty million unknowns, which is too large to solve on most single desktops or workstations. The direct solver would require well over 100 GB of RAM to solve this problem in memory. We will use the default PETSc preconditioned conjugate gradient (PCG) algorithm to solve this device, which requires much less memory per process.
The above example used an Alces Flight Cluster on Amazon Web Services (AWS). After the cluster was created, AWS provided an pulblic IPv4 address to the login node. The login node is typically a small instance that just manages the slurm job scheduler but does not do any of the simulation work.
The left image above shows the dialog after all of the fields have been entered. The default username of the login node in an Alces Flight Cluster is “alces”. The SSH private key file provided to AWS for creating the cluster was “awskeypair.pem”. When the Connect button is pressed the SSH connection to the AWS cluster is established. The dialog is now ready to begin submitting jobs, as shown on the right.
This cluster was created with four compute nodes, each of which has 36 physical cores (or 72 vCPUs in AWS terminology, since each core is hyper-threaded) and 144 GB of RAM (4 GB per core). As a rule of thumb, the PETSc PCG algorithm needs at least 0.5 GB of RAM per MPI rank per million vertices, so the maximum number of MPI ranks per node for this template on this hardware is 12 (12*0.5*20 = 120 GB < 144 GB). This determines the job parameters. The maximum run time should be a value greater than the run will take if successful. Slurm will kill the job if it exceeds this value, which is useful when runs malfunction. The job name can be any string up to 10 characters long. The standard console output from the run will be saved in a log file with the name <job name>.<job ID>, and downloaded with the rest of the results files.
Press the Submit button to send it to the remote job scheduler. A dialog will announce when it has completed, and the jobid number returned from slurm will show in the console. If the job submission fails, then most likely there is a problem with the job scheduler on the cluster.
Now that the job has been submitted, there is no longer any need to maintain a connection to the remote cluster. The job will run if and when resources and priority constraints allow. So the user has three options at this point: (1) check the job queue status, (2) submit another job, or (3) disconnect and/or leave the dialog. To submit another job, simply browse (with the “...” button by the Template field at the top of the dialog) to the next template to solve, specify the job parameters, and press the Submit button again.
When the dialog is initially opened and connected to the remote cluster, one can proceed directly to checking the job queue status. The image below shows the status for the job #2 submitted above. It is currently running on 4 nodes, with a wallclock elapsed time of 1:15 (1 minute 15 seconds).
As an aside, in the console window after the “Submitted batch job 2” line are three lines of text regarding the file “.slurm_jobs”. This file resides on the remote cluster in the working directory and contains the status of every job submitted and running. When the cluster is started and the first job has been submitted, as in this case, the file does not yet exist, so it is created. This file is updated every time the Status button is pressed, by first downloading it, rewriting it, and then uploading it again. This is how SYMMIC can determine whether a job has completed and is ready for downloading.
It is important to be consistent with the Working directory (i.e. use the same directory every time), because that is where the .slurm_jobs file is written. Although pressing the Status button will always show all of the active runs, when runs complete only those submitted in the current working directory will download.
If Status is pressed and a completed (and not-yet-downloaded) job exists, then a message box will ask the user if the job files should be downloaded now. If “No” is pressed, then the user will be asked again the next time that Status is pressed (since the job will remain listed in the .slurm_jobs file). The template and other files associated with the job are downloaded as well. All downloaded files will overwrite any local files of the same name in the current directory. After the files for a completed job have been downloaded, the run files are deleted on the remote cluster.
If the download fails for one or more files, then the job entry will remain in the .slurm_jobs file and the user will be prompted with a dialog that reads “Not all files were downloaded successfully. Clean up job anyways?”. It is usually advisable for the user to select “Yes” so that the job will be removed from the .slurm_jobs file and so that the user won't be prompted to download the files again the next time that the Status button is pressed. A typical reason for a download to fail is if xSYMMIC didn't run to completion, and therefore did not produce all of the files expected. One should examine the run log file (<job name>.<job ID>, which should have still downloaded correctly) to determine the problem.
Once the job files have been downloaded, the user may then use Load solution... from the Solve menu to view the results. Note that the template file must be open in SYMMIC for the Load solution... menu item to be available, and for the solution to display correctly. The recorded values file will have also been downloaded to the same directory as the original template file. These values can be viewed through the Results > Record values... menu item.
CapeSym > SYMMIC
> Users Manual
> Table of Contents
© Copyright 2007-2019 CapeSym, Inc. | 6 Huron Dr. Suite 1B, Natick, MA 01760, USA | +1 (508) 653-7100