Running COMSOL on our clusters#

Target audience#

This how-to is meant for people with a working knowledge of COMSOL (e.g. on their laptop) wanting to use the SCITAS clusters to run larger simulations than feasible on standard machines. It covers details like connecting to clusters and launching your simulations.

This how-to is not meant for a total beginner, since it does not cover any of the basics of COMSOL and does not address creating models.

Restricted access

Access to COMSOL is restricted and the number of licenses fairly limited. If you do need to use COMSOL you have to go through this page and choose the relevant version of COMSOL (for research or for teaching). This access is not managed by SCITAS.

Launching COMSOL jobs on the clusters#

There are two main ways of launching COMSOL on the clusters:

in batch mode, i.e. independently from an active COMSOL session;
from the COMSOL desktop environment on your own computer.

The first option is ideal for a cluster, since you can define and fine tune many parameters directly on the command line, which are hard or impossible through the GUI. As such, we will dedicate most of the document to this method. With this first method you save the model or models you want to study, send them to the cluster and run them there, retrieving the files at the end.

A later section briefly goes through the steps to connect from the GUI directly to the cluster. As of the writing of this tutorial, many options are hard to change while submitting jobs from the GUI. You cannot easily change the number of cores used per job, for instance. With this method you may find the COMSOL window to be blocked for the duration of the job, even though the job is running on a different machine.

Launching COMSOL jobs in batch mode#

Once you save your model the job needs to be uploaded to the cluster. Please check our documentation on how to transfer data to the cluster.

You'll also need a script to submit the jobs to the scheduler on the cluster. The script could be something along these lines, where my_model.mph is the COMSOL model and $SLURM_CPUS_PER_TASK is a Slurm variable that has the value you set on --cpus-per-task (so you don't have to manually change two values at once):

#!/bin/bash -l
#SBATCH -J comsol-project
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=30000
module load comsol/6.0
comsol batch slurm -np ${SLURM_CPUS_PER_TASK} -inputfile my_model.mph

After transferring the necessary files to the cluster and logging in to it you'll need to log in to the cluster where you have your files and submit the job to the scheduler:

$ cd /path/to/comsol_project/
$ sbatch comsol.slurm.sh
Submitted batch job 16143086

If you did everything right that last command should return a message with the batch job ID (16143086 in this example). The progress of the calculation will be written in a Slurm output file which is specific to the job you are running. In the case of the job above you can take a quick look at the progress with a command like:

$ tail slurm-16143086.out
---------  Current Progress:  95 % - Constraint handling
Memory: 4465/4708 11477/11816
---------  Current Progress:  95 % - Creating multigrid hierarchy
Memory: 4142/4708 11151/11816
Iter      SolEst     Damping    Stepsize #Res #Jac #Sol LinIt   LinErr   LinRes
---------  Current Progress:  95 % - Solving linear system
Memory: 4067/4708 11076/11816
   1      0.0013   0.5000000      0.0013   22   11   11   259  0.00065        -
---------  Current Progress:  95 % - Assembling matrices
Memory: 3749/4708 10759/11816

This is similar to the output within the COMSOL GUI, but with some more information.

Optimizing your job script#

While the example script above works fine, you may want to change some options. Adjusting the number of cores is trivial (just change the corresponding value on the #SBATCH --cpus-per-task line).

But there are other options that may be useful to change. By default COMSOL will store a lot of data in the $HOME/.comsol directory. This may be a problem since at SCITAS the home directory is limited in size. Several options can be defined to modify this behavior. The most important one concerns the recovery files which are the biggest files stored in that directory. A script that stores those files directly on the node could be like this:

#!/bin/bash -l
#SBATCH -J comsol-project
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=30000
module load comsol/6.0
comsol batch slurm -np ${SLURM_CPUS_PER_TASK} -recoverydir $TMPDIR/recovery -inputfile my_model.mph

Besides avoiding the size limit of your home, storing these files on the node, also improves the performance as writing to the $TMPDIR is faster than writing to the home directory.

Keep in mind that $TMPDIR is a directory that Slurm deletes at the end of the job. If the job finishes successfully the files we are storing there are not needed, so there's no need to worry about them. In case we'd rather keep the recovery files an easier option is perhaps:

#!/bin/bash -l
#SBATCH -J comsol-project
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=30000
module load comsol/6.0
comsol batch slurm -np ${SLURM_CPUS_PER_TASK} -recoverydir $PWD/recovery.@process_id -inputfile my_model.mph

The @process_id is replaced by a number specific to the job, so multiple jobs can run at the same time, without risking them overwriting files from each other.

And since you likely want to maximize the performance, you can store every temporary file related to the job on the node. Another option is to turn off the autosave feature of the recovery files, since it's unlikely the nodes of the cluster will fail. To do that you can use something like:

#!/bin/bash -l
#SBATCH -J comsol-project
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=30000
module load comsol/6.0
comsol batch slurm -np ${SLURM_CPUS_PER_TASK} -autosave off -data $TMPDIR/data -configuration $TMPDIR/configuration -prefsdir $TMPDIR/prefs -tmpdir $TMPDIR/tmp -inputfile my_model.mph

By doing this you can get a few percentage points of improvement on the performance, relative to the standard calculation.

By default COMSOL will update the model file in place, but you may also find it interesting to separate the input model from the output model. In the example below, the input file is read, but nothing is written to it:

#!/bin/bash -l
#SBATCH -J comsol-project
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=30000
module load comsol/6.0
comsol batch slurm -np ${SLURM_CPUS_PER_TASK} -inputfile my_model.mph -outputfile my_model_solved.mph

You're all set. You can now run any COMSOL job you want. A word of caution about performance, though...

Performance and scaling#

You are perhaps tempted to think that with 40 cores per node your jobs will run 10 times faster than on your 4 core laptop. Unfortunately things don't work that way. Parallelization is never perfect and COMSOL is not the most efficient code out there.

As such, before you start submitting your jobs to full nodes on our clusters, you should perhaps test the scalability of your calculations. Some methods parallelize better than others and for those that do you can scale to more cores without a big penalty. Other methods, or parts of calculations, will run large segments on only one core or a few cores and this can significantly impact the performance. Part of being a proficient user of a code is knowing its strengths and limitations.

Submit a job that is representative of what you'll be doing (perhaps close to the final model, but with a coarser mesh) with --cpus-per-task equaling 2, 4, 8 and perhaps even 16. If at each step the time reduces to close to one half of the previous test, then the scaling is still good. If at some point the performance degrades significantly (e.g. rather than 50% of the time, it takes 75% of the time of the previous test) then perhaps it makes less sense to use that many cores.

And as mentioned earlier, temporary files stored on the node (and not in your home), as well as deactivating unnecessary options (e.g. the turning off the recovery files with --autosave off) helps reduce the time to solution. And if you're planning on doing many simulations, the gains add up.

Launching COMSOL jobs from the desktop environment#

If you're not ready not deal with the details described above and are willing to accept the limitations imposed by running from the GUI, then COMSOL Desktop can be configured to submit jobs to the clusters. Just to make it clear, this is COMSOL running on your computer, but launching jobs that run on the cluster. The jobs run independently of your computer, but as long as the connection to the cluster is working COMSOL will update some information about the job.

As of August 2022, this page on COMSOL's website describes in detail how to configure COMSOL in order to connect to a cluster. That page assumes you are using a Windows machine and have done a full installation of PuTTY with one of the Windows installers. If you need help getting started with PuTTY, the first couple sections of this documentation should be enough for COMSOL. Since we use the same scheduler as mentioned in the COMSOL page (i.e. Slurm), most of the instructions are valid. The section Settings for the Cluster Computing Node has a few details that are specific to each cluster. You'll want to edit most of these directly in the Preferences menu, so that they are defaults for all jobs and not just the current one.

After you open the Preferences, go to the Multicore and Cluster Computing section and change the following:

Number of cores: 4
Additional scheduler arguments: --cpus-per-task=4 --time=72:00:00 --mem=30000 (choose values that fit your needs)
User: leave this blank
Queue name: standard
Batch directory: A directory on your own computer where the files will be stored (e.g. a COMSOL directory within your Documents)
External COMSOL batch directory path: A directory within your scratch, in my case it could be /scratch/ddossant/comsol/outputfiles
External COMSOL installation directory path: For COMSOL 6.0 use /ssoft/spack/external/comsol/6.0

And then still within Preferences you need to edit Remote Computing:

Activate Run Remote
Remote invoke command: SSH
SSH command: Putty
SSH directory: path to your PuTTY installation (probably C:\Program Files\PuTTY)
SSH key file: path to the SSH key you created on your own computer (probably stored in your Documents as a file with a .ppk extension). You need to add the public part of this key to the cluster!
SSH user: your GASPAR user name
File transfer command: SCP
SCP command: Putty
SCP directory: likely the same as the SSH directory above
SCP key file: the same as the SSH key file above
Remote hosts list: Table
Remote hosts: jed or, if you know have access to it Helvetios
Remote OS: Linux

With these options set and after adding a Cluster Computing node to your study, you should be able to run your jobs directly on the cluster when you hit Compute.

There is no easy way to adjust the number of cores, the amount of RAM, or the total time asked to Slurm. You'll need to change the values of Additional scheduler arguments in Preferences and launch the job again.

Troubleshooting#

This sections is meant to be a work in progress. If you think you have solved an issue that may be relevant to others, contact us so we can add that information here.

Job fails with an out-of-memory error#

One of the lines of your batch script is of the form #SBATCH --mem=10000. If the value you set was too small for the job you tried to run you may have an error like:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 4819 RUNNING AT f104
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
slurmstepd: error: Detected 1 oom-kill event(s) in step 16148135.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

There is no way of telling COMSOL how much RAM to use. It will see all the RAM available in the node and feel free to use it. However, we configure our clusters with control groups (or cgroups, as in the output above) and we limit the amount of RAM the program can use to what you asked on the #SBATCH --mem=10000 line. If you see this kind of message you'll need to increase the amount of RAM you asked for in your Slurm script. If you launch your jobs through the GUI you need to change the memory value of the Additional scheduler parameters in Preferences.

SSH path is not a directory#

If you've configured the COMSOL Desktop installation on your computer to be able to send jobs to a cluster (as explained above) and you then opt to submit jobs directly to the batch scheduler, some of the options on the GUI may interfere with the correct functioning of the batch job. If you encounter this error:

$ tail slurm-16147807.out
----- Time-Dependent Solver 1 in fpt time/Solution 2 (sol2) ------------------->
Running: Study 2

/******************/
/*****Error********/
/******************/
The SSH path is not a directory
Saving model: /scratch/ddossant/comsol_project/my_model_solved.mph
Save time: 2 s.
Total time: 681 s.

You need to open your model in the GUI and delete the Cluster Computing section from the study. Cluster Computing is meant for submiting directly from the GUI to the cluster, but not when launching jobs directly from the cluster.

License errors#

The school has a relatively small number of COMSOL licenses. Your job may fail with an error like:

/******************/
/*****Error********/
/******************/
Could not obtain license for COMSOL ...
License error: -5.
No such product exists.
Feature:       COMSOL
License path:  ...
FlexNet Licensing error:-5,414

One common issue is caused by having COMSOL running on your laptop/workstation using the tokens for the same features you need for your jobs. Please close any COMSOL sessions you are not actively using and try again.

The exact number of licenses depends on the specific feature you're trying to use. If the number of jobs you can run seems a bit erratic (sometimes you can launch more, some times less) remember that you are also sharing the licenses with other users. More importantly, the license server doesn't check the licenses in real time, but at regular intervals. This means that you may be able to launch a few jobs if they all start in a narrow time window in between the checks by the server. But if they happen to start at different moments they could fail due to the lack of licenses.

A possible workaround for this is to submit multiple jobs in the same session. This is explained in a bit more detail in this COMSOL webpage but suffice it to say that several calculations within one job should work. So, an example script for this could be:

#!/bin/bash -l
#SBATCH -J comsol-project
#SBATCH --time=24:00:00
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=120000
module load comsol/6.0
comsol batch slurm -np ${SLURM_CPUS_PER_TASK} -inputfile model_1.mph -outputfile model_1_solved.mph &
comsol batch slurm -np ${SLURM_CPUS_PER_TASK} -inputfile model_2.mph -outputfile model_2_solved.mph &
comsol batch slurm -np ${SLURM_CPUS_PER_TASK} -inputfile model_3.mph -outputfile model_3_solved.mph &
comsol batch slurm -np ${SLURM_CPUS_PER_TASK} -inputfile model_4.mph -outputfile model_4_solved.mph &
wait

Each of these jobs would use 8 cores and, according to COMSOL, this should allow you to run multiple jobs at once.

Submitting multiple jobs#

If it seems cumbersome to launch only one job at a time you can try setting up Slurm dependencies between the jobs. Supposing you typically launch a job with:

$ sbatch submit_comsol.sh
Submitted batch job 123456

When you submit your next job you could tell it to depend on this job being finished before the new job is actually eligible to run. You can do this with:

$ sbatch --dependency=afterany:123456 submit_comsol.sh
Submitted batch job 123457

This won't avoid issues if you're competing for licenses with other users, but at least you avoid the situation where your own jobs are killing your next calculations.