Skip to content

How to use TensorFlow and TensorBoard#

This is a short guide to describe how to use TensorBoard on Izar or Kuma through SSH port forwarding.

Run TensorBoard on the Izar cluster#

First, connect to the cluster and load the necessary modules:

$ module load gcc python openmpi py-tensorflow
The script below is a template that allows you to start TensorBoard and launch a TensorFlow python script on a compute node. You can copy it in a file called launch_tensorboard.sh for example. It has to be placed in your /home (or you have to modify it accordingly). Note that all the modules that TensorFlow may need have to be loaded here. As an example, here, we loaded the modules that allow us to use TensorFlow in the TensorBoard:

  • On Izar:
    #!/bin/bash -l
    #SBATCH --job-name=tensorbord-trial
    #SBATCH --nodes=1
    #SBATCH --gres=gpu:1
    #SBATCH --time=00:10:00
    #SBATCH --output tensorboard-log-%J.out
    
    module load gcc python openmpi py-tensorflow
    
    ipnport=$(shuf -i8000-9999 -n1)
    tensorboard --logdir logs --port=${ipnport} --bind_all
    
    python example.py
    
  • On Kuma:
    #!/bin/bash -l
    #SBATCH --job-name=tensorbord-trial
    #SBATCH --nodes=1
    #SBATCH -p h100
    #SBATCH --gres=gpu:1
    #SBATCH --time=00:10:00
    #SBATCH --output tensorboard-log-%J.out
    
    module load gcc python openmpi py-tensorflow
    
    ipnport=$(shuf -i8000-9999 -n1)
    tensorboard --logdir logs --port=${ipnport} --bind_all
    
    python example.py
    
    For testing, an example of a Python script, example.py, is given here:
    import tensorflow as tf
    import numpy as np
    
    # Model parameters
    W = tf.Variable([0.3], dtype=tf.float32, name="W")
    b = tf.Variable([-0.3], dtype=tf.float32, name="b")
    
    # Training data
    x_train = np.array([1, 2, 3, 4], dtype=np.float32)
    y_train = np.array([0, -1, -2, -3], dtype=np.float32)
    
    # Create a summary writer for TensorBoard
    log_dir = 'logs'
    writer = tf.summary.create_file_writer(log_dir)
    
    # Training loop
    for i in range(1000):
        with tf.GradientTape() as tape:
            linear_model = W * x_train + b
            loss = tf.reduce_sum(tf.square(linear_model - y_train))
    
        # Compute gradients
        gradients = tape.gradient(loss, [W, b])
    
        # Update weights
        W.assign_sub(0.01 * gradients[0])
        b.assign_sub(0.01 * gradients[1])
    
        # Log the loss to TensorBoard
        with writer.as_default():
            tf.summary.scalar('loss', loss, step=i)
    
    # Close the writer
    writer.close()
    
    # Evaluate training accuracy
    print("W: %s b: %s loss: %s" % (W.numpy(), b.numpy(), loss.numpy()))
    

Launch your job as usual:

$ sbatch launch_tensorboard.sh

After the job starts, analyze the output file named tensorboard-log-[SLURM_ID].out. Look for a line similar to the following (on Kuma):

TensorBoard 2.16.2 at http://kh045.kuma.cluster:8263/ (Press CTRL+C to quit)

It has the form:

http://[NODE NAME].[CLUSTER NAME].cluster:<PORT NUMBER>

Use TensorBoard on a local machine#

On your local machine execute the following command with the information provided by the above step:

ssh -L <PORT NUMBER>:[NODE NAME].[CLUSTER NAME].cluster:<PORT NUMBER> -l <USERNAME> [CLUSTER NAME].hpc.epfl.ch -f -N

In our example, this gives:

ssh -L 8263:kh045.kuma.cluster:8263 -l peyberne kuma.hpc.epfl.ch -f -N

Now you should be able to access to the cluster compute node through the web browser by pasting the following address:

http://[NODE NAME].[CLUSTER NAME].cluster:<PORT NUMBER>/

For our example, this gives:

http://kh045.kuma.cluster:8263/