How to use TensorFlow and TensorBoard#
This is a short guide to describe how to use TensorBoard on Izar or Kuma through SSH port forwarding.
Run TensorBoard on the Izar cluster#
First, connect to the cluster and load the necessary modules:
The script below is a template that allows you to start TensorBoard and launch a TensorFlow python script on a compute node. You can copy it in a file calledlaunch_tensorboard.sh
for example. It has to be placed in your /home
(or you have to modify it accordingly). Note that all the modules that TensorFlow may need have to be loaded here. As an example, here, we loaded the modules that allow us to use TensorFlow in the TensorBoard:
- On Izar:
#!/bin/bash -l #SBATCH --job-name=tensorbord-trial #SBATCH --nodes=1 #SBATCH --gres=gpu:1 #SBATCH --time=00:10:00 #SBATCH --output tensorboard-log-%J.out module load gcc python openmpi py-tensorflow ipnport=$(shuf -i8000-9999 -n1) tensorboard --logdir logs --port=${ipnport} --bind_all python example.py
- On Kuma:
For testing, an example of a Python script,
#!/bin/bash -l #SBATCH --job-name=tensorbord-trial #SBATCH --nodes=1 #SBATCH -p h100 #SBATCH --gres=gpu:1 #SBATCH --time=00:10:00 #SBATCH --output tensorboard-log-%J.out module load gcc python openmpi py-tensorflow ipnport=$(shuf -i8000-9999 -n1) tensorboard --logdir logs --port=${ipnport} --bind_all python example.py
example.py
, is given here:import tensorflow as tf import numpy as np # Model parameters W = tf.Variable([0.3], dtype=tf.float32, name="W") b = tf.Variable([-0.3], dtype=tf.float32, name="b") # Training data x_train = np.array([1, 2, 3, 4], dtype=np.float32) y_train = np.array([0, -1, -2, -3], dtype=np.float32) # Create a summary writer for TensorBoard log_dir = 'logs' writer = tf.summary.create_file_writer(log_dir) # Training loop for i in range(1000): with tf.GradientTape() as tape: linear_model = W * x_train + b loss = tf.reduce_sum(tf.square(linear_model - y_train)) # Compute gradients gradients = tape.gradient(loss, [W, b]) # Update weights W.assign_sub(0.01 * gradients[0]) b.assign_sub(0.01 * gradients[1]) # Log the loss to TensorBoard with writer.as_default(): tf.summary.scalar('loss', loss, step=i) # Close the writer writer.close() # Evaluate training accuracy print("W: %s b: %s loss: %s" % (W.numpy(), b.numpy(), loss.numpy()))
Launch your job as usual:
After the job starts, analyze the output file named tensorboard-log-[SLURM_ID].out
. Look for a line similar to the following (on Kuma):
It has the form:
Use TensorBoard on a local machine#
On your local machine execute the following command with the information provided by the above step:
ssh -L <PORT NUMBER>:[NODE NAME].[CLUSTER NAME].cluster:<PORT NUMBER> -l <USERNAME> [CLUSTER NAME].hpc.epfl.ch -f -N
In our example, this gives:
Now you should be able to access to the cluster compute node through the web browser by pasting the following address:
For our example, this gives: