4. Multilayer Perceptron

One of the simplest neural network architectures are multilayer perceptrons. Fashion MNIST is one of the simplest commonly available datasets. We get started with “real” neural networks in PyTorch by training a multilayer perceptron on Fashion MNIST 😎.

Before going into the actual ML-related parts, we typically have to work on our data first. Preparing the data before doing anything in a machine learning framework might turn out to be one of the most time-consuming tasks. This is especially true for “unique” data sets common in scientific machine learning where we often have to find custom solutions. Luckily Fashion MNIST is one of the standard examples and somebody else, i.e., Zalando SE, prepared everything for us. Still we have to find a way to load our data into PyTorch and to potentially preprocess and postprocess it. Once all of this infrastructure is in place, we’ll define our model and train it.

Hint

The Quickstart tutorial conceptually does the same thing as this lab: Train a multilayer perceptron on Fashion MNIST.

4.1. Datasets and Data Loaders

As in Section 3, we’ll first have a brief look at our data. Then we wrap the Fashion MNIST dataset in a data loader.

Tasks

Create a training and test dataset by calling torchvision.datasets.FashionMNIST. Use the transformation torchvision.transforms.ToTensor() to convert the data to PyTorch tensors.
Visualize a few images and show their labels. Use Matplotlib for your plots. Save your visualizations in a PDF file by using PdfPages. An example is given in the Multipage PDF demo.
Wrap the datasets into data loaders. Use torch.utils.data.DataLoader for this.

4.2. Training and Validation

We are ready to train our first “real” neural net in PyTorch! We’ll train a MultiLayer Perceptron (MLP). Our MLP assumes input images with 28 \(\times\) 28 pixels and 10 output classes. Initially, our MLP will use the following network architecture:

Fully-connected layer with 28 \(\times\) 28 input features and 512 output features;
ReLU activation;
Fully-connected layer with 512 input features and 512 output features;
ReLU activation; and
Fully-connected layer with 512 input features and 10 output features.

For the training procedure we’ll also have to think about a loss function and an optimizer. For now:

Use torch.nn.CrossEntropyLoss as your loss function; and

Use torch.optim.SGD as your optimizer.

Listing 4.2.1 Template for the module eml.mlp.trainer.

## Trains the given MLP-model.
#  @param i_loss_func used loss function.
#  @param io_data_loader data loader containing the data to which the model is applied (single epoch).
#  @param io_model model which is trained.
#  @param io_optimizer.
#  @return summed loss over all training samples.
def train( i_loss_func,
           io_data_loader,
           io_model,
           io_optimizer ):
  # switch model to training mode
  io_model.train()

  l_loss_total = 0

  # TODO: finish implementation

  return l_loss_total

Listing 4.2.2 Template for the module eml.mlp.tester.

import torch

## Tests the model
#  @param i_loss_func used loss function.
#  @param io_data_loader data loader containing the data to which the model is applied.
#  @param io_model model which is tested.
#  @return summed loss over all test samples, number of correctly predicted samples.
def test( i_loss_func,
          io_data_loader,
          io_model ):
  # switch model to evaluation mode
  io_model.eval()

  l_loss_total = 0
  l_n_correct = 0

  # TODO: finish implementation

  return l_loss_total, l_n_correct

Listing 4.2.1 and Listing 4.2.2 contain two templates for the modules eml.mlp.trainer and eml.mlp.tester. We’ll use these to avoid spaghetti code by separating our training and testing procedures from the main code.

Tasks

Implement the class Model in the module eml.mlp.model which contains the MultiLayer Perceptron (MLP).
Implement the training loop and print the total training loss after every epoch. For the time being, implement the training loop directly in your main function.
Move the training loop to the module eml.mlp.trainer. Use the template in Listing 4.2.1 to guide your implementation.
Implement the module eml.mlp.tester. Use the template in Listing 4.2.2 to guide your implementation. The module’s only function test simply applies the MLP to the given data and returns the obtained total loss and number of correctly predicted samples.

Hint

When testing your model, switch to evaluation mode through nn.Module.eval() and locally disable gradient tracking through torch.no_grad. Don’t forget to switch back to training mode afterwards if needed. Further information is available from the article Autograd mechanics.

4.3. Visualization

Now, let’s, once again, professionalize our visualization efforts. This means that we provide some data to our MLP and infer Fashion MNIST’s classes by applying our trained model. The input data are then plotted together with the obtained labels.

Listing 4.3.1 Template for the module eml.vis.fashion_mnist.

import torch
import matplotlib.pyplot as plt

## Converts an Fashion MNIST numeric id to a string.
#  @param i_id numeric value of the label.
#  @return string corresponding to the id.
def toLabel( i_id ):
  l_labels = [ "T-Shirt",
               "Trouser",
               "Pullover",
               "Dress",
               "Coat",
               "Sandal",
               "Shirt",
               "Sneaker",
               "Bag",
               "Ankle Boot" ]

  return l_labels[i_id]

## Applies the model to the data and plots the data.
#  @param i_off offset of the first image.
#  @param i_stride stride between the images.
#  @param io_data_loader data loader from which the data is retrieved.
#  @param io_model model which is used for the predictions.
#  @param i_path_to_pdf optional path to an output file, i.e., nothing is shown at runtime.
def plot( i_off,
          i_stride,
          io_data_loader,
          io_model,
          i_path_to_pdf = None ):
  # switch to evaluation mode
  io_model.eval()

  # create pdf if required
  if( i_path_to_pdf != None ):
    import matplotlib.backends.backend_pdf
    l_pdf_file = matplotlib.backends.backend_pdf.PdfPages( i_path_to_pdf )

  # TODO: finish implementation

  # close pdf if required
  if( i_path_to_pdf != None ):
    l_pdf_file.close()

Tasks

Implement a Fashion MNIST visualization module in eml.vis.fashion_mnist. Use the template in Listing 4.3.1 to guide your implementation. The module’s function plot function takes the argument i_off for the offset of the first visualized image and the argument i_stride for the stride between images. For example, if i_off=5 and i_stride=17, the function would plot the images with ids 5, 21, 38, and so on.
Monitor your training process by visualizing the test data after every ten epochs. Use the stride feature of eml.vis.fashion_mnist.plot to keep the file sizes small.

4.4. Batch Jobs

All puzzle pieces for training the MLP are in place. Further, in Section 1 we have seen that we can get a node from the cluster through the salloc command. This is great! We can already use dedicated resources to train our MLP. However, interactive resources require us to monitor the training process manually. What happens if we loose connection to the machine? Can we still grab a coffee or should we wait until completion and duly release our compute node after? This is inconvenient: Our software is mature enough to run on its own 💪.

Batch jobs help us with exactly this issue! In simple words we can write a shell script which automatically starts our training once resources of the machine are available and releases the occupied node(s) once we are done.

Listing 4.4.1 Example script for a batch job on the Draco cluster.

#!/usr/bin/env bash
##
# Example Draco job script.
##
#SBATCH --job-name=mlp_training
#SBATCH --output=mlp_training_%j.out
#SBATCH -p short
#SBATCH -N 1
#SBATCH --cpus-per-task=96
#SBATCH --time=01:00:00
#SBATCH --mail-type=all
#SBATCH --mail-user=alex.breuer@uni-jena.de

echo "submit host:"
echo $SLURM_SUBMIT_HOST
echo "submit dir:"
echo $SLURM_SUBMIT_DIR
echo "nodelist:"
echo $SLURM_JOB_NODELIST

# activate conda environment
module load tools/anaconda3/2021.05
source "$(conda info -a | grep CONDA_ROOT | awk -F ' ' '{print $2}')"/etc/profile.d/conda.sh
conda activate pytorch_x86

# train MLP
cd $HOME/mlp_fashion_mnist
export PYTHONUNBUFFERED=TRUE
python mlp_fashion_mnist.py

An exemplary job script is given in Listing 4.4.1. The script is separated into two parts:

A section sharing details of the job with the job scheduler, i.e., Slurm. Everything related to Slurm starts with #SBATCH. In this example we ask for a single node in the “short” queue for an hour. Further, we specify the name of the job, the location for the job’s output, and request emails on all changes of the job status. Now, when the job is starting (or done), we conveniently get an email. Ok, to be precise its not you who gets these emails in the example script, please do adjust this part. 😅
A section which represents the commands which are executed once the job starts running. This includes not only the pure call to Python but also everything else we would do in an interactive jobs. For example, we might want to load a customized Conda environment.

Tasks

Write a job script which powers the training of your MLP training.
Submit your job and maybe grab a coffee. ☕