4. Multilayer Perceptron

One of the simplest neural network architectures are multilayer perceptrons. Fashion MNIST is one of the simplest commonly available datasets. We get started with “real” neural networks in PyTorch by training a multilayer perceptron on Fashion MNIST 😎.

Before going into the actual ML-related parts, we typically have to work on our data first. Preparing the data before doing anything in a machine learning framework might turn out to be one of the most time-consuming tasks. This is especially true for “unique” data sets common in scientific machine learning where we often have to find custom solutions. Luckily Fashion MNIST is one of the standard examples and somebody else, i.e., Zalando SE, prepared everything for us. Still we have to find a way to load our data into PyTorch and to potentially preprocess and postprocess it. Once all of this infrastructure is in place, we’ll define our model and train it.

Hint

The Quickstart tutorial conceptually does the same thing as this lab: Train a multilayer perceptron on Fashion MNIST.

4.1. Datasets and Data Loaders

As in Section 3, we’ll first have a brief look at our data. Then we wrap the Fashion MNIST dataset in a data loader.

Tasks

  1. Create a training and test dataset by calling torchvision.datasets.FashionMNIST. Use the transformation torchvision.transforms.ToTensor() to convert the data to PyTorch tensors.

  2. Visualize a few images and show their labels. Use Matplotlib for your plots. Save your visualizations in a PDF file by using PdfPages. An example is given in the Multipage PDF demo.

  3. Wrap the datasets into data loaders. Use torch.utils.data.DataLoader for this.

4.2. Training and Validation

We are ready to train our first “real” neural net in PyTorch! We’ll train a MultiLayer Perceptron (MLP). Our MLP assumes input images with 28 \(\times\) 28 pixels and 10 output classes. Initially, our MLP will use the following network architecture:

  • Fully-connected layer with 28 \(\times\) 28 input features and 512 output features;

  • ReLU activation;

  • Fully-connected layer with 512 input features and 512 output features;

  • ReLU activation; and

  • Fully-connected layer with 512 input features and 10 output features.

For the training procedure we’ll also have to think about a loss function and an optimizer. For now:

Listing 4.2.1 Template for the module eml.mlp.trainer.
 1## Trains the given MLP-model.
 2#  @param i_loss_func used loss function.
 3#  @param io_data_loader data loader containing the data to which the model is applied (single epoch).
 4#  @param io_model model which is trained.
 5#  @param io_optimizer.
 6#  @return summed loss over all training samples.
 7def train( i_loss_func,
 8           io_data_loader,
 9           io_model,
10           io_optimizer ):
11  # switch model to training mode
12  io_model.train()
13
14  l_loss_total = 0
15
16  # TODO: finish implementation
17
18  return l_loss_total
Listing 4.2.2 Template for the module eml.mlp.tester.
 1import torch
 2
 3## Tests the model
 4#  @param i_loss_func used loss function.
 5#  @param io_data_loader data loader containing the data to which the model is applied.
 6#  @param io_model model which is tested.
 7#  @return summed loss over all test samples, number of correctly predicted samples.
 8def test( i_loss_func,
 9          io_data_loader,
10          io_model ):
11  # switch model to evaluation mode
12  io_model.eval()
13
14  l_loss_total = 0
15  l_n_correct = 0
16
17  # TODO: finish implementation
18
19  return l_loss_total, l_n_correct

Listing 4.2.1 and Listing 4.2.2 contain two templates for the modules eml.mlp.trainer and eml.mlp.tester. We’ll use these to avoid spaghetti code by separating our training and testing procedures from the main code.

Tasks

  1. Implement the class Model in the module eml.mlp.model which contains the MultiLayer Perceptron (MLP).

  2. Implement the training loop and print the total training loss after every epoch. For the time being, implement the training loop directly in your main function.

  3. Move the training loop to the module eml.mlp.trainer. Use the template in Listing 4.2.1 to guide your implementation.

  4. Implement the module eml.mlp.tester. Use the template in Listing 4.2.2 to guide your implementation. The module’s only function test simply applies the MLP to the given data and returns the obtained total loss and number of correctly predicted samples.

    Hint

    When testing your model, switch to evaluation mode through nn.Module.eval() and locally disable gradient tracking through torch.no_grad. Don’t forget to switch back to training mode afterwards if needed. Further information is available from the article Autograd mechanics.

4.3. Visualization

Now, let’s, once again, professionalize our visualization efforts. This means that we provide some data to our MLP and infer Fashion MNIST’s classes by applying our trained model. The input data are then plotted together with the obtained labels.

Listing 4.3.1 Template for the module eml.vis.fashion_mnist.
 1import torch
 2import matplotlib.pyplot as plt
 3
 4## Converts an Fashion MNIST numeric id to a string.
 5#  @param i_id numeric value of the label.
 6#  @return string corresponding to the id.
 7def toLabel( i_id ):
 8  l_labels = [ "T-Shirt",
 9               "Trouser",
10               "Pullover",
11               "Dress",
12               "Coat",
13               "Sandal",
14               "Shirt",
15               "Sneaker",
16               "Bag",
17               "Ankle Boot" ]
18
19  return l_labels[i_id]
20
21## Applies the model to the data and plots the data.
22#  @param i_off offset of the first image.
23#  @param i_stride stride between the images.
24#  @param io_data_loader data loader from which the data is retrieved.
25#  @param io_model model which is used for the predictions.
26#  @param i_path_to_pdf optional path to an output file, i.e., nothing is shown at runtime.
27def plot( i_off,
28          i_stride,
29          io_data_loader,
30          io_model,
31          i_path_to_pdf = None ):
32  # switch to evaluation mode
33  io_model.eval()
34
35  # create pdf if required
36  if( i_path_to_pdf != None ):
37    import matplotlib.backends.backend_pdf
38    l_pdf_file = matplotlib.backends.backend_pdf.PdfPages( i_path_to_pdf )
39
40  # TODO: finish implementation
41
42  # close pdf if required
43  if( i_path_to_pdf != None ):
44    l_pdf_file.close()

Tasks

  1. Implement a Fashion MNIST visualization module in eml.vis.fashion_mnist. Use the template in Listing 4.3.1 to guide your implementation. The module’s function plot function takes the argument i_off for the offset of the first visualized image and the argument i_stride for the stride between images. For example, if i_off=5 and i_stride=17, the function would plot the images with ids 5, 21, 38, and so on.

  2. Monitor your training process by visualizing the test data after every ten epochs. Use the stride feature of eml.vis.fashion_mnist.plot to keep the file sizes small.

4.4. Batch Jobs

All puzzle pieces for training the MLP are in place. Further, in Section 1 we have seen that we can get a node from the cluster through the salloc command. This is great! We can already use dedicated resources to train our MLP. However, interactive resources require us to monitor the training process manually. What happens if we loose connection to the machine? Can we still grab a coffee or should we wait until completion and duly release our compute node after? This is inconvenient: Our software is mature enough to run on its own 💪.

Batch jobs help us with exactly this issue! In simple words we can write a shell script which automatically starts our training once resources of the machine are available and releases the occupied node(s) once we are done.

Listing 4.4.1 Example script for a batch job on the Draco cluster.
 1#!/usr/bin/env bash
 2##
 3# Example Draco job script.
 4##
 5#SBATCH --job-name=mlp_training
 6#SBATCH --output=mlp_training_%j.out
 7#SBATCH -p short
 8#SBATCH -N 1
 9#SBATCH --cpus-per-task=96
10#SBATCH --time=01:00:00
11#SBATCH --mail-type=all
12#SBATCH --mail-user=alex.breuer@uni-jena.de
13
14echo "submit host:"
15echo $SLURM_SUBMIT_HOST
16echo "submit dir:"
17echo $SLURM_SUBMIT_DIR
18echo "nodelist:"
19echo $SLURM_JOB_NODELIST
20
21# activate conda environment
22module load tools/anaconda3/2021.05
23source "$(conda info -a | grep CONDA_ROOT | awk -F ' ' '{print $2}')"/etc/profile.d/conda.sh
24conda activate pytorch_x86
25
26# train MLP
27cd $HOME/mlp_fashion_mnist
28export PYTHONUNBUFFERED=TRUE
29python mlp_fashion_mnist.py

An exemplary job script is given in Listing 4.4.1. The script is separated into two parts:

  1. A section sharing details of the job with the job scheduler, i.e., Slurm. Everything related to Slurm starts with #SBATCH. In this example we ask for a single node in the “short” queue for an hour. Further, we specify the name of the job, the location for the job’s output, and request emails on all changes of the job status. Now, when the job is starting (or done), we conveniently get an email. Ok, to be precise its not you who gets these emails in the example script, please do adjust this part. 😅

  2. A section which represents the commands which are executed once the job starts running. This includes not only the pure call to Python but also everything else we would do in an interactive jobs. For example, we might want to load a customized Conda environment.

Tasks

  1. Write a job script which powers the training of your MLP training.

  2. Submit your job and maybe grab a coffee. ☕